Kubernetes AI 推理服务部署与优化实战 | 极客日志

PythonAI算法

Kubernetes AI 推理服务部署与优化实战

综述由AI生成Kubernetes 环境下 AI 推理服务的部署涉及 GPU 资源管理、模型服务框架选型及性能优化。涵盖 TensorFlow Serving 与 Triton Inference Server 的配置细节，包括 YAML 部署文件编写、PVC 挂载、HPA 自动扩缩容及网络策略安全加固。重点讨论了模型量化、批处理配置等提升吞吐量的方法，并提供了监控日志与故障排查的实际操作指南，旨在帮助工程师构建高可用的生产级推理平台。

板砖工程师发布于 2026/4/8更新于 2026/4/262 浏览

Kubernetes AI 推理服务部署与优化实战

AI 推理服务核心概念

AI 推理服务本质上是将训练好的模型转化为可被调用的接口，支持实时或批量处理请求。在 Kubernetes 上运行这类服务时，资源调度、性能瓶颈和高可用架构是必须考虑的核心要素。

常见的推理框架包括 Google 的 TensorFlow Serving、PyTorch 官方的 TorchServe、微软的 ONNX Runtime 以及 NVIDIA 的高性能 Triton Inference Server。选择哪种框架通常取决于你的模型来源和硬件环境。

GPU 资源管理

驱动与插件配置

要在 K8s 节点上使用 GPU，首先需要确保宿主机安装了正确的 NVIDIA 驱动，并在集群中部署 Device Plugin 以暴露 GPU 资源。

# 在节点上安装驱动
apt-get install -y nvidia-driver-535

# 部署 NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# 验证 GPU 是否可见
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.nvidia\.com/gpu}{"\n"}{end}'

资源分配策略

部署推理服务时，务必在 Pod 规格中明确声明 GPU 资源的请求（requests）和限制（limits），避免资源争抢导致性能抖动。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
      - name: tensorflow-serving
        image: tensorflow/serving:latest
        ports:

mkdir -p models/mnist/1
wget -O models/mnist/1/saved_model.pb https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz

kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tf-serving
  template:
    metadata:
      labels:
        app: tf-serving
    spec:
      containers:
      - name: tf-serving
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8500
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: mnist
        volumeMounts:
        - name: model-volume
          mountPath: /models
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-pvc

apiVersion: v1
kind: Service
metadata:
  name: tf-serving
  namespace: default
spec:
  selector:
    app: tf-serving
  ports:
  - port: 8501
    targetPort: 8501
  type: LoadBalancer

MODEL_SERVICE=$(kubectl get svc tf-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -d '{"instances": [[[0.0 for _ in range(28)] for _ in range(28)]]}' \
     -X POST http://$MODEL_SERVICE:8501/v1/models/mnist:predict

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton-server
        image: nvcr.io/nvidia/tritonserver:23.08-py3
        ports:
        - containerPort: 8000
        - containerPort: 8001
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-volume
          mountPath: /models
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-pvc

apiVersion: v1
kind: Service
metadata:
  name: triton-server
  namespace: default
spec:
  selector:
    app: triton-server
  ports:
  - port: 8000
    targetPort: 8000
  - port: 8001
    targetPort: 8001
  - port: 8002
    targetPort: 8002
  type: LoadBalancer

env:
- name: TF_FORCE_GPU_ALLOW_GROWTH
  value: "true"
- name: BATCH_SIZE
  value: "32"

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tf-serving-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tf-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tf-serving-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: tf-serving
  endpoints:
  - port: 8501
    path: /v1/monitoring/prometheus
    interval: 15s

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-inference-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: tf-serving
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8501

annotations:
  nginx.ingress.kubernetes.io/canary: "true"
  nginx.ingress.kubernetes.io/canary-weight: "20"

kubectl exec -it <pod-name> -- nvidia-smi
kubectl logs -l app=tf-serving