Kubernetes AI 推理服务部署与优化最佳实践 | 极客日志

Shell / BashAI算法

Kubernetes AI 推理服务部署与优化最佳实践

在 Kubernetes 上部署 AI 推理服务的最佳实践，涵盖 GPU 资源管理、TensorFlow Serving 和 Triton Inference Server 的部署配置、性能优化（量化、批处理、自动缩放）、监控可观测性及安全策略。通过合理配置资源和模型，构建高性能可靠的 AI 服务。

宁静发布于 2026/4/6更新于 2026/5/2137 浏览

Kubernetes AI 推理服务部署与优化最佳实践

1. AI 推理服务核心概念

1.1 什么是 AI 推理服务

AI 推理服务是指将训练好的 AI 模型部署为可访问的服务，用于实时或批量处理推理请求。在 Kubernetes 环境中，AI 推理服务需要考虑资源管理、性能优化和高可用性。

1.2 常见的 AI 推理框架

TensorFlow Serving：Google 开源的机器学习模型服务框架
TorchServe：PyTorch 官方的模型服务框架
ONNX Runtime：微软开源的跨平台推理引擎
Triton Inference Server：NVIDIA 开源的高性能推理服务器

2. GPU 资源管理

2.1 安装 GPU 驱动和 NVIDIA Device Plugin

# 安装 NVIDIA 驱动（在节点上执行）
apt-get install -y nvidia-driver-535

# 安装 NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# 验证 GPU 资源
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t":.status.capacity.nvidia\.com/gpu}{"\n"}{end}'

2.2 GPU 资源分配

部署使用 GPU 的推理服务

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
      -

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

# 下载示例模型
mkdir -p models/mnist/1
wget -O models/mnist/1/saved_model.pb https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz

# 创建模型存储
kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tf-serving
  template:
    metadata:
      labels:
        app: tf-serving
    spec:
      containers:
      - name: tf-serving
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8500
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: mnist
        volumeMounts:
        - name: model-volume
          mountPath: /models
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-pvc

apiVersion: v1
kind: Service
metadata:
  name: tf-serving
  namespace: default
spec:
  selector:
    app: tf-serving
  ports:
  - port: 8501
    targetPort: 8501
  type: LoadBalancer

# 部署服务
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

# 测试推理服务
MODEL_SERVICE=$(kubectl get svc tf-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -d '{"instances": [[[0.0 for _ in range(28)] for _ in range(28)]]}' -X POST http://$MODEL_SERVICE:8501/v1/models/mnist:predict

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton-server
        image: nvcr.io/nvidia/tritonserver:23.08-py3
        ports:
        - containerPort: 8000
        - containerPort: 8001
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-volume
          mountPath: /models
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-pvc

apiVersion: v1
kind: Service
metadata:
  name: triton-server
  namespace: default
spec:
  selector:
    app: triton-server
  ports:
  - port: 8000
    targetPort: 8000
  - port: 8001
    targetPort: 8001
  - port: 8002
    targetPort: 8002
  type: LoadBalancer

# 部署服务
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

# 检查服务状态
kubectl get pods -l app=triton-server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving-batched
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tf-serving-batched
  template:
    metadata:
      labels:
        app: tf-serving-batched
    spec:
      containers:
      - name: tf-serving
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: mnist
        - name: TF_FORCE_GPU_ALLOW_GROWTH
          value: "true"
        - name: BATCH_SIZE
          value: "32"
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tf-serving-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tf-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
    target:
      type: Utilization
      averageUtilization: 70
  - type: Resource
    resource:
      name: memory
    target:
      type: Utilization
      averageUtilization: 80

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tf-serving-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: tf-serving
  endpoints:
  - port: 8501
    path: /v1/monitoring/prometheus
    interval: 15s

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
  namespace: default
spec:
  # ...
  template:
    spec:
      containers:
      - name: tf-serving
        image: tensorflow/serving:latest
        # ...
        env:
        - name: TF_CPP_MIN_LOG_LEVEL
          value: "0"
        - name: TF_ENABLE_GPU_GARBAGE_COLLECTION
          value: "true"
        args:
        - --model_name=mnist
        - --model_base_path=/models/mnist
        - --enable_batching=true
        - --batching_parameters_file=/models/batching_parameters.txt

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-inference-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: tf-serving
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8501
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: monitoring
    ports:
    - protocol: TCP
      port: 9090

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-multi-model
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-multi-model
  template:
    metadata:
      labels:
        app: triton-multi-model
    spec:
      containers:
      - name: triton-server
        image: nvcr.io/nvidia/tritonserver:23.08-py3
        ports:
        - containerPort: 8000
        - containerPort: 8001
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-volume
          mountPath: /models
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: models-pvc

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ai-inference-ingress
  namespace: default
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "20"
spec:
  rules:
  - host: inference.example.com
    http:
      paths:
      - path: /v1/models
        pathType: Prefix
        backend:
          service:
            name: tf-serving-v2
            port:
              number: 8501

# 查看 GPU 使用情况
kubectl exec -it <pod-name> -- nvidia-smi

# 查看推理服务日志
kubectl logs -l app=tf-serving

# 检查模型状态
curl http://<service-ip>:8501/v1/models/mnist

# 测试推理服务
curl -d '{"instances": [[[0.0 for _ in range(28)] for _ in range(28)]]}' -X POST http://<service-ip>:8501/v1/models/mnist:predict

Kubernetes AI 推理服务部署与优化最佳实践

Kubernetes AI 推理服务部署与优化最佳实践

1. AI 推理服务核心概念

1.1 什么是 AI 推理服务

1.2 常见的 AI 推理框架

2. GPU 资源管理

2.1 安装 GPU 驱动和 NVIDIA Device Plugin

2.2 GPU 资源分配

更多推荐文章

相关免费在线工具

3. TensorFlow Serving 部署

3.1 准备模型

3.2 部署 TensorFlow Serving

4. Triton Inference Server 部署

4.1 安装 Triton Inference Server

5. 性能优化

5.1 模型优化

5.2 推理服务优化

5.3 自动缩放

6. 监控与可观测性

6.1 监控配置

6.2 日志管理

7. 安全最佳实践

7.1 模型安全

7.2 网络安全

8. 实际应用场景

8.1 多模型部署

8.2 A/B 测试

9. 故障排查

9.1 常见问题解决

9.2 调试技巧

10. 总结

更多推荐文章

相关免费在线工具

Kubernetes AI 推理服务部署与优化最佳实践

Kubernetes AI 推理服务部署与优化最佳实践

1. AI 推理服务核心概念

1.1 什么是 AI 推理服务

1.2 常见的 AI 推理框架

2. GPU 资源管理

2.1 安装 GPU 驱动和 NVIDIA Device Plugin

2.2 GPU 资源分配

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. TensorFlow Serving 部署

3.1 准备模型

3.2 部署 TensorFlow Serving

4. Triton Inference Server 部署

4.1 安装 Triton Inference Server

5. 性能优化

5.1 模型优化

5.2 推理服务优化

5.3 自动缩放

6. 监控与可观测性

6.1 监控配置

6.2 日志管理

7. 安全最佳实践

7.1 模型安全

7.2 网络安全

8. 实际应用场景

8.1 多模型部署

8.2 A/B 测试

9. 故障排查

9.1 常见问题解决

9.2 调试技巧

10. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具