Kubernetes 与 AI 推理服务最佳实践
1. AI 推理服务核心概念
1.1 什么是 AI 推理服务
AI 推理服务是指将训练好的 AI 模型部署为可访问的服务,用于实时或批量处理推理请求。在 Kubernetes 环境中,AI 推理服务需要考虑资源管理、性能优化和高可用性。
1.2 常见的 AI 推理框架
- TensorFlow Serving:Google 开源的机器学习模型服务框架
在 Kubernetes 环境中部署 AI 推理服务的最佳实践,涵盖 GPU 资源管理、TensorFlow Serving 与 Triton Inference Server 部署流程、模型与服务性能优化(量化、批处理、自动缩放)、监控可观测性配置及安全策略。通过合理配置资源与参数,可实现高性能、高可用的 AI 推理服务。
AI 推理服务是指将训练好的 AI 模型部署为可访问的服务,用于实时或批量处理推理请求。在 Kubernetes 环境中,AI 推理服务需要考虑资源管理、性能优化和高可用性。
# 安装 NVIDIA 驱动(在节点上执行)
apt-get install -y nvidia-driver-535
# 安装 NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# 验证 GPU 资源
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.nvidia\.com/gpu}{"\n"}{end}'
部署使用 GPU 的推理服务
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-serving
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: tensorflow-serving
template:
metadata:
labels:
app: tensorflow-serving
spec:
containers:
- name: tensorflow-serving
image: tensorflow/serving:latest
ports:
- containerPort: 8501
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
# 下载示例模型
mkdir -p models/mnist/1
wget -O models/mnist/1/saved_model.pb https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz
# 创建模型存储
kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
EOF
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: tf-serving
template:
metadata:
labels:
app: tf-serving
spec:
containers:
- name: tf-serving
image: tensorflow/serving:latest
ports:
- containerPort: 8500
- containerPort: 8501
env:
- name: MODEL_NAME
value: mnist
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
service.yaml
apiVersion: v1
kind: Service
metadata:
name: tf-serving
namespace: default
spec:
selector:
app: tf-serving
ports:
- port: 8501
targetPort: 8501
type: LoadBalancer
# 部署服务
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# 测试推理服务
MODEL_SERVICE=$(kubectl get svc tf-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -d '{"instances": [[[0.0 for _ in range(28)] for _ in range(28)]]}' -X POST http://$MODEL_SERVICE:8501/v1/models/mnist:predict
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton-server
image: nvcr.io/nvidia/tritonserver:23.08-py3
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
service.yaml
apiVersion: v1
kind: Service
metadata:
name: triton-server
namespace: default
spec:
selector:
app: triton-server
ports:
- port: 8000
targetPort: 8000
- port: 8001
targetPort: 8001
- port: 8002
targetPort: 8002
type: LoadBalancer
# 部署服务
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# 检查服务状态
kubectl get pods -l app=triton-server
配置批处理
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving-batched
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: tf-serving-batched
template:
metadata:
labels:
app: tf-serving-batched
spec:
containers:
- name: tf-serving
image: tensorflow/serving:latest
ports:
- containerPort: 8501
env:
- name: MODEL_NAME
value: mnist
- name: TF_FORCE_GPU_ALLOW_GROWTH
value: "true"
- name: BATCH_SIZE
value: "32"
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
HPA 配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tf-serving-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tf-serving
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Prometheus 配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: tf-serving-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: tf-serving
endpoints:
- port: 8501
path: /v1/monitoring/prometheus
interval: 15s
日志配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving
namespace: default
spec:
# ...
template:
spec:
containers:
- name: tf-serving
image: tensorflow/serving:latest
# ...
env:
- name: TF_CPP_MIN_LOG_LEVEL
value: "0"
- name: TF_ENABLE_GPU_GARBAGE_COLLECTION
value: "true"
args:
- --model_name=mnist
- --model_base_path=/models/mnist
- --enable_batching=true
- --batching_parameters_file=/models/batching_parameters.txt
网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-inference-network-policy
namespace: default
spec:
podSelector:
matchLabels:
app: tf-serving
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8501
egress:
- to:
- podSelector:
matchLabels:
app: monitoring
ports:
- protocol: TCP
port: 9090
多模型配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-multi-model
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: triton-multi-model
template:
metadata:
labels:
app: triton-multi-model
spec:
containers:
- name: triton-server
image: nvcr.io/nvidia/tritonserver:23.08-py3
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: models-pvc
A/B 测试配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ai-inference-ingress
namespace: default
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "20"
spec:
rules:
- host: inference.example.com
http:
paths:
- path: /v1/models
pathType: Prefix
backend:
service:
name: tf-serving-v2
port:
number: 8501
# 查看 GPU 使用情况
kubectl exec -it <pod-name> -- nvidia-smi
# 查看推理服务日志
kubectl logs -l app=tf-serving
# 检查模型状态
curl http://<service-ip>:8501/v1/models/mnist
# 测试推理服务
curl -d '{"instances": [[[0.0 for _ in range(28)] for _ in range(28)]]}' -X POST http://<service-ip>:8501/v1/models/mnist:predict
Kubernetes 为 AI 推理服务提供了强大的部署和管理能力。通过合理配置 GPU 资源、优化模型和服务参数,可以构建高性能、可靠的 AI 推理服务。
关键要点:
通过以上最佳实践,可以充分发挥 Kubernetes 的优势,构建更加高效、可靠的 AI 推理服务。

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online