跳到主要内容
混合云架构 K8s 自动化部署与监控运维实践 | 极客日志
Shell / Bash java
混合云架构 K8s 自动化部署与监控运维实践 综述由AI生成 基于本地虚拟化与阿里云公有云的混合云 K8s 自动化部署平台的搭建过程。内容涵盖环境规划与虚拟化部署、Kubernetes 集群及 containerd 核心组件安装、混合云网络打通(WireGuard)、CI/CD 流水线配置(GitLab、Jenkins、ArgoCD)、监控体系搭建(Prometheus、Grafana、Alertmanager)以及日志服务集成(阿里云 SLS)。文章提供了完整的命令行操作指南、配置文件示例及故障排查方案,旨在实现安全隔离、自动化交付、弹性稳定且可监控的运维体系。
苹果系统 发布于 2026/3/30 更新于 2026/5/24 31 浏览云原生混合架构 K8s 自动化部署平台
本项目构建了一套'本地虚拟化 + 阿里云公有云'的混合云原生 K8s 自动化部署平台,核心目标是落地安全隔离、自动化交付、弹性稳定且可监控的运维体系,完整覆盖从基础环境搭建到云原生集群部署、服务交付、混合云网络打通的全流程。
1 环境搭建
本阶段核心目标是通过虚拟化技术创建 3 个节点的本地集群(1 个 master 节点+3 个 node 节点),为后续云原生环境测试、CI/CD 组件部署提供基础环境。
1.1 环境规划
节点角色 CPU 内存 磁盘 IP 规划(桥接模式) master 节点(master) 2 核 8G 50G 192.168.0.200 node1 节点(node1) 2 核 8G 50G 192.168.0.201 node2 节点(node2) 2 核 8G 50G 192.168.0.202 node3 节点(node3) 2 核 8G 50G 192.168.0.203 阿里云 ECS(Jenkins) 2 核 4G 40G 弹性公网 IP 阿里云 ECS(Gitlab) 2 核 8G 40G 弹性公网 IP 阿里云 ACR 容器镜像服务 - - - - 阿里云 SLS 日志服务 - - - -
1.2 技术栈总览
虚拟化层:VMware Workstation 17 Pro、Ubuntu 22.04;
云原生核心:Kubernetes 1.32.10、containerd 1.7.18、Calico CNI;
公有云服务(阿里云):ECS、SLS、ACR;
CI/CD 链路:GitLab、Jenkins、ArgoCD;
监控体系:Prometheus、Grafana、Alertmanager。
1.3 虚拟机创建与系统部署
打开 VMware,创建新虚拟机,选择'自定义(高级)'模式,硬件兼容性默认;
选择 Ubuntu 镜像文件(22.04.5),设置虚拟机名称与存储路径;
按规划配置 CPU、内存,网络选择'桥接模式'(确保虚拟机可访问外网,使用桥接后续与 ECS 网络互通比较方便);
磁盘选择'创建新虚拟磁盘',容量 50G,勾选'将虚拟磁盘拆分为多个文件';
启动虚拟机,安装 Ubuntu 系统:设置 root 密码(统一为 Root@123456,测试环境简化),分区选择'自动分区',等待安装完成后重启;
克隆虚拟机:右键已创建的 master 节点虚拟机,选择'克隆',创建完整克隆,分别命名为 node1、node2,避免重复安装系统;
修改各节点网络配置:
vim /etc/netplan/50-cloud-init.yaml
network:
ethernets:
ens32:
dhcp4: no
addresses: [192.168.0.200/24]
routes:
- to: default via: 192.168.0.1
nameservers:
addresses: [223.5.5.5, 114.114.114.114]
version: 2
netplan apply
关闭各节点 selinux(Ubuntu 默认不启用)
关闭防火墙(Ubuntu 默认不启用)
永久关闭 swap 交换分区
hostnamectl set-hostname master
hostnamectl set-hostname node1
hostnamectl set-hostname node2
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
apt update && apt upgrade -y
apt install -y ca-certificates curl gnupg lsb-release apt-transport-https software-properties-common
modprobe overlay
modprobe br_netfilter
cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF
cat > /etc/sysctl.d/99-containerd.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
lsb_release -cs
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu jammy stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
vim /etc/hosts
192.168.121.100 master
192.168.121.101 node1
192.168.121.102 node2
vim ~/.bashrc
export http_proxy="http://[代理 IP]:7890"
export https_proxy="http://[代理 IP]:7890"
export no_proxy="192.168.0.0/24, localhost, 127.0.0.1, 10.96.0.0/12, 10.20.0.0/16, cluster.local, .svc, .svc.cluster.local, 192.168.0.200"
2 云原生核心层部署(本地 k8s 集群) 本阶段核心目标是搭建基于 K8s 1.32.10 和 containerd 1.7.18 的云原生集群。
2.1 部署 containerd:1.7.18(k8s 集群)
2.1.1 前置准备
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
apt update && apt upgrade -y
apt install -y ca-certificates curl gnupg lsb-release apt-transport-https software-properties-common
modprobe overlay
modprobe br_netfilter
cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF
cat > /etc/sysctl.d/99-containerd.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
2.1.2 添加 docker 官方软件源
lsb_release -cs
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu focal stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
2.1.3 更新源安装指定版本 apt update
apt install -y containerd.io=1.7.18-1
2.1.4 适配 systemd
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
sed -i 's/registry.k8s.io\/pause/registry.aliyuncs.com\/google_containers\/pause/g' /etc/containerd/config.toml
systemctl restart containerd && systemctl enable containerd
containerd 遇到了无法拉取镜像的问题,解决:
root@master:/# ctr image pull docker.io/library/busybox:alpine
WARN[0000] Config "/etc/crictl.yaml" does not exist, trying next: "/usr/bin/crictl.yaml"
WARN[0000] Image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1202 00:06:16.457412 16804 log.go:32]"PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/library/busybox:alpine\": failed to resolve reference \"docker.io/library/busybox:alpine\": failed to do request: Head \"https://registry-1.docker.io/v2/library/busybox/manifests/alpine\": dial tcp 54.89.135.129:443: connect: connection refused" image="docker.io/library/busybox:alpine"
FATA[0020] pulling image: failed to pull and unpack image "docker.io/library/busybox:alpine" : failed to resolve reference "docker.io/library/busybox:alpine" : failed to do request: Head "https://registry-1.docker.io/v2/library/busybox/manifests/alpine" : dial tcp 54.89.135.129:443: connect: connection refused
vim /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout : 10
debug: false
pull-image-on-create: false
containerd config default > /etc/containerd/config.toml
mkdir -p /etc/systemd/system/containerd.service.d
cat > /etc/systemd/system/containerd.service.d/proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://[代理 IP]:7890"
Environment="HTTPS_PROXY=http://[代理 IP]:7890"
Environment="NO_PROXY=localhost,127.0.0.1,10.0.0.0/8,192.168.0.0/16,172.16.0.0/12,*.local,kubernetes.default,service,*.cluster.local,192.168.0.200,192.168.0.*,crpi-2pnpj68s945gixnz.cn-shenzhen.personal.cr.aliyuncs.com"
EOF
systemctl daemon-reload
systemctl restart containerd
root@master1:~# ctr image pull docker.io/library/nginx:latest
Image is up to date for sha256:d4918ca7576a537caa7b0c043051c8efc1796de33fee8724ee0fff4a1cabed9
2.1.5 部署 nerdctl 工具 nerdctl 兼容 docker 语法,containerd 是划分命名空间的
curl -L https://github.com/containerd/nerdctl/releases/download/v1.7.0/nerdctl-1.7.0-linux-amd64.tar.gz -o nerdctl.tar.gz
sudo tar Cxzvf /usr/local/bin nerdctl.tar.gz nerdctl
nerdctl version
nerdctl -n 命名空间名称 images
nerdctl -n 命名空间名称 rm
nerdctl -n 命名空间名称 (images/rm/tag/rmi/stop/pull/push)
2.2 部署 k8s 集群
2.2.1 安装 kubelet kubeadm kubectl 1.32.10 每台服务器执行
sudo tee /etc/modules-load.d/k8s.conf <<EOF
overlay
br_netfilter
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
EOF
sudo modprobe overlay && sudo modprobe br_netfilter && sudo modprobe ip_vs
sudo tee /etc/sysctl.d/k8s.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
apt-get update
apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
2.2.2 kubeadm 初始化 k8s 集群 sed -i 's#sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.8"#sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.10"#g' /etc/containerd/config.toml
root@master:~# vim kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.32.10
imageRepository: registry.aliyuncs.com/google_containers
networking:
podSubnet: 10.20 .0 .0 /16
controlPlaneEndpoint: "192.168.0.200:6443"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
ignorePreflightErrors:
- SystemVerification
criSocket: unix:///run/containerd/containerd.sock
kubeletExtraArgs:
- name: cgroup-driver
value: "systemd"
localAPIEndpoint:
advertiseAddress: 192.168 .0 .200
bindPort: 6443
root@master:~# kubeadm init --config=kubeadm-config.yaml
kubeadm join 192.168 .0 .200 :6443 --token zwf3h4.qcy63iq2avjnflvt --discovery-token-ca-cert-hash sha256:7de5455af5d69939dfb49379f85d7f4f96e9a7962920569a8f29b4ca3079d21e
2.2.3 配置管理权限
root@master:~# mkdir -p $HOME /.kube
root@master:~# sudo cp -i /etc/kubernetes/admin.conf $HOME /.kube/config
root@master:~# sudo chown $(id -u):$(id -g) $HOME /.kube/config
2.2.4 扩容工作节点
kubeadm join 192.168.121.100:6443 --token e6p5bq.bqju9z9dqwj2ydvy --discovery-token-ca-cert-hash sha256:9b3750aedaed5c1c3f95f689ce41d7da1951f2bebba6e7974a53e0b20754a09d
2.2.5 把 roles 变成 work root@master:~# kubectl label node node1 node-role.kubernetes.io/work=work
root@master:~# kubectl label node node2 node-role.kubernetes.io/work=work
root@master:~# kubectl label node node3 node-role.kubernetes.io/work=work
2.2.6 安装 kubernetes 网络组件-Calico
root@master:~# curl -O https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/calico.yaml
root@master:~# kubectl apply -f calico.yaml
kubectl get pods -n kube-system -w
2.2.7 命令补全
apt update && apt install -y bash-completion
echo "source <(kubectl completion bash)" >> ~/.bashrc
source ~/.bashrc
2.3 环境确认
systemctl status containerd
kubectl get nodes
ping oss-cn-hangzhou.aliyuncs.co
2.4 部署核心微服务(主节点执行) 通过 K8s 原生部署服务,采用 PVC 实现数据持久化。部署前需先完成服务镜像的构建与阿里云 ACR 推送,具体步骤如下:
2.4.1 前置准备:确认基础环境与资源
环境要求:已安装 Containerd(本地 k8s 环境已部署),且本地机器可访问阿里云 ACR(网络通畅,无防火墙限制);
资源准备:① 商品服务源代码(含 pom.xml/mvn 配置文件,用于编译构建);此部分用 nginx 官方镜像代替② 阿里云账号(已开通 ACR 服务,拥有命名空间权限);③ 本地已配置阿里云访问凭证(或后续步骤中输入账号密码登录 ACR)。
2.4.2 步骤 1:开通阿里云 ACR 服务
进入 ACR 控制面板
进入个人版实例,创建命名空间
创建本地私有镜像仓库
2.4.3 步骤 2:制作镜像并上传 ACR 镜像仓库(master 节点) 此章节商品服务镜像的构建采用 nginx 官方镜像作为案例,模拟实际生产环境中的微服务
root@master:~# crictl pull nginx:latest
Image is up to date for sha256:058f4935d1cbc026f046e4c7f6ef3b1d778170ac61f293709a2fc89b1cff7009
root@master:~# crictl images
IMAGE TAG IMAGE ID SIZE
docker.io/calico/cni v3.30.0 15f996c472622 71.8MB
docker.io/calico/node v3.30.0 d12dae9bc0999 156MB
docker.io/library/nginx latest 058f4935d1cbc 59.8MB
registry.aliyuncs.com/google_containers/coredns v1.11.3 c69fa2e9cbf5f 18.6MB
registry.aliyuncs.com/google_containers/etcd 3.5.24-0 8cb12dd0c3e4 23.7MB
registry.aliyuncs.com/google_containers/kube-apiserver v1.32.10 77f8b0de97da9 29.1MB
registry.aliyuncs.com/google_containers/kube-controller-manager v1.32.10 34e0beef266f 26.6MB
registry.aliyuncs.com/google_containers/kube-proxy v1.32.10 db4bcdca85a39 31.2MB
registry.aliyuncs.com/google_containers/kube-scheduler v1.32.10 fd6f6aae834c2 21.1MB
registry.aliyuncs.com/google_containers/pause 3.10 873ed75102791 320kB
registry.aliyuncs.com/google_containers/pause 3.8 4873874c08ef 311kB
root@master:~# nerdctl login --username=[您的用户名] [您的阿里云镜像仓库地址]
Enter Password:
WARNING: Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
root@master:~# nerdctl -n k8s.io tag docker.io/library/nginx:latest [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
root@master:~# nerdctl -n k8s.io images
REPOSITORY TAG IMAGE ID CREATED PLATFORM SIZE BLOB SIZE
[您的阿里云镜像仓库地址]/product-service-test/product-service v1 ca871a86d45a 9 seconds ago linux/amd64 157.5 MiB 57.0 MiB
root@master:~# nerdctl -n k8s.io push [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
INFO[0000] pushing as a reduced-platform image (application/vnd.oci.image.index.v1+json, sha256:32502741bf9dbc4ad2c22e24f46c001506711f5bb7d674ac043aaa3242326ef3) index-sha256:32502741bf9dbc4ad2c22e24f46c001506711f5bb7d674ac043aaa3242326ef3: done |++++++++++++++++++++++++++++++++++++++| manifest-sha256:8c39d819008c669731d333c44c766c1d9de3492beb03f8fc035bb5ef7081000: done |++++++++++++++++++++++++++++++++++++++| config-sha256:058f4935d1cbc026f046e4c7f6ef3b1d778170ac61f293709a2fc89b1cff7009: done |++++++++++++++++++++++++++++++++++++++| elapsed: 1.3 s
2.4.4 步骤 3:部署服务(基于 ACR 镜像)
1.ConfigMap 配置(Nginx 主页) root@master:~/yaml/product-service# vim product-service-welcome-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: welcome-nginx-cm
namespace: product
data:
index.html: |<!DOCTYPE html><html><head><title>Welcome</title></head><body><h1>v1</h1></body></html>
2.创建 sc 动态存储卷供应 root@master:~# mkdir yaml
root@master:~# cd yaml
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.24/deploy/local-path-storage.yaml
root@master:~/yaml# vim sc-local-path.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
provisioner: rancher.io/local-path
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
pathPattern: "/var/lib/local-path-provisioner"
root@master:~/yaml# kubectl apply -f sc-local-path.yaml
root@master:~/yaml# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path rancher.io/local-path Delete WaitForFirstConsumer true 68m
3.创建 PVC(持久化存储)
root@master:~/yaml# kubectl create ns product
root@master:~/yaml# vim product-service-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: product-service-pvc
namespace: product
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-path
resources:
requests:
storage: 10Gi
root@master:~/yaml# kubectl apply -f product-service-pvc.yaml
root@master:~/yaml# kubectl get pvc -n product
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
product-service-pvc Bound pvc-f9f2916d-98ba-4435-aa80-ffcfb342cd6a 10Gi RWO local-path <unset > 69m
root@master:~/yaml# kubectl get pv -n product
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
pvc-f9f2916d-98ba-4435-aa80-ffcfb342cd6a 10Gi RWO Delete Bound product/product-service-pvc local-path <unset > 68m <unset > 60m
4.部署服务(deployment)
kubectl create secret docker-registry acr-pull-secret \
--namespace=product \
--docker-server=[您的阿里云镜像仓库地址] \
--docker-username=[您的用户名] \
--docker-password='[您的密码]'
root@master:~/yaml# vim product-service-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: product-service
namespace: product
spec:
replicas: 3
selector:
matchLabels:
app: product-service
strategy:
type : RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: product-service
spec:
imagePullSecrets:
- name: acr-pull-secret
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node2
containers:
- name: product-service
image: [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
ports:
- containerPort: 80
volumeMounts:
- name: welcome-page
mountPath: /usr/share/nginx/html/index.html
subPath: index.html
- name: product-data
mountPath: /data
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
volumes:
- name: welcome-page
configMap:
name: welcome-nginx-cm
items:
- key: index.html
path: index.html
- name: product-data
persistentVolumeClaim:
claimName: product-service-pvc
root@master:~/yaml# kubectl apply -f product-service-deploy.yaml
deployment.apps/product-service configured
root@master:~/yaml# kubectl get pod -n product
NAME READY STATUS RESTARTS AGE
product-service-65dff7d8d4-b8lc7 1/1 Running 0 6s
product-service-65dff7d8d4-czc7w 1/1 Running 0 4s
product-service-65dff7d8d4-gcpsp 1/1 Running 0 5s
5.创建 svc 暴露服务端口 root@master:~/yaml# vim product-service-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: product-service-nodeport
namespace: product
labels:
app: product-service
spec:
type: NodePort
selector:
app: product-service
ports:
- name: http
port: 80
targetPort: 80
nodePort: 30080
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: product-service
namespace: product
spec:
selector:
app: product-service
ports:
- port: 80
targetPort: 8080
type: ClusterIP
root@master:~/yaml# kubectl apply -f product-service-svc.yaml
service/product-service created
root@master:~/yaml/product-service# kubectl get svc -n product
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
product-service-nodeport NodePort 10.107 .131 .224 <none>80:30080/TCP 17h
浏览器访问 192.168.0.201:30080
6.创建 hpa 自动扩缩容
root@master:~/yaml# wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
root@master:~/yaml# vim components.yaml
spec:
containers:
- args:
- --kubelet-insecure-tls
- --cert-dir=/tmp
- --secure-port=10250
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
root@master:~/yaml# kubectl apply -f components.yaml
root@master:~/yaml# vim product-service-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: product-service-hpa
namespace: product
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: product-service
minReplicas: 2
maxReplicas: 10
metrics:
- type : Resource
resource:
name: cpu
target:
type : Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type : Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type : Percent
value: 30
periodSeconds: 60
root@master:~/yaml# kubectl apply -f product-service-hpa.yaml
horizontalpodautoscaler.autoscaling/product-service-hpa created
root@master:~/yaml# kubectl get hpa -n product
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
product-service-hpa Deployment/product-service cpu: 0%/50% 2103 97s
3 CI/CD 链路搭建 配置云端 ECS 与本地 Kubernetes 集群关联需要打通网络,通过 WireGuard VPN 打通本地 VMware 环境与阿里云 VPC,实现本地 k8s 集群访问阿里云 RDS、OSS 等资源。
3.1 混合云网络打通 服务端 :阿里云 ECS(公网可访问),部署 WireGuard 作为 VPN 服务端。
客户端 :本地 K8s 集群的主节点 (Master),部署 WireGuard 客户端,接入 VPN 网络。
核心目标 :
本地 K8s 节点 ↔ ECS 互通;
3.1.1 阿里云 ECS 配置 创建 ECS 实例:
操作系统:Ubuntu 22.04
网络:VPC(10.0.0.0/16)和交换机 (10.0.10.0/24) 弹性公网 ip
安全组入方向规则:
服务 协议 访问来源 访问目的 WireGuard 监听端口 UDP 本机 ip 51820 GitLab TCP 本机 ip + vpc 专有网络网段 + jenkins 公网 ip 443 GitLab TCP 本机 ip + vpc 专有网络网段 + jenkins 公网 ip 80 GitLab ssh 端口 TCP 所有 ip 2222 jenkins TCP 本机 ip + vpc 专有网络网段 + gitlab 公网 ip 8080 jenkins TCP 本机 ip + vpc 专有网络网段 50000
3.1.2 WireGuard 安装与配置 (1)在阿里云 ECS 以及本地节点 上安装 WireGuard
wg-quick up wg0
wg-quick down wg0
wg-quick down wg0 && wg-quick up wg0
wg show
wg show wg0
wg show wg0 dump
apt update
apt install wireguard -y
mkdir -p /etc/wireguard
cd /etc/wireguard
sudo wg genkey | sudo tee private.key | sudo wg pubkey > public.key
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# cat private.key
YG9CkSAnVIy4F8hIiE6ugma5xcgDiT5bMqqTRcy0M2M=
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# cat public.key
k5FafPFqLcQG6MhkIrHy8U2fg5bhN/VgDpXqmiVgwls=
vim /etc/wireguard/wg0.conf
[Interface]
Address = 10.255.255.1/24
ListenPort = 51820
PrivateKey = YJUSqwLfS/VZWsC8qBXPxdIiilsRBUnbZszPtrKoN0A=
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostUp = ip6tables -A FORWARD -i wg0 -j ACCEPT
PostUp = ip6tables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
PostDown = ip6tables -D FORWARD -i wg0 -j ACCEPT
PostDown = ip6tables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
PublicKey = 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
AllowedIPs = 10.255.255.2/32, 192.168.0.0/24
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
root@jenkins:/etc/wireguard# wg show wg0
interface: wg0
public key: fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
private key: (hidden)
listening port: 51820
peer: 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
allowed ips: 10.255.255.2/32, 192.168.0.0/24
vim /etc/wireguard/wg0.conf
[Interface]
Address = 10.255.255.2/24
PrivateKey = iHhpTPwdNSl4cCYCPmOGyUDU46gcAtuNlsRn1QqTOVg=
PostUp = sysctl -w net.ipv4.ip_forward=1
[Peer]
PublicKey = fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
AllowedIPs = 10.255.255.1/32, 10.0.10.0/24
Endpoint = [ECS 公网 IP]:51820
PersistentKeepalive = 25
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
root@master:/etc/wireguard# wg show
interface: wg0
public key: 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
private key: (hidden)
listening port: 37352
peer: fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
endpoint: [ECS 公网 IP]:51820
allowed ips: 10.255.255.1/32, 10.0.10.0/24
latest handshake: 4 seconds ago
transfer: 92 B received, 180 B sent
persistent keepalive: every 25 seconds
ens32 替换为实际网卡名
iptables -A FORWARD -i wg0 -o ens32 -j ACCEPT
iptables -A FORWARD -i ens32 -o wg0 -j ACCEPT
遇到的问题: 在 systemctl start wg-quick@wg0 启动之后 master 节点的 calico-node-t4r7h 处于未运行状态
停止 wg0 接口后 Calico 恢复正常,核心问题:WireGuard 的 wg0 接口抢占了 10.0.0.0/16 网段的路由,而 Calico 的 Pod 网段(10.20.219.64/26)恰好属于这个范围 ,导致 Calico 的 BGP 通信流量被错误路由到 wg0 接口(而非集群内网的 ens32 接口),最终引发 BGP 连接失败。
解决方案 编辑 Calico 的 DaemonSet,强制其使用集群内网的 ens32 接口(而非 wg0)进行 BGP 通信:
kubectl edit ds calico-node -n kube-system
- name: IP_AUTODETECTION_METHOD
value: "interface=ens32"
- name: CALICO_NETWORK_INTERFACE
value: "ens32"
最后重新启动 systemctl start wg-quick@wg0 观察 calico 运行状态,问题解决
3.1.4 连通性测试 root@master:~# ping 10.255.255.1
PING 10.255.255.1 (10.255.255.1) 56(84) bytes of data.
64 bytes from 10.255.255.1: icmp_seq=1 ttl=64 time =20.8 ms
64 bytes from 10.255.255.1: icmp_seq=2 ttl=64 time =22.3 ms
64 bytes from 10.255.255.1: icmp_seq=3 ttl=64 time =21.3 ms
^C
--- 10.255.255.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.815/21.464/22.275/0.606 ms
root@master:~# ping 10.0.10.45
PING 10.0.10.45 (10.0.10.45) 56(84) bytes of data.
64 bytes from 10.0.10.45: icmp_seq=1 ttl=64 time =22.2 ms
64 bytes from 10.0.10.45: icmp_seq=2 ttl=64 time =20.9 ms
64 bytes from 10.0.10.45: icmp_seq=3 ttl=64 time =21.3 ms
^C
--- 10.0.10.45 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.913/21.498/22.240/0.552 ms
root@master:~# telnet 10.0.10.45 22
Trying 10.0.10.45...
Connected to 10.0.10.45.
Escape character is '^]' .
SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.13
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# ping 10.255.255.2
PING 10.255.255.2 (10.255.255.2) 56(84) bytes of data.
64 bytes from 10.255.255.2: icmp_seq=1 ttl=64 time =20.9 ms
64 bytes from 10.255.255.2: icmp_seq=2 ttl=64 time =21.4 ms
64 bytes from 10.255.255.2: icmp_seq=3 ttl=64 time =21.0 ms
^C
--- 10.255.255.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.873/21.103/21.424/0.233 ms
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# ping 192.168.121.100
PING 192.168.121.100 (192.168.121.100) 56(84) bytes of data.
64 bytes from 192.168.121.100: icmp_seq=1 ttl=64 time =21.0 ms
64 bytes from 192.168.121.100: icmp_seq=2 ttl=64 time =20.8 ms
64 bytes from 192.168.121.100: icmp_seq=3 ttl=64 time =20.5 ms
^C
--- 192.168.121.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 20.541/20.778/20.956/0.174 ms
3.2 部署 GitLab(阿里云 ECS)
3.2.1 在阿里云 ECS 创建一台实例 2 核 4G
绑定弹性公网 ip
部署 docker 社区版
ubuntu 22.04 与本地环境一致
修改实例名,复制公网 ip,进行远程连接
安全组设置允许本机 ip 和 jenkins 服务器的公网 ip 访问
3.2.2 docker-compose 部署 Gitlab root@iZwz90hzjc4m3pd9ick3miZ:~# hostnamectl set-hostname gitlab
root@iZwz90hzjc4m3pd9ick3miZ:~# su
root@gitlab:~# apt install -y docker-compose
root@gitlab:~# docker-compose --version
docker-compose version 1.25.0, build unknown
root@gitlab:~# mkdir -p /data/gitlab/{config,data,logs}
root@gitlab:~# chmod -R 777 /data/gitlab
root@gitlab:~# vim docker-compose.yml
version: '3'
services:
gitlab:
image: gitlab/gitlab-ce:14.3.6-ce.0
container_name: gitlab
privileged: true
restart: always
ports:
- "80:80"
- "443:443"
- "2222:22"
volumes:
- /data/gitlab/config:/etc/gitlab
- /data/gitlab/data:/var/opt/gitlab
- /data/gitlab/logs:/var/log/gitlab
environment:
- TZ=Asia/Shanghai
- GITLAB_OMNIBUS_CONFIG=external_url 'http://[ECS 公网 IP]' ; gitlab_rails['gitlab_shell_ssh_port']=2222;
root@gitlab:~# docker-compose up -d
Creating network "root_default" with the default driver
Pulling gitlab (gitlab/gitlab-ce:latest)...
latest: Pulling from gitlab/gitlab-ce
7b1a6ab2e44d: Pull complete
6c37b8f20a77: Pull complete
f509191f201: Pull complete
bb6bfd7806: Pull complete
2c03ae5f5fcd: Pull complete
8311111743: Pull complete
499fee924bc: Pull complete
6667fb304: Pull complete
Digest: sha256:5a0b03f09ab2f2634ecc6bfeb41521d19329cf4c9bbf330227117c048e75163
Status: Downloaded newer image for gitlab/gitlab-ce:latest
Creating gitlab ... done
root@gitlab:~# docker-compose logs -f gitlab
root@gitlab:~# docker exec -it gitlab /bin/bash
root@4c054babda87:/# cat /etc/gitlab/initial_root_password
Password: jqV6Dmlo+kbke3pLVFP0PTV2ttWiFPnDq54uX4WQ0Hc=# NOTE: This file will be automatically deleted in the first reconfigure run after 24 hours.
打开浏览器,访问 http://[ECS 公网 IP];
输入用户名 root,粘贴上述初始密码,点击登录;
修改密码,设置新的强密码。
访问 GitLab 并登录 如果无法访问需要检查安全组是否开放端口访问权限,没有则新建允许 80 和 443 端口
3.2.3 初始化 gitlab
3.2.4 配置 master 节点 ssh 免密认证 root@master:~/gitlab/e-commerce-platform# cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC2/jHMzETQsYS0+IkoKsZGDvqF3mmjEMYS1hjGfJnMin+mPRKH0quZll/4RuHFky3sbn3WSDonCcgvXP0TWUZTvCe9CGvlnU+zkkCMuOwCRqXNb/pXeAjzOCDBkUX+vXYHrhkmtNPylS8JDAuOdr+6qnIKG8GBjRVFmu7tl6+NFgjgpEGbgTE6vowWK+J3zKx6iN7FCKx+oMcWdEvcOy/WNnYWq7uCfQQgXerONTKHTJ6I9z6x/MMHnCTszSAYHSr7D9HV9un0k9tnoV5cSTA0tuDmFzNWX288v702DWDxgDJeaJLSeQTAAu6lm93GAdNC77QpI7IPDcZ/NkO3/AQoE5yIdCX8ApE7hobNQVL/24+8n+EmzfYsP+IWK/SWf7WZV4BR7v1QTz2M7HqPiYNR5rxOniCAhJ4dwnoS4LjeYMknGoB4SBqPcnpoUZT9q1iYf02JunKgCpAHSdNJ4IfbdiKYeO6IlCPL78xjvEAfOuqwSjOgUbiH70OXWfrJKmj5j/4J4crWm7cApCcevx6dzqo072rQtZLLoOZSBf114EkjCglE5W0hlnh6/sivBt/Yq0iNMAGVBsexJ8c8n5+saKuY+T1SU5JQiIeoISgVG/Ssv1913RRravFj5Fme3A8UnyYri0/4k3PYGu7QBBTytFmuim3sBYaQIzmqpRBLbw== root@master
3.3 部署 jenkins(阿里云 ECS)
3.3.1 在阿里云 ECS 创建一台实例 2 核 4G
分配公网 ip
部署 docker 社区版
ubuntu 22.04 与本地环境一致
3.3.2 docker-compose 部署 jenkins root@iZwz9749p6a8r7y1673ypyZ:~# hostnamectl set-hostname jenkins
root@iZwz9749p6a8r7y1673ypyZ:~# su
root@jenkins:~# apt install -y docker-compose
root@jenkins:~# mkdir -p /opt/jenkins/data
root@jenkins:~# chown -R 1000:1000 /opt/jenkins/data
root@jenkins:~# chmod -R 755 /opt/jenkins/data
root@jenkins:~# cd /opt/jenkins
root@jenkins:/opt/jenkins# vim docker-compose.yml
version: '2.2'
services:
jenkins:
image: jenkins/jenkins:2.528.2
container_name: jenkins
restart: always
privileged: true
user: root
ports:
- "8080:8080"
- "50000:50000"
volumes:
- ./data:/var/jenkins_home
- /var/run/docker.sock:/var/run/docker.sock
- /usr/bin/docker:/usr/bin/docker
- /usr/local/bin/docker-compose:/usr/local/bin/docker-compose
environment:
- TZ=Asia/Shanghai
root@jenkins:/opt/jenkins# docker-compose up -d
Creating network "jenkins_default" with the default driver
Pulling jenkins (jenkins/jenkins:2.528.2)...
2.528.2: Pulling from jenkins/jenkins
13cc3f8244a: Pull complete
dc27f462ea: Pull complete
33300af18dd0: Pull complete
c2759c6dffa: Pull complete
e4beac6dffa: Pull complete
a37b858bb47: Pull complete
744b792e083: Pull complete
05d79a8b608: Pull complete
8d27b2b2b2: Pull complete
65e4ba86bc: Pull complete
5dc073277a: Pull complete
7718ff1022: Pull complete
Digest: sha256:7b1c378278279c8688efd6168c25a1c2723a6bd6f0420beb5ccefabee3cc3bb1
Status: Downloaded newer image for jenkins/jenkins:2.528.2
Creating jenkins ... done
root@jenkins:/opt/jenkins# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6e126cdd99b jenkins/jenkins:2.528.2 "/sbin/tini -- /usr/…" 2 seconds ago Up 2 seconds 0.0.0.0:8080->8080/tcp, [::]:8080->8080/tcp, 0.0.0.0:50000->50000/tcp, [::]:50000->50000/tcp jenkins
3.3.3 jenkins 初始化配置 在浏览器中输入:http://[ECS 公网 IP]:8080(需要在安全组放行 8080 访问端口和 50000 代理端口)。
root@jenkins:/opt/jenkins# docker exec -it jenkins cat /var/jenkins_home/secrets/initialAdminPassword
de747fc1faa540cabfcd937c36e71ac6
若部分插件安装失败,可点击'重试',或后续在 Jenkins 插件管理中手动安装。
root@jenkins:/opt/jenkins# mv plugins.tar data/
root@jenkins:/opt/jenkins# cd data/
root@jenkins:/opt/jenkins/data/# tar -xvf plugins.tar
创建用户
配置实例地址
安装额外插件:GitLab Plugin、Kubernetes Plugin、Nexus Plugin;
配置 GitLab 关联:在 Jenkins 系统管理→系统设置→GitLab 中,添加 GitLab 服务器,输入 GitLab 地址和 Access Token(从 GitLab 个人设置→Access Tokens 创建);
如果不连通需要检查 ecs 的安全组是否开放了 jenkins 服务器公网访问 gitlab 服务器公网 ip:80 的权限
3.4 部署 Argo CD(本地 k8s 集群)
3.4.1 安装 ArgoCD
root@master:~/yaml# kubectl create namespace argocd
root@master:~/yaml# mkdir argocd/
root@master:~/yaml/argocd# kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.8.3/manifests/install.yaml
root@master:~/yaml/argocd# kubectl patch svc argocd-server -n argocd -p '{"spec":{"type":"NodePort"}}'
root@master:~/yaml/argocd# kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo
eyd7NOqAVLDGak1o
root@master:~/yaml/argocd# kubectl get svc -n argocd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-applicationset-controller ClusterIP 10.106.160.96 <none>7000/TCP,8080/TCP 27h
argocd-dex-server ClusterIP 10.107.111.20 <none>5556/TCP,5557/TCP,5558/TCP 27h
argocd-metrics ClusterIP 10.97.249.73 <none>8082/TCP 27h
argocd-notifications-controller-metrics ClusterIP 10.110.61.50 <none>9001/TCP 27h
argocd-redis ClusterIP 10.105.69.236 <none>6379/TCP 27h
argocd-repo-server ClusterIP 10.99.240.50 <none>8081/TCP,8084/TCP 27h
argocd-server NodePort 10.108.75.197 <none>80:31375/TCP,443:32324/TCP 27h
argocd-server-metrics ClusterIP 10.110.198.250 <none>8083/TCP 27h
访问 ArgoCD UI:http://<本地 K8s 节点 IP>:<argocd-server 的 NodePort>,用 admin 和初始密码登录。
3.4.2 配置 ArgoCD 访问 GitLab
在 ArgoCD UI 中 → Settings → Repositories → Connect Repo
Repository URL:GitLab 的仓库地址(http://[GitLab 公网 IP]/root/e-commerce-platform.git)
Type:Git
Authentication:Username + Password → 用户名填 GitLab 账号 root,密码填之前生成的个人访问令牌
3.5 Jenkins 流水线配置
3.5.1 git 克隆 gitlab 仓库 首先本地克隆 gitlab 仓库,然后进行代码文件编写与提交
root@master:~# mkdir gitlab
root@master:~# cd gitlab
root@master:~/github# git config --global user.name "[您的用户名]"
root@master:~/github# git config --global user.email "[您的邮箱]"
root@master:~/github# git config --global color.ui true
root@master:~/github# git config --list
root@master:~/github# git init
root@master:~/gitlab# git clone ssh://git@[GitLab 公网 IP]:2222/root/e-commerce-platform.git
Cloning into 'e-commerce-platform' ...
remote: Enumerating objects: 17, done .
remote: Counting objects: 100% (14/14), done .
remote: Compressing objects: 100% (13/13), done .
remote: Total 17(delta 2), reused 0(delta 0), pack-reused 3
Receiving objects: 100% (17/17), done .
Resolving deltas: 100% (2/2), done .
root@master:~/gitlab# ls e-commerce-platform
root@master:~/gitlab# cd e-commerce-platform/
3.5.2 编写 Jenkinsfile(存放在 GitLab 的仓库根目录) root@master:~/gitlab/e-commerce-platform# vim Jenkinsfile
pipeline {
agent any
environment {
ACR_REGISTRY = "[您的阿里云镜像仓库地址]/product-service-test"
APP_NAME = "product-service"
GITLAB_REPO_URL = "http://[GitLab 公网 IP]/root/e-commerce-platform.git"
GITLAB_REPO_HOST = "[GitLab 公网 IP]/root/e-commerce-platform.git"
GIT_CRED_ID = "Gitlab-token-Secret" # secret text 格式密钥
ACR_CRED_ID = "acr-cred"
MANIFEST_FILE = "product-service-deploy.yaml"
MANIFEST_CLONE_DIR = "e-commerce-platform-manifests"
VERSION_FILE = "version.txt"
}
options {
timeout(time: 30, unit: 'MINUTES')
retry(1)
skipDefaultCheckout(false)
disableConcurrentBuilds()
}
stages {
stage('Check Skip Conditions') {
steps {
script {
// 只检测 Jenkins 提交,不再检测 version.txt 变更
def commitMessage = sh(script: 'git log -1 --pretty=%B || echo ""', returnStdout: true).trim()
def commitAuthor = sh(script: 'git log -1 --pretty=%an || echo ""', returnStdout: true).trim()
// 如果是 Jenkins 提交,跳过
if(commitMessage.contains('[Jenkins]')|| commitMessage.contains('[ci skip]')|| commitAuthor =='jenkins-bot'){
echo"===== 检测到 Jenkins 提交,跳过构建 ====="
currentBuild.result ='SUCCESS'
env.SKIP_BUILD ='true'
return
}
// 用户提交,继续构建(即使 version.txt 被修改)
echo"===== 用户提交,继续构建 ====="
}
}
}
stage('Get Version') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
script {
// 从源代码读取版本号(用户手动指定的)
env.NEXT_VERSION = sh(script: 'cat ${VERSION_FILE} 2>/dev/null || echo "v0"', returnStdout: true).trim()
if(env.NEXT_VERSION =='v0'){ error "version.txt 不存在或为空,请先创建并提交"}
echo"使用手动指定的版本:${env.NEXT_VERSION}"
}
}
}
stage('Build Docker Image') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 构建镜像:${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION} ====="
sh """
if[! -f Dockerfile ];then
echo'错误:Dockerfile 不存在'
exit 1
fi
docker build --no-cache -t ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}.
"""
}
}
stage('Push to ACR') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 推送镜像到 ACR ====="
withCredentials([usernamePassword(credentialsId: "${ACR_CRED_ID}", passwordVariable: 'ACR_PWD', usernameVariable: 'ACR_USER')]){
sh """
echo${ACR_PWD}|docker login --username ${ACR_USER} --password-stdin ${ACR_REGISTRY.split('/')[0]}
docker push ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}
docker logout${ACR_REGISTRY.split('/')[0]}
"""
}
}
}
stage('Update K8s Manifest') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 更新 K8s 清单 ====="
withCredentials([string(credentialsId: "${GIT_CRED_ID}", variable: 'GITLAB_TOKEN')]){
script {
sh"rm -rf ${MANIFEST_CLONE_DIR} 2>/dev/null || true"
sh """
git clone http://oauth2:${GITLAB_TOKEN}@${GITLAB_REPO_HOST}${MANIFEST_CLONE_DIR}||{echo'克隆仓库失败';exit 1}
"""
dir("${MANIFEST_CLONE_DIR}"){
// 更新 K8s 清单
sh """
sed -i.bak 's|image: .*${APP_NAME}:.*|image: ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}|g'${MANIFEST_FILE}
rm -f ${MANIFEST_FILE}.bak
"""
// 可选:同步更新 manifest 仓库的 version.txt(保持一致)
sh """
echo"${NEXT_VERSION}">${VERSION_FILE}
"""
// 提交到 manifest 仓库
sh """
git config user.email "[您的邮箱]"
git config user.name "jenkins-bot"
if git status --porcelain |grep -q .;then
git add${MANIFEST_FILE}${VERSION_FILE}
git commit -m "[Jenkins] Update ${APP_NAME} to ${NEXT_VERSION} [ci skip]"
git push origin main
echo"已推送修改到 manifest 仓库"
else
echo"无修改,跳过提交"
fi
"""
}
}
}
}
}
}
post {
always {
echo"===== 清理资源 ====="
sh"rm -rf ${MANIFEST_CLONE_DIR} || true"
script {
if(env.NEXT_VERSION){sh"docker rmi ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION} || true 2>/dev/null"}
}
}
success {
script {
if(env.SKIP_BUILD !='true'){
echo"Pipeline 成功!镜像:${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}"
}else{
echo"构建跳过(自动化提交)"
}
}
}
failure {
echo"Pipeline 失败!请检查配置"
}
}
}
3.5.3 编写 Dockerfile root@master:~/gitlab/e-commerce-platform# vim Dockerfile
FROM docker.io/library/nginx:latest
3.5.4 准备 k8s 部署清单提交至代码仓库 product-service-deploy.yaml
product-service-welcome-cm.yaml
Dockerfile
Jenkinsfile
root@master:~/gitlab/e-commerce-platform# cp /root/yaml/product-service/product-service-deploy.yaml ./
root@master:~/gitlab/e-commerce-platform# cp /root/yaml/product-service/product-service-welcome-cm.yaml ./
root@master:~/gitlab/e-commerce-platform# ls Dockerfile Jenkinsfile product-service-deploy.yaml product-service-welcome-cm.yaml
root@master:~/gitlab/e-commerce-platform# git add ./
root@master:~/gitlab/e-commerce-platform# git commit -m "v1" [main 3429cff] v1 4 files changed, 156 insertions(+) create mode 100644 Dockerfile create mode 100644 Jenkinsfile create mode 100644 product-service-deploy.yaml create mode 100644 product-service-welcome-cm.yaml
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 7, done .
Counting objects: 100% (7/7), done .
Delta compression using up to 2 threads
Compressing objects: 100% (5/5), done .
Writing objects: 100% (6/6), 2.51 KiB |2.51 MiB/s, done .
Total 6(delta 0), reused 0(delta 0), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
62ecaae..3429cff main -> main
3.6 配置 GitLab WebHook 触发 Jenkins
3.6.1 在 Jenkins 中创建流水线任务 新建任务 ->选择'流水线' -> 名称设为 product-service-ci
流水线 -> 定义选择'Pipeline script from SCM' ->SCM 选 Git -> 填入 app-code 仓库地址,凭证选 gitlab-token (必须是用户密码类型否则不显示)->分支 main->脚本路径填 Jenkinsfile -> 保存。
3.6.2 配置 GitLab WebHook 进入 GitLab 的 app-code 仓库-> 设置 ->Webhooks -> 添加 Webhook:
URL:Jenkins 的触发地址(格式:http://用户名:token[Jenkins 公网 IP]:8080/project/构建名)
触发条件:勾选'Push events'
点击'Add webhook' → 测试(点击'Test' → 选'Push events'),验证 Jenkins 能触发构建。
3.6.3 测试 jenkins 自动构建 root@master:~/gitlab/e-commerce-platform# vim product-service-deploy.yaml
image: [您的阿里云镜像仓库地址]/product-service-test/product-service:v2
root@master:~/gitlab/e-commerce-platform# git add .
root@master:~/gitlab/e-commerce-platform# git commit -m "test 自动构建,修改了版本号" [main 84df6a9] test 自动构建,修改了版本号 1file changed, 1 insertion(+), 1 deletion(-)
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 5, done .
Counting objects: 100% (5/5), done .
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done .
Writing objects: 100% (3/3), 334 bytes |334.00 KiB/s, done .
Total 3(delta 2), reused 0(delta 0), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
8e98ca7..84df6a9 main -> main
版本号会随构建次数自动叠加
上传到 ACR 镜像仓库成功
Started by GitLab push by Administrator Obtained Jenkinsfile from git http://[GitLab 公网 IP]/root/e-commerce-platform.git [Pipeline] Start of Pipeline [Pipeline]node Running on Jenkins in /var/jenkins_home/workspace/product-service-ci [Pipeline]{[Pipeline] stage [Pipeline]{(Declarative: Checkout SCM)[Pipeline] checkout The recommended git tool is: NONE using credential Gitlab-token-us >git rev-parse --resolve-git-dir /var/jenkins_home/workspace/product-service-ci/.git
3.7 ArgoCD 配置自动同步
1. 在 ArgoCD UI 中 → New App: Application Name:my-demo-app
Project:default
Sync Policy:勾选'Automatic'(自动同步)、'Prune Resources'、'Self Heal'
Source:
Repository URL:GitLab 的 k8s-manifests 仓库地址
Revision:main
Path:./(部署清单所在路径)
Destination:
Cluster URL:https://kubernetes.default.svc(本地 K8s)
Namespace:default
2.点击 Create,ArgoCD 会自动同步部署清单到本地 K8s。
3.8 全链路测试
3.8.1 触发 CI/CD 流程
root@master:~/gitlab/e-commerce-platform# vim product-service-welcome-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: welcome-nginx-cm
namespace: product
data:
index.html: |<!DOCTYPE html><html><head ><title>Welcome</title></head><body><h1>v2</h1></body># 修改成 v2</html># 修改版本号
root@master:~/gitlab/e-commerce-platform# vim version.txt
v2
root@master:~/gitlab/e-commerce-platform# git add .
root@master:~/gitlab/e-commerce-platform# git commit -m "v2" [main 2c5da2e] v2 2 files changed, 2 insertions(+), 2 deletions(-)
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 7, done .
Counting objects: 100% (7/7), done .
Delta compression using up to 2 threads
Compressing objects: 100% (2/2), done .
Writing objects: 100% (4/4), 344 bytes |344.00 KiB/s, done .
Total 4(delta 2), reused 2(delta 1), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
f8f2f1b..2c5da2e main -> main
可以看到提交代码后,jenkins 触发了自动构建
4 监控体系搭建(本地部署)
4.1 前置准备
4.1.1 阿里云 ACR 创建命名空间
4.1.2 拉取镜像上传对应的镜像仓库 直接在 monitoring_k8s 命名空间后面加上镜像仓库名上传会自动创建镜像仓库
root@master:~# ctr images pull docker.io/prom/node-exporter:v1.8.1
root@master:~# nerdctl tag prom/node-exporter:v1.8.1 [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
root@master:~# ctr images pull docker.io/prom/prometheus:v2.53.1
root@master:~# nerdctl tag prom/prometheus:v2.53.1 [您的阿里云镜像仓库地址]/monitoring_k8s/prometheus:v2.53.1
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/prometheus:v2.53.1
root@master:~# ctr images pull docker.io/grafana/grafana:11.2.0
root@master:~# nerdctl tag grafana/grafana:11.2.0 [您的阿里云镜像仓库地址]/monitoring_k8s/grafana:11.2.0
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/grafana:11.2.0
root@master:~# ctr images pull docker.io/prom/blackbox-exporter:v0.24.0
root@master:~# nerdctl tag prom/blackbox-exporter:v0.24.0 [您的阿里云镜像仓库地址]/monitoring_k8s/blackbox-exporter:v0.24.0
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/blackbox-exporter:v0.24.0
root@master:~/yaml/monitoring# ctr images pull docker.io/prom/alertmanager:v0.26.0
root@master:~/yaml/monitoring# nerdctl tag prom/alertmanager:v0.26.0 [您的阿里云镜像仓库地址]/monitoring_k8s/alertmanager:v0.26.0
root@master:~/yaml/monitoring# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/alertmanager:v0.26.0
root@master:~/yaml/filebeat# ctr images pull docker.io/elastic/filebeat:8.11.0
root@master:~/yaml/filebeat# nerdctl tag elastic/filebeat:8.11.0 [您的阿里云镜像仓库地址]/logging_k8s/filebeat:8.11.0
root@master:~/yaml/filebeat# nerdctl push [您的阿里云镜像仓库地址]/logging_k8s/filebeat:8.11.0
4.1.4 创建监控命名空间
root@master:~# kubectl create ns monitoring
4.2 部署 node-exporter root@master:~/yaml# ls product-service secret
root@master:~/yaml# mkdir monitoring
root@master:~/yaml# kubectl create secret docker-registry acr-pull-secret \
--namespace=monitoring \
--docker-server=[您的阿里云镜像仓库地址] \
--docker-username=[您的用户名] \
--docker-password='[您的密码]'
secret/acr-pull-secret created
root@master:~/yaml/monitoring# vim node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
hostNetwork: true
hostPID: true
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: node-exporter
image: [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)
securityContext:
privileged: true
volumeMounts:
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /rootfs
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
type : ClusterIP
root@master:~/yaml/monitoring# kubectl apply -f node-exporter.yaml
root@master:~/yaml# kubectl get pods -n monitoring -l app=node-exporter -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-7kcrl 1/1 Running 0 2m3s 192.168.0.200 master <none><none>
node-exporter-gknxb 1/1 Running 0 2m3s 192.168.0.202 node2 <none><none>
node-exporter-p99j6 1/1 Running 0 2m4s 192.168.0.203 node3 <none><none>
node-exporter-q5m95 1/1 Running 0 2m3s 192.168.0.201 node1 <none><none>
4.3 部署 Prometheus
4.3.1 配置 Prometheus RBAC 权限 root@master:~/yaml/monitoring# vim prometheus-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: ["" ]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get" , "list" , "watch" ]
- apiGroups: ["" ]
resources:
- configmaps
verbs: ["get" ]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get" , "list" , "watch" ]
- nonResourceURLs: ["/metrics" , "/metrics/cadvisor" ]
verbs: ["get" ]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: default
namespace: monitoring
4.3.2 配置 Prometheus 抓取规则 root@master:~/yaml/monitoring# vim prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
# ========== 全局配置(所有抓取任务的默认规则) ==========
global:
# 抓取指标的间隔:每 15 秒抓取一次所有监控目标的指标(默认值,可被单个 job 覆盖)
scrape_interval: 15s
# 规则评估间隔:每 15 秒评估一次告警规则/记录规则(如 PromQL 告警表达式)
evaluation_interval: 15s
# ========== 抓取配置列表(定义所有需要监控的目标) ==========
scrape_configs:
# 1. 抓取 Prometheus 自身的运行指标(监控监控系统本身)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090'] # Prometheus 自身的指标端口(9090 为默认端口)
# 2. 抓取 K8s 集群节点的 Node Exporter 指标(K8s 自动发现)
- job_name: 'k8s-node-exporter' # K8s 服务发现配置:基于 K8s 的 Endpoints 自动发现监控目标
kubernetes_sd_configs:
- role: endpoints # 发现角色:Endpoints(Service 对应的后端 Pod 端点)
namespaces:
# 仅发现 monitoring 命名空间下的 Endpoints(Node Exporter 部署在此)
names: ['monitoring']
# 标签重写规则:过滤/修改目标的标签,只保留需要的监控目标
relabel_configs:
# 规则 1:仅保留 Service 标签包含 app=node-exporter 的 Endpoints
- source_labels: [__meta_kubernetes_service_label_app] # 源标签:K8s Service 的 app 标签
regex: node-exporter # 匹配规则:值为 node-exporter
action: keep # 动作:保留匹配的目标(不匹配的丢弃)
# 规则 2:仅保留端口名称为 metrics 的 Endpoints(Node Exporter 的端口名)
- source_labels: [__meta_kubernetes_endpoint_port_name] # 源标签:Endpoints 的端口名称
regex: metrics # 匹配规则:值为 metrics
action: keep # 动作:保留匹配的目标
# 3. 抓取 Blackbox Exporter 指标(页面/接口可用性监控)
- job_name: 'blackbox-exporter' # 指标路径:Blackbox Exporter 的探针接口(默认/probe)
metrics_path: /probe
# 请求参数:指定检测模块为 http_2xx(检测 HTTP 接口是否返回 200 状态码)
params:
module: [http_2xx]
# K8s 服务发现:自动发现 monitoring 命名空间下的 Blackbox Exporter Endpoints
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ['monitoring']
# 标签重写规则:适配 Blackbox Exporter 的探针请求逻辑
relabel_configs:
# 规则 1:仅保留 Service 标签为 app=blackbox-exporter 的目标
- source_labels: [__meta_kubernetes_service_label_app]
regex: blackbox-exporter
action: keep
# 规则 2:将目标地址(__address__)作为探针请求的 target 参数
- source_labels: [__address__]
target_label: __param_target
# 规则 3:将 target 参数值作为 instance 标签(Prometheus UI 中显示的实例名)
- source_labels: [__param_target]
target_label: instance
# 规则 4:修改目标地址为 Blackbox Exporter 的 Service 地址(所有探针请求转发到这里)
- target_label: __address__
replacement: blackbox-exporter.monitoring.svc:9115 # Blackbox Service 的集群内地址
# 规则 5:将 instance 标签值赋值给 target 标签(便于在 Grafana 中筛选目标)
- source_labels: [instance]
regex: (.*)
target_label: target
replacement: ${1}
# 4. 抓取 K8s 集群核心组件:APIServer 指标
- job_name: 'kubernetes-apiservers' # K8s 服务发现:全局发现所有 Endpoints(APIServer 在 default 命名空间)
kubernetes_sd_configs:
- role: endpoints
# 访问协议:APIServer 仅支持 HTTPS
scheme: https
# TLS 配置:使用 K8s ServiceAccount 的 CA 证书(Pod 内默认挂载的证书)
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# 认证配置:使用 Pod 内默认挂载的 ServiceAccount Token(RBAC 权限认证)
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 标签重写规则:仅保留 default 命名空间下 kubernetes Service 的 https 端口
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] # 匹配规则:命名空间=default、Service 名=kubernetes、端口名=https
regex: default;kubernetes;https
action: keep # 仅保留 APIServer 的 Endpoints(过滤其他无关目标)
4.3.3 部署 Prometheus Deployment + Service root@master:~/yaml/monitoring# vim prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: prometheus
image: [您的阿里云镜像仓库地址 ]/monitoring_k8s/prometheus:v2.53.1
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
type: NodePort
4.3.4 应用 Prometheus 所有配置 root@master:~/yaml/monitoring# kubectl apply -f .
root@master:~/yaml# kubectl get pod -n monitoring -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-7kcrl 1/1 Running 0 11m 192.168.0.200 master <none><none>
node-exporter-gknxb 1/1 Running 0 11m 192.168.0.202 node2 <none><none>
node-exporter-p99j6 1/1 Running 0 12m 192.168.0.203 node3 <none><none>
node-exporter-q5m95 1/1 Running 0 11m 192.168.0.201 node1 <none><none>
prometheus-68f95956cf-v5bh2 1/1 Running 0 32s 10.20.166.132 node1 <none><none>
4.3.5 访问验证 浏览器访问 node1 节点 ip,192.168.0.201:30090
4.4 部署 Grafana
4.4.1 部署 Grafana Deployment + Service root@master:~/yaml/monitoring# vim grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: grafana
image: [您的阿里云镜像仓库地址 ]/monitoring_k8s/grafana:11.2.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123"
- name: GF_USERS_ALLOW_SIGN_UP
value: "false"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
volumes:
- name: grafana-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
targetPort: 3000
nodePort: 30030
type: NodePort
4.4.2 应用 Grafana 配置 root@master:~/yaml/monitoring# kubectl apply -f grafana-deployment.yaml
root@master:~/yaml# kubectl get pods -n monitoring -o wide -l app=grafana
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
grafana-57596f6bcb-5lw47 1/1 Running 0 2m39s 10.20.135.4 node3 <none><none>
4.4.3 访问验证 浏览器访问 192.168.0.201:30030
4.4.4 配置 Grafana 数据源 设置中文
首页,右上角头像-profile-Language
登录 Grafana 后,点击左侧 连接->数据源->添加新数据源;
选择 Prometheus,配置 URL 为:http://prometheus.monitoring.svc:9090(K8s 内部 Service 地址);
4.4.5 导入 Grafana 仪表盘
点击左侧仪表板->右侧新建导入
输入仪表盘 ID,点击 加载:
节点状态监控:1860(Node Exporter Full,节点 CPU / 内存 / 磁盘);
K8s 集群监控:7249(Kubernetes Cluster Monitoring,集群组件);
4.5 部署 Alertmanager
4.5.1 编写 Alertmanager 核心配置文件 Alertmanager 的核心配置是 alertmanager.yml,主要包含路由规则、接收人、通知渠道、抑制 / 静默规则 等。
root@master:~/yaml/monitoring# vim alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '[您的邮箱]'
smtp_smarthost: 'smtp.qq.com:587'
smtp_auth_username: '[您的邮箱]'
smtp_auth_password: '[您的 SMTP 密码]'
smtp_require_tls: true
route:
receiver: 'chenjun'
group_by: ['alertname' , 'cluster' , 'service' ]
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receivers:
- name: 'chenjun'
email_configs:
- to: '[您的邮箱]'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname' , 'cluster' , 'service' ]
4.5.2 将配置文件存储为 ConfigMap root@master:~/yaml/monitoring# kubectl create configmap alertmanager-config \
--namespace=monitoring \
--from-file=alertmanager.yml=./alertmanager.yml
4.5.3 编写 Alertmanager 部署清单(Deployment + Service)
root@master:~/yaml/monitoring# vim alertmanager-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: alertmanager
image: [您的阿里云镜像仓库地址 ]/monitoring_k8s/alertmanager:v0.26.0
imagePullPolicy: IfNotPresent
args:
- --config.file=/etc/alertmanager/alertmanager.yml
- --storage.path=/alertmanager
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager-storage
mountPath: /alertmanager
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
livenessProbe:
httpGet:
path: /-/healthy
port: 9093
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9093
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: alertmanager-config
configMap:
name: alertmanager-config
- name: alertmanager-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
spec:
type: NodePort
selector:
app: alertmanager
ports:
- name: web
port: 9093
targetPort: 9093
nodePort: 30093
4.5.4 部署 Alertmanager 到集群 root@master:~/yaml/monitoring# kubectl apply -f alertmanager-deploy.yaml
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 58m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h1m
node-exporter-7kcrl 1/1 Running 0 3h27m
node-exporter-gknxb 1/1 Running 0 3h27m
node-exporter-p99j6 1/1 Running 0 3h28m
node-exporter-q5m95 1/1 Running 0 3h27m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 7m26s
root@master:~/yaml/monitoring# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager NodePort 10.100.74.196 <none>9093:30093/TCP 61m
grafana NodePort 10.98.127.132 <none>3000:30030/TCP 3h1m
node-exporter ClusterIP 10.107.29.187 <none>9100/TCP 3h28m
prometheus NodePort 10.96.26.78 <none>9090:30090/TCP 3h16m
4.5.5 访问 Alertmanager Web UI 浏览器访问 192.168.0.201:30093
若能看到 Alertmanager 界面,说明部署成功。
4.5.6 配置 Prometheus 关联 Alertmanager
root@master:~/yaml/monitoring# vim prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
# 添加以下内容---------------------------------
alerting:
alertmanagers:
- static_configs:
- targets: # Alertmanager 的 Service 地址
- alertmanager.monitoring.svc:9093
rule_files:
- "alert_rules.yml"
# 结束-------------------------------------------
# 中间采集指标略
# 最后添加以下内容告警规则,与 prometheus.yml:同级
alert_rules.yml: |
groups:
# 1. 节点级告警(服务器资源)
- name: node-resource-alerts
rules:
# 1.1 节点内存使用率过高
- alert: NodeHighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.2 节点内存使用率紧急(临界值)
- alert: NodeCriticalMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "节点内存使用率紧急"
description: "节点 {{ $labels.instance }} 内存使用率超过 95% (当前值:{{ printf \"%.2f\"$value }}%),已持续 2 分钟,可能导致服务不可用!"
# 1.3 节点 CPU 使用率过高
- alert: NodeHighCPUUsage
expr: 100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点 CPU 使用率过高"
description: "节点 {{ $labels.instance }} CPU 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.4 节点根磁盘使用率过高
- alert: NodeRootDiskHighUsage
expr: 100 * (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点根磁盘使用率过高"
description: "节点 {{ $labels.instance }} 根目录 / 磁盘使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.5 节点磁盘 IO 使用率过高
- alert: NodeHighDiskIO
expr: 100 * rate(node_disk_io_time_seconds_total{device!~"loop.*|sr.*"}[5m]) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点磁盘 IO 使用率过高"
description: "节点 {{ $labels.instance }} 的磁盘 {{ $labels.device }} IO 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.6 节点不可达(NodeExporter 失联)
- alert: NodeDown
expr: up{job=~"k8s-node-exporter|harbor-node-exporter|lb-node-exporter"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "节点监控失联"
description: "节点 {{ $labels.instance }} 的 NodeExporter 已失联超过 3 分钟,无法采集指标!"
# 2. K8s Pod/容器级告警
- name: k8s-pod-alerts
rules:
# 2.1 Pod 重启次数过多(1 小时内重启≥3 次)
- alert: PodRestartTooFrequent
expr: increase(kube_pod_container_restarts_total[1h]) >= 3
for: 10m
labels:
severity: warning
annotations:
summary: "Pod 重启次数过多"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} 1 小时内重启 {{ $value }} 次,可能存在服务异常。"
# 2.2 Pod 状态异常(Pending/Failed/Error)
- alert: PodStatusAbnormal
expr: kube_pod_status_phase{phase=~"Pending|Failed|Error"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod 状态异常"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 状态为 {{ $labels.phase }},已持续 5 分钟。"
# 2.3 容器 CPU 使用率过高
- alert: ContainerHighCPUUsage
expr: (sum by (namespace, pod, container)(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by (namespace, pod, container)(kube_pod_container_resource_limits_cpu_cores{container!=""})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "容器 CPU 使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} CPU 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 2.4 容器内存使用率过高
- alert: ContainerHighMemoryUsage
expr: (sum by (namespace, pod, container)(container_memory_usage_bytes{container!=""}) / sum by (namespace, pod, container)(kube_pod_container_resource_limits_memory_bytes{container!=""})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "容器内存使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} 内存使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 3. K8s 核心组件告警
- name: k8s-component-alerts
rules:
# 3.1 APIServer 请求延迟过高
- alert: K8sAPIServerHighRequestLatency
expr: (apiserver_request_latency_seconds_sum{verb!~"LIST|WATCH"} / apiserver_request_latency_seconds_count{verb!~"LIST|WATCH"}) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "K8s APIServer 请求延迟过高"
description: "APIServer {{ $labels.instance }} {{ $labels.verb }} 请求平均延迟超过 500ms (当前值:{{ printf \"%.3f\"$value }}s),已持续 5 分钟。"
# 3.2 APIServer 错误率过高
- alert: K8sAPIServerHighErrorRate
expr: sum by (instance)(rate(apiserver_request_total{code=~"5.."}[5m])) / sum by (instance)(rate(apiserver_request_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "K8s APIServer 错误率过高"
description: "APIServer 5XX 错误率超过 5% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
更新 Prometheus ConfigMap 并重启 Prometheus Pod:
root@master:~/yaml/monitoring# kubectl apply -f prometheus-config.yaml
root@master:~/yaml/monitoring# kubectl rollout restart deployment prometheus -n monitoring
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 64m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h7m
node-exporter-7kcrl 1/1 Running 0 3h33m
node-exporter-gknxb 1/1 Running 0 3h33m
node-exporter-p99j6 1/1 Running 0 3h33m
node-exporter-q5m95 1/1 Running 0 3h33m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 13m
进入 Prometheus web 页面查看规则是否生效
4.5.7 测试邮箱告警
root@master:~/yaml/monitoring# kubectl apply -f prometheus-config.yaml
root@master:~/yaml/monitoring# kubectl rollout restart deployment prometheus -n monitoring
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 64m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h7m
node-exporter-7kcrl 1/1 Running 0 3h33m
node-exporter-gknxb 1/1 Running 0 3h33m
node-exporter-p99j6 1/1 Running 0 3h33m
node-exporter-q5m95 1/1 Running 0 3h33m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 13m
4.6 阿里云 SLS 日志服务
4.6.1 部署 LoongCollector(本地集群 master) root@master:~/yaml# mkdir logging
root@master:~/yaml# cd logging
root@master:~/yaml/logging# wget https://aliyun-observability-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/loongcollector/k8s-custom-pkg/3.0.12/loongcollector-custom-k8s-package.tgz; tar xvf loongcollector-custom-k8s-package.tgz; chmod 744 ./loongcollector-custom-k8s-package/k8s-custom-install.sh
root@master:~/yaml/logging/loongcollector-custom-k8s-package# vim loongcollector/values.yaml
projectName: "k8s-pod-logs"
region: "cn-shenzhen"
aliUid: "[您的阿里云账号 ID]"
net: Internet
accessKeyID: "[您的 AccessKey ID]"
accessKeySecret: "[您的 AccessKey Secret]"
clusterID: "k8s-pod"
root@master:~/yaml/logging/loongcollector-custom-k8s-package# bash k8s-custom-install.sh install
root@master:~/yaml/logging/loongcollector-custom-k8s-package# kubectl get po -n kube-system -o wide | grep loongcollector-ds
loongcollector-ds-6hcvp 1/1 Running 0 78s 10.20.166.154 node1 <none><none>
loongcollector-ds-hhklj 1/1 Running 0 78s 10.20.104.20 node2 <none><none>
loongcollector-ds-jx4ll 1/1 Running 0 78s 10.20.135.23 node3 <none><none>
loongcollector-ds-wj8c7 1/1 Running 0 78s 10.20.219.71 master <none><none>
组件安装成功后,日志服务会自动创建如下资源,可登录 日志服务控制台 查看。
4.6.2 创建日志采集规则
1. 标准输出日志采集(容器日志) 选择 project->创建日志库->数据介入->选择 k8s-标准输出 - 新版模板
2.配置机器组 使用场景:k8s 场景
部署方式:自建集群 Daemonset
添加机器组:k8s-group-k8s-pod
3.Logtail 配置 全局配置:填写采集名称(如 k8s-stdout)。
容器过滤:通过 Pod 标签、命名空间或容器名称筛选目标日志(如命名空间 kube-system)
4.本地 k8s 集群主机日志采集(本地 k8s 集群所有主机节点)
前提条件 若您无可用 Project,请参考此处步骤创建一个基础 Project,如需详细了解创建配置请参见 管理 Project 。
登录 日志服务控制台 ,单击创建 Project ,完成下述基础配置,其他配置保持默认即可:
所属地域 :请根据日志来源等信息选择合适的阿里云地域,创建后不可修改。
Project 名称 :设置名称,名称在阿里云地域内全局唯一,创建后不可修改。
若您无可用 Logstore,请参考此处步骤创建一个基础 Logstore,如需详细了解创建配置请参见 管理 LogStore 。
登录 日志服务控制台 ,在 Project 列表中单击目标 Project。
填写Logstore 名称 ,其余配置保持默认无需修改。
在****日志存储** > **日志库**页签中,单击 +**图标。
选择传输方式并执行安装命令:替换${region_id}为 Project 所属地域的 RegionID 。
下载安装包:在服务器上执行下载命令,示例代码中${region_id}可使用cn-hangzhou替换。
root@master:~# mkdir logotail
root@master:~# cd logotail/
root@master:~/logotail# wget https://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/loongcollector.sh -O loongcollector.sh; --2026-01-09 09:16:04-- https://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/loongcollector.sh
公网:适用于大多数场景,常见于跨地域或其他云/自建服务器,但受带宽限制且可能不稳定。
root@master:~/logotail# chmod +x loongcollector.sh; ./loongcollector.sh install cn-shenzhen-internet
loongcollector.sh version: 1.7.0
OS Arch: x86_64
OS Distribution: Ubuntu
current glibc version is :2.35
glibc >=2.12, and cpu flag meet
BIN_DIR: /usr/local/ilogtail
CONTROLLER_FILE: loongcollectord
update-rc.d del loongcollectord successfully.
Uninstall loongcollector successfully.
RUNUSER:root
Download package from region cn-shenzhen-internet ...
Package address: http://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/x86_64/main/loongcollector-linux64.tar.gz
[1] Download loongcollector-linux64.tar.gz successfully.
Generate config successfully.
Installing loongcollector in /usr/local/ilogtail ...
sysom-cn-shenzhenPreparing eBPF enviroment ...
Found valid btf file: /sys/kernel/btf/vmlinux
Prepare eBPF enviroment successfully
agent stub for telegraf has been installed
agent stub for jvm has been installed
Install loongcollector files successfully.
Configuring loongcollector service...
Use systemd for startup
service_file_path: /etc/systemd/system/loongcollectord.service
Synchronizing state of loongcollectord.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable loongcollectord
Created symlink /etc/systemd/system/default.target.wants/loongcollectord.service → /etc/systemd/system/loongcollectord.service.
systemd startup successfully.
Synchronizing state of ilogtaild.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable ilogtaild
Created symlink /etc/systemd/system/default.target.wants/ilogtaild.service → /etc/systemd/system/ilogtaild.service.
Configure loongcollector successfully.
Starting loongcollector ...
Start loongcollector successfully.
{"UUID" :"DD64E1D0-ECF9-11F0-92B1-9D94276D7AA7" , "compiler" :"GCC 9.3.1" , "host_id" :"DCCBAF1A-ECF9-11F0-92B1-9D94276D7AA7" , "hostname" :"master" , "instance_id" :"DD64D532-ECF9-11F0-92B1-9D94276D7AA7_192.168.0.200_1767921834" , "ip" :"192.168.0.200" , "loongcollector_version" :"3.2.6" , "os" :"Linux; 5.15.0-164-generic; #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025; x86_64" , "update_time" :"2026-01-09 09:23:55" }
查看启动状态:执行命令,返回 loongcollector is running 表示启动成功。
root@master:~/logotail# sudo /etc/init.d/loongcollectord status
loongcollector is running
配置用户 ID:用户 ID 文件包含 Project 所属阿里云主账号的 ID 信息,用于标识该账号有权限访问、采集这台服务器的日志。
只有在采集非本账号 ECS、自建服务器、其他云厂商服务器日志时需要配置用户 ID。多个账号对同一台服务器进行日志采集时,支持在同一台服务器上创建多个用户 ID 文件。
登录 日志服务控制台 ,鼠标悬浮在右上角用户头像上,在弹出的标签页中查看并复制账号 ID。注意需要复制主账号 ID。
在安装了 LoongCollector 的服务器上,以主账号 ID 作为文件名,创建用户 ID 文件。
root@master:~/logotail# touch /etc/ilogtail/users/[您的阿里云账号 ID]
配置机器组:日志服务通过机器组发现用户自定义标识并与主机上的 LoongCollector 建立心跳连接。
在服务器上将自定义字符串 user-defined-test-1 写入用户自定义标识文件,该字符串将在后续步骤中使用。
echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
root@master:~/logotail# echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
登录 日志服务控制台 。在 Project 列表中,单击目标 Project。
设置机器组名称:名称 Project 内唯一,必须以小写字母或数字开头和结尾,且只能包含小写字母、数字、连字符(-)和下划线(_),长度为 3~128 字符。
机器组标识:选择用户自定义标识 。
用户自定义标识:输入配置的用户自定义标识,需要与服务器用户自定义标识文件中自定义字符串内容一致。此例为 user-defined-test-1。
机器组创建完成后,在机器组列表单击目标机器组,在机器组状态中查看心跳 状态,若为 FAIL,请等待两分钟左右并手动刷新。如果心跳为 OK 则表示创建成功。
5.ECS 服务器日志采集同主机日志采集步骤 systemd for startup
service_file_path: /etc/systemd/system/loongcollectord.service
Synchronizing state of loongcollectord.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable loongcollectord
Created symlink /etc/systemd/system/default.target.wants/loongcollectord.service → /etc/systemd/system/loongcollectord.service.
systemd startup successfully.
Synchronizing state of ilogtaild.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable ilogtaild
Created symlink /etc/systemd/system/default.target.wants/ilogtaild.service → /etc/systemd/system/ilogtaild.service.
Configure loongcollector successfully.
Starting loongcollector …
Start loongcollector successfully.
{
"UUID" : "DD64E1D0-ECF9-11F0-92B1-9D94276D7AA7",
"compiler" : "GCC 9.3.1",
"host_id" : "DCCBAF1A-ECF9-11F0-92B1-9D94276D7AA7",
"hostname" : "master",
"instance_id" : "DD64D532-ECF9-11F0-92B1-9D94276D7AA7_192.168.0.200_1767921834",
"ip" : "192.168.0.200",
"loongcollector_version" : "3.2.6",
"os" : "Linux; 5.15.0-164-generic; #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025; x86_64",
"update_time" : "2026-01-09 09:23:55"
}
3. 查看启动状态:执行命令,返回`loongcollector is running`表示启动成功。
root@master:~/logotail# sudo /etc/init.d/loongcollectord status
loongcollector is running
配置用户 ID:用户 ID 文件包含 Project 所属阿里云主账号的 ID 信息,用于标识该账号有权限访问、采集这台服务器的日志。
只有在采集非本账号 ECS、自建服务器、其他云厂商服务器日志时需要配置用户 ID。多个账号对同一台服务器进行日志采集时,支持在同一台服务器上创建多个用户 ID 文件。
登录 日志服务控制台 ,鼠标悬浮在右上角用户头像上,在弹出的标签页中查看并复制账号 ID。注意需要复制主账号 ID。
在安装了 LoongCollector 的服务器上,以主账号 ID 作为文件名,创建用户 ID 文件。
root@master:~/logotail# touch /etc/ilogtail/users/[您的阿里云账号 ID]
配置机器组:日志服务通过机器组发现用户自定义标识并与主机上的 LoongCollector 建立心跳连接。
在服务器上将自定义字符串 user-defined-test-1 写入用户自定义标识文件,该字符串将在后续步骤中使用。
echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
root@master:~/logotail# echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
登录 日志服务控制台 。在 Project 列表中,单击目标 Project。
进行如下配置后单击确定。
设置机器组名称:名称 Project 内唯一,必须以小写字母或数字开头和结尾,且只能包含小写字母、数字、连字符(-)和下划线(_),长度为 3~128 字符。
机器组标识:选择用户自定义标识 。
用户自定义标识:输入配置的用户自定义标识,需要与服务器用户自定义标识文件中自定义字符串内容一致。此例为 user-defined-test-1。
机器组创建完成后,在机器组列表单击目标机器组,在机器组状态中查看心跳 状态,若为 FAIL,请等待两分钟左右并手动刷新。如果心跳为 OK 则表示创建成功。
相关免费在线工具 Keycode 信息 查找任何按下的键的javascript键代码、代码、位置和修饰符。 在线工具,Keycode 信息在线工具,online
Escape 与 Native 编解码 JavaScript 字符串转义/反转义;Java 风格 \uXXXX(Native2Ascii)编码与解码。 在线工具,Escape 与 Native 编解码在线工具,online
JavaScript / HTML 格式化 使用 Prettier 在浏览器内格式化 JavaScript 或 HTML 片段。 在线工具,JavaScript / HTML 格式化在线工具,online
JavaScript 压缩与混淆 Terser 压缩、变量名混淆,或 javascript-obfuscator 高强度混淆(体积会增大)。 在线工具,JavaScript 压缩与混淆在线工具,online
Base64 字符串编码/解码 将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
Base64 文件转换器 将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online