云原生混合架构 K8s 自动化部署平台
本项目构建了一套'本地虚拟化 + 阿里云公有云'的混合云原生 K8s 自动化部署平台,核心目标是落地安全隔离、自动化交付、弹性稳定且可监控的运维体系,完整覆盖从基础环境搭建到云原生集群部署、服务交付、混合云网络打通的全流程。
1 环境搭建
本阶段核心目标是通过虚拟化技术创建 3 个节点的本地集群(1 个 master 节点+3 个 node 节点),为后续云原生环境测试、CI/CD 组件部署提供基础环境。
1.1 环境规划
| 节点角色 | CPU | 内存 | 磁盘 | IP 规划(桥接模式) |
|---|
| master 节点(master) | 2 核 | 8G | 50G | 192.168.0.200 |
| node1 节点(node1) | 2 核 | 8G | 50G | 192.168.0.201 |
| node2 节点(node2) | 2 核 | 8G | 50G | 192.168.0.202 |
| node3 节点(node3) | 2 核 | 8G | 50G | 192.168.0.203 |
| 阿里云 ECS(Jenkins) | 2 核 | 4G | 40G | 弹性公网 IP |
| 阿里云 ECS(Gitlab) | 2 核 | 8G | 40G | 弹性公网 IP |
| 阿里云 ACR 容器镜像服务 | - | - | - | - |
| 阿里云 SLS 日志服务 | - | - | - | - |
1.2 技术栈总览
- 虚拟化层:VMware Workstation 17 Pro、Ubuntu 22.04;
- 云原生核心:Kubernetes 1.32.10、containerd 1.7.18、Calico CNI;
- 公有云服务(阿里云):ECS、SLS、ACR;
- CI/CD 链路:GitLab、Jenkins、ArgoCD;
- 监控体系:Prometheus、Grafana、Alertmanager。
1.3 虚拟机创建与系统部署
- 打开 VMware,创建新虚拟机,选择'自定义(高级)'模式,硬件兼容性默认;
- 选择 Ubuntu 镜像文件(22.04.5),设置虚拟机名称与存储路径;
- 按规划配置 CPU、内存,网络选择'桥接模式'(确保虚拟机可访问外网,使用桥接后续与 ECS 网络互通比较方便);
- 磁盘选择'创建新虚拟磁盘',容量 50G,勾选'将虚拟磁盘拆分为多个文件';
- 启动虚拟机,安装 Ubuntu 系统:设置 root 密码(统一为 Root@123456,测试环境简化),分区选择'自动分区',等待安装完成后重启;
- 克隆虚拟机:右键已创建的 master 节点虚拟机,选择'克隆',创建完整克隆,分别命名为 node1、node2,避免重复安装系统;
修改各节点网络配置:
vim /etc/netplan/50-cloud-init.yaml
network:
ethernets:
ens32:
dhcp4: no
addresses: [192.168.0.200/24]
routes:
- to: default via: 192.168.0.1
nameservers:
addresses: [223.5.5.5, 114.114.114.114]
version: 2
netplan apply
- 关闭各节点 selinux(Ubuntu 默认不启用)
- 关闭防火墙(Ubuntu 默认不启用)
- 永久关闭 swap 交换分区
vim /etc/fstab
- 修改主机名
hostnamectl set-hostname master
hostnamectl set-hostname node1
hostnamectl set-hostname node2
- 添加 docker 官方 GPG 密钥
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
- 更新系统并安装基础依赖
apt update && apt upgrade -y
apt install -y ca-certificates curl gnupg lsb-release apt-transport-https software-properties-common
- 加载内核参数
modprobe overlay
modprobe br_netfilter
cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF
cat > /etc/sysctl.d/99-containerd.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
- 添加 docker 官方软件源
lsb_release -cs
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu jammy stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
- 配置 hosts 域名映射
vim /etc/hosts
192.168.121.100 master
192.168.121.101 node1
192.168.121.102 node2
- 配置代理
vim ~/.bashrc
export http_proxy="http://[代理 IP]:7890"
export https_proxy="http://[代理 IP]:7890"
export no_proxy="192.168.0.0/24, localhost, 127.0.0.1, 10.96.0.0/12, 10.20.0.0/16, cluster.local, .svc, .svc.cluster.local, 192.168.0.200"
2 云原生核心层部署(本地 k8s 集群)
本阶段核心目标是搭建基于 K8s 1.32.10 和 containerd 1.7.18 的云原生集群。
2.1 部署 containerd:1.7.18(k8s 集群)
2.1.1 前置准备
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
apt update && apt upgrade -y
apt install -y ca-certificates curl gnupg lsb-release apt-transport-https software-properties-common
modprobe overlay
modprobe br_netfilter
cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF
cat > /etc/sysctl.d/99-containerd.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
2.1.2 添加 docker 官方软件源
lsb_release -cs
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu focal stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
2.1.3 更新源安装指定版本
apt update
apt install -y containerd.io=1.7.18-1
2.1.4 适配 systemd
参考文档:容器运行时 | Kubernetes
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
sed -i 's/registry.k8s.io\/pause/registry.aliyuncs.com\/google_containers\/pause/g' /etc/containerd/config.toml
systemctl restart containerd && systemctl enable containerd
containerd 遇到了无法拉取镜像的问题,解决:
root@master:/# ctr image pull docker.io/library/busybox:alpine
WARN[0000] Config "/etc/crictl.yaml" does not exist, trying next: "/usr/bin/crictl.yaml"
WARN[0000] Image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1202 00:06:16.457412 16804 log.go:32]"PullImage from image service failed"err="rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/library/busybox:alpine\": failed to resolve reference \"docker.io/library/busybox:alpine\": failed to do request: Head \"https://registry-1.docker.io/v2/library/busybox/manifests/alpine\": dial tcp 54.89.135.129:443: connect: connection refused"image="docker.io/library/busybox:alpine"
FATA[0020] pulling image: failed to pull and unpack image "docker.io/library/busybox:alpine": failed to resolve reference "docker.io/library/busybox:alpine": failed to do request: Head "https://registry-1.docker.io/v2/library/busybox/manifests/alpine": dial tcp 54.89.135.129:443: connect: connection refused
vim /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
pull-image-on-create: false
containerd config default > /etc/containerd/config.toml
mkdir -p /etc/systemd/system/containerd.service.d
cat > /etc/systemd/system/containerd.service.d/proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://[代理 IP]:7890"
Environment="HTTPS_PROXY=http://[代理 IP]:7890"
Environment="NO_PROXY=localhost,127.0.0.1,10.0.0.0/8,192.168.0.0/16,172.16.0.0/12,*.local,kubernetes.default,service,*.cluster.local,192.168.0.200,192.168.0.*,crpi-2pnpj68s945gixnz.cn-shenzhen.personal.cr.aliyuncs.com"
EOF
systemctl daemon-reload
systemctl restart containerd
root@master1:~# ctr image pull docker.io/library/nginx:latest
Image is up to date for sha256:d4918ca7576a537caa7b0c043051c8efc1796de33fee8724ee0fff4a1cabed9
2.1.5 部署 nerdctl 工具
nerdctl 兼容 docker 语法,containerd 是划分命名空间的
curl -L https://github.com/containerd/nerdctl/releases/download/v1.7.0/nerdctl-1.7.0-linux-amd64.tar.gz -o nerdctl.tar.gz
sudo tar Cxzvf /usr/local/bin nerdctl.tar.gz nerdctl
nerdctl version
nerdctl -n 命名空间名称 images
nerdctl -n 命名空间名称 rm
nerdctl -n 命名空间名称 (images/rm/tag/rmi/stop/pull/push)
2.2 部署 k8s 集群
官网链接:安装 kubeadm | Kubernetes
2.2.1 安装 kubelet kubeadm kubectl 1.32.10 每台服务器执行
sudo tee /etc/modules-load.d/k8s.conf <<EOF
overlay
br_netfilter
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
EOF
sudo modprobe overlay && sudo modprobe br_netfilter && sudo modprobe ip_vs
sudo tee /etc/sysctl.d/k8s.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
apt-get update
apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
2.2.2 kubeadm 初始化 k8s 集群
初始化过程中遇到了报错 pause 版本不一致
sed -i 's#sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.8"#sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.10"#g' /etc/containerd/config.toml
初始化 k8s 集群:
root@master:~# vim kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.32.10
imageRepository: registry.aliyuncs.com/google_containers
networking:
podSubnet: 10.20.0.0/16
controlPlaneEndpoint: "192.168.0.200:6443"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
ignorePreflightErrors:
- SystemVerification
criSocket: unix:///run/containerd/containerd.sock
kubeletExtraArgs:
- name: cgroup-driver
value: "systemd"
localAPIEndpoint:
advertiseAddress: 192.168.0.200
bindPort: 6443
2.2.3 配置管理权限
root@master:~# mkdir -p $HOME/.kube
root@master:~# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
root@master:~# sudo chown $(id -u):$(id -g) $HOME/.kube/config
2.2.4 扩容工作节点
kubeadm join 192.168.121.100:6443 --token e6p5bq.bqju9z9dqwj2ydvy --discovery-token-ca-cert-hash sha256:9b3750aedaed5c1c3f95f689ce41d7da1951f2bebba6e7974a53e0b20754a09d
2.2.5 把 roles 变成 work
root@master:~# kubectl label node node1 node-role.kubernetes.io/work=work
root@master:~# kubectl label node node2 node-role.kubernetes.io/work=work
root@master:~# kubectl label node node3 node-role.kubernetes.io/work=work
2.2.6 安装 kubernetes 网络组件-Calico
root@master:~# curl -O https://raw.githubusercontent.com/projectcalico/calico/v3.30.0/manifests/calico.yaml
root@master:~# kubectl apply -f calico.yaml
kubectl get pods -n kube-system -w
2.2.7 命令补全
apt update && apt install -y bash-completion
echo "source <(kubectl completion bash)" >> ~/.bashrc
source ~/.bashrc
2.3 环境确认
systemctl status containerd
kubectl get nodes
ping oss-cn-hangzhou.aliyuncs.co
2.4 部署核心微服务(主节点执行)
通过 K8s 原生部署服务,采用 PVC 实现数据持久化。部署前需先完成服务镜像的构建与阿里云 ACR 推送,具体步骤如下:
2.4.1 前置准备:确认基础环境与资源
- 环境要求:已安装 Containerd(本地 k8s 环境已部署),且本地机器可访问阿里云 ACR(网络通畅,无防火墙限制);
- 资源准备:① 商品服务源代码(含 pom.xml/mvn 配置文件,用于编译构建);此部分用 nginx 官方镜像代替② 阿里云账号(已开通 ACR 服务,拥有命名空间权限);③ 本地已配置阿里云访问凭证(或后续步骤中输入账号密码登录 ACR)。
2.4.2 步骤 1:开通阿里云 ACR 服务
- 进入 ACR 控制面板
- 进入个人版实例,创建命名空间
- 创建本地私有镜像仓库
2.4.3 步骤 2:制作镜像并上传 ACR 镜像仓库(master 节点)
此章节商品服务镜像的构建采用 nginx 官方镜像作为案例,模拟实际生产环境中的微服务
- 拉取 nginx 官方镜像
root@master:~# crictl pull nginx:latest
Image is up to date for sha256:058f4935d1cbc026f046e4c7f6ef3b1d778170ac61f293709a2fc89b1cff7009
root@master:~# crictl images
IMAGE TAG IMAGE ID SIZE
docker.io/calico/cni v3.30.0 15f996c472622 71.8MB
docker.io/calico/node v3.30.0 d12dae9bc0999 156MB
docker.io/library/nginx latest 058f4935d1cbc 59.8MB
registry.aliyuncs.com/google_containers/coredns v1.11.3 c69fa2e9cbf5f 18.6MB
registry.aliyuncs.com/google_containers/etcd 3.5.24-0 8cb12dd0c3e4 23.7MB
registry.aliyuncs.com/google_containers/kube-apiserver v1.32.10 77f8b0de97da9 29.1MB
registry.aliyuncs.com/google_containers/kube-controller-manager v1.32.10 34e0beef266f 26.6MB
registry.aliyuncs.com/google_containers/kube-proxy v1.32.10 db4bcdca85a39 31.2MB
registry.aliyuncs.com/google_containers/kube-scheduler v1.32.10 fd6f6aae834c2 21.1MB
registry.aliyuncs.com/google_containers/pause 3.10 873ed75102791 320kB
registry.aliyuncs.com/google_containers/pause 3.8 4873874c08ef 311kB
- 登录阿里云
root@master:~# nerdctl login --username=[您的用户名] [您的阿里云镜像仓库地址]
Enter Password:
WARNING: Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
- 将镜像推送到阿里云镜像仓库
root@master:~# nerdctl -n k8s.io tag docker.io/library/nginx:latest [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
root@master:~# nerdctl -n k8s.io images
REPOSITORY TAG IMAGE ID CREATED PLATFORM SIZE BLOB SIZE
[您的阿里云镜像仓库地址]/product-service-test/product-service v1 ca871a86d45a 9 seconds ago linux/amd64 157.5 MiB 57.0 MiB
root@master:~# nerdctl -n k8s.io push [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
INFO[0000] pushing as a reduced-platform image (application/vnd.oci.image.index.v1+json, sha256:32502741bf9dbc4ad2c22e24f46c001506711f5bb7d674ac043aaa3242326ef3) index-sha256:32502741bf9dbc4ad2c22e24f46c001506711f5bb7d674ac043aaa3242326ef3: done|++++++++++++++++++++++++++++++++++++++| manifest-sha256:8c39d819008c669731d333c44c766c1d9de3492beb03f8fc035bb5ef7081000: done|++++++++++++++++++++++++++++++++++++++| config-sha256:058f4935d1cbc026f046e4c7f6ef3b1d778170ac61f293709a2fc89b1cff7009: done|++++++++++++++++++++++++++++++++++++++| elapsed: 1.3 s
控制台进入镜像版本可以看到镜像已经推送成功
2.4.4 步骤 3:部署服务(基于 ACR 镜像)
1.ConfigMap 配置(Nginx 主页)
root@master:~/yaml/product-service# vim product-service-welcome-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: welcome-nginx-cm
namespace: product
data:
index.html: |<!DOCTYPE html><html><head><title>Welcome</title></head><body><h1>v1</h1></body></html>
2.创建 sc 动态存储卷供应
root@master:~# mkdir yaml
root@master:~# cd yaml
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.24/deploy/local-path-storage.yaml
root@master:~/yaml# vim sc-local-path.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
provisioner: rancher.io/local-path
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
pathPattern: "/var/lib/local-path-provisioner"
root@master:~/yaml# kubectl apply -f sc-local-path.yaml
root@master:~/yaml# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path rancher.io/local-path Delete WaitForFirstConsumer true 68m
3.创建 PVC(持久化存储)
root@master:~/yaml# kubectl create ns product
root@master:~/yaml# vim product-service-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: product-service-pvc
namespace: product
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-path
resources:
requests:
storage: 10Gi
root@master:~/yaml# kubectl apply -f product-service-pvc.yaml
root@master:~/yaml# kubectl get pvc -n product
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
product-service-pvc Bound pvc-f9f2916d-98ba-4435-aa80-ffcfb342cd6a 10Gi RWO local-path <unset> 69m
root@master:~/yaml# kubectl get pv -n product
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
pvc-f9f2916d-98ba-4435-aa80-ffcfb342cd6a 10Gi RWO Delete Bound product/product-service-pvc local-path <unset> 68m <unset> 60m
4.部署服务(deployment)
kubectl create secret docker-registry acr-pull-secret \
--namespace=product \
--docker-server=[您的阿里云镜像仓库地址] \
--docker-username=[您的用户名] \
--docker-password='[您的密码]'
root@master:~/yaml# vim product-service-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: product-service
namespace: product
spec:
replicas: 3
selector:
matchLabels:
app: product-service
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: product-service
spec:
imagePullSecrets:
- name: acr-pull-secret
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node2
containers:
- name: product-service
image: [您的阿里云镜像仓库地址]/product-service-test/product-service:v1
ports:
- containerPort: 80
volumeMounts:
- name: welcome-page
mountPath: /usr/share/nginx/html/index.html
subPath: index.html
- name: product-data
mountPath: /data
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
volumes:
- name: welcome-page
configMap:
name: welcome-nginx-cm
items:
- key: index.html
path: index.html
- name: product-data
persistentVolumeClaim:
claimName: product-service-pvc
root@master:~/yaml# kubectl apply -f product-service-deploy.yaml
deployment.apps/product-service configured
root@master:~/yaml# kubectl get pod -n product
NAME READY STATUS RESTARTS AGE
product-service-65dff7d8d4-b8lc7 1/1 Running 0 6s
product-service-65dff7d8d4-czc7w 1/1 Running 0 4s
product-service-65dff7d8d4-gcpsp 1/1 Running 0 5s
5.创建 svc 暴露服务端口
root@master:~/yaml# vim product-service-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: product-service-nodeport
namespace: product
labels:
app: product-service
spec:
type: NodePort
selector:
app: product-service
ports:
- name: http
port: 80
targetPort: 80
nodePort: 30080
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: product-service
namespace: product
spec:
selector:
app: product-service
ports:
- port: 80
targetPort:
浏览器访问 192.168.0.201:30080
6.创建 hpa 自动扩缩容
root@master:~/yaml# wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
root@master:~/yaml# vim components.yaml
spec:
containers:
- args:
- --kubelet-insecure-tls
- --cert-dir=/tmp
- --secure-port=10250
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
root@master:~/yaml# kubectl apply -f components.yaml
root@master:~/yaml# vim product-service-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: product-service-hpa
namespace: product
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: product-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 30
periodSeconds: 60
root@master:~/yaml# kubectl apply -f product-service-hpa.yaml
horizontalpodautoscaler.autoscaling/product-service-hpa created
root@master:~/yaml# kubectl get hpa -n product
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
product-service-hpa Deployment/product-service cpu: 0%/50% 2103 97s
3 CI/CD 链路搭建
配置云端 ECS 与本地 Kubernetes 集群关联需要打通网络,通过 WireGuard VPN 打通本地 VMware 环境与阿里云 VPC,实现本地 k8s 集群访问阿里云 RDS、OSS 等资源。
3.1 混合云网络打通
服务端:阿里云 ECS(公网可访问),部署 WireGuard 作为 VPN 服务端。
客户端:本地 K8s 集群的主节点(Master),部署 WireGuard 客户端,接入 VPN 网络。
核心目标:
本地 K8s 节点 ↔ ECS 互通;
3.1.1 阿里云 ECS 配置
创建 ECS 实例:
操作系统:Ubuntu 22.04
网络:VPC(10.0.0.0/16)和交换机 (10.0.10.0/24) 弹性公网 ip
安全组入方向规则:
| 服务 | 协议 | 访问来源 | 访问目的 |
|---|
| WireGuard 监听端口 | UDP | 本机 ip | 51820 |
| GitLab | TCP | 本机 ip + vpc 专有网络网段 + jenkins 公网 ip | 443 |
| GitLab | TCP | 本机 ip + vpc 专有网络网段 + jenkins 公网 ip | 80 |
| GitLab ssh 端口 | TCP | 所有 ip | 2222 |
| jenkins | TCP | 本机 ip + vpc 专有网络网段 + gitlab 公网 ip | 8080 |
| jenkins | TCP | 本机 ip + vpc 专有网络网段 | 50000 |
3.1.2 WireGuard 安装与配置
(1)在阿里云 ECS 以及本地节点 上安装 WireGuard
常用命令
wg-quick up wg0
wg-quick down wg0
wg-quick down wg0 && wg-quick up wg0
wg show
wg show wg0
wg show wg0 dump
apt update
apt install wireguard -y
mkdir -p /etc/wireguard
cd /etc/wireguard
sudo wg genkey | sudo tee private.key | sudo wg pubkey > public.key
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# cat private.key
YG9CkSAnVIy4F8hIiE6ugma5xcgDiT5bMqqTRcy0M2M=
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# cat public.key
k5FafPFqLcQG6MhkIrHy8U2fg5bhN/VgDpXqmiVgwls=
(2)在 ECS 节点 上创建配置文件
vim /etc/wireguard/wg0.conf
[Interface]
Address = 10.255.255.1/24
ListenPort = 51820
PrivateKey = YJUSqwLfS/VZWsC8qBXPxdIiilsRBUnbZszPtrKoN0A=
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT
PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostUp = ip6tables -A FORWARD -i wg0 -j ACCEPT
PostUp = ip6tables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT
PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
PostDown = ip6tables -D FORWARD -i wg0 -j ACCEPT
PostDown = ip6tables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
[Peer]
PublicKey = 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
AllowedIPs = 10.255.255.2/32, 192.168.0.0/24
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
root@jenkins:/etc/wireguard# wg show wg0
interface: wg0
public key: fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
private key: (hidden)
listening port: 51820
peer: 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
allowed ips: 10.255.255.2/32, 192.168.0.0/24
(3)在本地节点 上创建配置文件
vim /etc/wireguard/wg0.conf
[Interface]
Address = 10.255.255.2/24
PrivateKey = iHhpTPwdNSl4cCYCPmOGyUDU46gcAtuNlsRn1QqTOVg=
PostUp = sysctl -w net.ipv4.ip_forward=1
[Peer]
PublicKey = fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
AllowedIPs = 10.255.255.1/32, 10.0.10.0/24
Endpoint = [ECS 公网 IP]:51820
PersistentKeepalive = 25
systemctl enable wg-quick@wg0
systemctl start wg-quick@wg0
root@master:/etc/wireguard# wg show
interface: wg0
public key: 8JAEThs8LkcYv27YBc1ROVX2QMD9TODwsYKuUmLHyRI=
private key: (hidden)
listening port: 37352
peer: fwNl1Us9Hk0oEebqGLdi8Bo9NyeiFoUAIYYeX5qdsHI=
endpoint: [ECS 公网 IP]:51820
allowed ips: 10.255.255.1/32, 10.0.10.0/24
latest handshake: 4 seconds ago
transfer: 92 B received, 180 B sent
persistent keepalive: every 25 seconds
ens32 替换为实际网卡名
iptables -A FORWARD -i wg0 -o ens32 -j ACCEPT
iptables -A FORWARD -i ens32 -o wg0 -j ACCEPT
遇到的问题:
在 systemctl start wg-quick@wg0 启动之后 master 节点的 calico-node-t4r7h 处于未运行状态
停止 wg0 接口后 Calico 恢复正常,核心问题:WireGuard 的 wg0 接口抢占了 10.0.0.0/16 网段的路由,而 Calico 的 Pod 网段(10.20.219.64/26)恰好属于这个范围,导致 Calico 的 BGP 通信流量被错误路由到 wg0 接口(而非集群内网的 ens32 接口),最终引发 BGP 连接失败。
解决方案
编辑 Calico 的 DaemonSet,强制其使用集群内网的 ens32 接口(而非 wg0)进行 BGP 通信:
kubectl edit ds calico-node -n kube-system
- name: IP_AUTODETECTION_METHOD
value: "interface=ens32"
- name: CALICO_NETWORK_INTERFACE
value: "ens32"
最后重新启动 systemctl start wg-quick@wg0 观察 calico 运行状态,问题解决
3.1.4 连通性测试
从本地 VM 测试到阿里云 ECS
root@master:~# ping 10.255.255.1
PING 10.255.255.1 (10.255.255.1) 56(84) bytes of data.
64 bytes from 10.255.255.1: icmp_seq=1 ttl=64 time=20.8 ms
64 bytes from 10.255.255.1: icmp_seq=2 ttl=64 time=22.3 ms
64 bytes from 10.255.255.1: icmp_seq=3 ttl=64 time=21.3 ms
^C
--- 10.255.255.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.815/21.464/22.275/0.606 ms
root@master:~# ping 10.0.10.45
PING 10.0.10.45 (10.0.10.45) 56(84) bytes of data.
64 bytes from 10.0.10.45: icmp_seq=1 ttl=64 time=22.2 ms
64 bytes from 10.0.10.45: icmp_seq=2 ttl=64 time=20.9 ms
64 bytes from 10.0.10.45: icmp_seq=3 ttl=64 time=21.3 ms
^C
--- 10.0.10.45 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.913/21.498/22.240/0.552 ms
root@master:~# telnet 10.0.10.45 22
Trying 10.0.10.45...
Connected to 10.0.10.45.
Escape character is '^]'.
SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.13
从阿里云 ECS 测试到本地
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# ping 10.255.255.2
PING 10.255.255.2 (10.255.255.2) 56(84) bytes of data.
64 bytes from 10.255.255.2: icmp_seq=1 ttl=64 time=20.9 ms
64 bytes from 10.255.255.2: icmp_seq=2 ttl=64 time=21.4 ms
64 bytes from 10.255.255.2: icmp_seq=3 ttl=64 time=21.0 ms
^C
--- 10.255.255.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.873/21.103/21.424/0.233 ms
root@iZwz9cnnlu0g55olnxfuw4Z:/etc/wireguard# ping 192.168.121.100
PING 192.168.121.100 (192.168.121.100) 56(84) bytes of data.
64 bytes from 192.168.121.100: icmp_seq=1 ttl=64 time=21.0 ms
64 bytes from 192.168.121.100: icmp_seq=2 ttl=64 time=20.8 ms
64 bytes from 192.168.121.100: icmp_seq=3 ttl=64 time=20.5 ms
^C
--- 192.168.121.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 20.541/20.778/20.956/0.174 ms
3.2 部署 GitLab(阿里云 ECS)
3.2.1 在阿里云 ECS 创建一台实例
2 核 4G
绑定弹性公网 ip
部署 docker 社区版
ubuntu 22.04 与本地环境一致
修改实例名,复制公网 ip,进行远程连接
安全组设置允许本机 ip 和 jenkins 服务器的公网 ip 访问
3.2.2 docker-compose 部署 Gitlab
- 安装 docker-compsoe
root@iZwz90hzjc4m3pd9ick3miZ:~# hostnamectl set-hostname gitlab
root@iZwz90hzjc4m3pd9ick3miZ:~# su
root@gitlab:~# apt install -y docker-compose
root@gitlab:~# docker-compose --version
docker-compose version 1.25.0, build unknown
- 创建 GitLab 数据持久化目录
root@gitlab:~# mkdir -p /data/gitlab/{config,data,logs}
root@gitlab:~# chmod -R 777 /data/gitlab
- 创建
docker-compose.yml 文件
root@gitlab:~# vim docker-compose.yml
version: '3'
services:
gitlab:
image: gitlab/gitlab-ce:14.3.6-ce.0
container_name: gitlab
privileged: true
restart: always
ports:
- "80:80"
- "443:443"
- "2222:22"
volumes:
- /data/gitlab/config:/etc/gitlab
- /data/gitlab/data:/var/opt/gitlab
- /data/gitlab/logs:/var/log/gitlab
environment:
- TZ=Asia/Shanghai
- GITLAB_OMNIBUS_CONFIG=external_url 'http://[ECS 公网 IP]'; gitlab_rails['gitlab_shell_ssh_port']=2222;
- 启动容器
root@gitlab:~# docker-compose up -d
Creating network "root_default" with the default driver
Pulling gitlab (gitlab/gitlab-ce:latest)...
latest: Pulling from gitlab/gitlab-ce
7b1a6ab2e44d: Pull complete
6c37b8f20a77: Pull complete
f509191f201: Pull complete
bb6bfd7806: Pull complete
2c03ae5f5fcd: Pull complete
8311111743: Pull complete
499fee924bc: Pull complete
6667fb304: Pull complete
Digest: sha256:5a0b03f09ab2f2634ecc6bfeb41521d19329cf4c9bbf330227117c048e75163
Status: Downloaded newer image for gitlab/gitlab-ce:latest
Creating gitlab ... done
root@gitlab:~# docker-compose logs -f gitlab
- 获取初始化密码
root@gitlab:~# docker exec -it gitlab /bin/bash
root@4c054babda87:/# cat /etc/gitlab/initial_root_password
Password: jqV6Dmlo+kbke3pLVFP0PTV2ttWiFPnDq54uX4WQ0Hc=# NOTE: This file will be automatically deleted in the first reconfigure run after 24 hours.
- 打开浏览器,访问
http://[ECS 公网 IP];
- 输入用户名
root,粘贴上述初始密码,点击登录;
- 修改密码,设置新的强密码。
访问 GitLab 并登录
如果无法访问需要检查安全组是否开放端口访问权限,没有则新建允许 80 和 443 端口
3.2.3 初始化 gitlab
创建项目
3.2.4 配置 master 节点 ssh 免密认证
root@master:~/gitlab/e-commerce-platform# cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC2/jHMzETQsYS0+IkoKsZGDvqF3mmjEMYS1hjGfJnMin+mPRKH0quZll/4RuHFky3sbn3WSDonCcgvXP0TWUZTvCe9CGvlnU+zkkCMuOwCRqXNb/pXeAjzOCDBkUX+vXYHrhkmtNPylS8JDAuOdr+6qnIKG8GBjRVFmu7tl6+NFgjgpEGbgTE6vowWK+J3zKx6iN7FCKx+oMcWdEvcOy/WNnYWq7uCfQQgXerONTKHTJ6I9z6x/MMHnCTszSAYHSr7D9HV9un0k9tnoV5cSTA0tuDmFzNWX288v702DWDxgDJeaJLSeQTAAu6lm93GAdNC77QpI7IPDcZ/NkO3/AQoE5yIdCX8ApE7hobNQVL/24+8n+EmzfYsP+IWK/SWf7WZV4BR7v1QTz2M7HqPiYNR5rxOniCAhJ4dwnoS4LjeYMknGoB4SBqPcnpoUZT9q1iYf02JunKgCpAHSdNJ4IfbdiKYeO6IlCPL78xjvEAfOuqwSjOgUbiH70OXWfrJKmj5j/4J4crWm7cApCcevx6dzqo072rQtZLLoOZSBf114EkjCglE5W0hlnh6/sivBt/Yq0iNMAGVBsexJ8c8n5+saKuY+T1SU5JQiIeoISgVG/Ssv1913RRravFj5Fme3A8UnyYri0/4k3PYGu7QBBTytFmuim3sBYaQIzmqpRBLbw== root@master
3.3 部署 jenkins(阿里云 ECS)
3.3.1 在阿里云 ECS 创建一台实例
2 核 4G
分配公网 ip
部署 docker 社区版
ubuntu 22.04 与本地环境一致
修改实例名,复制弹性公网 ip 进行远程连接
3.3.2 docker-compose 部署 jenkins
- 安装 docker-compose
root@iZwz9749p6a8r7y1673ypyZ:~# hostnamectl set-hostname jenkins
root@iZwz9749p6a8r7y1673ypyZ:~# su
root@jenkins:~# apt install -y docker-compose
- 创建 Jenkins 数据目录(持久化数据)
root@jenkins:~# mkdir -p /opt/jenkins/data
root@jenkins:~# chown -R 1000:1000 /opt/jenkins/data
root@jenkins:~# chmod -R 755 /opt/jenkins/data
- 编写 docker-compose.yml 文件
root@jenkins:~# cd /opt/jenkins
root@jenkins:/opt/jenkins# vim docker-compose.yml
version: '2.2'
services:
jenkins:
image: jenkins/jenkins:2.528.2
container_name: jenkins
restart: always
privileged: true
user: root
ports:
- "8080:8080"
- "50000:50000"
volumes:
- ./data:/var/jenkins_home
- /var/run/docker.sock:/var/run/docker.sock
- /usr/bin/docker:/usr/bin/docker
- /usr/local/bin/docker-compose:/usr/local/bin/docker-compose
environment:
- TZ=Asia/Shanghai
- 启动 jenkins 容器
root@jenkins:/opt/jenkins# docker-compose up -d
Creating network "jenkins_default" with the default driver
Pulling jenkins (jenkins/jenkins:2.528.2)...
2.528.2: Pulling from jenkins/jenkins
13cc3f8244a: Pull complete
dc27f462ea: Pull complete
33300af18dd0: Pull complete
c2759c6dffa: Pull complete
e4beac6dffa: Pull complete
a37b858bb47: Pull complete
744b792e083: Pull complete
05d79a8b608: Pull complete
8d27b2b2b2: Pull complete
65e4ba86bc: Pull complete
5dc073277a: Pull complete
7718ff1022: Pull complete
Digest: sha256:7b1c378278279c8688efd6168c25a1c2723a6bd6f0420beb5ccefabee3cc3bb1
Status: Downloaded newer image for jenkins/jenkins:2.528.2
Creating jenkins ... done
root@jenkins:/opt/jenkins# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6e126cdd99b jenkins/jenkins:2.528.2 "/sbin/tini -- /usr/…" 2 seconds ago Up 2 seconds 0.0.0.0:8080->8080/tcp, [::]:8080->8080/tcp, 0.0.0.0:50000->50000/tcp, [::]:50000->50000/tcp jenkins
3.3.3 jenkins 初始化配置
- 访问 jenkins 页面
在浏览器中输入:http://[ECS 公网 IP]:8080(需要在安全组放行 8080 访问端口和 50000 代理端口)。
- 输入初始化密码
root@jenkins:/opt/jenkins# docker exec -it jenkins cat /var/jenkins_home/secrets/initialAdminPassword
de747fc1faa540cabfcd937c36e71ac6
- 安装插件
选择'安装推荐的插件',等待插件安装完成。
若部分插件安装失败,可点击'重试',或后续在 Jenkins 插件管理中手动安装。
我这里上传准备好的插件包,节省下载安装时间
root@jenkins:/opt/jenkins# mv plugins.tar data/
root@jenkins:/opt/jenkins# cd data/
root@jenkins:/opt/jenkins/data/# tar -xvf plugins.tar
-
创建用户
-
配置实例地址
-
安装额外插件:GitLab Plugin、Kubernetes Plugin、Nexus Plugin;
-
配置 GitLab 关联:在 Jenkins 系统管理→系统设置→GitLab 中,添加 GitLab 服务器,输入 GitLab 地址和 Access Token(从 GitLab 个人设置→Access Tokens 创建);
测试连通性,显示 success 说明连接成功
最后保存
如果不连通需要检查 ecs 的安全组是否开放了 jenkins 服务器公网访问 gitlab 服务器公网 ip:80 的权限
3.4 部署 Argo CD(本地 k8s 集群)
3.4.1 安装 ArgoCD
root@master:~/yaml# kubectl create namespace argocd
root@master:~/yaml# mkdir argocd/
root@master:~/yaml/argocd# kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.8.3/manifests/install.yaml
root@master:~/yaml/argocd# kubectl patch svc argocd-server -n argocd -p '{"spec":{"type":"NodePort"}}'
root@master:~/yaml/argocd# kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d && echo
eyd7NOqAVLDGak1o
root@master:~/yaml/argocd# kubectl get svc -n argocd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-applicationset-controller ClusterIP 10.106.160.96 <none>7000/TCP,8080/TCP 27h
argocd-dex-server ClusterIP 10.107.111.20 <none>5556/TCP,5557/TCP,5558/TCP 27h
argocd-metrics ClusterIP 10.97.249.73 <none>8082/TCP 27h
argocd-notifications-controller-metrics ClusterIP 10.110.61.50 <none>9001/TCP 27h
argocd-redis ClusterIP 10.105.69.236 <none>6379/TCP 27h
argocd-repo-server ClusterIP 10.99.240.50 <none>8081/TCP,8084/TCP 27h
argocd-server NodePort 10.108.75.197 <none>80:31375/TCP,443:32324/TCP 27h
argocd-server-metrics ClusterIP 10.110.198.250 <none>8083/TCP 27h
访问 ArgoCD UI:http://<本地 K8s 节点 IP>:<argocd-server 的 NodePort>,用 admin 和初始密码登录。
3.4.2 配置 ArgoCD 访问 GitLab
- 在 ArgoCD UI 中 → Settings → Repositories → Connect Repo
Repository URL:GitLab 的仓库地址(http://[GitLab 公网 IP]/root/e-commerce-platform.git)
Type:Git
Authentication:Username + Password → 用户名填 GitLab 账号 root,密码填之前生成的个人访问令牌
点击 Connect,验证连接成功。
3.5 Jenkins 流水线配置
3.5.1 git 克隆 gitlab 仓库
首先本地克隆 gitlab 仓库,然后进行代码文件编写与提交
root@master:~# mkdir gitlab
root@master:~# cd gitlab
root@master:~/github# git config --global user.name "[您的用户名]"
root@master:~/github# git config --global user.email "[您的邮箱]"
root@master:~/github# git config --global color.ui true
root@master:~/github# git config --list
root@master:~/github# git init
root@master:~/gitlab# git clone ssh://git@[GitLab 公网 IP]:2222/root/e-commerce-platform.git
Cloning into 'e-commerce-platform'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (14/14), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 17(delta 2), reused 0(delta 0), pack-reused 3
Receiving objects: 100% (17/17), done.
Resolving deltas: 100% (2/2), done.
root@master:~/gitlab# ls e-commerce-platform
root@master:~/gitlab# cd e-commerce-platform/
3.5.2 编写 Jenkinsfile(存放在 GitLab 的仓库根目录)
root@master:~/gitlab/e-commerce-platform# vim Jenkinsfile
pipeline {
agent any
environment {
ACR_REGISTRY = "[您的阿里云镜像仓库地址]/product-service-test"
APP_NAME = "product-service"
GITLAB_REPO_URL = "http://[GitLab 公网 IP]/root/e-commerce-platform.git"
GITLAB_REPO_HOST = "[GitLab 公网 IP]/root/e-commerce-platform.git"
GIT_CRED_ID = "Gitlab-token-Secret" # secret text 格式密钥
ACR_CRED_ID = "acr-cred"
MANIFEST_FILE = "product-service-deploy.yaml"
MANIFEST_CLONE_DIR = "e-commerce-platform-manifests"
VERSION_FILE = "version.txt"
}
options {
timeout(time: 30, unit: 'MINUTES')
retry(1)
skipDefaultCheckout(false)
disableConcurrentBuilds()
}
stages {
stage('Check Skip Conditions') {
steps {
script {
// 只检测 Jenkins 提交,不再检测 version.txt 变更
def commitMessage = sh(script: 'git log -1 --pretty=%B || echo ""', returnStdout: true).trim()
def commitAuthor = sh(script: 'git log -1 --pretty=%an || echo ""', returnStdout: true).trim()
// 如果是 Jenkins 提交,跳过
if(commitMessage.contains('[Jenkins]')|| commitMessage.contains('[ci skip]')|| commitAuthor =='jenkins-bot'){
echo"===== 检测到 Jenkins 提交,跳过构建 ====="
currentBuild.result ='SUCCESS'
env.SKIP_BUILD ='true'
return
}
// 用户提交,继续构建(即使 version.txt 被修改)
echo"===== 用户提交,继续构建 ====="
}
}
}
stage('Get Version') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
script {
// 从源代码读取版本号(用户手动指定的)
env.NEXT_VERSION = sh(script: 'cat ${VERSION_FILE} 2>/dev/null || echo "v0"', returnStdout: true).trim()
if(env.NEXT_VERSION =='v0'){ error "version.txt 不存在或为空,请先创建并提交"}
echo"使用手动指定的版本:${env.NEXT_VERSION}"
}
}
}
stage('Build Docker Image') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 构建镜像:${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION} ====="
sh """
if[! -f Dockerfile ];then
echo'错误:Dockerfile 不存在'
exit 1
fi
docker build --no-cache -t ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}.
"""
}
}
stage('Push to ACR') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 推送镜像到 ACR ====="
withCredentials([usernamePassword(credentialsId: "${ACR_CRED_ID}", passwordVariable: 'ACR_PWD', usernameVariable: 'ACR_USER')]){
sh """
echo${ACR_PWD}|docker login --username ${ACR_USER} --password-stdin ${ACR_REGISTRY.split('/')[0]}
docker push ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}
docker logout${ACR_REGISTRY.split('/')[0]}
"""
}
}
}
stage('Update K8s Manifest') {
when {
expression { env.SKIP_BUILD !='true'}
}
steps {
echo"===== 更新 K8s 清单 ====="
withCredentials([string(credentialsId: "${GIT_CRED_ID}", variable: 'GITLAB_TOKEN')]){
script {
sh"rm -rf ${MANIFEST_CLONE_DIR} 2>/dev/null || true"
sh """
git clone http://oauth2:${GITLAB_TOKEN}@${GITLAB_REPO_HOST}${MANIFEST_CLONE_DIR}||{echo'克隆仓库失败';exit 1}
"""
dir("${MANIFEST_CLONE_DIR}"){
// 更新 K8s 清单
sh """
sed -i.bak 's|image: .*${APP_NAME}:.*|image: ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}|g'${MANIFEST_FILE}
rm -f ${MANIFEST_FILE}.bak
"""
// 可选:同步更新 manifest 仓库的 version.txt(保持一致)
sh """
echo"${NEXT_VERSION}">${VERSION_FILE}
"""
// 提交到 manifest 仓库
sh """
git config user.email "[您的邮箱]"
git config user.name "jenkins-bot"
if git status --porcelain |grep -q .;then
git add${MANIFEST_FILE}${VERSION_FILE}
git commit -m "[Jenkins] Update ${APP_NAME} to ${NEXT_VERSION} [ci skip]"
git push origin main
echo"已推送修改到 manifest 仓库"
else
echo"无修改,跳过提交"
fi
"""
}
}
}
}
}
}
post {
always {
echo"===== 清理资源 ====="
sh"rm -rf ${MANIFEST_CLONE_DIR} || true"
script {
if(env.NEXT_VERSION){sh"docker rmi ${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION} || true 2>/dev/null"}
}
}
success {
script {
if(env.SKIP_BUILD !='true'){
echo"Pipeline 成功!镜像:${ACR_REGISTRY}/${APP_NAME}:${NEXT_VERSION}"
}else{
echo"构建跳过(自动化提交)"
}
}
}
failure {
echo"Pipeline 失败!请检查配置"
}
}
}
3.5.3 编写 Dockerfile
root@master:~/gitlab/e-commerce-platform# vim Dockerfile
FROM docker.io/library/nginx:latest
3.5.4 准备 k8s 部署清单提交至代码仓库
product-service-deploy.yaml
product-service-welcome-cm.yaml
Dockerfile
Jenkinsfile
root@master:~/gitlab/e-commerce-platform# cp /root/yaml/product-service/product-service-deploy.yaml ./
root@master:~/gitlab/e-commerce-platform# cp /root/yaml/product-service/product-service-welcome-cm.yaml ./
root@master:~/gitlab/e-commerce-platform# ls Dockerfile Jenkinsfile product-service-deploy.yaml product-service-welcome-cm.yaml
root@master:~/gitlab/e-commerce-platform# git add ./
root@master:~/gitlab/e-commerce-platform# git commit -m "v1"[main 3429cff] v1 4 files changed, 156 insertions(+) create mode 100644 Dockerfile create mode 100644 Jenkinsfile create mode 100644 product-service-deploy.yaml create mode 100644 product-service-welcome-cm.yaml
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 2.51 KiB |2.51 MiB/s, done.
Total 6(delta 0), reused 0(delta 0), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
62ecaae..3429cff main -> main
3.6 配置 GitLab WebHook 触发 Jenkins
3.6.1 在 Jenkins 中创建流水线任务
新建任务 ->选择'流水线' -> 名称设为 product-service-ci
流水线 -> 定义选择'Pipeline script from SCM' ->SCM 选 Git -> 填入 app-code 仓库地址,凭证选 gitlab-token (必须是用户密码类型否则不显示)->分支 main->脚本路径填 Jenkinsfile -> 保存。
3.6.2 配置 GitLab WebHook
生成 jenkins api token
进入 GitLab 的 app-code 仓库-> 设置 ->Webhooks -> 添加 Webhook:
URL:Jenkins 的触发地址(格式:http://用户名:token[Jenkins 公网 IP]:8080/project/构建名)
触发条件:勾选'Push events'
点击'Add webhook' → 测试(点击'Test' → 选'Push events'),验证 Jenkins 能触发构建。
3.6.3 测试 jenkins 自动构建
- 本地修改镜像版本号并提交
root@master:~/gitlab/e-commerce-platform# vim product-service-deploy.yaml
image: [您的阿里云镜像仓库地址]/product-service-test/product-service:v2
root@master:~/gitlab/e-commerce-platform# git add .
root@master:~/gitlab/e-commerce-platform# git commit -m "test 自动构建,修改了版本号"[main 84df6a9] test 自动构建,修改了版本号 1file changed, 1 insertion(+), 1 deletion(-)
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 334 bytes |334.00 KiB/s, done.
Total 3(delta 2), reused 0(delta 0), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
8e98ca7..84df6a9 main -> main
版本号会随构建次数自动叠加
上传到 ACR 镜像仓库成功
输出日志
Started by GitLab push by Administrator Obtained Jenkinsfile from git http://[GitLab 公网 IP]/root/e-commerce-platform.git [Pipeline] Start of Pipeline [Pipeline]node Running on Jenkins in /var/jenkins_home/workspace/product-service-ci [Pipeline]{[Pipeline] stage [Pipeline]{(Declarative: Checkout SCM)[Pipeline] checkout The recommended git tool is: NONE using credential Gitlab-token-us >git rev-parse --resolve-git-dir /var/jenkins_home/workspace/product-service-ci/.git
3.7 ArgoCD 配置自动同步
1. 在 ArgoCD UI 中 → New App:
Application Name:my-demo-app
Project:default
Sync Policy:勾选'Automatic'(自动同步)、'Prune Resources'、'Self Heal'
Source:
- Repository URL:GitLab 的
k8s-manifests 仓库地址
- Revision:
main
- Path:
./(部署清单所在路径)
Destination:
- Cluster URL:
https://kubernetes.default.svc(本地 K8s)
- Namespace:
default
2.点击 Create,ArgoCD 会自动同步部署清单到本地 K8s。
滚动更新
argo cd 显示同步完成
3.8 全链路测试
3.8.1 触发 CI/CD 流程
- 本地修改 cm,提交并推送到 GitLab:
原页面内容
root@master:~/gitlab/e-commerce-platform# vim product-service-welcome-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: welcome-nginx-cm
namespace: product
data:
index.html: |<!DOCTYPE html><html><head><title>Welcome</title></head><body><h1>v2</h1></body># 修改成 v2</html># 修改版本号
root@master:~/gitlab/e-commerce-platform# vim version.txt
v2
root@master:~/gitlab/e-commerce-platform# git add .
root@master:~/gitlab/e-commerce-platform# git commit -m "v2"[main 2c5da2e] v2 2 files changed, 2 insertions(+), 2 deletions(-)
root@master:~/gitlab/e-commerce-platform# git push origin main
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (4/4), 344 bytes |344.00 KiB/s, done.
Total 4(delta 2), reused 2(delta 1), pack-reused 0
To ssh://[GitLab 公网 IP]:2222/root/e-commerce-platform.git
f8f2f1b..2c5da2e main -> main
可以看到提交代码后,jenkins 触发了自动构建
状态完成
迭代版本镜像上传成功
argoCD 正在持续部署
cm 内容从 v1 变成了 v2
pod 镜像也变成了 v2
页面访问测试也变成了 v2
4 监控体系搭建(本地部署)
4.1 前置准备
4.1.1 阿里云 ACR 创建命名空间
4.1.2 拉取镜像上传对应的镜像仓库
直接在 monitoring_k8s 命名空间后面加上镜像仓库名上传会自动创建镜像仓库
root@master:~# ctr images pull docker.io/prom/node-exporter:v1.8.1
root@master:~# nerdctl tag prom/node-exporter:v1.8.1 [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
root@master:~# ctr images pull docker.io/prom/prometheus:v2.53.1
root@master:~# nerdctl tag prom/prometheus:v2.53.1 [您的阿里云镜像仓库地址]/monitoring_k8s/prometheus:v2.53.1
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/prometheus:v2.53.1
root@master:~# ctr images pull docker.io/grafana/grafana:11.2.0
root@master:~# nerdctl tag grafana/grafana:11.2.0 [您的阿里云镜像仓库地址]/monitoring_k8s/grafana:11.2.0
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/grafana:11.2.0
root@master:~# ctr images pull docker.io/prom/blackbox-exporter:v0.24.0
root@master:~# nerdctl tag prom/blackbox-exporter:v0.24.0 [您的阿里云镜像仓库地址]/monitoring_k8s/blackbox-exporter:v0.24.0
root@master:~# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/blackbox-exporter:v0.24.0
root@master:~/yaml/monitoring# ctr images pull docker.io/prom/alertmanager:v0.26.0
root@master:~/yaml/monitoring# nerdctl tag prom/alertmanager:v0.26.0 [您的阿里云镜像仓库地址]/monitoring_k8s/alertmanager:v0.26.0
root@master:~/yaml/monitoring# nerdctl push [您的阿里云镜像仓库地址]/monitoring_k8s/alertmanager:v0.26.0
root@master:~/yaml/filebeat# ctr images pull docker.io/elastic/filebeat:8.11.0
root@master:~/yaml/filebeat# nerdctl tag elastic/filebeat:8.11.0 [您的阿里云镜像仓库地址]/logging_k8s/filebeat:8.11.0
root@master:~/yaml/filebeat# nerdctl push [您的阿里云镜像仓库地址]/logging_k8s/filebeat:8.11.0
4.1.4 创建监控命名空间
root@master:~# kubectl create ns monitoring
4.2 部署 node-exporter
root@master:~/yaml# ls product-service secret
root@master:~/yaml# mkdir monitoring
root@master:~/yaml# kubectl create secret docker-registry acr-pull-secret \
--namespace=monitoring \
--docker-server=[您的阿里云镜像仓库地址] \
--docker-username=[您的用户名] \
--docker-password='[您的密码]'
secret/acr-pull-secret created
root@master:~/yaml/monitoring# vim node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
hostNetwork: true
hostPID: true
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: node-exporter
image: [您的阿里云镜像仓库地址]/monitoring_k8s/node-exporter:v1.8.1
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)
securityContext:
privileged:
volumeMounts:
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /rootfs
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
: ClusterIP
root@master:~/yaml/monitoring# kubectl apply -f node-exporter.yaml
root@master:~/yaml# kubectl get pods -n monitoring -l app=node-exporter -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-7kcrl 1/1 Running 0 2m3s 192.168.0.200 master <none><none>
node-exporter-gknxb 1/1 Running 0 2m3s 192.168.0.202 node2 <none><none>
node-exporter-p99j6 1/1 Running 0 2m4s 192.168.0.203 node3 <none><none>
node-exporter-q5m95 1/1 Running 0 2m3s 192.168.0.201 node1 <none><none>
4.3 部署 Prometheus
4.3.1 配置 Prometheus RBAC 权限
root@master:~/yaml/monitoring# vim prometheus-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
[]
[]
[, , ]
[, ]
[]
4.3.2 配置 Prometheus 抓取规则
root@master:~/yaml/monitoring# vim prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
# ========== 全局配置(所有抓取任务的默认规则) ==========
global:
# 抓取指标的间隔:每 15 秒抓取一次所有监控目标的指标(默认值,可被单个 job 覆盖)
scrape_interval: 15s
# 规则评估间隔:每 15 秒评估一次告警规则/记录规则(如 PromQL 告警表达式)
evaluation_interval: 15s
# ========== 抓取配置列表(定义所有需要监控的目标) ==========
scrape_configs:
# 1. 抓取 Prometheus 自身的运行指标(监控监控系统本身)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090'] # Prometheus 自身的指标端口(9090 为默认端口)
# 2. 抓取 K8s 集群节点的 Node Exporter 指标(K8s 自动发现)
- job_name: 'k8s-node-exporter' # K8s 服务发现配置:基于 K8s 的 Endpoints 自动发现监控目标
kubernetes_sd_configs:
- role: endpoints # 发现角色:Endpoints(Service 对应的后端 Pod 端点)
namespaces:
# 仅发现 monitoring 命名空间下的 Endpoints(Node Exporter 部署在此)
names: ['monitoring']
# 标签重写规则:过滤/修改目标的标签,只保留需要的监控目标
relabel_configs:
# 规则 1:仅保留 Service 标签包含 app=node-exporter 的 Endpoints
- source_labels: [__meta_kubernetes_service_label_app] # 源标签:K8s Service 的 app 标签
regex: node-exporter # 匹配规则:值为 node-exporter
action: keep # 动作:保留匹配的目标(不匹配的丢弃)
# 规则 2:仅保留端口名称为 metrics 的 Endpoints(Node Exporter 的端口名)
- source_labels: [__meta_kubernetes_endpoint_port_name] # 源标签:Endpoints 的端口名称
regex: metrics # 匹配规则:值为 metrics
action: keep # 动作:保留匹配的目标
# 3. 抓取 Blackbox Exporter 指标(页面/接口可用性监控)
- job_name: 'blackbox-exporter' # 指标路径:Blackbox Exporter 的探针接口(默认/probe)
metrics_path: /probe
# 请求参数:指定检测模块为 http_2xx(检测 HTTP 接口是否返回 200 状态码)
params:
module: [http_2xx]
# K8s 服务发现:自动发现 monitoring 命名空间下的 Blackbox Exporter Endpoints
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ['monitoring']
# 标签重写规则:适配 Blackbox Exporter 的探针请求逻辑
relabel_configs:
# 规则 1:仅保留 Service 标签为 app=blackbox-exporter 的目标
- source_labels: [__meta_kubernetes_service_label_app]
regex: blackbox-exporter
action: keep
# 规则 2:将目标地址(__address__)作为探针请求的 target 参数
- source_labels: [__address__]
target_label: __param_target
# 规则 3:将 target 参数值作为 instance 标签(Prometheus UI 中显示的实例名)
- source_labels: [__param_target]
target_label: instance
# 规则 4:修改目标地址为 Blackbox Exporter 的 Service 地址(所有探针请求转发到这里)
- target_label: __address__
replacement: blackbox-exporter.monitoring.svc:9115 # Blackbox Service 的集群内地址
# 规则 5:将 instance 标签值赋值给 target 标签(便于在 Grafana 中筛选目标)
- source_labels: [instance]
regex: (.*)
target_label: target
replacement: ${1}
# 4. 抓取 K8s 集群核心组件:APIServer 指标
- job_name: 'kubernetes-apiservers' # K8s 服务发现:全局发现所有 Endpoints(APIServer 在 default 命名空间)
kubernetes_sd_configs:
- role: endpoints
# 访问协议:APIServer 仅支持 HTTPS
scheme: https
# TLS 配置:使用 K8s ServiceAccount 的 CA 证书(Pod 内默认挂载的证书)
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# 认证配置:使用 Pod 内默认挂载的 ServiceAccount Token(RBAC 权限认证)
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 标签重写规则:仅保留 default 命名空间下 kubernetes Service 的 https 端口
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] # 匹配规则:命名空间=default、Service 名=kubernetes、端口名=https
regex: default;kubernetes;https
action: keep # 仅保留 APIServer 的 Endpoints(过滤其他无关目标)
4.3.3 部署 Prometheus Deployment + Service
root@master:~/yaml/monitoring# vim prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: prometheus
[]
{}
4.3.4 应用 Prometheus 所有配置
root@master:~/yaml/monitoring# kubectl apply -f .
root@master:~/yaml# kubectl get pod -n monitoring -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-7kcrl 1/1 Running 0 11m 192.168.0.200 master <none><none>
node-exporter-gknxb 1/1 Running 0 11m 192.168.0.202 node2 <none><none>
node-exporter-p99j6 1/1 Running 0 12m 192.168.0.203 node3 <none><none>
node-exporter-q5m95 1/1 Running 0 11m 192.168.0.201 node1 <none><none>
prometheus-68f95956cf-v5bh2 1/1 Running 0 32s 10.20.166.132 node1 <none><none>
4.3.5 访问验证
浏览器访问 node1 节点 ip,192.168.0.201:30090



4.4 部署 Grafana
4.4.1 部署 Grafana Deployment + Service
root@master:~/yaml/monitoring# vim grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: grafana
[]
{}
4.4.2 应用 Grafana 配置
root@master:~/yaml/monitoring# kubectl apply -f grafana-deployment.yaml
root@master:~/yaml# kubectl get pods -n monitoring -o wide -l app=grafana
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
grafana-57596f6bcb-5lw47 1/1 Running 0 2m39s 10.20.135.4 node3 <none><none>
4.4.3 访问验证
浏览器访问 192.168.0.201:30030
用户名 admin
密码 admin123
4.4.4 配置 Grafana 数据源
设置中文
首页,右上角头像-profile-Language
-
登录 Grafana 后,点击左侧 连接->数据源->添加新数据源;
-
选择 Prometheus,配置 URL 为:http://prometheus.monitoring.svc:9090(K8s 内部 Service 地址);
4.4.5 导入 Grafana 仪表盘
-
点击左侧仪表板->右侧新建导入
-
输入仪表盘 ID,点击 加载:
- 节点状态监控:
1860(Node Exporter Full,节点 CPU / 内存 / 磁盘);
- K8s 集群监控:
7249(Kubernetes Cluster Monitoring,集群组件);
4.5 部署 Alertmanager
4.5.1 编写 Alertmanager 核心配置文件
Alertmanager 的核心配置是 alertmanager.yml,主要包含路由规则、接收人、通知渠道、抑制 / 静默规则等。
root@master:~/yaml/monitoring# vim alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '[您的邮箱]'
smtp_smarthost: 'smtp.qq.com:587'
smtp_auth_username: '[您的邮箱]'
smtp_auth_password: '[您的 SMTP 密码]'
smtp_require_tls: true
route:
receiver: 'chenjun'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receivers:
- name: 'chenjun'
email_configs:
- to: '[您的邮箱]'
send_resolved:
[, , ]
4.5.2 将配置文件存储为 ConfigMap
root@master:~/yaml/monitoring# kubectl create configmap alertmanager-config \
--namespace=monitoring \
--from-file=alertmanager.yml=./alertmanager.yml
4.5.3 编写 Alertmanager 部署清单(Deployment + Service)
root@master:~/yaml/monitoring# vim alertmanager-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
labels:
app: alertmanager
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
imagePullSecrets:
- name: acr-pull-secret
containers:
- name: alertmanager
image: [您的阿里云镜像仓库地址]/monitoring_k8s/alertmanager:v0.26.0
imagePullPolicy: IfNotPresent
args:
- --config.file=/etc/alertmanager/alertmanager.yml
- --storage.path=/alertmanager
{}
4.5.4 部署 Alertmanager 到集群
root@master:~/yaml/monitoring# kubectl apply -f alertmanager-deploy.yaml
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 58m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h1m
node-exporter-7kcrl 1/1 Running 0 3h27m
node-exporter-gknxb 1/1 Running 0 3h27m
node-exporter-p99j6 1/1 Running 0 3h28m
node-exporter-q5m95 1/1 Running 0 3h27m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 7m26s
root@master:~/yaml/monitoring# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager NodePort 10.100.74.196 <none>9093:30093/TCP 61m
grafana NodePort 10.98.127.132 <none>3000:30030/TCP 3h1m
node-exporter ClusterIP 10.107.29.187 <none>9100/TCP 3h28m
prometheus NodePort 10.96.26.78 <none>9090:30090/TCP 3h16m
4.5.5 访问 Alertmanager Web UI
浏览器访问 192.168.0.201:30093
若能看到 Alertmanager 界面,说明部署成功。
4.5.6 配置 Prometheus 关联 Alertmanager
root@master:~/yaml/monitoring# vim prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
# 添加以下内容---------------------------------
alerting:
alertmanagers:
- static_configs:
- targets: # Alertmanager 的 Service 地址
- alertmanager.monitoring.svc:9093
rule_files:
- "alert_rules.yml"
# 结束-------------------------------------------
# 中间采集指标略
# 最后添加以下内容告警规则,与 prometheus.yml:同级
alert_rules.yml: |
groups:
# 1. 节点级告警(服务器资源)
- name: node-resource-alerts
rules:
# 1.1 节点内存使用率过高
- alert: NodeHighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.2 节点内存使用率紧急(临界值)
- alert: NodeCriticalMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "节点内存使用率紧急"
description: "节点 {{ $labels.instance }} 内存使用率超过 95% (当前值:{{ printf \"%.2f\"$value }}%),已持续 2 分钟,可能导致服务不可用!"
# 1.3 节点 CPU 使用率过高
- alert: NodeHighCPUUsage
expr: 100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点 CPU 使用率过高"
description: "节点 {{ $labels.instance }} CPU 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.4 节点根磁盘使用率过高
- alert: NodeRootDiskHighUsage
expr: 100 * (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点根磁盘使用率过高"
description: "节点 {{ $labels.instance }} 根目录 / 磁盘使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.5 节点磁盘 IO 使用率过高
- alert: NodeHighDiskIO
expr: 100 * rate(node_disk_io_time_seconds_total{device!~"loop.*|sr.*"}[5m]) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点磁盘 IO 使用率过高"
description: "节点 {{ $labels.instance }} 的磁盘 {{ $labels.device }} IO 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 1.6 节点不可达(NodeExporter 失联)
- alert: NodeDown
expr: up{job=~"k8s-node-exporter|harbor-node-exporter|lb-node-exporter"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "节点监控失联"
description: "节点 {{ $labels.instance }} 的 NodeExporter 已失联超过 3 分钟,无法采集指标!"
# 2. K8s Pod/容器级告警
- name: k8s-pod-alerts
rules:
# 2.1 Pod 重启次数过多(1 小时内重启≥3 次)
- alert: PodRestartTooFrequent
expr: increase(kube_pod_container_restarts_total[1h]) >= 3
for: 10m
labels:
severity: warning
annotations:
summary: "Pod 重启次数过多"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} 1 小时内重启 {{ $value }} 次,可能存在服务异常。"
# 2.2 Pod 状态异常(Pending/Failed/Error)
- alert: PodStatusAbnormal
expr: kube_pod_status_phase{phase=~"Pending|Failed|Error"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod 状态异常"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 状态为 {{ $labels.phase }},已持续 5 分钟。"
# 2.3 容器 CPU 使用率过高
- alert: ContainerHighCPUUsage
expr: (sum by (namespace, pod, container)(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by (namespace, pod, container)(kube_pod_container_resource_limits_cpu_cores{container!=""})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "容器 CPU 使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} CPU 使用率超过 80% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 2.4 容器内存使用率过高
- alert: ContainerHighMemoryUsage
expr: (sum by (namespace, pod, container)(container_memory_usage_bytes{container!=""}) / sum by (namespace, pod, container)(kube_pod_container_resource_limits_memory_bytes{container!=""})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "容器内存使用率过高"
description: "命名空间 {{ $labels.namespace }} 的 Pod {{ $labels.pod }} 容器 {{ $labels.container }} 内存使用率超过 85% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
# 3. K8s 核心组件告警
- name: k8s-component-alerts
rules:
# 3.1 APIServer 请求延迟过高
- alert: K8sAPIServerHighRequestLatency
expr: (apiserver_request_latency_seconds_sum{verb!~"LIST|WATCH"} / apiserver_request_latency_seconds_count{verb!~"LIST|WATCH"}) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "K8s APIServer 请求延迟过高"
description: "APIServer {{ $labels.instance }} {{ $labels.verb }} 请求平均延迟超过 500ms (当前值:{{ printf \"%.3f\"$value }}s),已持续 5 分钟。"
# 3.2 APIServer 错误率过高
- alert: K8sAPIServerHighErrorRate
expr: sum by (instance)(rate(apiserver_request_total{code=~"5.."}[5m])) / sum by (instance)(rate(apiserver_request_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "K8s APIServer 错误率过高"
description: "APIServer 5XX 错误率超过 5% (当前值:{{ printf \"%.2f\"$value }}%),已持续 5 分钟。"
更新 Prometheus ConfigMap 并重启 Prometheus Pod:
root@master:~/yaml/monitoring# kubectl apply -f prometheus-config.yaml
root@master:~/yaml/monitoring# kubectl rollout restart deployment prometheus -n monitoring
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 64m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h7m
node-exporter-7kcrl 1/1 Running 0 3h33m
node-exporter-gknxb 1/1 Running 0 3h33m
node-exporter-p99j6 1/1 Running 0 3h33m
node-exporter-q5m95 1/1 Running 0 3h33m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 13m
进入 Prometheus web 页面查看规则是否生效
4.5.7 测试邮箱告警
root@master:~/yaml/monitoring# kubectl apply -f prometheus-config.yaml
root@master:~/yaml/monitoring# kubectl rollout restart deployment prometheus -n monitoring
root@master:~/yaml/monitoring# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-5757855787-6p69n 1/1 Running 0 64m
grafana-5d56cd8487-s5z22 1/1 Running 0 3h7m
node-exporter-7kcrl 1/1 Running 0 3h33m
node-exporter-gknxb 1/1 Running 0 3h33m
node-exporter-p99j6 1/1 Running 0 3h33m
node-exporter-q5m95 1/1 Running 0 3h33m
prometheus-6d756fcfff-4tc7h 1/1 Running 0 13m
可以看到收到邮箱告警了
测试完毕后修改回原指标,将通知已解决
4.6 阿里云 SLS 日志服务
4.6.1 部署 LoongCollector(本地集群 master)
在 Kubernetes 集群中以 DaemonSet 和 Sidecar 安装 LoongCollector-日志服务 - 阿里云
卸载教程 运行管理 - 日志服务 - 阿里云
创建主账号 AccessKey
root@master:~/yaml# mkdir logging
root@master:~/yaml# cd logging
root@master:~/yaml/logging# wget https://aliyun-observability-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/loongcollector/k8s-custom-pkg/3.0.12/loongcollector-custom-k8s-package.tgz; tar xvf loongcollector-custom-k8s-package.tgz; chmod 744 ./loongcollector-custom-k8s-package/k8s-custom-install.sh
root@master:~/yaml/logging/loongcollector-custom-k8s-package# vim loongcollector/values.yaml
projectName: "k8s-pod-logs"
region: "cn-shenzhen"
aliUid: "[您的阿里云账号 ID]"
net: Internet
accessKeyID: "[您的 AccessKey ID]"
accessKeySecret: "[您的 AccessKey Secret]"
clusterID: "k8s-pod"
root@master:~/yaml/logging/loongcollector-custom-k8s-package# bash k8s-custom-install.sh install
root@master:~/yaml/logging/loongcollector-custom-k8s-package# kubectl get po -n kube-system -o wide | grep loongcollector-ds
loongcollector-ds-6hcvp 1/1 Running 0 78s 10.20.166.154 node1 <none><none>
loongcollector-ds-hhklj 1/1 Running 0 78s 10.20.104.20 node2 <none><none>
loongcollector-ds-jx4ll 1/1 Running 0 78s 10.20.135.23 node3 <none><none>
loongcollector-ds-wj8c7 1/1 Running 0 78s 10.20.219.71 master <none><none>
组件安装成功后,日志服务会自动创建如下资源,可登录 日志服务控制台 查看。
4.6.2 创建日志采集规则
1. 标准输出日志采集(容器日志)
选择 project->创建日志库->数据介入->选择 k8s-标准输出 - 新版模板
2.配置机器组
使用场景:k8s 场景
部署方式:自建集群 Daemonset
添加机器组:k8s-group-k8s-pod
3.Logtail 配置
全局配置:填写采集名称(如 k8s-stdout)。
容器过滤:通过 Pod 标签、命名空间或容器名称筛选目标日志(如命名空间 kube-system)
下一步
容器日志采集完成
4.本地 k8s 集群主机日志采集(本地 k8s 集群所有主机节点)
前提条件
创建 Project
若您无可用 Project,请参考此处步骤创建一个基础 Project,如需详细了解创建配置请参见 管理 Project。
登录 日志服务控制台,单击创建 Project,完成下述基础配置,其他配置保持默认即可:
- 所属地域:请根据日志来源等信息选择合适的阿里云地域,创建后不可修改。
- Project 名称:设置名称,名称在阿里云地域内全局唯一,创建后不可修改。
创建 Logstore
若您无可用 Logstore,请参考此处步骤创建一个基础 Logstore,如需详细了解创建配置请参见 管理 LogStore。
- 登录 日志服务控制台,在 Project 列表中单击目标 Project。
- 填写Logstore 名称,其余配置保持默认无需修改。
在****日志存储** > **日志库**页签中,单击+**图标。
在 Linux 服务器上分场景安装 LoongCollector 采集器 - 日志服务 - 阿里云
- 选择传输方式并执行安装命令:替换
${region_id}为 Project 所属地域的 RegionID。
下载安装包:在服务器上执行下载命令,示例代码中${region_id}可使用cn-hangzhou替换。
root@master:~# mkdir logotail
root@master:~# cd logotail/
root@master:~/logotail# wget https://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/loongcollector.sh -O loongcollector.sh; --2026-01-09 09:16:04-- https://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/loongcollector.sh
公网:适用于大多数场景,常见于跨地域或其他云/自建服务器,但受带宽限制且可能不稳定。
root@master:~/logotail# chmod +x loongcollector.sh; ./loongcollector.sh install cn-shenzhen-internet
loongcollector.sh version: 1.7.0
OS Arch: x86_64
OS Distribution: Ubuntu
current glibc version is :2.35
glibc >=2.12, and cpu flag meet
BIN_DIR: /usr/local/ilogtail
CONTROLLER_FILE: loongcollectord
update-rc.d del loongcollectord successfully.
Uninstall loongcollector successfully.
RUNUSER:root
Download package from region cn-shenzhen-internet ...
Package address: http://aliyun-observability-release-cn-shenzhen.oss-cn-shenzhen.aliyuncs.com/loongcollector/linux64/latest/x86_64/main/loongcollector-linux64.tar.gz
[1] Download loongcollector-linux64.tar.gz successfully.
Generate config successfully.
Installing loongcollector in /usr/local/ilogtail ...
sysom-cn-shenzhenPreparing eBPF enviroment ...
Found valid btf file: /sys/kernel/btf/vmlinux
Prepare eBPF enviroment successfully
agent stub for telegraf has been installed
agent stub for jvm has been installed
Install loongcollector files successfully.
Configuring loongcollector service...
Use systemd for startup
service_file_path: /etc/systemd/system/loongcollectord.service
Synchronizing state of loongcollectord.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable loongcollectord
Created symlink /etc/systemd/system/default.target.wants/loongcollectord.service → /etc/systemd/system/loongcollectord.service.
systemd startup successfully.
Synchronizing state of ilogtaild.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable ilogtaild
Created symlink /etc/systemd/system/default.target.wants/ilogtaild.service → /etc/systemd/system/ilogtaild.service.
Configure loongcollector successfully.
Starting loongcollector ...
Start loongcollector successfully.
{"UUID":"DD64E1D0-ECF9-11F0-92B1-9D94276D7AA7", "compiler":"GCC 9.3.1", "host_id":"DCCBAF1A-ECF9-11F0-92B1-9D94276D7AA7", "hostname":"master", "instance_id":"DD64D532-ECF9-11F0-92B1-9D94276D7AA7_192.168.0.200_1767921834", "ip":"192.168.0.200", "loongcollector_version":, :, :}
- 查看启动状态:执行命令,返回
loongcollector is running 表示启动成功。
root@master:~/logotail# sudo /etc/init.d/loongcollectord status
loongcollector is running
- 配置用户 ID:用户 ID 文件包含 Project 所属阿里云主账号的 ID 信息,用于标识该账号有权限访问、采集这台服务器的日志。
只有在采集非本账号 ECS、自建服务器、其他云厂商服务器日志时需要配置用户 ID。多个账号对同一台服务器进行日志采集时,支持在同一台服务器上创建多个用户 ID 文件。
- 登录 日志服务控制台,鼠标悬浮在右上角用户头像上,在弹出的标签页中查看并复制账号 ID。注意需要复制主账号 ID。
- 在安装了 LoongCollector 的服务器上,以主账号 ID 作为文件名,创建用户 ID 文件。
root@master:~/logotail# touch /etc/ilogtail/users/[您的阿里云账号 ID]
- 配置机器组:日志服务通过机器组发现用户自定义标识并与主机上的 LoongCollector 建立心跳连接。
- 在服务器上将自定义字符串
user-defined-test-1 写入用户自定义标识文件,该字符串将在后续步骤中使用。
echo"user-defined-test-1"> /etc/ilogtail/user_defined_id
root@master:~/logotail# echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
-
登录 日志服务控制台。在 Project 列表中,单击目标 Project。
- 设置机器组名称:名称 Project 内唯一,必须以小写字母或数字开头和结尾,且只能包含小写字母、数字、连字符(-)和下划线(_),长度为 3~128 字符。
- 机器组标识:选择用户自定义标识。
- 用户自定义标识:输入配置的用户自定义标识,需要与服务器用户自定义标识文件中自定义字符串内容一致。此例为
user-defined-test-1。
-
机器组创建完成后,在机器组列表单击目标机器组,在机器组状态中查看心跳状态,若为 FAIL,请等待两分钟左右并手动刷新。如果心跳为 OK 则表示创建成功。
进行如下配置后单击确定。
单击机器组右侧的
,单击创建机器组。
单击***
***资源,单击机器组。
上述同样的安装步骤对剩余服务器进行安装配置
安装完成后若需要采集日志还需进行 采集配置。
5.ECS 服务器日志采集同主机日志采集步骤
systemd for startup
service_file_path: /etc/systemd/system/loongcollectord.service
Synchronizing state of loongcollectord.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable loongcollectord
Created symlink /etc/systemd/system/default.target.wants/loongcollectord.service → /etc/systemd/system/loongcollectord.service.
systemd startup successfully.
Synchronizing state of ilogtaild.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable ilogtaild
Created symlink /etc/systemd/system/default.target.wants/ilogtaild.service → /etc/systemd/system/ilogtaild.service.
Configure loongcollector successfully.
Starting loongcollector …
Start loongcollector successfully.
{
"UUID" : "DD64E1D0-ECF9-11F0-92B1-9D94276D7AA7",
"compiler" : "GCC 9.3.1",
"host_id" : "DCCBAF1A-ECF9-11F0-92B1-9D94276D7AA7",
"hostname" : "master",
"instance_id" : "DD64D532-ECF9-11F0-92B1-9D94276D7AA7_192.168.0.200_1767921834",
"ip" : "192.168.0.200",
"loongcollector_version" : "3.2.6",
"os" : "Linux; 5.15.0-164-generic; #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025; x86_64",
"update_time" : "2026-01-09 09:23:55"
}
3. 查看启动状态:执行命令,返回`loongcollector is running`表示启动成功。
root@master:~/logotail# sudo /etc/init.d/loongcollectord status
loongcollector is running
- 配置用户 ID:用户 ID 文件包含 Project 所属阿里云主账号的 ID 信息,用于标识该账号有权限访问、采集这台服务器的日志。
只有在采集非本账号 ECS、自建服务器、其他云厂商服务器日志时需要配置用户 ID。多个账号对同一台服务器进行日志采集时,支持在同一台服务器上创建多个用户 ID 文件。
- 登录 日志服务控制台,鼠标悬浮在右上角用户头像上,在弹出的标签页中查看并复制账号 ID。注意需要复制主账号 ID。
- 在安装了 LoongCollector 的服务器上,以主账号 ID 作为文件名,创建用户 ID 文件。
root@master:~/logotail# touch /etc/ilogtail/users/[您的阿里云账号 ID]
- 配置机器组:日志服务通过机器组发现用户自定义标识并与主机上的 LoongCollector 建立心跳连接。
- 在服务器上将自定义字符串
user-defined-test-1 写入用户自定义标识文件,该字符串将在后续步骤中使用。
echo"user-defined-test-1"> /etc/ilogtail/user_defined_id
root@master:~/logotail# echo "user-defined-test-1" > /etc/ilogtail/user_defined_id
- 登录 日志服务控制台。在 Project 列表中,单击目标 Project。
- 进行如下配置后单击确定。
- 设置机器组名称:名称 Project 内唯一,必须以小写字母或数字开头和结尾,且只能包含小写字母、数字、连字符(-)和下划线(_),长度为 3~128 字符。
- 机器组标识:选择用户自定义标识。
- 用户自定义标识:输入配置的用户自定义标识,需要与服务器用户自定义标识文件中自定义字符串内容一致。此例为
user-defined-test-1。
单击机器组右侧的
,单击创建机器组。
单击***
***资源,单击机器组。
机器组创建完成后,在机器组列表单击目标机器组,在机器组状态中查看心跳状态,若为 FAIL,请等待两分钟左右并手动刷新。如果心跳为 OK 则表示创建成功。
上述同样的安装步骤对剩余服务器进行安装配置
安装完成后若需要采集日志还需进行 采集配置。