SRE 每日主题 #12: CI/CD 流水线设计与实践
日期: 2026-03-12
主题索引: 12 (12 % 12 = 0 → 主题 12)
作者: SRE Team
目录
- 概述
- CI/CD 架构设计
- GitLab CI/CD 完整配置
- GitHub Actions 配置模板
- Jenkins Pipeline 最佳实践
- Docker 镜像构建优化
- Kubernetes 部署策略
- 环境管理与配置
- 质量门禁与测试策略
- 监控与可观测性
- 故障排查手册
- 安全最佳实践
概述
CI/CD 核心价值
- 持续集成 (CI): 代码频繁合并到主干,自动构建和测试
- 持续交付 (CD): 自动化发布流程,随时可部署到生产
- 持续部署: 通过所有测试后自动部署到生产
关键指标
| 指标 |
目标值 |
说明 |
| 构建时间 |
< 10 分钟 |
单次 CI 流水线执行时间 |
| 部署频率 |
每日多次 |
生产部署次数 |
| 变更失败率 |
< 5% |
部署导致故障的比例 |
| 恢复时间 (MTTR) |
< 1 小时 |
故障恢复平均时间 |
CI/CD 架构设计
参考架构图
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Developer │───▶│ Git Repo │───▶│ CI Runner │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Production │◀───│ Staging │◀───│ Registry │
└─────────────┘ └─────────────┘ └─────────────┘
组件选型
| 组件 |
推荐方案 |
备选方案 |
| 代码托管 |
GitLab / GitHub |
Bitbucket |
| CI 引擎 |
GitLab CI / GitHub Actions |
Jenkins / CircleCI |
| 镜像仓库 |
Harbor / Docker Hub |
ECR / GCR |
| 部署目标 |
Kubernetes |
ECS / VM |
| 配置管理 |
Helm / Kustomize |
Ansible |
GitLab CI/CD 完整配置
.gitlab-ci.yml 完整示例
stages:
- validate
- build
- test
- security
- deploy-staging
- deploy-production
variables:
DOCKER_REGISTRY: registry.example.com
DOCKER_IMAGE: $CI_PROJECT_PATH
KUBE_NAMESPACE: $CI_PROJECT_NAME-$CI_COMMIT_REF_SLUG
HELM_RELEASE: $CI_PROJECT_NAME
MAVEN_OPTS: "-Dmaven.repo.local=.m2/repository"
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.pip-cache"
cache:
key: "$CI_COMMIT_REF_SLUG"
paths:
- .m2/repository/
- .pip-cache/
- node_modules/
policy: pull-push
default:
image: docker:24.0
services:
- docker:24.0-dind
tags:
- kubernetes
retry:
max: 2
when:
- runner_system_failure
- stuck_or_timeout_failure
validate:
stage: validate
image: node:20-alpine
script:
- npm ci
- npm run lint
- npm run format:check
rules:
- changes:
- "**/*.js"
- "**/*.ts"
- "**/*.jsx"
- "**/*.tsx"
- package*.json
build:
stage: build
image: docker:24.0
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build
-t $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
-t $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
--build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
--build-arg VCS_REF=$CI_COMMIT_SHA
--cache-from $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
.
- docker push $DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
- docker push $DOCKER_REGISTRY/$DOCKER_IMAGE:latest
only:
- main
- develop
- /^release\/.*$/
artifacts:
reports:
dotenv: build.env
unit-test:
stage: test
image: node:20-alpine
script:
- npm ci
- npm run test:unit -- --coverage
coverage: '/Lines\s*:\s*\d+.\d+\s*\(.*?\)/'
artifacts:
reports:
junit: test-results/junit.xml
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
integration-test:
stage: test
image: docker:24.0
services:
- postgres:15-alpine
- redis:7-alpine
variables:
POSTGRES_DB: test_db
POSTGRES_USER: test_user
POSTGRES_PASSWORD: test_pass
REDIS_URL: redis://redis:6379
script:
- docker run --rm
-e DATABASE_URL=postgresql://test_user:test_pass@postgres:5432/test_db
-e REDIS_URL=$REDIS_URL
$DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
npm run test:integration
needs:
- build
security-scan:
stage: security
image: aquasec/trivy:latest
script:
- trivy image --exit-code 0 --severity UNKNOWN,LOW,MEDIUM
$DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
- trivy image --exit-code 1 --severity HIGH,CRITICAL
$DOCKER_REGISTRY/$DOCKER_IMAGE:$CI_COMMIT_SHA
allow_failure: true
rules:
- if: $CI_COMMIT_BRANCH == "main"
sast:
stage: security
image: semgrep/semgrep:latest
script:
- semgrep --config auto --json --output semgrep.json .
artifacts:
reports:
sast: semgrep.json
rules:
- if: $CI_COMMIT_BRANCH == "main"
dependency-check:
stage: security
image: node:20-alpine
script:
- npm audit --audit-level=high
allow_failure: true
rules:
- changes:
- package*.json
deploy-staging:
stage: deploy-staging
image: bitnami/kubectl:latest
script:
- kubectl config use-context staging
- kubectl create namespace $KUBE_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
- helm upgrade --install $HELM_RELEASE ./charts/$CI_PROJECT_NAME
--namespace $KUBE_NAMESPACE
--set image.tag=$CI_COMMIT_SHA
--set environment=staging
--wait --timeout 5m
environment:
name: staging
url: https://staging.$CI_PROJECT_NAME.example.com
rules:
- if: $CI_COMMIT_BRANCH == "develop"
- if: $CI_COMMIT_BRANCH == "main"
deploy-production:
stage: deploy-production
image: bitnami/kubectl:latest
script:
- kubectl config use-context production
- kubectl create namespace $KUBE_NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
- helm upgrade --install $HELM_RELEASE ./charts/$CI_PROJECT_NAME
--namespace $KUBE_NAMESPACE
--set image.tag=$CI_COMMIT_SHA
--set environment=production
--wait --timeout 10m
--atomic
environment:
name: production
url: https://$CI_PROJECT_NAME.example.com
when: manual
only:
- main
before_script:
- echo "⚠️ 生产部署需要审批"
after_script:
- kubectl rollout status deployment/$HELM_RELEASE --namespace $KUBE_NAMESPACE --timeout=5m
rollback:
stage: deploy-production
image: bitnami/kubectl:latest
script:
- kubectl config use-context production
- helm rollback $HELM_RELEASE --namespace $KUBE_NAMESPACE
when: manual
variables:
GIT_STRATEGY: none
GitLab CI 变量管理
variables:
CI_DEBUG_TRACE: "false"
GIT_DEPTH: "50"
GIT_STRATEGY: clone
DOCKER_TLS_CERTDIR: "/certs"
DOCKER_BUILDKIT: "1"
DOCKER_DRIVER: overlay2
NODE_OPTIONS: "--max-old-space-size=4096"
NPM_CONFIG_LOGLEVEL: warn
GitHub Actions 配置模板
完整 Workflow 示例
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
tags: ['v*']
pull_request:
branches: [main, develop]
workflow_dispatch:
inputs:
environment:
description: 'Deploy environment'
required: true
default: 'staging'
type: choice
options:
- staging
- production
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
HELM_VERSION: 3.12.0
jobs:
validate:
name: Validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linter
run: npm run lint
- name: Check formatting
run: npm run format:check
build:
name: Build
runs-on: ubuntu-latest
needs: validate
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=semver,pattern={{version}}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
BUILD_DATE=${{ github.event.head_commit.timestamp }}
VCS_REF=${{ github.sha }}
test:
name: Test
runs-on: ubuntu-latest
needs: build
services:
postgres:
image: postgres:15-alpine
env:
POSTGRES_DB: test_db
POSTGRES_USER: test_user
POSTGRES_PASSWORD: test_pass
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
redis:
image: redis:7-alpine
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm run test:unit -- --coverage
env:
CI: true
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
REDIS_URL: redis://localhost:6379
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./coverage/lcov.info
fail_ci_if_error: false
security:
name: Security Scan
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
- name: Run npm audit
run: npm audit --audit-level=high
continue-on-error: true
deploy-staging:
name: Deploy Staging
runs-on: ubuntu-latest
needs: [build, test, security]
if: github.ref == 'refs/heads/develop' || github.ref == 'refs/heads/main'
environment:
name: staging
url: https://staging.example.com
steps:
- uses: actions/checkout@v4
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
- name: Deploy to Staging
run: |
helm upgrade --install ${{ github.event.repository.name }} ./charts/${{ github.event.repository.name }} \
--namespace ${{ github.event.repository.name }}-staging \
--set image.tag=${{ github.sha }} \
--set environment=staging \
--wait --timeout 5m
- name: Verify deployment
run: |
kubectl rollout status deployment/${{ github.event.repository.name }} \
--namespace ${{ github.event.repository.name }}-staging \
--timeout=5m
deploy-production:
name: Deploy Production
runs-on: ubuntu-latest
needs: deploy-staging
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://example.com
steps:
- uses: actions/checkout@v4
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}
- name: Deploy to Production
run: |
helm upgrade --install ${{ github.event.repository.name }} ./charts/${{ github.event.repository.name }} \
--namespace ${{ github.event.repository.name }}-prod \
--set image.tag=${{ github.sha }} \
--set environment=production \
--wait --timeout 10m \
--atomic
- name: Verify deployment
run: |
kubectl rollout status deployment/${{ github.event.repository.name }} \
--namespace ${{ github.event.repository.name }}-prod \
--timeout=5m
- name: Notify success
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "✅ Production deployment successful: ${{ github.event.repository.name }}@${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
rollback:
name: Rollback Production
runs-on: ubuntu-latest
needs: deploy-production
if: failure()
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}
- name: Rollback deployment
run: |
helm rollback ${{ github.event.repository.name }} \
--namespace ${{ github.event.repository.name }}-prod
Jenkins Pipeline 最佳实践
Jenkinsfile (Declarative Pipeline)
pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
containers:
- name: docker
image: docker:24.0
command:
- cat
tty: true
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker.sock
- name: kubectl
image: bitnami/kubectl:latest
command:
- cat
tty: true
volumes:
- name: docker-sock
hostPath:
path: /var/run/docker.sock
'''
}
}
environment {
DOCKER_REGISTRY = 'registry.example.com'
DOCKER_IMAGE = "${env.JOB_NAME}/${env.BRANCH_NAME}".toLowerCase()
KUBE_NAMESPACE = "${env.JOB_NAME}-${env.BRANCH_NAME}".toLowerCase()
}
options {
timeout(time: 60, unit: 'MINUTES')
retry(2)
disableConcurrentBuilds()
buildDiscarder(logRotator(numToKeepStr: '20'))
}
triggers {
pollSCM('H/5 * * * *')
cron('H 2 * * 1-5')
}
parameters {
string(name: 'DEPLOY_ENV', defaultValue: 'staging', description: '部署环境')
booleanParam(name: 'SKIP_TESTS', defaultValue: false, description: '跳过测试')
choice(name: 'DEPLOY_TYPE', choices: ['blue-green', 'rolling', 'canary'], description: '部署策略')
}
stages {
stage('Checkout') {
steps {
checkout scm
script {
env.GIT_COMMIT_SHORT = sh(script: 'git rev-parse --short HEAD', returnStdout: true).trim()
}
}
}
stage('Validate') {
steps {
script {
if (fileExists('package.json')) {
sh 'npm ci'
sh 'npm run lint'
}
}
}
}
stage('Build') {
steps {
script {
withCredentials([usernamePassword(credentialsId: 'docker-registry', usernameVariable: 'DOCKER_USER', passwordVariable: 'DOCKER_PASS')]) {
sh """
docker login -u ${DOCKER_USER} -p ${DOCKER_PASS} ${DOCKER_REGISTRY}
docker build -t ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT} .
docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}
"""
}
}
}
post {
success {
archiveArtifacts artifacts: '**/target/*.jar', allowEmptyArchive: true
}
}
}
stage('Test') {
when {
expression { return !params.SKIP_TESTS }
}
parallel {
stage('Unit Tests') {
steps {
script {
if (fileExists('package.json')) {
sh 'npm run test:unit -- --coverage'
}
}
}
post {
always {
junit allowEmptyResults: true, testResults: '**/test-results/*.xml'
publishHTML(target: [
allowMissing: true,
alwaysLinkToLastBuild: true,
keepAllHistory: true,
reportDir: 'coverage',
reportFiles: 'index.html',
reportName: 'Code Coverage Report'
])
}
}
}
stage('Integration Tests') {
steps {
script {
withCredentials([string(credentialsId: 'db-url', variable: 'DATABASE_URL')]) {
sh 'npm run test:integration'
}
}
}
}
}
}
stage('Security Scan') {
steps {
script {
sh """
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \\
aquasec/trivy:latest image \\
--exit-code 0 --severity UNKNOWN,LOW,MEDIUM \\
${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \\
aquasec/trivy:latest image \\
--exit-code 1 --severity HIGH,CRITICAL \\
${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${GIT_COMMIT_SHORT}
"""
}
}
post {
always {
recordIssues enabledForFailure: true, tool: trivy(pattern: '**/trivy-report.json')
}
}
}
stage('Deploy Staging') {
when {
branch 'develop'
}
steps {
script {
withKubeConfig([credentialsId: 'kube-staging', serverUrl: '']) {
sh """
kubectl create namespace ${KUBE_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
helm upgrade --install ${JOB_NAME} ./charts/${JOB_NAME} \\
--namespace ${KUBE_NAMESPACE} \\
--set image.tag=${GIT_COMMIT_SHORT} \\
--set environment=staging \\
--wait --timeout 5m
"""
}
}
}
}
stage('Deploy Production') {
when {
allOf {
branch 'main'
expression { return params.DEPLOY_ENV == 'production' }
}
}
steps {
input message: 'Deploy to production?', ok: 'Deploy'
script {
withKubeConfig([credentialsId: 'kube-production', serverUrl: '']) {
sh """
kubectl create namespace ${KUBE_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
helm upgrade --install ${JOB_NAME} ./charts/${JOB_NAME} \\
--namespace ${KUBE_NAMESPACE} \\
--set image.tag=${GIT_COMMIT_SHORT} \\
--set environment=production \\
--wait --timeout 10m \\
--atomic
"""
}
}
}
post {
success {
script {
slackSend(channel: '#deployments', color: 'good', message: "✅ Production deployment successful: ${JOB_NAME}@${GIT_COMMIT_SHORT}")
}
}
failure {
script {
slackSend(channel: '#deployments', color: 'danger', message: "❌ Production deployment failed: ${JOB_NAME}@${GIT_COMMIT_SHORT}")
}
}
}
}
}
post {
always {
cleanWs()
script {
currentBuild.description = "Build ${GIT_COMMIT_SHORT} - ${currentBuild.result}"
}
}
success {
script {
if (env.BRANCH_NAME == 'main') {
sh 'git tag -a v${BUILD_NUMBER} -m "Release ${BUILD_NUMBER}" || true'
sh 'git push origin v${BUILD_NUMBER} || true'
}
}
}
failure {
script {
emailext(
subject: "Build failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}",
body: """Build failed. Check console output at ${BUILD_URL}""",
to: 'team@example.com',
recipientProviders: [[$class: 'DevelopersRecipientProvider']]
)
}
}
}
}
Docker 镜像构建优化
多阶段构建 Dockerfile
# Dockerfile - 多阶段构建优化
ARG NODE_VERSION=20-alpine
ARG ALPINE_VERSION=3.19
# ========== Stage 1: Dependencies ==========
FROM node:${NODE_VERSION} AS deps
WORKDIR /app
# 复制 package 文件
COPY package*.json ./
# 安装依赖(利用缓存层)
RUN npm ci --only=production && \
npm cache clean --force
# ========== Stage 2: Build ==========
FROM node:${NODE_VERSION} AS builder
WORKDIR /app
# 复制依赖
COPY --from=deps /app/node_modules ./node_modules
COPY package*.json ./
COPY tsconfig*.json ./
COPY src/ ./src/
# 构建应用
RUN npm run build && \
npm prune --production
# ========== Stage 3: Production ==========
FROM node:${NODE_VERSION} AS production
WORKDIR /app
# 创建非 root 用户
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# 复制构建产物
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package.json ./
# 设置用户
USER nodejs
# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"
# 暴露端口
EXPOSE 3000
# 启动命令
CMD ["node", "dist/index.js"]
# ========== Labels (构建参数) ==========
ARG BUILD_DATE
ARG VCS_REF
LABEL org.label-schema.build-date="${BUILD_DATE}" \
org.label-schema.vcs-ref="${VCS_REF}" \
org.label-schema.schema-version="1.0"
Docker Build 优化技巧
#!/bin/bash
export DOCKER_BUILDKIT=1
IMAGE_NAME="myapp"
IMAGE_TAG="${GIT_COMMIT:-latest}"
REGISTRY="registry.example.com"
docker build \
--progress=plain \
--build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg VCS_REF=${GIT_COMMIT} \
--cache-from ${REGISTRY}/${IMAGE_NAME}:latest \
--tag ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \
--tag ${REGISTRY}/${IMAGE_NAME}:latest \
--file Dockerfile \
.
docker push ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}
docker push ${REGISTRY}/${IMAGE_NAME}:latest
docker image prune -f
.dockerignore 优化
# .dockerignore
# Git
.git
.gitignore
.gitattributes
# Documentation
*.md
!README.md
docs/
# Tests
test/
tests/
__tests__/
*.test.js
*.spec.js
coverage/
# Development
.env
.env.*
!.env.example
.vscode/
.idea/
*.log
npm-debug.log*
# Dependencies (will be installed in container)
node_modules/
# Build artifacts
dist/
build/
*.tgz
# Docker
Dockerfile*
docker-compose*.yml
.docker/
# CI/CD
.github/
.gitlab-ci.yml
Jenkinsfile
.travis.yml
.circleci/
# OS files
.DS_Store
Thumbs.db
Kubernetes 部署策略
Helm Chart 模板
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
strategy:
type: {{ .Values.deploymentStrategy.type }}
{{- if eq .Values.deploymentStrategy.type "RollingUpdate" }}
rollingUpdate:
maxSurge: {{ .Values.deploymentStrategy.rollingUpdate.maxSurge }}
maxUnavailable: {{ .Values.deploymentStrategy.rollingUpdate.maxUnavailable }}
{{- end }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "myapp.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }}
failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }}
failureThreshold: {{ .Values.probes.readiness.failureThreshold }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
- name: NODE_ENV
value: {{ .Values.environment }}
- name: PORT
value: {{ .Values.service.port | quote }}
envFrom:
- configMapRef:
name: {{ include "myapp.fullname" . }}-config
- secretRef:
name: {{ include "myapp.fullname" . }}-secret
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
values-production.yaml
replicaCount: 3
image:
repository: registry.example.com/myapp
pullPolicy: IfNotPresent
tag: ""
deploymentStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
environment: production
probes:
liveness:
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readiness:
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
podDisruptionBudget:
enabled: true
minAvailable: 2
service:
type: ClusterIP
port: 80
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
hosts:
- host: myapp.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: myapp-tls
hosts:
- myapp.example.com
podSecurityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
nodeSelector: {}
tolerations: []
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: myapp
topologyKey: kubernetes.io/hostname
部署策略对比
| 策略 |
适用场景 |
优点 |
缺点 |
| RollingUpdate |
默认推荐 |
零停机,资源效率高 |
版本混合期 |
| Blue-Green |
关键应用 |
快速回滚,完全隔离 |
资源翻倍 |
| Canary |
高风险变更 |
风险可控,渐进式 |
配置复杂 |
| Recreate |
开发环境 |
简单直接 |
服务中断 |
Canary 部署示例
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 600
service:
port: 80
targetPort: 3000
analysis:
interval: 1m
threshold: 10
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
type: pre-rollout
url: http://flagger-loadtester.testing/
timeout: 60s
环境管理与配置
环境变量管理
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-config
data:
LOG_LEVEL: "info"
LOG_FORMAT: "json"
ENABLE_METRICS: "true"
CACHE_TTL: "3600"
MAX_CONNECTIONS: "100"
TIMEOUT_MS: "30000"
apiVersion: v1
kind: Secret
metadata:
name: myapp-secret
type: Opaque
stringData:
DATABASE_URL: "postgresql://user:pass@db:5432/myapp"
REDIS_URL: "redis://redis:6379"
JWT_SECRET: "change-me-in-production"
API_KEY: "sk-xxx"
配置管理最佳实践
#!/bin/bash
ENV=${1:-staging}
kubectl apply -k overlays/${ENV}/
helm upgrade --install myapp ./charts/myapp \
-f values.yaml \
-f values-${ENV}.yaml \
--namespace myapp-${ENV}
kubectl get configmap,secret -l app=myapp -n myapp-${ENV}
质量门禁与测试策略
质量门禁标准
| 检查项 |
阈值 |
执行阶段 |
| 单元测试覆盖率 |
≥ 80% |
CI |
| 代码重复率 |
≤ 5% |
CI |
| 严重安全漏洞 |
0 |
CI |
| 构建时间 |
≤ 10 分钟 |
CI |
| E2E 测试通过率 |
100% |
CD-Staging |
| 性能回归 |
≤ 5% |
CD-Staging |
测试金字塔配置
test-strategy:
unit-tests:
count: 500+
execution-time: "< 2 min"
run-on: "every commit"
integration-tests:
count: 50+
execution-time: "< 5 min"
run-on: "PR to main/develop"
e2e-tests:
count: 20+
execution-time: "< 10 min"
run-on: "before production deploy"
performance-tests:
count: 10+
execution-time: "< 15 min"
run-on: "weekly / before major release"
监控与可观测性
流水线监控指标
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cicd-monitor
spec:
selector:
matchLabels:
app: gitlab-runner
endpoints:
- port: metrics
interval: 30s
path: /metrics
关键监控告警
groups:
- name: cicd-alerts
rules:
- alert: HighBuildFailureRate
expr: |
sum(rate(gitlab_ci_builds_failed_total[5m]))
/ sum(rate(gitlab_ci_builds_total[5m])) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "CI 构建失败率超过 20%"
- alert: LongBuildDuration
expr: |
histogram_quantile(0.95,
rate(gitlab_ci_build_duration_seconds_bucket[5m])) > 600
for: 10m
labels:
severity: warning
annotations:
summary: "构建时间 P95 超过 10 分钟"
- alert: DeploymentFailure
expr: |
sum(rate(deployment_failures_total[5m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "生产部署失败"
监控命令
gitlab-ci-stats --project myapp --days 30
kubectl get pods -n gitlab -l app=gitlab-runner -o json | \
jq '.items[].status.containerStatuses[].state'
kubectl rollout status deployment/myapp --namespace production
kubectl get deployments -A --sort-by='.metadata.creationTimestamp'
kubectl get events --namespace production --field-selector type=Warning
故障排查手册
常见问题及解决方案
1. 构建失败 - 依赖安装超时
curl -I https://registry.npmjs.org
npm config set registry https://registry.npmmirror.com
npm cache clean --force
rm -rf node_modules package-lock.json
npm install
2. Docker 构建内存不足
export DOCKER_BUILDKIT=1
export BUILDKIT_STEP_LOG_MAX_SIZE=10485760
3. Kubernetes 部署卡住
kubectl describe deployment myapp -n production
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl logs -f deployment/myapp -n production
kubectl rollout undo deployment/myapp -n production
kubectl describe quota -n production
kubectl describe limitrange -n production
4. Helm 部署失败
helm history myapp -n production
helm status myapp -n production
helm rollback myapp 1 -n production
kubectl delete secret -n production -l owner=helm
helm upgrade --install myapp ./charts/myapp --atomic --timeout 10m
5. CI Runner 无响应
gitlab-runner verify --delete
gitlab-runner restart
journalctl -u gitlab-runner -f
gitlab-runner list
gitlab-runner register --url https://gitlab.example.com/
排查流程图
构建失败
│
├─→ 检查构建日志 → 定位错误行
│
├─→ 本地复现 → docker build --no-cache
│
├─→ 检查依赖 → npm install / mvn dependency:tree
│
├─→ 检查资源 → 内存/磁盘/CPU
│
└─→ 联系平台团队 → Runner 配置问题
安全最佳实践
CI/CD 安全清单
Secret 管理
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: myapp-secret
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: myapp-secret
data:
- secretKey: DATABASE_URL
remoteRef:
key: myapp/database
property: url
镜像安全
cosign sign --key cosign.key registry.example.com/myapp:latest
cosign verify --key cosign.pub registry.example.com/myapp:latest
trivy image --severity HIGH,CRITICAL registry.example.com/myapp:latest
syft registry.example.com/myapp:latest -o spdx-json > sbom.json
附录:快速参考
常用命令速查
gitlab-runner register
gitlab-runner verify
gitlab-ci-local
act
java -jar jenkins-cli.jar -s http://localhost:8080/
docker build -t myapp:latest .
docker push myapp:latest
docker scan myapp:latest
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
kubectl rollout undo deployment/myapp
kubectl describe pod myapp-xxx
helm lint ./charts/myapp
helm template myapp ./charts/myapp
helm upgrade --install myapp ./charts/myapp
helm rollback myapp 1
推荐工具链
| 类别 |
工具 |
用途 |
| 本地 CI 测试 |
gitlab-ci-local, act |
本地验证流水线 |
| 镜像扫描 |
Trivy, Grype |
漏洞扫描 |
| 镜像签名 |
Cosign, Notary |
签名验证 |
| Secret 管理 |
External Secrets, SealedSecrets |
密钥管理 |
| 配置管理 |
Helm, Kustomize |
部署配置 |
| 渐进式交付 |
Flagger, Argo Rollouts |
Canary/蓝绿部署 |
| 工作流引擎 |
Argo Workflows, Tekton |
复杂流水线 |
文档版本: 1.0
最后更新: 2026-03-12
维护团队: SRE Team
反馈渠道: #sre-feedback