SRE 每日主题:Higress 云原生网关部署与生产实践
日期: 2026-03-13
主题序号: 1 (13 % 12 = 1)
难度等级: ⭐⭐⭐⭐
适用场景: 生产环境云原生网关部署
一、Higress 概述
Higress 是阿里巴巴开源的云原生网关,基于 Envoy + Istio 构建,提供:
- 流量网关:南北向流量入口
- 微服务网关:东西向服务治理
- 安全网关:WAF、认证、限流
- AI 网关:大模型 API 统一接入
核心优势
| 特性 | 说明 |
|---|---|
| 高性能 | 基于 Envoy,单机 10W+ QPS |
| 热更新 | 配置变更无需重启 |
| 多协议 | HTTP/HTTPS/gRPC/Dubbo |
| 可观测 | 内置 Prometheus 指标 |
| 插件化 | WASM 插件扩展能力 |
二、生产环境部署方案
2.1 前置要求
# Kubernetes 版本要求
kubectl version --short
# 要求:v1.20+
# Helm 版本要求
helm version
# 要求:v3.0+
# 节点资源要求(生产环境最小配置)
# CPU: 4 核 × 3 节点
# 内存:8Gi × 3 节点
2.2 添加 Helm Chart 仓库
helm repo add higress https://higress.io/helm-charts
helm repo update
2.3 创建命名空间
kubectl create namespace higress-system
2.4 生产环境 values.yaml 配置
# higress-production-values.yaml
# ========== 全局配置 ==========
global:
# 镜像仓库(国内使用阿里云镜像)
imageRepository: registry.cn-hangzhou.aliyuncs.com/higress
# 镜像拉取策略
imagePullPolicy: IfNotPresent
# ========== Gateway 配置 ==========
gateway:
# 副本数(生产环境至少 3 副本)
replicas: 3
# 资源限制(关键!防止 OOM)
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
# 自动扩缩容配置
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Pod 反亲和性(分散到不同节点)
antiAffinity:
enabled: true
type: "preferred"
# 容忍度(允许调度到 master 节点,如需)
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
# 节点选择器
nodeSelector:
gateway-node: "true"
# 健康检查
livenessProbe:
httpGet:
path: /healthz/ready
port: 15021
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/ready
port: 15021
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Service 配置(LoadBalancer 类型)
service:
type: LoadBalancer
# 阿里云 SLB 注解
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-type: "nlb"
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-spec: "slb.s3.small"
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-charge-type: "paybytraffic"
# 外部 IP(如使用固定 IP)
# loadBalancerIP: "192.168.1.100"
ports:
- name: http2
port: 80
targetPort: 80
protocol: TCP
- name: https
port: 443
targetPort: 443
protocol: TCP
# 日志配置
logging:
level: "warning" # production: warning, debug: debug
format: "json" # 生产环境使用 JSON 格式便于日志收集
# ========== Controller 配置 ==========
controller:
replicas: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
# Leader 选举配置
leaderElection:
enabled: true
leaseDuration: 30s
renewDeadline: 20s
retryPeriod: 5s
# ========== 监控配置 ==========
monitoring:
enabled: true
serviceMonitor:
enabled: true
namespace: higress-system
interval: 30s
scrapeTimeout: 10s
# ========== TLS/SSL 配置 ==========
tls:
# 启用自动证书(Let's Encrypt)
autoCert:
enabled: true
email: "admin@example.com"
server: "https://acme-v02.api.letsencrypt.org/directory"
# 或手动指定证书 Secret
# secretName: "higress-tls"
# ========== 限流配置 ==========
rateLimit:
enabled: true
redis:
# Redis 地址(生产环境使用独立 Redis)
host: "redis-master.redis.svc.cluster.local"
port: 6379
password: "your-redis-password"
db: 0
# ========== WAF 配置 ==========
waf:
enabled: true
# 自定义规则
customRules:
- name: "block-sql-injection"
action: "block"
conditions:
- field: "uri_query"
operator: "contains"
value: "union select"
- field: "uri_query"
operator: "contains"
value: "or 1=1"
# ========== 认证配置 ==========
auth:
enabled: true
# JWT 认证
jwt:
enabled: true
issuer: "https://auth.example.com"
jwksUri: "https://auth.example.com/.well-known/jwks.json"
audiences:
- "higress-gateway"
2.5 部署命令
# 安装 Higress(生产环境)
helm install higress higress/higress \
-n higress-system \
-f higress-production-values.yaml \
--wait \
--timeout 10m
# 验证部署
kubectl get pods -n higress-system
kubectl get svc -n higress-system
# 查看部署详情
helm status higress -n higress-system
2.6 升级命令
# 平滑升级(零停机)
helm upgrade higress higress/higress \
-n higress-system \
-f higress-production-values.yaml \
--reuse-values \
--wait
# 回滚到上一版本
helm rollback higress -n higress-system
# 查看历史版本
helm history higress -n higress-system
三、路由配置示例
3.1 基础 HTTP 路由
# http-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app
namespace: default
annotations:
kubernetes.io/ingress.class: higress
# 路径匹配类型:Exact, Prefix, ImplementationSpecific
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-service
port:
number: 80
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
3.2 灰度发布(Canary)
# canary-release.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app-canary
namespace: default
annotations:
kubernetes.io/ingress.class: higress
# 灰度流量比例(10%)
higress.io/canary: "true"
higress.io/canary-by-header: "X-Canary"
higress.io/canary-by-header-value: "true"
# 或按权重
# higress.io/canary-weight: "10"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-service-v2
port:
number: 80
3.3 gRPC 路由
# grpc-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grpc-service
namespace: default
annotations:
kubernetes.io/ingress.class: higress
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
tls:
- hosts:
- grpc.example.com
secretName: grpc-tls
rules:
- host: grpc.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grpc-backend
port:
number: 50051
3.4 WebSocket 支持
# websocket-route.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: websocket-app
namespace: default
annotations:
kubernetes.io/ingress.class: higress
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
rules:
- host: ws.example.com
http:
paths:
- path: /ws
pathType: Prefix
backend:
service:
name: websocket-service
port:
number: 8080
四、关键参数调优
4.1 Envoy 连接参数
# 在 values.yaml 的 gateway.extraEnvoyConfig 中添加
gateway:
extraEnvoyConfig: |
# 连接超时配置
connect_timeout: 5s
# 连接池配置
max_connections: 1024
max_pending_requests: 1024
max_requests: 1024
max_retries: 3
# HTTP/2 配置
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536
initial_connection_window_size: 1048576
# 保持连接
keepalive:
time: 30s
interval: 10s
timeout: 5s
4.2 超时配置(生产推荐值)
| 参数 | 推荐值 | 说明 |
|---|---|---|
| connect_timeout | 5s | 连接建立超时 |
| request_timeout | 60s | 请求总超时 |
| idle_timeout | 300s | 空闲连接超时 |
| stream_idle_timeout | 30s | 流空闲超时 |
| max_stream_duration | 3600s | 最大流时长(WebSocket) |
4.3 限流配置
# 全局限流
apiVersion: networking.higress.io/v1
kind: HigressRateLimit
metadata:
name: global-rate-limit
namespace: higress-system
spec:
# 限流维度:global, route, cluster
domain: higress
descriptors:
- key: remote_address
rate_limit:
unit: second
requests_per_unit: 100 # 每 IP 每秒 100 请求
- key: header_match
value: "api-key"
rate_limit:
unit: minute
requests_per_unit: 1000 # 每 API Key 每分钟 1000 请求
4.4 熔断配置
# 熔断器配置(HigressRoute)
apiVersion: networking.higress.io/v1
kind: HigressRoute
metadata:
name: api-route
namespace: default
spec:
hosts:
- "api.example.com"
routes:
- match:
uri:
prefix: /api
route:
- destination:
host: api-service
port: 8080
# 熔断配置
outlierDetection:
consecutive5xxErrors: 5 # 连续 5 次 5xx 错误触发
interval: 30s # 检测间隔
baseEjectionTime: 30s # 隔离基础时间
maxEjectionPercent: 50 # 最大隔离比例 50%
minHealthPercent: 30 # 最小健康实例比例
五、监控与告警
5.1 Prometheus 指标
# 关键指标列表
# 请求量
higress_gateway_requests_total{route, status_code}
# 延迟
higress_gateway_request_duration_seconds{route, quantile}
# 连接数
higress_gateway_connections_active
higress_gateway_connections_total
# 限流
higress_gateway_rate_limited_requests_total
# 熔断
higress_gateway_circuit_breaker_open
# 证书
higress_gateway_ssl_cert_expiry_timestamp_seconds
5.2 Grafana 仪表盘配置
{
"dashboard": {
"title": "Higress Gateway 监控",
"panels": [
{
"title": "QPS",
"targets": [{
"expr": "sum(rate(higress_gateway_requests_total[1m]))"
}]
},
{
"title": "P99 延迟",
"targets": [{
"expr": "histogram_quantile(0.99, rate(higress_gateway_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "错误率",
"targets": [{
"expr": "sum(rate(higress_gateway_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(higress_gateway_requests_total[5m]))"
}]
},
{
"title": "活跃连接数",
"targets": [{
"expr": "higress_gateway_connections_active"
}]
}
]
}
}
5.3 告警规则(Prometheus AlertManager)
# higress-alerts.yaml
groups:
- name: higress-alerts
rules:
# 高错误率告警
- alert: HigressHighErrorRate
expr: |
sum(rate(higress_gateway_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(higress_gateway_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Higress 错误率超过 5%"
description: "当前错误率:{{ $value | humanizePercentage }}"
# 高延迟告警
- alert: HigressHighLatency
expr: |
histogram_quantile(0.99, rate(higress_gateway_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Higress P99 延迟超过 1 秒"
description: "当前 P99 延迟:{{ $value }}s"
# Pod 重启告警
- alert: HigressPodRestarting
expr: |
increase(kube_pod_container_status_restarts_total{namespace="higress-system"}[1h]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Higress Pod 频繁重启"
description: "Pod {{ $labels.pod }} 1 小时内重启 {{ $value }} 次"
# 证书即将过期告警
- alert: HigressCertExpiring
expr: |
(higress_gateway_ssl_cert_expiry_timestamp_seconds - time()) < 86400 * 7
for: 1h
labels:
severity: warning
annotations:
summary: "SSL 证书将在 7 天内过期"
5.4 监控命令
# 查看 Gateway Pod 状态
kubectl get pods -n higress-system -o wide
# 查看资源使用
kubectl top pods -n higress-system
# 查看实时日志
kubectl logs -n higress-system -l app=higress-gateway -f --tail=100
# 查看 Envoy 配置
kubectl exec -n higress-system $(kubectl get pod -n higress-system -l app=higress-gateway -o jsonpath='{.items[0].metadata.name}') -- pilot-agent request GET /config_dump
# 查看连接统计
kubectl exec -n higress-system $(kubectl get pod -n higress-system -l app=higress-gateway -o jsonpath='{.items[0].metadata.name}') -- pilot-agent request GET /stats | grep connection
# 测试延迟
for i in {1..100}; do curl -s -o /dev/null -w "%{time_total}\n" https://app.example.com; done | awk '{sum+=$1} END {print "avg:", sum/NR}'
# 压力测试(ab)
ab -n 10000 -c 100 https://app.example.com/
# 压力测试(wrk)
wrk -t12 -c400 -d30s https://app.example.com/
六、故障排查
6.1 常见问题排查流程
1. 检查 Pod 状态
kubectl get pods -n higress-system
2. 查看 Pod 事件
kubectl describe pod <pod-name> -n higress-system
3. 查看日志
kubectl logs <pod-name> -n higress-system
4. 检查 Service/Endpoints
kubectl get svc,ep -n higress-system
5. 检查 Ingress 配置
kubectl get ingress -A
kubectl describe ingress <ingress-name>
6. 检查路由配置
kubectl get higressroute -A
7. 验证 DNS 解析
nslookup app.example.com
dig app.example.com
8. 测试连通性
curl -v https://app.example.com
6.2 典型故障场景
场景 1:502 Bad Gateway
# 原因:后端服务不可用
# 排查步骤:
# 1. 检查后端 Pod 状态
kubectl get pods -n default -l app=web-service
# 2. 检查 Endpoints
kubectl get endpoints web-service -n default
# 3. 查看 Gateway 日志中的 upstream 错误
kubectl logs -n higress-system -l app=higress-gateway | grep "upstream"
# 4. 测试后端直连
kubectl exec -n default <backend-pod> -- curl localhost:8080/health
场景 2:503 Service Unavailable
# 原因:无可用后端实例或熔断触发
# 排查步骤:
# 1. 检查熔断状态
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep circuit_breaker
# 2. 检查限流状态
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep rate_limit
# 3. 查看是否有健康检查失败
kubectl logs -n higress-system -l app=higress-gateway | grep "health_check"
场景 3:SSL/TLS 证书问题
# 原因:证书过期或配置错误
# 排查步骤:
# 1. 检查证书有效期
echo | openssl s_client -connect app.example.com:443 2>/dev/null | openssl x509 -noout -dates
# 2. 检查 Secret 中的证书
kubectl get secret higress-tls -n higress-system -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# 3. 验证证书链
openssl s_client -connect app.example.com:443 -showcerts
# 4. 检查自动证书状态(如使用 Let's Encrypt)
kubectl get certificaterequest -n higress-system
kubectl describe certificaterequest <request-name> -n higress-system
场景 4:路由不匹配
# 原因:Ingress 配置错误或路径不匹配
# 排查步骤:
# 1. 查看 Ingress 配置
kubectl get ingress <name> -o yaml
# 2. 检查 Higress 路由配置
kubectl get higressroute -A -o yaml
# 3. 查看 Envoy 路由表
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /config_dump | jq '.configs[] | select(.route_config != null)'
# 4. 测试不同路径
curl -v -H "Host: app.example.com" http://<gateway-ip>/api
curl -v -H "Host: app.example.com" http://<gateway-ip>/static
场景 5:性能下降
# 原因:资源不足或配置不当
# 排查步骤:
# 1. 检查资源使用
kubectl top pods -n higress-system
# 2. 检查连接数
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep connection
# 3. 检查请求队列
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /stats | grep queue
# 4. 查看慢请求日志
kubectl logs -n higress-system -l app=higress-gateway | grep -E "duration.*[1-9][0-9]{2,}ms"
# 5. 检查是否有 OOM
kubectl describe pod -n higress-system | grep -A5 "OOM"
6.3 调试工具
# 启用 Debug 日志
kubectl patch deploy higress-gateway -n higress-system \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/env", "value": [{"name": "LOG_LEVEL", "value": "debug"}]}]'
# 抓取 Envoy 配置快照
kubectl exec -n higress-system <gateway-pod> -- pilot-agent request GET /config_dump > envoy-config.json
# 抓取性能剖析
kubectl exec -n higress-system <gateway-pod> -- curl -s localhost:15000/ready
kubectl exec -n higress-system <gateway-pod> -- curl -s localhost:15000/stats/prometheus > metrics.prom
# 网络抓包(需要 debug 容器)
kubectl debug -n higress-system <gateway-pod> -it --image=nicolaka/netshoot -- tcpdump -i any port 80 or 443
七、最佳实践
7.1 部署最佳实践
| 实践 | 说明 | 推荐配置 |
|---|---|---|
| 多副本部署 | 避免单点故障 | 至少 3 副本 |
| 跨可用区部署 | 提高容灾能力 | Pod 反亲和性 + 多 AZ |
| 资源限制 | 防止资源耗尽 | 设置 requests/limits |
| PDB 配置 | 保证升级可用性 | minAvailable: 2 |
| 健康检查 | 快速故障检测 | 5s interval, 3 次失败 |
7.2 安全最佳实践
# 1. 启用 mTLS
apiVersion: security.higress.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: higress-system
spec:
mtls:
mode: STRICT
# 2. 配置 WAF 规则
apiVersion: security.higress.io/v1
kind: WafPolicy
metadata:
name: default-waf
namespace: higress-system
spec:
rules:
- name: sql-injection
action: BLOCK
conditions:
- field: ARGS
operator: CONTAINS
value: "(?i)(union.*select|select.*from)"
- name: xss-protection
action: BLOCK
conditions:
- field: ARGS
operator: CONTAINS
value: "(?i)(<script|javascript:)"
# 3. IP 白名单
apiVersion: networking.higress.io/v1
kind: HigressGateway
metadata:
name: internal-gateway
spec:
accessLog:
- filter:
remoteIp:
cidr: "10.0.0.0/8"
7.3 性能最佳实践
# 1. 启用 HTTP/2
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/http2: "true"
# 2. 启用 Gzip 压缩
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/enable-gzip: "true"
nginx.ingress.kubernetes.io/gzip-types: "text/plain,text/css,application/json,application/javascript"
nginx.ingress.kubernetes.io/gzip-min-length: "256"
# 3. 配置连接池
# 在 HigressRoute 中
spec:
routes:
- route:
- destination:
host: backend-service
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
7.4 运维最佳实践
# 1. 定期备份配置
kubectl get ingress,higressroute,virtualservice -A -o yaml > higress-config-backup-$(date +%Y%m%d).yaml
# 2. 证书监控(提前 30 天告警)
# 使用 cert-manager + Prometheus
# 3. 配置变更审计
# 启用 Kubernetes Audit Log
# 4. 定期压测
# 每月执行一次全链路压测
# 5. 灾备演练
# 每季度执行一次故障切换演练
八、配置模板速查
8.1 完整 Ingress 模板
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: production-app
namespace: production
annotations:
kubernetes.io/ingress.class: higress
# TLS
cert-manager.io/cluster-issuer: "letsencrypt-prod"
# 限流
nginx.ingress.kubernetes.io/limit-rps: "100"
# 超时
nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# 重定向
nginx.ingress.kubernetes.io/ssl-redirect: "true"
# CORS
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://*.example.com"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
nginx.ingress.kubernetes.io/cors-allow-headers: "DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization"
spec:
tls:
- hosts:
- app.example.com
secretName: app-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 80
- path: /api
pathType: Prefix
backend:
service:
name: backend
port:
number: 8080
- path: /health
pathType: Exact
backend:
service:
name: frontend
port:
number: 80
8.2 HigressRoute 模板
apiVersion: networking.higress.io/v1
kind: HigressRoute
metadata:
name: api-route
namespace: production
spec:
hosts:
- "api.example.com"
http:
- name: "api-v1"
match:
- uri:
prefix: "/api/v1"
route:
- destination:
host: api-v1-service
port:
number: 8080
weight: 90
- destination:
host: api-v2-service
port:
number: 8080
weight: 10
timeout: 30s
retries:
attempts: 3
perTryTimeout: 10s
retryOn: "5xx,reset,connect-failure"
fault:
delay:
percentage:
value: 0.1
fixedDelay: 100ms
corsPolicy:
allowOrigins:
- exact: "https://app.example.com"
allowMethods:
- GET
- POST
allowHeaders:
- Authorization
- Content-Type
exposeHeaders:
- X-Request-Id
maxAge: 24h
allowCredentials: true
rateLimit:
type: Local
qps: 100
burst: 200
九、参考资源
- 官方文档: https://higress.io/docs/
- GitHub: https://github.com/alibaba/higress
- Helm Chart: https://higress.io/helm-charts/
- 最佳实践: https://higress.io/docs/latest/overview/what-is-higress/
- 性能基准: https://higress.io/docs/latest/benchmark/
十、今日检查清单
- 检查 Gateway Pod 健康状态
- 验证 SSL 证书有效期(> 30 天)
- 检查错误率(< 1%)
- 检查 P99 延迟(< 500ms)
- 查看限流触发次数
- 检查熔断状态
- 备份当前配置
- 审查最近变更的 Ingress 配置
文档生成时间: 2026-03-13 10:00 CST
下次主题: 2026-03-14 - Redis 生产配置与性能调优