跳转至

使用 Prometheus 进行监控报警

1. 配置 Traefik 监控

Prometheus Operator 提供了 ServiceMonitor 这个 CRD 来配置监控指标的采集,这里我们定义一个如下所示的对象

008-traefik-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name:  traefik
  namespace: default
  labels:
    app: traefik
    release: prometheus-stack
spec:
  jobLabel: traefik-metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: traefik-dashboard
      app: traefik
  namespaceSelector:
    matchNames:
    - kube-system
  endpoints:
  - port: admin
    path: /metrics

# 注意 traefik-dashboard 服务是在 kube-system 命名空间中创建的。
# ServiceMonitor 则部署在默认的 default 命名空间中,所以使用 namespaceSelector 进行命名空间匹配。
1. 创建资源后 Prometheus 将获取 traefik-dashboard 服务的 /metrics 端点。
kubectl apply -f 008-traefik-service-monitor.yaml

验证一下 Prometheus 是否已经开始抓取 Traefik 的指标

image-20240416142614717

2. 配置 Traefik 报警

添加一个报警规则,当条件匹配的时候会触发报警,同样 Prometheus Operator 也提供了一个名为 PrometheusRule 的 CRD 对象来配置报警规则:

009-traefik-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-name: kps
    meta.helm.sh/release-namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: kps
  name: traefik-alert-rules
  namespace: monitoring
spec:
  groups:
  - name: Traefik
    rules:
    - alert: TooManyRequest
      expr: avg(traefik_entrypoint_open_connections{job="traefik-dashboard",namespace="kube-system"}) > 5
      for: 1m
      labels:
        severity: critical
1. 定义了一个规则:如果 1 分钟内有超过 5 个 open connections 机会触发一个 TooManyRequest 报警。
kubectl apply -f 009-traefik-rules.yaml

注意: PrometheusRule 的 annotations & labels 可以借鉴其它已经运行的 rule

kubectl get PrometheusRule kps-kube-prometheus-stack-prometheus-operator -n monitoring -oyaml |head

查看 rule 是否成功

kubectl exec -n monitoring prometheus-kps-kube-prometheus-stack-prometheus-0 -- ls /etc/prometheus/rules/prometheus-kps-kube-prometheus-stack-prometheus-rulefiles-0/

创建完成后正常在 Promethues 的 Dashboard 下的 Status > Rules 页面就可以看到对应的报警规则:

image-20240416150415896

3. Grafana 配置

image-20240416152545861

4. 测试

Traefik 已经开始工作了,并且指标也被 Prometheus 和 Grafana 获取到了,接下来我们需要使用一个应用程序来测试。这里我们部署 HTTPBin 服务,它提供了许多端点,可用于模拟不同类型的用户流量。对应的资源清单文件如下所示:

010-traefik-httpbin.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: httpbin
  labels:
    app: httpbin
spec:
  replicas: 1
  selector:
    matchLabels:
      app: httpbin
  template:
    metadata:
      labels:
        app: httpbin
    spec:
      containers:
      - image: kennethreitz/httpbin
        name: httpbin
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: httpbin
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 80
  selector:
    app: httpbin
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: httpbin
spec:
  entryPoints:
    - web
  routes:
  - match: Host(`httpbin.local`)
    kind: Rule
    services:
    - name: httpbin
      port: 8000
1. httpbin 路由会匹配 httpbin.local 的主机名,然后将请求转发给 httpbin Service:。
kubectl apply -f 010-traefik-httpbin.yaml 
2. 查看 Traefik 的地址
kubectl get svc traefik -n kube-system
3. httpbin 路由会匹配 httpbin.local 的主机名,然后将请求转发给 httpbin Service:
curl -I http://10.96.190.2  -H "host:httpbin.local"
4. 使用 ab 来访问 HTTPBin 服务模拟一些流量,这些请求会产生对应的指标,执行以下脚本:
host=10.96.190.2 
ab -c 5 -n 10000  -m PATCH -H "host:httpbin.local" -H "accept: application/json" http://${host}/patch
ab -c 5 -n 10000  -m GET -H "host:httpbin.local" -H "accept: application/json" http://${host}/get
ab -c 5 -n 10000  -m POST -H "host:httpbin.local" -H "accept: application/json" http://${host}/post