2023年6月30日发(作者:)
k8s(七)、监控--Prometheus部署篇前⾔前⾯⼏篇⽂章介绍了k8s的部署、对外服务、集群⽹络、微服务⽀持,在⽣产环境中使⽤,离不开运⾏状态监控,本篇开始部署使⽤prometheus,被各⼤公司⼴泛使⽤的容器监控⼯具。⼯作⽅式Prometheus⼯作⽰意图:在k8s中,关于集群的资源有metrics度量值的概念,有各种不同的exporter可以通过api接⼝对外提供各种度量值的及时数据,prometheus在与k8s融合⼯作的过程,就是通过与这些提供metric值得exporter进⾏交互,获取数据,整合数据,展⽰数据,触发告警的过程。⼀、获取metrics:1.对短暂⽣命周期的任务,采取拉的形式获取metrics (不常见)2.对于exporter提供的metrics,采取拉的⽅式获取metrics(通常⽅式),对接的exporter常见的有:kube-apiserver 、cadvisor、node-exporter,也可根据应⽤类型部署相应的exporter,获取该应⽤的状态信息,⽬前⽀持的应⽤有:nginx/haproxy/mysql/redis/memcache等。⼆、数据汇总及按需获取:可以按照官⽅定义的expr表达式格式,以及PromQL语法对相应的指标进程过滤,数据展⽰及图形展⽰。不过⾃带的webui较为简陋,但prometheus同时提供获取数据的api,grafana可通过api获取prometheus数据源,来绘制更精细的图形效果⽤以展⽰。三、告警推送prometheus⽀持多种告警媒介,对满⾜条件的告警⾃动触发告警,并可对告警的发送规则进⾏定制,例如重复间隔、路由等,可以实现⾮常灵活的告警触发。部署1.配置configmap,在部署前将Prometheus主程序配置⽂件准备好,以configmap的形式挂载进deployment中。:apiVersion: v1kind: ConfigMapmetadata:metadata: name: prometheus-config namespace: kube-systemdata: : | global: scrape_interval: 15s evaluation_interval: 15s rule_files: - /etc/prometheus/ alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets//serviceaccount/ bearer_token_file: /var/run/secrets//serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets//serviceaccount/ bearer_token_file: /var/run/secrets//serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: :443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: :9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'kubernetes-ingresses' kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe] action: keep regex: true regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
- job_name: 'kubernetes_node' tls_config: ca_file: /var/run/secrets//serviceaccount/ bearer_token_file: /var/run/secrets//serviceaccount/token kubernetes_sd_configs: #
基于endpoint的服务发现,不再经过service代理层⾯ - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_endpoint_port_name] regex: true;prometheus-node-exporter action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::d+);(d+) replacement: $1:$2 #
去掉label name中的前缀__meta_kubernetes_service_label_ - action: labelmap regex: __meta_kubernetes_service_label_(.+) #
为了区分所属node,把instance
从node-exporter ep的实例,替换成ep所在node的ip - source_labels: [__meta_kubernetes_pod_host_ip] regex: '(.*)' replacement: '${1}' target_label: instance2.部署prometheus⼯作主程序,注意挂载上⾯的configmap::apiVersion: apps/v1beta2kind: Deploymentmetadata: labels: name: prometheus-deployment name: prometheus namespace: kube-systemspec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: - image: prom/prometheus:v2.0.0 name: prometheus command: - "/bin/prometheus" args: - "--=/etc/prometheus/" - "--=/prometheus" - "--ion=24h" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: "/prometheus" name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 100Mi limits: cpu: 500m memory: 2500Mi serviceAccountName: prometheus
volumes: - name: data emptyDir: {} - name: config-volume configMap: name: prometheus-config
3.部署svc、ingress、rbac授权。注意:在本地是使⽤traefik做对外服务代理的,因此修改了默认的NodePort的为ClusterIP的⽅式,添加ingress后,可以以域名⽅式直接访问。若不做代理,可以⽆需部署ingress,使⽤默认的NodePort,然后通过node ip+port的形式访问。Ingress如何使⽤,请参考此前的⽂章::kind: ServiceapiVersion: v1metadata: labels: app: prometheus name: prometheus namespace: kube-systemspec: type: ClusterIP ports: - port: 80 protocol: TCP targetPort: 9090 selector: app: :apiVersion: extensions/v1beta1kind: Ingressmetadata: name: prometheus namespace: kube-system selfLink: /apis/extensions/v1beta1/namespaces/default/ingresses/prometheusspec: rules: - host: http: paths: - backend: serviceName: prometheus servicePort: 80 path: /:apiVersion: /v1kind: ClusterRolemetadata: name: prometheusrules:- apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"]- apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"]- nonResourceURLs: ["/metrics"] verbs: ["get"]---apiVersion: v1kind: ServiceAccountmetadata: name: prometheus namespace: kube-system---apiVersion: /v1kind: ClusterRoleBindingmetadata: name: prometheusroleRef: apiGroup: kind: ClusterRole name: prometheussubjects:- kind: ServiceAccount name: prometheus namespace: kube-system依次部署上⽅⼏个yaml⽂件,待初始化完成后,配置好dns记录,即可打开浏览器访问:随便选取⼀个metric,点击execute,查看是否能正常获取结果输出。点击status—target,可以看到metrics的数据来源,即各exporter,点击相应exporter上的链接可查看这个exporter提供的metrics明细。为了更好的展⽰图形效果,需要部署grafana,因此前已经部署有grafana,这⾥不再部署,贴⼀个部署⽂件。:apiVersion: extensions/v1beta1kind: Deploymentmetadata: name: grafana-core namespace: kube-system labels: app: grafana component: corespec: replicas: 1 template: metadata: labels: app: grafana component: core spec: containers: - image: grafana/grafana:4.2.0 name: grafana-core imagePullPolicy: IfNotPresent # env: resources: # keep request = limit to keep this container in guaranteed class limits: cpu: 100m memory: 100Mi requests: requests: cpu: 100m memory: 100Mi env: # The following env variables set up basic auth twith the default admin user and admin password. - name: GF_AUTH_BASIC_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ENABLED value: "false" # - name: GF_AUTH_ANONYMOUS_ORG_ROLE # value: Admin # does not really work, because of template variables in exported dashboards: # - name: GF_DASHBOARDS_JSON_ENABLED # value: "true" readinessProbe: httpGet: path: /login port: 3000 # initialDelaySeconds: 30 # timeoutSeconds: 1 volumeMounts: - name: grafana-persistent-storage mountPath: /var volumes: - name: grafana-persistent-storage emptyDir: {}---apiVersion: v1kind: Servicemetadata: name: grafana namespace: kube-system labels: app: grafana component: corespec: type: NodePort ports: - port: 3000 selector: app: grafana component: core访问grafana,添加prometheus数据源:默认管理账号密码为admin admin选择资源类型,填⼊prometheus的服务地址及端⼝号,点击保存导⼊展⽰模板:点击dashboard,点击import dashboard,在弹出框内填写数字315,会⾃动加载官⽅提供的315号模板,然后选择数据源为刚添加的数据源,模板就创建好了,⾮常easy。据源,模板就创建好了,⾮常easy。基本部署到这⾥就结束了,下篇介绍⼀下prometheus的告警相关规则。===========================================================================================7.19更新:最近发现,采⽤daemon-set⽅式部署的node-exporterc采集到的度量值不准确,最后发现需要将host的/proc和/sys⽬录挂载进node-exporter的容器内。(已解决,更新后的⽂件):apiVersion: extensions/v1beta1kind: DaemonSetmetadata: labels: k8s-app: prometheus-node-exporter name: prometheus-node-exporter namespace: kube-systemspec: selector: matchLabels: k8s-app: prometheus-node-exporter template: metadata: creationTimestamp: null labels: k8s-app: prometheus-node-exporter spec: containers: - args: - - - /host/proc - - - /host/sys - -d-mount-points - ^/(proc|sys|host|etc|dev)($|/) - -d-fs-types - ^(tmpfs|cgroup|configfs|debugfs|devpts|efivarfs|nsfs|overlay|sysfs|proc)$ image: prom/node-exporter:v0.14.0 imagePullPolicy: IfNotPresent name: node-exporter ports: - containerPort: 9100 hostPort: 9101 name: http protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/proc name: proc - mountPath: /host/sys name: sys - mountPath: /rootfs name: root dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /proc type: "" name: proc - hostPath: path: /sys type: "" name: sys - hostPath: path: / type: "" name: root templateGeneration: 17 updateStrategy: type: OnDelete---apiVersion: v1kind: Servicemetadata: annotations: /scrape: 'true' /app-metrics: 'true' /app-metrics-path: '/metrics' name: prometheus-node-exporter namespace: kube-system labels: app: prometheus-node-exporterspec: clusterIP: None ports: - name: prometheus-node-exporter port: 9100 protocol: TCP selector: k8s-app: prometheus-node-exporter type: ClusterIP
但是发现,部署完成之后,采集到的node指标依然不准确,⾮常奇怪,尝试脱离k8s使⽤docker⽅式直接部署,结果采集到的node数值就很准确了,有点不明⽩原因,后续继续排查⼀下。(11-12更新,数据采集不准问题已解决,是因为通过service代理后,采集到的数据是后端随机的ep,⽽⾮是你想要的指定主机上的ep,因此,prometheus端的服务发现,改发现的资源类型为endpoint,⽽不经过endpoint)========================================================采集问题已解决,如下docker运⾏⽅式仅作参考,不要再使⽤,直接按上⾯的yaml⽂件部署即可。docker运⾏命令:docker run -d -p 9100:9100 --name node-exporter -v "/proc:/host/proc" -v "/sys:/host/sys" -v "/:/rootfs" --net="host" prom/node-exporter:v0.14.0 - /host/proc - /host/sys -d-mount-points "^/(sys|proc|dev|host|etc)($|/)"最后,记得修改configmap内的job相关targets配置。为什么依附于k8s集群内采集的node指标就不准确,这个问题后续得好好研究,这次先到这⾥。11.12 补充上⾯的node-exporter采集数据不准确的问题找到了,感谢下⾯评论区中的朋友 @架势糖007,指出node-exporter以service形式访问,会导致访问service时,按LB算法随机请求到某⼀个后端的ep pod上去,⽽⾮到达真正想要去的指定pod。这突然让我才想起,此前数据采集计算出来不准,就是因为采集到的⼤概率可能是来⾃其他node上的数据。因此,对上⾯的prometheus configMap⽂件,以及下⽅的exporter部署yaml⽂件作了⼀些修改,采集对象从service改为endpoint,绕过代理层,直接访问endpoint层,经过改正后,检查node数据不准的问题得到了解决。
发布者:admin,转转请注明出处:http://www.yc00.com/news/1688055707a72104.html
评论列表(0条)