背景
在 Kubernetes 集群运行时,节点有时会因为组件问题、内核死锁、资源不足等原因不可用。Kubelet 默认对节点的 PIDPressure、MemoryPressure、DiskPressure 等资源状态进行监控,但是存在当 Kubelet 上报状态时节点已处于不可用状态的情况,甚至 Kubelet 可能已开始驱逐 Pod。在此类场景下,原生 Kubernetes 对节点健康的检测机制是不完善的,为了提前发现节点的问题,需要添加更加细致化的指标来描述节点的健康状态并且采取相应的恢复策略,实现智能运维,以节省开发和减轻运维人员的负担。
NPD 故障检测
NPD(node-problem-detector)是 Kubernetes 社区开源的集群节点的健康检测组件。NPD 提供了通过正则匹配系统日志或文件来发现节点异常的功能。用户可以通过运维经验,配置可能产生异常问题日志的正则表达式,选择不同的上报方式。NPD 会解析用户的配置文件,当有日志能匹配到用户配置的正则表达式时,可以通过 NodeCondition、Event 或 Promethues Metric 等方式将检测到的异常状态上报。除了日志匹配功能,NPD 还接受用户自行编写的自定义检测插件,用户可以开发自己的脚本或可执行文件集成到 NPD 的插件中,让 NPD 定期执行检测程序。
Draino自动驱逐排空节点
Draino 基于标签和 node conditions 自动排干 Kubernetes 节点。匹配了所有指定标签和任意指定 node condition 的节点会立即被禁用(cordon),并在等待 drain-buffer
时间后排干(drain)节点上的 pod。
Draino 通常是与 Node Problem Detector 及 Cluster Autoscaler 一起使用。NPD 通过监控节点日志或者执行某一脚本来探测节点健康状态,当 NPD 探测到某个节点上存在异常时,就会给该节点设置一个 node condition。Cluster Autoscaler 可以配置为删除未充分利用的节点。这两者搭配上 Draino 可以实现一些场景下的自动故障补救:
- NPD 探测到节点存在一个永久问题,并且给该节点设置相应的 node condition。
- Draino 发现了这个 node condition,它会马上禁用该节点,从而避免有新的 pod 调度到这个故障节点,并开启定时任务来排干这个节点。
- 一旦该故障节点被排干,Cluster Autoscaler 会认为该节点未充分利用,Autoscaler 等待一段时间后将该节点缩容掉。
NPD部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| helm repo add deliveryhero https://charts.deliveryhero.io/ helm install --generate-name deliveryhero/node-problem-detector -n kube-system
kubectl -n kube-system get pod | grep node-problem-detector node-problem-detector-bkbrl 1/1 Running 0 23h node-problem-detector-prfqb 1/1 Running 0 29h node-problem-detector-tdk47 1/1 Running 0 22h node-problem-detector-xj86c 1/1 Running 0 2d5h node-problem-detector-xm8ff 1/1 Running 0 3d20h
kubectl describe node 10.4.83.25 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- KernelDeadlock True Tue, 25 Apr 2023 14:29:05 +0800 Tue, 25 Apr 2023 04:15:45 +0800 DockerHung kernel: INFO: task docker:20744 blocked for more than 120 seconds.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning KernelOops 7s kernel-monitor kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING Warning TaskHung 2s kernel-monitor kernel: INFO: task docker:20744 blocked for more than 120 seconds. Warning DockerHung 2s kernel-monitor Node condition KernelDeadlock is now: True, reason: DockerHung, message: "kernel: INFO: task docker:20744 blocked for more than 120 seconds."
|
Draino部署及配置
部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
| kubectl label node 10.192.177.34 draino=node
cat draino-deployment.yaml --- apiVersion: v1 kind: ServiceAccount metadata: labels: {component: draino} name: draino namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: {component: draino} name: draino rules: - apiGroups: - '*' resources: - '*' verbs: - '*' --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: {component: draino} name: draino roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino} subjects: - {kind: ServiceAccount, name: draino, namespace: kube-system} --- apiVersion: apps/v1 kind: Deployment metadata: labels: {component: draino} name: draino namespace: kube-system spec: replicas: 1 selector: matchLabels: {component: draino} template: metadata: labels: {component: draino} name: draino namespace: kube-system spec: imagePullSecrets: - name: pull-docker-image-secret containers: - command: - /draino - --debug - --node-label-expr=metadata['labels']['draino'] in ['master','node'] - --max-grace-period=8m0s - --eviction-headroom=30s - --drain-buffer=10m0s - --namespace=kube-system - ReadonlyFilesystem image: registry.ljohn.cn/kgcr/draino:9d39b53 livenessProbe: httpGet: {path: /healthz, port: 10002} initialDelaySeconds: 30 name: draino serviceAccountName: draino
kubectl -n kube-system get pod | grep drain draino-5775bc9466-zvp4f 1/1 Running 0 10m kubectl -n kube-system logs -f draino-5775bc9466-zvp4f 2023-04-23T06:44:30.918Z INFO draino/draino.go:134 web server is running {"listen": ":10002"} 2023-04-23T06:44:30.922Z DEBUG draino/draino.go:197 label expression {"expr": "metadata['labels']['draino'] in ['master','node']"} I0423 06:44:30.922508 1 leaderelection.go:235] attempting to acquire leader lease kube-system/draino... I0423 06:44:48.346548 1 leaderelection.go:245] successfully acquired lease kube-system/draino 2023-04-23T06:44:48.346Z INFO draino/draino.go:236 node watcher is running
|
测试验证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| kubectl get pod -owide | grep 10.192.177.34 nettools-deploy-57c9465d74-29vqj 1/1 Running 1 (13d ago) 13d 10.194.82.131 10.192.177.34 <none> <none> nettools-deploy-57c9465d74-jt7dg 1/1 Running 0 6h50m 10.194.82.144 10.192.177.34 <none> <none> nettools-deploy-57c9465d74-kclmj 1/1 Running 0 5h30m 10.194.82.154 10.192.177.34 <none> <none> nettools-deploy-57c9465d74-l7n24 1/1 Running 1 (13d ago) 13d 10.194.82.130 10.192.177.34 <none> <none> nettools-deploy-57c9465d74-mz6mk 1/1 Running 0 5h30m 10.194.82.155 10.192.177.34 <none> <none> npd-node-problem-detector-7bllt 1/1 Running 0 26h 10.194.82.134 10.192.177.34 <none> <none>
sudo sh -c "echo 'Remounting filesystem read-only' >> /dev/kmsg"
kubectl describe node 10.192.177.34 ... Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- CorruptDockerOverlay2 False Mon, 24 Apr 2023 16:56:15 +0800 Sun, 23 Apr 2023 14:32:51 +0800 NoCorruptDockerOverlay2 docker overlay2 is functioning properly KernelDeadlock False Mon, 24 Apr 2023 16:56:15 +0800 Sun, 23 Apr 2023 14:32:51 +0800 KernelHasNoDeadlock kernel has no deadlock ReadonlyFilesystem True Mon, 24 Apr 2023 16:56:15 +0800 Mon, 24 Apr 2023 16:56:15 +0800 FilesystemIsReadOnly Remounting filesystem read-only NetworkUnavailable False Tue, 11 Apr 2023 15:22:53 +0800 Tue, 11 Apr 2023 15:22:53 +0800 CalicoIsUp Calico is running on this node MemoryPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled DrainScheduled False Mon, 24 Apr 2023 16:57:04 +0800 Mon, 24 Apr 2023 16:56:15 +0800 Draino Drain activity scheduled 2023-04-24T16:56:26+08:00 | Completed: 2023-04-24T16:57:04 +08:00 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning CordonStarting 2m29s draino Cordoning node Warning CordonSucceeded 2m29s draino Cordoned node Warning DrainScheduled 2m29s draino Will drain node after 2023-04-24T16:56:26.540741556+08:00 Normal NodeNotSchedulable 2m28s (x2 over 6d23h) kubelet Node 10.192.177.34 status is now: NodeNotSchedulable Warning DrainStarting 2m18s draino Draining node Warning DrainSucceeded 100s draino Drained node
kubectl get node 10.192.177.34 NAME STATUS ROLES AGE VERSION 10.192.177.34 Ready,SchedulingDisabled node 13d v1.22.14 kubectl get pod -owide -A | grep 10.192.177.34 default npd-node-problem-detector-7bllt 1/1 Running 0 26h 10.194.82.134 10.192.177.34 <none> <none> kube-system calico-node-7dwv8 1/1 Running 1 (13d ago) 13d 10.192.177.34 10.192.177.34 <none> <none> kube-system node-local-dns-rdj24 1/1 Running 1 (13d ago) 13d 10.192.177.34 10.192.177.34 <none> <none> kubesphere-monitoring-system node-exporter-c8xkv 2/2 Running 0 2d23h 10.192.177.34 10.192.177.34 <none> <none> kubesphere-system node-shell-jzvcp 1/1 Running 0 5d7h 10.192.177.34 10.192.177.34 <none> <none>
|
参考
draino
node-problem-detector
https://www.jianshu.com/p/eeba98425307
https://www.jianshu.com/p/cc6a45cf3208