背景

在 Kubernetes 集群运行时,节点有时会因为组件问题、内核死锁、资源不足等原因不可用。Kubelet 默认对节点的 PIDPressure、MemoryPressure、DiskPressure 等资源状态进行监控,但是存在当 Kubelet 上报状态时节点已处于不可用状态的情况,甚至 Kubelet 可能已开始驱逐 Pod。在此类场景下,原生 Kubernetes 对节点健康的检测机制是不完善的,为了提前发现节点的问题,需要添加更加细致化的指标来描述节点的健康状态并且采取相应的恢复策略,实现智能运维,以节省开发和减轻运维人员的负担。

NPD 故障检测

NPD(node-problem-detector)是 Kubernetes 社区开源的集群节点的健康检测组件。NPD 提供了通过正则匹配系统日志或文件来发现节点异常的功能。用户可以通过运维经验,配置可能产生异常问题日志的正则表达式,选择不同的上报方式。NPD 会解析用户的配置文件,当有日志能匹配到用户配置的正则表达式时,可以通过 NodeCondition、Event 或 Promethues Metric 等方式将检测到的异常状态上报。除了日志匹配功能,NPD 还接受用户自行编写的自定义检测插件,用户可以开发自己的脚本或可执行文件集成到 NPD 的插件中,让 NPD 定期执行检测程序。

Draino自动驱逐排空节点

Draino 基于标签和 node conditions 自动排干 Kubernetes 节点。匹配了所有指定标签和任意指定 node condition 的节点会立即被禁用(cordon),并在等待 drain-buffer 时间后排干(drain)节点上的 pod。

Draino 通常是与 Node Problem DetectorCluster Autoscaler 一起使用。NPD 通过监控节点日志或者执行某一脚本来探测节点健康状态,当 NPD 探测到某个节点上存在异常时,就会给该节点设置一个 node condition。Cluster Autoscaler 可以配置为删除未充分利用的节点。这两者搭配上 Draino 可以实现一些场景下的自动故障补救:

  1. NPD 探测到节点存在一个永久问题,并且给该节点设置相应的 node condition。
  2. Draino 发现了这个 node condition,它会马上禁用该节点,从而避免有新的 pod 调度到这个故障节点,并开启定时任务来排干这个节点。
  3. 一旦该故障节点被排干,Cluster Autoscaler 会认为该节点未充分利用,Autoscaler 等待一段时间后将该节点缩容掉。

NPD部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# helm安装,这里参照官网
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install --generate-name deliveryhero/node-problem-detector -n kube-system
# 查看demonset是否运行正常
kubectl -n kube-system get pod | grep node-problem-detector
node-problem-detector-bkbrl 1/1 Running 0 23h
node-problem-detector-prfqb 1/1 Running 0 29h
node-problem-detector-tdk47 1/1 Running 0 22h
node-problem-detector-xj86c 1/1 Running 0 2d5h
node-problem-detector-xm8ff 1/1 Running 0 3d20h
# 这里也可以下载到本地自定修改一些配置,自定义增加一些检测的指标,不在这里赘述。
# 测试内核事件
# sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
# sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"
kubectl describe node 10.4.83.25
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock True Tue, 25 Apr 2023 14:29:05 +0800 Tue, 25 Apr 2023 04:15:45 +0800 DockerHung kernel: INFO: task docker:20744 blocked for more than 120 seconds.

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning KernelOops 7s kernel-monitor kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING
Warning TaskHung 2s kernel-monitor kernel: INFO: task docker:20744 blocked for more than 120 seconds.
Warning DockerHung 2s kernel-monitor Node condition KernelDeadlock is now: True, reason: DockerHung, message: "kernel: INFO: task docker:20744 blocked for more than 120 seconds."

Draino部署及配置

部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# 给节点添加label,这样节点就可以被draino自动排干
kubectl label node 10.192.177.34 draino=node
# 部署deployment 配置
cat draino-deployment.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels: {component: draino}
name: draino
rules:
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels: {component: draino}
name: draino
roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
subjects:
- {kind: ServiceAccount, name: draino, namespace: kube-system}
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
# Draino does not currently support locking/master election, so you should
# only run one draino at a time. Draino won't start draining nodes immediately
# so it's usually safe for multiple drainos to exist for a brief period of
# time.
replicas: 1
selector:
matchLabels: {component: draino}
template:
metadata:
labels: {component: draino}
name: draino
namespace: kube-system
spec:
imagePullSecrets:
- name: pull-docker-image-secret
containers:
# You'll want to change these labels and conditions to suit your deployment.
#- command: [/draino, --debug, --dry-run, --node-label=draino-enabled=true, BadCondition, ReallyBadCondition, KernelDeadlock, ReadonlyFilesystem]
# --node-label-expr="metadata['labels']['draino'] in ['master','node']" --evict-unreplicated-pods --evict-emptydir-pods --evict-daemonset-pods KernelDeadlock ReadonlyFilesystem
- command:
- /draino
- --debug
#- --node-label=draino-enabled=true
#- --dry-run
- --node-label-expr=metadata['labels']['draino'] in ['master','node']
- --max-grace-period=8m0s
- --eviction-headroom=30s
- --drain-buffer=10m0s
- --namespace=kube-system
#- --evict-unreplicated-pods
#- --evict-daemonset-pods
#- --evict-emptydir-pods
# - KernelDeadlock
- ReadonlyFilesystem
image: registry.ljohn.cn/kgcr/draino:9d39b53
livenessProbe:
httpGet: {path: /healthz, port: 10002}
initialDelaySeconds: 30
name: draino
serviceAccountName: draino
# 查看服务及日志
kubectl -n kube-system get pod | grep drain
draino-5775bc9466-zvp4f 1/1 Running 0 10m
kubectl -n kube-system logs -f draino-5775bc9466-zvp4f
2023-04-23T06:44:30.918Z INFO draino/draino.go:134 web server is running {"listen": ":10002"}
2023-04-23T06:44:30.922Z DEBUG draino/draino.go:197 label expression {"expr": "metadata['labels']['draino'] in ['master','node']"}
I0423 06:44:30.922508 1 leaderelection.go:235] attempting to acquire leader lease kube-system/draino...
I0423 06:44:48.346548 1 leaderelection.go:245] successfully acquired lease kube-system/draino
2023-04-23T06:44:48.346Z INFO draino/draino.go:236 node watcher is running

测试验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# 查看节点上服务 nettools-deploy
kubectl get pod -owide | grep 10.192.177.34
nettools-deploy-57c9465d74-29vqj 1/1 Running 1 (13d ago) 13d 10.194.82.131 10.192.177.34 <none> <none>
nettools-deploy-57c9465d74-jt7dg 1/1 Running 0 6h50m 10.194.82.144 10.192.177.34 <none> <none>
nettools-deploy-57c9465d74-kclmj 1/1 Running 0 5h30m 10.194.82.154 10.192.177.34 <none> <none>
nettools-deploy-57c9465d74-l7n24 1/1 Running 1 (13d ago) 13d 10.194.82.130 10.192.177.34 <none> <none>
nettools-deploy-57c9465d74-mz6mk 1/1 Running 0 5h30m 10.194.82.155 10.192.177.34 <none> <none>
npd-node-problem-detector-7bllt 1/1 Running 0 26h 10.194.82.134 10.192.177.34 <none> <none>
# 主动给节点输入kernel异常,文件系统只读
sudo sh -c "echo 'Remounting filesystem read-only' >> /dev/kmsg"
# 查看node事件
kubectl describe node 10.192.177.34
...
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
CorruptDockerOverlay2 False Mon, 24 Apr 2023 16:56:15 +0800 Sun, 23 Apr 2023 14:32:51 +0800 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
KernelDeadlock False Mon, 24 Apr 2023 16:56:15 +0800 Sun, 23 Apr 2023 14:32:51 +0800 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem True Mon, 24 Apr 2023 16:56:15 +0800 Mon, 24 Apr 2023 16:56:15 +0800 FilesystemIsReadOnly Remounting filesystem read-only
NetworkUnavailable False Tue, 11 Apr 2023 15:22:53 +0800 Tue, 11 Apr 2023 15:22:53 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 24 Apr 2023 16:58:06 +0800 Tue, 11 Apr 2023 15:22:48 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled DrainScheduled False Mon, 24 Apr 2023 16:57:04 +0800 Mon, 24 Apr 2023 16:56:15 +0800 Draino Drain activity scheduled 2023-04-24T16:56:26+08:00 | Completed: 2023-04-24T16:57:04
+08:00
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning CordonStarting 2m29s draino Cordoning node
Warning CordonSucceeded 2m29s draino Cordoned node
Warning DrainScheduled 2m29s draino Will drain node after 2023-04-24T16:56:26.540741556+08:00
Normal NodeNotSchedulable 2m28s (x2 over 6d23h) kubelet Node 10.192.177.34 status is now: NodeNotSchedulable
Warning DrainStarting 2m18s draino Draining node
Warning DrainSucceeded 100s draino Drained node

# 最后查看节点已经被关闭调度,并且主机上pod,除了demonset的pod均已经被驱逐,刚部署的nettools-deploy也被调度到其他节点。
kubectl get node 10.192.177.34
NAME STATUS ROLES AGE VERSION
10.192.177.34 Ready,SchedulingDisabled node 13d v1.22.14
kubectl get pod -owide -A | grep 10.192.177.34
default npd-node-problem-detector-7bllt 1/1 Running 0 26h 10.194.82.134 10.192.177.34 <none> <none>
kube-system calico-node-7dwv8 1/1 Running 1 (13d ago) 13d 10.192.177.34 10.192.177.34 <none> <none>
kube-system node-local-dns-rdj24 1/1 Running 1 (13d ago) 13d 10.192.177.34 10.192.177.34 <none> <none>
kubesphere-monitoring-system node-exporter-c8xkv 2/2 Running 0 2d23h 10.192.177.34 10.192.177.34 <none> <none>
kubesphere-system node-shell-jzvcp 1/1 Running 0 5d7h 10.192.177.34 10.192.177.34 <none> <none>

参考

draino

node-problem-detector

https://www.jianshu.com/p/eeba98425307

https://www.jianshu.com/p/cc6a45cf3208