2023-06-25 10:48:34 來(lái)源 : 博客園
taintManager
的主要功能為:當(dāng)某個(gè)node被打上NoExecute
污點(diǎn)后,其上面的pod如果不能容忍該污點(diǎn),則taintManager
將會(huì)驅(qū)逐這些pod,而新建的pod也需要容忍該污點(diǎn)才能調(diào)度到該node上;
通過(guò)kcm啟動(dòng)參數(shù)--enable-taint-manager
來(lái)確定是否啟動(dòng)taintManager
,true
時(shí)啟動(dòng)(啟動(dòng)參數(shù)默認(rèn)值為true
);
【資料圖】
kcm啟動(dòng)參數(shù)--feature-gates=TaintBasedEvictions=xxx
,默認(rèn)值true,配合--enable-taint-manager
共同作用,兩者均為true,才會(huì)開啟污點(diǎn)驅(qū)逐;
當(dāng)node出現(xiàn)NoExecute
污點(diǎn)時(shí),判斷node上的pod是否能容忍node的污點(diǎn),不能容忍的pod,會(huì)被立即刪除,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,pod被刪除;
NoExecuteTaintManager
結(jié)構(gòu)體為taintManager
的主要結(jié)構(gòu)體,其主要屬性有:(1)taintEvictionQueue
:不能容忍node上NoExecute
的污點(diǎn)的pod,會(huì)被加入到該隊(duì)列中,然后pod會(huì)被刪除;(2)taintedNodes
:記錄了每個(gè)node的taint;(3)nodeUpdateQueue
:當(dāng)node對(duì)象發(fā)生add、delete、update(新舊node對(duì)象的taint不相同)事件時(shí),node會(huì)進(jìn)入該隊(duì)列;(4)podUpdateQueue
:當(dāng)pod對(duì)象發(fā)生add、delete、update(新舊pod對(duì)象的NodeName
或Tolerations
不相同)事件時(shí),pod會(huì)進(jìn)入該隊(duì)列;(5)nodeUpdateChannels
:nodeUpdateChannels
即8個(gè)nodeUpdateItem
類型的channel
,有worker負(fù)責(zé)消費(fèi)nodeUpdateQueue
隊(duì)列,然后根據(jù)node name計(jì)算出index,把node放入其中1個(gè)nodeUpdateItem
類型的channel
中;(6)podUpdateChannels
:podUpdateChannels
即8個(gè)podUpdateItem
類型的channel
,有worker負(fù)責(zé)消費(fèi)podUpdateQueue
隊(duì)列,然后根據(jù)pod的node name計(jì)算出index,把pod放入其中1個(gè)podUpdateItem
類型的channel
中;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gotype NoExecuteTaintManager struct {client clientset.Interfacerecorder record.EventRecordergetPod GetPodFuncgetNode GetNodeFuncgetPodsAssignedToNode GetPodsByNodeNameFunctaintEvictionQueue *TimedWorkerQueue// keeps a map from nodeName to all noExecute taints on that NodetaintedNodesLock sync.MutextaintedNodes map[string][]v1.TaintnodeUpdateChannels []chan nodeUpdateItempodUpdateChannels []chan podUpdateItemnodeUpdateQueue workqueue.InterfacepodUpdateQueue workqueue.Interface}
1.2 taintEvictionQueue分析taintEvictionQueue
屬性是一個(gè)TimedWorkerQueue
類型的隊(duì)列,調(diào)用tc.taintEvictionQueue.AddWork
,會(huì)將pod添加到該隊(duì)列中,會(huì)添加一個(gè)定時(shí)器,然后到期之后會(huì)自動(dòng)執(zhí)行workFunc
,初始化taintEvictionQueue
時(shí),傳入的workFunc
是deletePodHandler
函數(shù),作用是刪除pod;
所以進(jìn)入taintEvictionQueue
中的pod,會(huì)在設(shè)置好的時(shí)間,被刪除;
pod.Spec.Tolerations
配置的是pod的污點(diǎn)容忍信息;
// vendor/k8s.io/api/core/v1/types.gotype Toleration struct {Key string `json:"key,omitempty" protobuf:"bytes,1,opt,name=key"`Operator TolerationOperator `json:"operator,omitempty" protobuf:"bytes,2,opt,name=operator,casttype=TolerationOperator"`Value string `json:"value,omitempty" protobuf:"bytes,3,opt,name=value"`Effect TaintEffect `json:"effect,omitempty" protobuf:"bytes,4,opt,name=effect,casttype=TaintEffect"`TolerationSeconds *int64 `json:"tolerationSeconds,omitempty" protobuf:"varint,5,opt,name=tolerationSeconds"`}
Tolerations的屬性值解析如下:(1)Key
:匹配node污點(diǎn)的Key;(2)Operator
:表示Tolerations中Key與node污點(diǎn)的Key相同時(shí),其Value與node污點(diǎn)的Value的關(guān)系,默認(rèn)值Equal
,代表相等,Exists
則代表Tolerations中Key與node污點(diǎn)的Key相同即可,不用比較其Value值;(3)Value
:匹配node污點(diǎn)的Value;(4)Effect
:匹配node污點(diǎn)的Effect;(5)TolerationSeconds
:node污點(diǎn)容忍時(shí)間;
配置示例:
tolerations:- key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 3600
上述配置表示如果該pod正在運(yùn)行,同時(shí)一個(gè)匹配的污點(diǎn)被添加到其所在的node節(jié)點(diǎn)上,那么該pod還將繼續(xù)在節(jié)點(diǎn)上運(yùn)行3600秒,然后會(huì)被驅(qū)逐(如果在此之前其匹配的node污點(diǎn)被刪除了,則該pod不會(huì)被驅(qū)逐);
2.初始化分析2.1 NewNodeLifecycleControllerNewNodeLifecycleController
為NodeLifecycleController
的初始化函數(shù),里面給taintManager
注冊(cè)了pod與node的EventHandler
,Add
、Update
、Delete
事件都會(huì)調(diào)用taintManager
的PodUpdated
或NodeUpdated
方法來(lái)做處理;
// pkg/controller/nodelifecycle/node_lifecycle_controller.gofunc NewNodeLifecycleController( ... podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{AddFunc: func(obj interface{}) {...if nc.taintManager != nil {nc.taintManager.PodUpdated(nil, pod)}},UpdateFunc: func(prev, obj interface{}) {...if nc.taintManager != nil {nc.taintManager.PodUpdated(prevPod, newPod)}},DeleteFunc: func(obj interface{}) {...if nc.taintManager != nil {nc.taintManager.PodUpdated(pod, nil)}},}) ... if nc.runTaintManager {podGetter := func(name, namespace string) (*v1.Pod, error) { return nc.podLister.Pods(namespace).Get(name) }nodeLister := nodeInformer.Lister()nodeGetter := func(name string) (*v1.Node, error) { return nodeLister.Get(name) }nc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient, podGetter, nodeGetter, nc.getPodsAssignedToNode)nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {nc.taintManager.NodeUpdated(nil, node)return nil}),UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error {nc.taintManager.NodeUpdated(oldNode, newNode)return nil}),DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {nc.taintManager.NodeUpdated(node, nil)return nil}),})}...}
2.1.1 tc.NodeUpdatedtc.NodeUpdated
方法會(huì)判斷新舊node對(duì)象的taint是否相同,不相同則調(diào)用tc.nodeUpdateQueue.Add
,將該node放入到nodeUpdateQueue
隊(duì)列中;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) NodeUpdated(oldNode *v1.Node, newNode *v1.Node) {nodeName := ""oldTaints := []v1.Taint{}if oldNode != nil {nodeName = oldNode.NameoldTaints = getNoExecuteTaints(oldNode.Spec.Taints)}newTaints := []v1.Taint{}if newNode != nil {nodeName = newNode.NamenewTaints = getNoExecuteTaints(newNode.Spec.Taints)}if oldNode != nil && newNode != nil && helper.Semantic.DeepEqual(oldTaints, newTaints) {return}updateItem := nodeUpdateItem{nodeName: nodeName,}tc.nodeUpdateQueue.Add(updateItem)}
2.1.2 tc.PodUpdatedtc.PodUpdated
方法會(huì)判斷新舊pod對(duì)象的NodeName
或Tolerations
是否相同,不相同則調(diào)用tc.podUpdateQueue.Add
,將該pod放入到podUpdateQueue
隊(duì)列中;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) PodUpdated(oldPod *v1.Pod, newPod *v1.Pod) {podName := ""podNamespace := ""nodeName := ""oldTolerations := []v1.Toleration{}if oldPod != nil {podName = oldPod.NamepodNamespace = oldPod.NamespacenodeName = oldPod.Spec.NodeNameoldTolerations = oldPod.Spec.Tolerations}newTolerations := []v1.Toleration{}if newPod != nil {podName = newPod.NamepodNamespace = newPod.NamespacenodeName = newPod.Spec.NodeNamenewTolerations = newPod.Spec.Tolerations}if oldPod != nil && newPod != nil && helper.Semantic.DeepEqual(oldTolerations, newTolerations) && oldPod.Spec.NodeName == newPod.Spec.NodeName {return}updateItem := podUpdateItem{podName: podName,podNamespace: podNamespace,nodeName: nodeName,}tc.podUpdateQueue.Add(updateItem)}
2.2 taintEvictionQueue看到TaintManager
的初始化方法NewNoExecuteTaintManager
中,調(diào)用CreateWorkerQueue
給taintEvictionQueue
做了初始化;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc NewNoExecuteTaintManager(...) ... { ... tm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent)) ...}
CreateWorkerQueue
函數(shù)初始化并返回TimedWorkerQueue
結(jié)構(gòu)體;
// pkg/controller/nodelifecycle/scheduler/timed_workers.gofunc CreateWorkerQueue(f func(args *WorkArgs) error) *TimedWorkerQueue {return &TimedWorkerQueue{workers: make(map[string]*TimedWorker),workFunc: f,}}
2.2.1 deletePodHandler初始化taintEvictionQueue
時(shí)傳入了deletePodHandler
作為隊(duì)列中元素的處理方法;deletePodHandler
函數(shù)的主要邏輯是請(qǐng)求apiserver,刪除pod對(duì)象,所以說(shuō),被放入到taintEvictionQueue
隊(duì)列中的pod,會(huì)被刪除;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc deletePodHandler(c clientset.Interface, emitEventFunc func(types.NamespacedName)) func(args *WorkArgs) error {return func(args *WorkArgs) error {ns := args.NamespacedName.Namespacename := args.NamespacedName.Nameklog.V(0).Infof("NoExecuteTaintManager is deleting Pod: %v", args.NamespacedName.String())if emitEventFunc != nil {emitEventFunc(args.NamespacedName)}var err errorfor i := 0; i < retries; i++ {err = c.CoreV1().Pods(ns).Delete(name, &metav1.DeleteOptions{})if err == nil {break}time.Sleep(10 * time.Millisecond)}return err}}
2.2.2 tc.taintEvictionQueue.AddWork再來(lái)看一下tc.taintEvictionQueue.AddWork
方法,作用是添加pod進(jìn)入taintEvictionQueue
隊(duì)列,即調(diào)用CreateWorker
給該pod創(chuàng)建一個(gè)worker來(lái)刪除該pod;
// pkg/controller/nodelifecycle/scheduler/timed_workers.gofunc (q *TimedWorkerQueue) AddWork(args *WorkArgs, createdAt time.Time, fireAt time.Time) {key := args.KeyFromWorkArgs()klog.V(4).Infof("Adding TimedWorkerQueue item %v at %v to be fired at %v", key, createdAt, fireAt)q.Lock()defer q.Unlock()if _, exists := q.workers[key]; exists {klog.Warningf("Trying to add already existing work for %+v. Skipping.", args)return}worker := CreateWorker(args, createdAt, fireAt, q.getWrappedWorkerFunc(key))q.workers[key] = worker}
CreateWorker
函數(shù)會(huì)先判斷是否應(yīng)該立即執(zhí)行workFunc
,是的話立即拉起一個(gè)goroutine來(lái)執(zhí)行workFunc
并返回,否則定義一個(gè)timer定時(shí)器,到時(shí)間后自動(dòng)拉起一個(gè)goroutine執(zhí)行workFunc
;
// pkg/controller/nodelifecycle/scheduler/timed_workers.gofunc CreateWorker(args *WorkArgs, createdAt time.Time, fireAt time.Time, f func(args *WorkArgs) error) *TimedWorker {delay := fireAt.Sub(createdAt)if delay <= 0 {go f(args)return nil}timer := time.AfterFunc(delay, func() { f(args) })return &TimedWorker{WorkItem: args,CreatedAt: createdAt,FireAt: fireAt,Timer: timer,}}
2.2.3 tc.taintEvictionQueue.Canceltc.taintEvictionQueue.AddWork
方法,作用是停止對(duì)應(yīng)的pod的timer,即停止執(zhí)行對(duì)應(yīng)pod的workFunc(不刪除pod);
// pkg/controller/nodelifecycle/scheduler/timed_workers.gofunc (w *TimedWorker) Cancel() {if w != nil {w.Timer.Stop()}}
3.核心處理邏輯分析nc.taintManager.Runnc.taintManager.Run
為taintManager
的啟動(dòng)方法,處理邏輯都在這,主要是判斷node上的pod是否能容忍node的NoExecute
污點(diǎn),不能容忍的pod,會(huì)被刪除,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,被刪除;
主要邏輯:(1)創(chuàng)建8個(gè)類型為nodeUpdateItem
的channel(緩沖區(qū)大小10),并賦值給tc.nodeUpdateChannels
;創(chuàng)建8個(gè)類型為podUpdateItem
的channel(緩沖區(qū)大小1),并賦值給podUpdateChannels
;
(2)消費(fèi)tc.nodeUpdateQueue
隊(duì)列,根據(jù)node name計(jì)算hash,將node放入對(duì)應(yīng)的tc.nodeUpdateChannels[hash]
中;
(3)消費(fèi)tc.podUpdateQueue
隊(duì)列,根據(jù)pod的node name計(jì)算hash,將node放入對(duì)應(yīng)的tc.podUpdateChannels[hash]
中;
(4)啟動(dòng)8個(gè)goroutine,調(diào)用tc.worker
對(duì)其中一個(gè)tc.nodeUpdateChannels
與tc.podUpdateChannels
做處理,判斷node上的pod是否能容忍node的NoExecute
污點(diǎn),不能容忍的pod,會(huì)被刪除,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,被刪除;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {klog.V(0).Infof("Starting NoExecuteTaintManager")for i := 0; i < UpdateWorkerSize; i++ {tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan nodeUpdateItem, NodeUpdateChannelSize))tc.podUpdateChannels = append(tc.podUpdateChannels, make(chan podUpdateItem, podUpdateChannelSize))}// Functions that are responsible for taking work items out of the workqueues and putting them// into channels.go func(stopCh <-chan struct{}) {for {item, shutdown := tc.nodeUpdateQueue.Get()if shutdown {break}nodeUpdate := item.(nodeUpdateItem)hash := hash(nodeUpdate.nodeName, UpdateWorkerSize)select {case <-stopCh:tc.nodeUpdateQueue.Done(item)returncase tc.nodeUpdateChannels[hash] <- nodeUpdate:// tc.nodeUpdateQueue.Done is called by the nodeUpdateChannels worker}}}(stopCh)go func(stopCh <-chan struct{}) {for {item, shutdown := tc.podUpdateQueue.Get()if shutdown {break}// The fact that pods are processed by the same worker as nodes is used to avoid races// between node worker setting tc.taintedNodes and pod worker reading this to decide// whether to delete pod.// It"s possible that even without this assumption this code is still correct.podUpdate := item.(podUpdateItem)hash := hash(podUpdate.nodeName, UpdateWorkerSize)select {case <-stopCh:tc.podUpdateQueue.Done(item)returncase tc.podUpdateChannels[hash] <- podUpdate:// tc.podUpdateQueue.Done is called by the podUpdateChannels worker}}}(stopCh)wg := sync.WaitGroup{}wg.Add(UpdateWorkerSize)for i := 0; i < UpdateWorkerSize; i++ {go tc.worker(i, wg.Done, stopCh)}wg.Wait()}
tc.workertc.worker
方法負(fù)責(zé)消費(fèi)nodeUpdateChannels
和podUpdateChannels
,分別調(diào)用tc.handleNodeUpdate
和tc.handlePodUpdate
方法做進(jìn)一步處理;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) worker(worker int, done func(), stopCh <-chan struct{}) {defer done()// When processing events we want to prioritize Node updates over Pod updates,// as NodeUpdates that interest NoExecuteTaintManager should be handled as soon as possible -// we don"t want user (or system) to wait until PodUpdate queue is drained before it can// start evicting Pods from tainted Nodes.for {select {case <-stopCh:returncase nodeUpdate := <-tc.nodeUpdateChannels[worker]:tc.handleNodeUpdate(nodeUpdate)tc.nodeUpdateQueue.Done(nodeUpdate)case podUpdate := <-tc.podUpdateChannels[worker]:// If we found a Pod update we need to empty Node queue first.priority:for {select {case nodeUpdate := <-tc.nodeUpdateChannels[worker]:tc.handleNodeUpdate(nodeUpdate)tc.nodeUpdateQueue.Done(nodeUpdate)default:break priority}}// After Node queue is emptied we process podUpdate.tc.handlePodUpdate(podUpdate)tc.podUpdateQueue.Done(podUpdate)}}}
3.1 tc.handleNodeUpdatetc.handleNodeUpdate
方法主要是判斷node上的pod是否能容忍node的NoExecute
污點(diǎn),不能容忍的pod,會(huì)被刪除,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,被刪除;
主要邏輯:(1)從informer本地緩存中獲取node對(duì)象;(2)從node.Spec.Taints
中獲取NoExecute
的taints
;(3)將該node的NoExecute
的taints
更新到tc.taintedNodes
中;(4)調(diào)用tc.getPodsAssignedToNode
,獲取該node上的所有pod,如果pod數(shù)量為0,直接return;(5)如果node的NoExecute
的taints
數(shù)量為0,則遍歷該node上所有pod,調(diào)用tc.cancelWorkWithEvent
,將該pod從taintEvictionQueue
隊(duì)列中移除,然后直接return;(6)遍歷該node上所有pod,調(diào)用tc.processPodOnNode
,對(duì)pod做進(jìn)一步處理;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate nodeUpdateItem) {node, err := tc.getNode(nodeUpdate.nodeName)if err != nil {if apierrors.IsNotFound(err) {// Deleteklog.V(4).Infof("Noticed node deletion: %#v", nodeUpdate.nodeName)tc.taintedNodesLock.Lock()defer tc.taintedNodesLock.Unlock()delete(tc.taintedNodes, nodeUpdate.nodeName)return}utilruntime.HandleError(fmt.Errorf("cannot get node %s: %v", nodeUpdate.nodeName, err))return}// Create or Updateklog.V(4).Infof("Noticed node update: %#v", nodeUpdate)taints := getNoExecuteTaints(node.Spec.Taints)func() {tc.taintedNodesLock.Lock()defer tc.taintedNodesLock.Unlock()klog.V(4).Infof("Updating known taints on node %v: %v", node.Name, taints)if len(taints) == 0 {delete(tc.taintedNodes, node.Name)} else {tc.taintedNodes[node.Name] = taints}}()// This is critical that we update tc.taintedNodes before we call getPodsAssignedToNode:// getPodsAssignedToNode can be delayed as long as all future updates to pods will call// tc.PodUpdated which will use tc.taintedNodes to potentially delete delayed pods.pods, err := tc.getPodsAssignedToNode(node.Name)if err != nil {klog.Errorf(err.Error())return}if len(pods) == 0 {return}// Short circuit, to make this controller a bit faster.if len(taints) == 0 {klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)for i := range pods {tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})}return}now := time.Now()for _, pod := range pods {podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}tc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now)}}
3.1.1 tc.processPodOnNodetc.processPodOnNode
方法主要作用是判斷pod是否能容忍node上所有的NoExecute
的污點(diǎn),如果不能,則將該pod加到taintEvictionQueue
隊(duì)列中,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,加到taintEvictionQueue
隊(duì)列中;
主要邏輯:(1)如果node的NoExecute
的taints
數(shù)量為0,則調(diào)用tc.cancelWorkWithEvent
,將該pod從taintEvictionQueue
隊(duì)列中移除;(2)調(diào)用v1helper.GetMatchingTolerations
,判斷pod是否容忍node上所有的NoExecute
的taints,以及獲取能容忍taints的容忍列表;(3)如果不能容忍所有污點(diǎn),則調(diào)用tc.taintEvictionQueue.AddWork
,將該pod加到taintEvictionQueue
隊(duì)列中;(4)如果能容忍所有污點(diǎn),則等待所有污點(diǎn)的容忍時(shí)間里最小值后,再調(diào)用tc.taintEvictionQueue.AddWork
,將該pod加到taintEvictionQueue
隊(duì)列中;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) processPodOnNode(podNamespacedName types.NamespacedName,nodeName string,tolerations []v1.Toleration,taints []v1.Taint,now time.Time,) {if len(taints) == 0 {tc.cancelWorkWithEvent(podNamespacedName)}allTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations)if !allTolerated {klog.V(2).Infof("Not all taints are tolerated after update for Pod %v on %v", podNamespacedName.String(), nodeName)// We"re canceling scheduled work (if any), as we"re going to delete the Pod right away.tc.cancelWorkWithEvent(podNamespacedName)tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now())return}minTolerationTime := getMinTolerationTime(usedTolerations)// getMinTolerationTime returns negative value to denote infinite toleration.if minTolerationTime < 0 {klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won"t be cancelled if already scheduled.", podNamespacedName.String())return}startTime := nowtriggerTime := startTime.Add(minTolerationTime)scheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String())if scheduledEviction != nil {startTime = scheduledEviction.CreatedAtif startTime.Add(minTolerationTime).Before(triggerTime) {return}tc.cancelWorkWithEvent(podNamespacedName)}tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime)}
3.2 tc.handlePodUpdatetc.handlePodUpdate
方法最終也是調(diào)用了tc.processPodOnNode
對(duì)pod做進(jìn)一步處理;
tc.processPodOnNode
方法在上面已經(jīng)分析過(guò)了,這里不再進(jìn)行分析;
主要邏輯:(1)從informer本地緩存中獲取pod對(duì)象;(2)獲取pod的node name,如果為空,直接return;(3)根據(jù)node name從tc.taintedNodes
中獲取node的污點(diǎn),如果污點(diǎn)為空,直接return;(4)調(diào)用tc.processPodOnNode
對(duì)pod做進(jìn)一步處理;
// pkg/controller/nodelifecycle/scheduler/taint_manager.gofunc (tc *NoExecuteTaintManager) handlePodUpdate(podUpdate podUpdateItem) {pod, err := tc.getPod(podUpdate.podName, podUpdate.podNamespace)if err != nil {if apierrors.IsNotFound(err) {// DeletepodNamespacedName := types.NamespacedName{Namespace: podUpdate.podNamespace, Name: podUpdate.podName}klog.V(4).Infof("Noticed pod deletion: %#v", podNamespacedName)tc.cancelWorkWithEvent(podNamespacedName)return}utilruntime.HandleError(fmt.Errorf("could not get pod %s/%s: %v", podUpdate.podName, podUpdate.podNamespace, err))return}// We key the workqueue and shard workers by nodeName. If we don"t match the current state we should not be the one processing the current object.if pod.Spec.NodeName != podUpdate.nodeName {return}// Create or UpdatepodNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}klog.V(4).Infof("Noticed pod update: %#v", podNamespacedName)nodeName := pod.Spec.NodeNameif nodeName == "" {return}taints, ok := func() ([]v1.Taint, bool) {tc.taintedNodesLock.Lock()defer tc.taintedNodesLock.Unlock()taints, ok := tc.taintedNodes[nodeName]return taints, ok}()// It"s possible that Node was deleted, or Taints were removed before, which triggered// eviction cancelling if it was needed.if !ok {return}tc.processPodOnNode(podNamespacedName, nodeName, pod.Spec.Tolerations, taints, time.Now())}
總結(jié)taintManager
的主要功能為:當(dāng)某個(gè)node被打上NoExecute
污點(diǎn)后,其上面的pod如果不能容忍該污點(diǎn),則taintManager
將會(huì)驅(qū)逐這些pod,而新建的pod也需要容忍該污點(diǎn)才能調(diào)度到該node上;
通過(guò)kcm啟動(dòng)參數(shù)--enable-taint-manager
來(lái)確定是否啟動(dòng)taintManager
,true
時(shí)啟動(dòng)(啟動(dòng)參數(shù)默認(rèn)值為true
);
kcm啟動(dòng)參數(shù)--feature-gates=TaintBasedEvictions=xxx
,默認(rèn)值true,配合--enable-taint-manager
共同作用,兩者均為true,才會(huì)開啟污點(diǎn)驅(qū)逐;
當(dāng)node出現(xiàn)NoExecute
污點(diǎn)時(shí),判斷node上的pod是否能容忍node的污點(diǎn),不能容忍的pod,會(huì)被立即刪除,能容忍所有污點(diǎn)的pod,則等待所有污點(diǎn)的容忍時(shí)間里最小值后,pod被刪除;