Configure pod resource request for k8s etcd pod - kubernetes

When running k8s 1.18 alongside with a default “on cluster” etcd pod deployment, what is the way to assign a resource (CPU/memory) request, or influence the pod spec for the etcd container?
The default configuration provides no resource requests or limits.
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system etcd-172-25-87-82-hybrid.com 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m
I’m aware of how one can pass extra args to etcd via kubeadm extraArgs config but these do not cover the etcd pod resources.
etcd:
local:
extraArgs:
heartbeat-interval: "1000"
election-timeout: "5000"
The question can be extended to the other resources in the kube-system namespace eg coredns, etc.

After init cluster, you can find generated /etc/kubernetes/manifests/etcd.yaml. Tried to edit it? The kubelet should pick the changes and restart the etcd instance.
root#kube-1:~# cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.154.0.33:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://10.154.0.33:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --initial-advertise-peer-urls=https://10.154.0.33:2380
- --initial-cluster=kube-1=https://10.154.0.33:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://10.154.0.33:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://10.154.0.33:2380
- --name=kube-1
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: k8s.gcr.io/etcd:3.4.13-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: etcd
resources: {}
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /health
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
- hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}

Related

Multiple kube-shcedulers

I'm trying to create a kube-shceduler that will be used by my organisation pod deployment (a data-center). The objective would be to have the pods of any deployment be split equally between the two sites of the data-center. To archive this I created a new kube-shceduler manifest (see below), in which i changed the Ports that are used and added a new configuration file (see below) using the --config parameter. The problem is that even after setting the --port parameter to a new port that isn't already used by the original kube-scheduler, it still tries to use the old one. As such the new kube scheduler can't start:
I1109 20:27:59.225996 1 registry.go:173] Registering SelectorSpread plugin
I1109 20:27:59.226097 1 registry.go:173] Registering SelectorSpread plugin
I1109 20:27:59.926994 1 serving.go:331] Generated self-signed cert in-memory
failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use
How can I force the use of the specified port OR how can I delete the original kube-scheduler so that there won't be any conflict between the two. The first solution would be preferable.
Configuration kube-scheduler (manifest - /etc/kubernetes/manifest/kube-scheduler.yaml) :
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
#- --port=0
image: k8s.gcr.io/kube-scheduler:v1.19.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
Configuration kube-custom-scheduler.yaml (/etc/kubernetes/manifest/kube-custom-scheduler.yaml):
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-custom-scheduler
tier: control-plane
name: kube-custom-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --config=/etc/kubernetes/scheduler-custom.conf
- --master=true
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=false
- --secure-port=10269
- --scheduler-name=kube-custom-shceduler
- --port=10261
image: k8s.gcr.io/kube-scheduler:v1.19.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10269
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10269
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/scheduler-custom.conf
name: customconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
- hostPath:
path: /etc/kubernetes/scheduler-custom.conf
type: FileOrCreate
name: customconfig
status: {}
Custom configuration mentioned in the kube-custom-scheduler.yaml (/etc/kubernetes/scheduler-custom.conf):
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kube.io/datacenter
whenUnsatisfiable: ScheduleAnyway
Please don't hesistate if you need more information about the cluster.
According to https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/ , the --port would be overwritten by the file specified by --config. Adding HealthzBindAddress and MetricsBindAddress to /etc/kubernetes/scheduler-custom.conf works:
healthzBindAddress: 0.0.0.0:10261
metricsBindAddress: 0.0.0.0:10261
If you want to remove the default scheduler, you can move kube-scheduler.yaml out of /etc/kubernetes/manifests/.

kubectl get componentstatus shows unhealthy

i've finished setting up my HA k8s cluster using kubeadm.
Everything seems to be working fine, but after checking with the command kubectl get componentstatus I get:
NAME STATUS MESSAGE
scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 12
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 12
etcd-0 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
etcd-1 Healthy {"health":"true"}
I see that manifests for scheduler and controller have other ports set up for the health check:
kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.18.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
kube-controller-manager.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-controller-manager
tier: control-plane
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- command:
- kube-controller-manager
- --allocate-node-cidrs=true
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
- --bind-address=127.0.0.1
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --cluster-cidr=10.244.0.0/16
- --cluster-name=kubernetes
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --node-cidr-mask-size=24
- --port=0
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --service-cluster-ip-range=10.96.0.0/12
- --use-service-account-credentials=true
image: k8s.gcr.io/kube-controller-manager:v1.18.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10257
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-controller-manager
resources:
requests:
cpu: 200m
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certs
readOnly: true
- mountPath: /etc/pki
name: etc-pki
readOnly: true
- mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
name: flexvolume-dir
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
- mountPath: /etc/kubernetes/controller-manager.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
name: ca-certs
- hostPath:
path: /etc/pki
type: DirectoryOrCreate
name: etc-pki
- hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
type: DirectoryOrCreate
name: flexvolume-dir
- hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
name: k8s-certs
- hostPath:
path: /etc/kubernetes/controller-manager.conf
type: FileOrCreate
name: kubeconfig
status: {}
So these are using ports 10259 and 10257 respectively.
Any idea why is kubectl trying to perform health check using 10251 and 10252?
version:
kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
PS: I am able to make deployments and expose services, no problem there.
This is a known issue which unfortunately is not going to be fixed as the feature is planned to be deprecated. Also, see this source:
I wouldn't expect a change for this issue. Upstream Kubernetes wants
to deprecate component status and does not plan on enhancing it. If
you need to check for cluster health using other monitoring sources is
recommended.
kubernetes/kubernetes#93171 - 'fix component status server address'
which is getting recommendation to close due to deprecation talk.
kubernetes/enhancements#553 - Deprecate ComponentStatus
kubernetes/kubeadm#2222 - kubeadm default init and they are looking to
'start printing a warning in kubect get componentstatus that this API
object is no longer supported and there are plans to remove it.'

Why my GKE node pool does not auto-scale down?

I've got a preemptible node pool which is clearly under-utilized:
The node pool hosts a deployment with HPA with the following setup:
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
labels:
app: backend
spec:
replicas: 1
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
initContainers:
- name: wait-for-database
image: ### IMAGE ###
command: ['bash', 'init.sh']
containers:
- name: backend
image: ### IMAGE ###
command: ["bash", "entrypoint.sh"]
imagePullPolicy: Always
resources:
requests:
memory: "200M"
cpu: "50m"
ports:
- name: probe-port
containerPort: 8080
hostPort: 8080
volumeMounts:
- name: static-shared-data
mountPath: /static
readinessProbe:
httpGet:
path: /readiness/
port: probe-port
failureThreshold: 5
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
- name: nginx
image: nginx:alpine
resources:
requests:
memory: "400M"
cpu: "20m"
ports:
- containerPort: 80
volumeMounts:
- name: nginx-proxy-config
mountPath: /etc/nginx/conf.d/default.conf
subPath: app.conf
- name: static-shared-data
mountPath: /static
volumes:
- name: nginx-proxy-config
configMap:
name: backend-nginx
- name: static-shared-data
emptyDir: {}
nodeSelector:
cloud.google.com/gke-nodepool: app-dev
tolerations:
- effect: NoSchedule
key: workload
operator: Equal
value: dev
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: backend
namespace: default
spec:
maxReplicas: 12
minReplicas: 8
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: backend
metrics:
- resource:
name: cpu
targetAverageUtilization: 50
type: Resource
---
The node pool also has the toleration label.
The HPA utilization shows this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
backend-develop Deployment/backend-develop 10%/50% 8 12 8 38d
But the node pool does not scale down for about a day. No heavy load on this deployment:
NAME STATUS ROLES AGE VERSION
gke-dev-app-dev-fee1a901-fvw9 Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-gls7 Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-lf3f Ready <none> 24h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-lgw9 Ready <none> 3d10h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-qxkz Ready <none> 3h35m v1.14.10-gke.36
gke-dev-app-dev-fee1a901-s10l Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-sj4d Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-vdnw Ready <none> 27h v1.14.10-gke.36
There's no affinity settings for this deployment and node pool. Some of the nodes easily pack several same pods, but others just hold one pod for hours, no scale down happens.
What could be wrong?
The issue was:
hostPort: 8080
This lead to FailedScheduling didn't have free ports.
That's why the nodes were kept online.

K8s pod priority & outOfPods

we had the situation that the k8s-cluster was running out of pods after an update (kubernetes or more specific: ICP) resulting in "OutOfPods" error messages. The reason was a lower "podsPerCore"-setting which we corrected afterwards. Until then there were pods with a provided priorityClass (1000000) which cannot be scheduled. Others - without a priorityClass (0) - were scheduled. I assumed a different behaviour. I thought that the K8s scheduler would kill pods with no priority so that a pod with priority can be scheduled. Was I wrong?
Thats just a question for understanding because I want to guarantee that the priority pods are running, no matter what.
Thanks
Pod with Prio:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-12-16T13:39:21Z"
generateName: dms-config-server-555dfc56-
labels:
app: config-server
pod-template-hash: 555dfc56
release: dms-config-server
name: dms-config-server-555dfc56-2ssxb
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: dms-config-server-555dfc56
uid: c29c40e1-1da7-11ea-b646-005056a72568
resourceVersion: "65065735"
selfLink: /api/v1/namespaces/dms/pods/dms-config-server-555dfc56-2ssxb
uid: 7758e138-2009-11ea-9ff4-005056a72568
spec:
containers:
- env:
- name: CONFIG_SERVER_GIT_USERNAME
valueFrom:
secretKeyRef:
key: username
name: dms-config-server-git
- name: CONFIG_SERVER_GIT_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: dms-config-server-git
envFrom:
- configMapRef:
name: dms-config-server-app-env
- configMapRef:
name: dms-config-server-git
image: docker.repository..../infra/config-server:2.0.8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: config-server
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 250m
memory: 600Mi
requests:
cpu: 10m
memory: 300Mi
securityContext:
capabilities:
drop:
- MKNOD
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 1000000
priorityClassName: infrastructure
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
Pod without Prio (just an example within the same namespace):
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-09-10T09:09:28Z"
generateName: produkt-service-57d448979d-
labels:
app: produkt-service
pod-template-hash: 57d448979d
release: dms-produkt-service
name: produkt-service-57d448979d-4x5qs
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: produkt-service-57d448979d
uid: 4096ab97-5cee-11e9-97a2-005056a72568
resourceVersion: "65065755"
selfLink: /api/v1/namespaces/dms/pods/produkt-service-57d448979d-4x5qs
uid: b112c5f7-d3aa-11e9-9b1b-005056a72568
spec:
containers:
- image: docker-snapshot.repository..../dms/produkt- service:0b6e0ecc88a28d2a91ffb1db61f8ca99c09a9d92
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: produkt-service
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
capabilities:
drop:
- MKNOD
procMount: Default
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
There could be a lot of circumstances that will alter the work of the scheduler. There is a documentation talking about it: Pod priority and preemption.
Be aware of the fact that this features were deemed stable at version 1.14.0
From the IBM perspective please take in mind that the version 1.13.9 will be supported until 19 of February 2020!.
You are correct that pods with lower priority should be replaced with higher priority pods.
Let me elaborate on that with an example:
Example
Let's assume a Kubernetes cluster with 3 nodes (1 master and 2 nodes):
By default you cannot schedule normal pods on master node
The only worker node that can schedule the pods has 8GB of RAM.
2nd worker node has a taint that disables scheduling.
This example will base on RAM usage but it can be used in the same manner as CPU time.
Priority Class
There are 2 priority classes:
zero-priority (0)
high-priority (1 000 000)
YAML definition of zero priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: zero-priority
value: 0
globalDefault: false
description: "This is priority class for hello pod"
globalDefault: false is used for objects that do not have assigned priority class. It will assign this class by default.
YAML definition of high priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This is priority class for goodbye pod"
To apply this priority classes you will need to invoke:
$ kubectl apply -f FILE.yaml
Deployments
With above objects you can create deployments:
Hello - deployment with low priority
Goodbye - deployment with high priority
YAML definition of hello deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
spec:
selector:
matchLabels:
app: hello
version: 1.0.0
replicas: 10
template:
metadata:
labels:
app: hello
version: 1.0.0
spec:
containers:
- name: hello
image: "gcr.io/google-samples/hello-app:1.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
Please take a specific look on this fragment:
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
It will limit number of pods because of the requested resources as well as it will assign low priority to this deployment.
YAML definition of goodbye deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: goodbye
spec:
selector:
matchLabels:
app: goodbye
version: 2.0.0
replicas: 3
template:
metadata:
labels:
app: goodbye
version: 2.0.0
spec:
containers:
- name: goodbye
image: "gcr.io/google-samples/hello-app:2.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
Also please take a specific look on this fragment:
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
This pods will have much higher request for RAM and high priority.
Testing and troubleshooting
There is no enough information to properly troubleshoot issues like this. Without extensive logs of many components starting from kubelet to pods,nodes and deployments itself.
Apply hello deployment and see what happens:
$ kubectl apply -f hello.yaml
Get basic information about the deployment with command:
$ kubectl get deployments hello
Output should look like that after a while:
NAME READY UP-TO-DATE AVAILABLE AGE
hello 10/10 10 10 9s
As you can see all of the pods are ready and available. The requested resources were assigned to them.
To get more details for troubleshooting purposes you can invoke:
$ kubectl describe deployment hello
$ kubectl describe node NAME_OF_THE_NODE
Example information about allocated resources from the above command:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 1280Mi (17%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Apply goodbye deployment and see what happens:
$ kubectl apply -f goodbye.yaml
Get basic information about the deployments with command:
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 1/3 3 1 25s
hello 9/10 10 9 11m
As you can see there is goodbye deployment but only 1 pod is available. And despite the fact that the goodbye has much higher priority, the hello pods are still working.
Why it is like that?:
$ kubectl describe node NAME_OF_THE_NODE
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default goodbye-575968c8d6-bnrjc 0 (0%) 0 (0%) 6Gi (83%) 0 (0%) 15m
default hello-fdfb55c96-6hkwp 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
default hello-fdfb55c96-djrwf 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
Take a look at requested memory for goodbye pod. It is as described above as 6Gi.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 7296Mi (98%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
The memory usage is near 100%.
Getting information about specific goodbye pod that is in Pending state will yield some more information $ kubectl describe pod NAME_OF_THE_POD_IN_PENDING_STATE:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38s (x3 over 53s) default-scheduler 0/3 nodes are available: 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
Goodbye pod was not created because there were not enough resources that could be satisfied. But there still was some left resources for hello pods.
There is possibility for a situation that will kill lower priority pods and schedule higher priority pods.
Change the requested memory for goodbye pod to 2304Mi. It will allow scheduler to assign of all required pods (3):
resources:
requests:
memory: "2304Mi"
You can delete the previous deployment and apply new one with memory parameter changed.
Invoke command: $ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 3/3 3 3 5m59s
hello 3/10 10 3 48m
As you can see all of the goodbye pods are available.
Hello pods got reduced to make space for pods with higher priority (goodbye).

All containers showing ready, but pod is not ready

when I describe my pod, I can see the following conditions:
$ kubectl describe pod blah-84c6554d77-6wn42
...
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
...
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
blah-84c6554d77-6wn42 1/1 Running 46 23h 10.247.76.179 xxx-x-x-xx-123.nodes.xxx.d.ocp.xxx.xxx.br <none> <none>
...
I wonder how can this be possible: All the containers in the pod are showing ready=true but pod is ready=false
Has anyone experienced this before? Do you know what else could be causing the pod to not be ready?
I'm running kubernetes version 1.15.4.
I can see on the code that
// The status of "Ready" condition is "True", if all containers in a pod are ready
// AND all matching conditions specified in the ReadinessGates have status equal to "True".
but, I haven't defined any custom Readiness Gates.
I wonder how can I check what's the reason for the check failure.
I couldn't find this on the docs for pod-readiness-gate
here is the full pod yaml
$ kubectl get pod blah-84c6554d77-6wn42 -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2019-10-17T04:05:30Z"
generateName: blah-84c6554d77-
labels:
app: blah
commit: cb511424a5ec43f8dghdfdwervxz8a19edbb
pod-template-hash: 84c6554d77
name: blah-84c6554d77-6wn42
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: blah-84c6554d77
uid: 514da64b-c242-11e9-9c5b-0050123234523
resourceVersion: "19780031"
selfLink: /api/v1/namespaces/blah/pods/blah-84c6554d77-6wn42
uid: 62be74a1-541a-4fdf-987d-39c97644b0c8
spec:
containers:
- env:
- name: URL
valueFrom:
configMapKeyRef:
key: url
name: external-mk9249b92m
image: myregistry/blah:3.0.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthcheck
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 3
name: blah
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 10
httpGet:
path: /healthcheck
port: 8080
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 3
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-4tp6z
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: xxxxxxxxxxx
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-4tp6z
secret:
defaultMode: 420
secretName: default-token-4tp6z
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-10-17T04:14:22Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-10-17T09:47:15Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-10-17T07:54:55Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-10-17T04:14:18Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://33820f432a5a372d028c18f1b0376e2526ef65871f4f5c021e2cbea5dcdbe3ea
image: myregistry/blah:3.0.0
imageID: docker-pullable://myregistry/blah:#sha256:5c0634f03627bsdasdasdasdasdasdc91ce2147993a0181f53a
lastState:
terminated:
containerID: docker://5c8d284f79aaeaasdasdaqweqweqrwt9811e34da48f355081
exitCode: 1
finishedAt: "2019-10-17T07:49:36Z"
reason: Error
startedAt: "2019-10-17T07:49:35Z"
name: blah
ready: true
restartCount: 46
state:
running:
startedAt: "2019-10-17T07:54:39Z"
hostIP: 10.247.64.115
phase: Running
podIP: 10.247.76.179
qosClass: BestEffort
startTime: "2019-10-17T04:14:22Z"
Thanks
You got a ReadinessProbe:
readinessProbe:
failureThreshold: 10
httpGet:
path: /healthcheck
port: 8080
scheme: HTTP
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
Readiness probes are configured similarly to liveness probes. The only difference is that you use the readinessProbe field instead of the livenessProbe field.
tl;dr; check if /healthcheck on port 8080 is returning a successful HTTP status code and, if not used or not necessary, drop it.