my kubernetes cluster does not scale down - kubernetes

I have kuberentes cluster. One master and one worker.
I install metric-server for auto scaling and then i run stress test
$ kubectl run autoscale-test --image=ubuntu:16.04 --requests=cpu=1000m --command sleep 1800
deployment "autoscale-test" created
$ kubectl autoscale deployment autoscale-test --cpu-percent=25 --min=1 --max=5
deployment "autoscale-test" autoscaled
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
autoscale-test Deployment/autoscale-test 0% / 25% 1 5 1 1m
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
autoscale-test-59d66dcbf7-9fqr8 1/1 Running 0 9m
kubectl exec autoscale-test-59d66dcbf7-9fqr8 -- apt-get update
kubectl exec autoscale-test-59d66dcbf7-9fqr8 -- apt-get install stress
$ kubectl exec autoscale-test-59d66dcbf7-9fqr8 -- stress --cpu 2 --timeout 600s &
stress: info: [227] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
everything works fine and the pod was auto scaled but after that the pod that was created by autoscale is still running and they do not terminate after the stress test
the hpa shows that the 0% of cpu is in use but the 5 autoscaled pod still running
#kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
autoscale-test Deployment/autoscale-test 0%/25% 1 5 5 74m
#kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default autoscale-test-8f4d84bbf-7ddjw 1/1 Running 0 61m
default autoscale-test-8f4d84bbf-bmr59 1/1 Running 0 61m
default autoscale-test-8f4d84bbf-cxt26 1/1 Running 0 61m
default autoscale-test-8f4d84bbf-x9jws 1/1 Running 0 61m
default autoscale-test-8f4d84bbf-zbhvk 1/1 Running 0 71m
I wait for an hour but nothing happen

From the documentation:
--horizontal-pod-autoscaler-downscale-delay: The value for this option is a duration that specifies how long the autoscaler has to wait before another downscale operation can be performed after the current one has completed. The default value is 5 minutes (5m0s).
Note: When tuning these parameter values, a cluster operator should be
aware of the possible consequences. If the delay (cooldown) value is
set too long, there could be complaints that the Horizontal Pod
Autoscaler is not responsive to workload changes. However, if the
delay value is set too short, the scale of the replicas set may keep
thrashing as usual.
Finally, just before HPA scales the target, the scale recommendation
is recorded. The controller considers all recommendations within a
configurable window choosing the highest recommendation from within
that window. This value can be configured using the
--horizontal-pod-autoscaler-downscale-stabilization-window flag, which defaults to 5 minutes. This means that scaledowns will occur
gradually, smoothing out the impact of rapidly fluctuating metric
values.

Related

Reason for repeated pod eviction

A node on my 5-node cluster had memory usage peak at ~90% last night. Looking around with kubectl, a single pod (in a 1-replica deployment) was the culprit of the high memory usage and was evicted.
However, logs show that the pod was evicted about 10 times (AGE corresponds to around the time when memory usage peaked, all evictions on the same node)
NAMESPACE NAME READY STATUS RESTARTS AGE
example-namespace example-deployment-84f8d7b6d9-2qtwr 0/1 Evicted 0 14h
example-namespace example-deployment-84f8d7b6d9-6k2pn 0/1 Evicted 0 14h
example-namespace example-deployment-84f8d7b6d9-7sbw5 0/1 Evicted 0 14h
example-namespace example-deployment-84f8d7b6d9-8kcbg 0/1 Evicted 0 14h
example-namespace example-deployment-84f8d7b6d9-9fw2f 0/1 Evicted 0 14h
example-namespace example-deployment-84f8d7b6d9-bgrvv 0/1 Evicted 0 14h
...
node memory usage graph:
Status: Failed
Reason: Evicted
Message: Pod The node had condition: [MemoryPressure].
My question is to do with how or why this situation would happen, and/or what steps can I take to debug and figure out why the pod was repeatedly evicted? The pod uses an in-memory database so it makes sense that after some time it eats up a lot of memory, but it's memory usage on boot shouldn't be abnormal at all.
My intuition would have been that the high memory usage pod gets evicted, deployment replaces the pod, new pod isn't using that much memory, all is fine. But the eviction happened many times, which doesn't make sense to me.
The simplest steps are to run the following commands to debug and read the logs from the specific Pod.
Look at the Pod's states and last restarts:
kubectl describe pods ${POD_NAME}
Look for it's node name and run the same for the node:
kubectl describe node ${NODE_NAME}
And you will see some information in Conditions section.
Examine pod logs:
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
If you want to rerun your pod and watch the logs directly, rerun your pod and do the command:
kubectl logs ${POD_NAME} -f
More info with kubectl logs command and its flags here

kubctl get deployment READY 0/0 meaning

I see a lot of heavy documentation online related to Kubernetes deployment but still can't find the definition of 0/0.
> $ k get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
async-handler-redis-master 1/1 1 1 211d
bbox-inference-4k-pilot-2d-boxes 0/0 0 0 148d
What exactly does it mean to be 0/0? It's deployed but not ready? Why is it not ready? How do I make this deployment READY?
It means replica of your deployment is 0. In other words you don't have any pods under this deployment so 0/0 means 0 out of 0 pod is ready.
You can;
kubectl scale deployment <deployment-name> --replicas=1

Is there some way only increase statefulset's replicas and NO decrease the replicas?

I do not want to decrease the number of pods controlled by StatefulSet, and i think that decreasing pods is a dangerous operation in production env.
so... , is there some way ? thx ~
I'm not sure if this is what you are looking for but you can scale a StatefulSet
Use kubectl to scale StatefulSets
First, find the StatefulSet you want to scale.
kubectl get statefulsets <stateful-set-name>
Change the number of replicas of your StatefulSet:
kubectl scale statefulsets <stateful-set-name> --replicas=<new-replicas>
To show you an example, I've deployed a 2 pod StatefulSet called web:
$ kubectl get statefulsets.apps web
NAME READY AGE
web 2/2 60s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 63s
web-1 1/1 Running 0 44s
$ kubectl describe statefulsets.apps web
Name: web
Namespace: default
CreationTimestamp: Wed, 23 Oct 2019 13:46:33 +0200
Selector: app=nginx
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"StatefulSet","metadata":{"annotations":{},"name":"web","namespace":"default"},"spec":{"replicas":2,"select...
Replicas: 824643442664 desired | 2 total
Update Strategy: RollingUpdate
Partition: 824643442984
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
...
Now if we do scale this StatefulSet up to 5 replicas:
$ kubectl scale statefulset web --replicas=5
statefulset.apps/web scaled
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 3m41s
web-1 1/1 Running 0 3m22s
web-2 1/1 Running 0 59s
web-3 1/1 Running 0 40s
web-4 1/1 Running 0 27s
$ kubectl get statefulsets.apps web
NAME READY AGE
web 5/5 3m56s
You do not have any downtime in already working pods.
i think that decreasing pods is a dangerous operation in production env.
I agree with you.
As Crou wrote, it is possible to do this operation with kubectl scale statefulsets <stateful-set-name> but this is an imperative operation and it is not recommended to do imperative operations in a production environment.
In a production environment it is better to use a declarative operation, e.g. have the number of replicas in a text file (e.g. stateful-set-name.yaml) and deploy them with kubectl apply -f <stateful-set-name>.yaml with this way of working, it is easy to store the yaml-files in Git so you have full control of all changes and can revert/rollback to a previous configuration. When you store the declarative files in a Git repository you can use a CICD solution e.g. Jenkins or ArgoCD to 1) validate the operation (e.g. not allow decrease) and 2) first deploy to a test-environment and see that it works, before applying the changes to the production environment.
I recommend the book (new edition) Kubernetes Up&Running 2nd ed that describes this procedure in Chapter 18 (new chapter).

How to get list of terminated and running pods list using Kubectl command

I want to see the list of terminated and running pods details in Kubernetes.
Below command only shows the running pods, somehow I want to see the history of all pods so far terminated.
$ ./kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
POD1 1/1 Running 0 3d 10.333.33.333 node123
POD2 1/1 Running 0 4d 10.333.33.333 node121
POD3 1/1 Running 0 1m 10.333.33.333 node124
I expect the terminated pods list using kubectl command
Since v1.10 kubectl prints terminated pods by default:
--show-all (which only affected pods and only for human
readable/non-API printers) is now defaulted to true and deprecated.
The flag determines whether pods in a terminal state are displayed. It
will be inert in 1.11 and removed in a future release.
If running older versions than v1.10, you still need to use the --show-all flag.

Kubernetes Rolling Update not obeying 'maxUnavailable' replicas when redeployed in autoscaled conditions

In a nutshell, most of our apps are configured with the following strategy in the Deployment -
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
The Horizonatal Pod Autoscaler is configured as so
spec:
maxReplicas: 10
minReplicas: 2
Now when our application was redeployed, instead of running a rolling update, it instantly terminated 8 of our pods and dropped the number of pods to 2 which is the min number of replicas available. This happened in a fraction of a second as you can see here.
Here is the output of kubectl get hpa -
As maxUnavailable is 25%, shouldn't only about 2-3 pods go down at max ? Why did so many pods crash at once ? It seems as though rolling update is useless if it works this way.
What am I missing ?
After looking at this question, I decided to try this with test Environment where I wanted to check If it doesn't work.
I have setup the metrics-server to fetch the metrics server and set a HPA. I have followed the following steps to setup the HPA and deployment:
How to Enable KubeAPI server for HPA Autoscaling Metrics
Once, I have working HPA and max 10 pods running on system, I have updated the images using:
[root#ip-10-0-1-176 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache 49%/50% 1 10 10 87m
[root#ip-10-0-1-176 ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
load-generator-557649ddcd-6jlnl 1/1 Running 0 61m
php-apache-75bf8f859d-22xvv 1/1 Running 0 91s
php-apache-75bf8f859d-dv5xg 1/1 Running 0 106s
php-apache-75bf8f859d-g4zgb 1/1 Running 0 106s
php-apache-75bf8f859d-hv2xk 1/1 Running 0 2m16s
php-apache-75bf8f859d-jkctt 1/1 Running 0 2m46s
php-apache-75bf8f859d-nlrzs 1/1 Running 0 2m46s
php-apache-75bf8f859d-ptg5k 1/1 Running 0 106s
php-apache-75bf8f859d-sbctw 1/1 Running 0 91s
php-apache-75bf8f859d-tkjhb 1/1 Running 0 55m
php-apache-75bf8f859d-wv5nc 1/1 Running 0 106s
[root#ip-10-0-1-176 ~]# kubectl set image deployment php-apache php-apache=hpa-example:v1 --record
deployment.extensions/php-apache image updated
[root#ip-10-0-1-176 ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
load-generator-557649ddcd-6jlnl 1/1 Running 0 62m
php-apache-75bf8f859d-dv5xg 1/1 Terminating 0 2m40s
php-apache-75bf8f859d-g4zgb 1/1 Terminating 0 2m40s
php-apache-75bf8f859d-hv2xk 1/1 Terminating 0 3m10s
php-apache-75bf8f859d-jkctt 1/1 Running 0 3m40s
php-apache-75bf8f859d-nlrzs 1/1 Running 0 3m40s
php-apache-75bf8f859d-ptg5k 1/1 Terminating 0 2m40s
php-apache-75bf8f859d-sbctw 0/1 Terminating 0 2m25s
php-apache-75bf8f859d-tkjhb 1/1 Running 0 56m
php-apache-75bf8f859d-wv5nc 1/1 Terminating 0 2m40s
php-apache-847c8ff9f4-7cbds 1/1 Running 0 6s
php-apache-847c8ff9f4-7vh69 1/1 Running 0 6s
php-apache-847c8ff9f4-9hdz4 1/1 Running 0 6s
php-apache-847c8ff9f4-dlltb 0/1 ContainerCreating 0 3s
php-apache-847c8ff9f4-nwcn6 1/1 Running 0 6s
php-apache-847c8ff9f4-p8c54 1/1 Running 0 6s
php-apache-847c8ff9f4-pg8h8 0/1 Pending 0 3s
php-apache-847c8ff9f4-pqzjw 0/1 Pending 0 2s
php-apache-847c8ff9f4-q8j4d 0/1 ContainerCreating 0 4s
php-apache-847c8ff9f4-xpbzl 0/1 Pending 0 1s
Also, I have kept job in background which pushed the kubectl get pods output every second in a file. At no time till all images are upgraded, number of pods never went below 8.
I believe you need to check how you're setting up your rolling upgrade. Are you using deployment or replicaset? I have kept the rolling update strategy same as you maxUnavailable: 25% and maxSurge: 25% with deployment and it is working well for me.
I want to point out the minReadySeconds property.
The minReadySeconds property that specifies how long a newly created pod should be ready before the pod is treated as available. Actually the redeploying that without minReadySeconds's property has been done successfully in a very short time. But after short time readiness probe started to failing for any reason and the pods start scale down.
maxUnavailable property is only taken care about while RollingUpdate. After RollingUpdate event this property ignored.
Note from Kubernetes In Action's book : If you only define the readiness probe without setting minReadySeconds properly, new pods are considered available immediately when the first invocation of the readiness probe succeeds. If the readiness probe starts failing shortly after, the bad version is rolled out across all pods. Therefore, you should set minReadySeconds appropriately.
In our case we added the replicas field a while ago and forgot to remove it when we added the HPA. The HPA does not play nice with the replicas field during deployments, so if you have a HPA remove the replicas field. See https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#migrating-deployments-and-statefulsets-to-horizontal-autoscaling
When an HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s). If this isn't done, any time a change to that object is applied, for example via kubectl apply -f deployment.yaml, this will instruct Kubernetes to scale the current number of Pods to the value of the spec.replicas key. This may not be desired and could be troublesome when an HPA is active.
Keep in mind that the removal of spec.replicas may incur a one-time degradation of Pod counts as the default value of this key is 1 (reference Deployment Replicas). Upon the update, all Pods except 1 will begin their termination procedures. Any deployment application afterwards will behave as normal and respect a rolling update configuration as desired.