Kubernetes deployment not scaling down even though usage is below threshold - kubernetes

I’m having a hard time understanding what’s going on with my horizontal pod autoscaler.
I’m trying to scale up my deployment if the memory or cpu usage goes above 80%.
Here’s my HPA template:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
The thing is, it’s been sitting at 3 replicas for days even though the usage is below 80% and I don’t understand why.
$ kubectl get hpa --all-namespaces
NAMESPACE NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-ns my-hpa Deployment/my-deployment 61%/80%, 14%/80% 2 10 3 2d15h
Here’s the output of the top command:
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
my-deployment-86874588cc-chvxq 3m 146Mi
my-deployment-86874588cc-gkbg9 5m 149Mi
my-deployment-86874588cc-nwpll 7m 149Mi
Each pod consumes approximately 60% of their requested memory (So they are below the 80% target):
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
Here's my deployment:
kind: Deployment
apiVersion: apps/v1
metadata:
name: my-deployment
labels:
app: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: ...
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /liveness
port: 3000
initialDelaySeconds: 10
periodSeconds: 3
timeoutSeconds: 3
readinessProbe:
httpGet:
path: /readiness
port: 3000
initialDelaySeconds: 10
periodSeconds: 3
timeoutSeconds: 3
ports:
- containerPort: 3000
protocol: TCP
I manually scale down to 2 replicas and it goes back up to 3 right away for no reason:
Normal SuccessfulRescale 28s (x4 over 66m) horizontal-pod-autoscaler New size: 3; reason:
Anyone have any idea what’s going on?

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
As per your current numbers, It will never scale down unless your memory usage goes down to half of the desired percentage.
i.e the current utilization of both cpu and memory should go to 40%(in your case) or below
as per the below formula
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
= ceil[3 * (61/80)]
= ceil[3 * (0.7625)]
= ceil[2.2875]
desiredReplicas = 3
you might be having doubt like your cpu is below 40% why it is not downscaling.. but hpa will not work in that way.. it will always look for larger number.

I had the same issue because like you i was using 2 metrics CPU and Memory. The memory resource is never quickly released by the application as the CPU does. Once I removed the memory resource the pods scaled down when the utilization was less.

Related

GCP GKE Kubernetes HPA: horizontal-pod-autoscaler failed to get memory utilization: missing request for memory. How to fix it?

I have a cluster in the Google Kubernetes Engine and want to make one of the deployments auto scalable by memory.
After doing a deployment, I check the horizontal scalation with the following command
kubectl describe hpa -n my-namespace
With this result:
Name: myapi-api-deployment
Namespace: my-namespace
Labels: <none>
Annotations: <none>
CreationTimestamp: Tue, 15 Feb 2022 12:21:44 +0100
Reference: Deployment/myapi-api-deployment
Metrics: ( current / target )
resource memory on pods (as a percentage of request): <unknown> / 50%
Min replicas: 1
Max replicas: 5
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get memory utilization: missing request for memory
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 2m22s (x314 over 88m) horizontal-pod-autoscaler failed to get memory utilization: missing request for memory
When I use the kubectl top command I can see the memory and cpu usage. Here is my deployment including the autoscale:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api-deployment
namespace: my-namespace
annotations:
reloader.stakater.com/auto: "true"
spec:
replicas: 1
selector:
matchLabels:
app: my-api
version: v1
template:
metadata:
labels:
app: my-api
version: v1
annotations:
sidecar.istio.io/rewriteAppHTTPProbers: "true"
spec:
serviceAccountName: my-api-sa
containers:
- name: esp
image: gcr.io/endpoints-release/endpoints-runtime:2
imagePullPolicy: Always
args: [
"--listener_port=9000",
"--backend=127.0.0.1:8080",
"--service=myproject.company.ai"
]
ports:
- containerPort: 9000
- name: my-api
image: gcr.io/myproject/my-api:24
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: "/healthcheck"
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: "/healthcheck"
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
resources:
limits:
cpu: 500m
memory: 2048Mi
requests:
cpu: 300m
memory: 1024Mi
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-api-deployment
namespace: my-namespace
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-api-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: memory
target:
type: "Utilization"
averageUtilization: 50
---
Using the autoscaling/v2beta2 recommended by the GKE documentation
When using the HPA with memory or CPU, you need to set resource requests for whichever metric(s) your HPA is using. See How does a HorizontalPodAutoscaler work, specifically
For per-pod resource metrics (like CPU), the controller fetches the
metrics from the resource metrics API for each Pod targeted by the
HorizontalPodAutoscaler. Then, if a target utilization value is set,
the controller calculates the utilization value as a percentage of the
equivalent resource request on the containers in each Pod. If a target
raw value is set, the raw metric values are used directly.
Your HPA is set to match the my-api-deployment which has two containers. You have resource requests set for my-api but not for esp. So you just need to add a memory resource request to esp.

HPA scaling is triggered during Spring boot startup. Current CPU metric in HPA is high though CPU utilization at node level is low

We have a Spring Boot application on GKE with auto-scaling (HPA) enabled. During startup, HPA kicks in and start scaling the pods even though there is no traffic.
Result of 'kubectl get hpa' shows high current CPU average utilization while CPU utilization of nodes and PODs is quite low.
The behavior is same during scaling up, and multiple Pod are created eventually leading to Node scaling.
Application deployment Yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-api
labels:
app: myapp
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
serviceAccount: myapp-ksa
containers:
- name: myapp
image: gcr.io/project/myapp:126
env:
- name: DB_USER
valueFrom:
secretKeyRef:
name: my-db-credentials
key: username
- name: DB_PASS
valueFrom:
secretKeyRef:
name: my-db-credentials
key: password
- name: DB_NAME
valueFrom:
secretKeyRef:
name: my-db-credentials
key: database
- name: INSTANCE_CONNECTION
valueFrom:
configMapKeyRef:
name: connectionname
key: connectionname
resources:
requests:
cpu: "200m"
memory: "256Mi"
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 90
periodSeconds: 5
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.17
env:
- name: INSTANCE_CONNECTION
valueFrom:
configMapKeyRef:
name: connectionname
key: connectionname
command: ["/cloud_sql_proxy",
"-ip_address_types=PRIVATE",
"-instances=$(INSTANCE_CONNECTION)=tcp:5432"]
securityContext:
runAsNonRoot: true
runAsUser: 2
allowPrivilegeEscalation: false
resources:
requests:
memory: "128Mi"
cpu: "100m"
Yaml for HPA:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-api
labels:
app: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp-api
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
result of various commands:
$ kubectl get pods
$ kubectl get hpa
$ kubectl top nodes
$ kubectl top pods --all-namespaces
As the problem has already been resolved in the comments section, I decided to provide a Community Wiki answer just for better visibility to other community members.
The kubectl get hpa command from the question shows the high current memory utilization which causes the Pods scaling.
Reading the TARGETS column from the kubectl get hpa command can be confusing:
NOTE: The value of 33% applies to memory or cpu ... ?
$ kubectl get hpa app-1
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app-1 Deployment/app-1 33%/80%, 4%/80% 2 5 5 5m37s
I recommend using the kubectl describe hpa <HPA_NAME> command with grep to determine the current metrics without any doubts:
$ kubectl describe hpa app-1 | grep -A 2 "Metrics:"
Metrics: ( current / target )
resource memory on pods (as a percentage of request): 33% (3506176) / 80%
resource cpu on pods (as a percentage of request): 4% (0) / 80%

Kubernetes : How to attain 0 downtime

I have 2 pods running with each CPU : 0.2 Core and Mi : 1 Gi
My node has limit of 0.4 Core and 2 Gi. I can't increase the node limits.
For Zero downtime I have done following config -
apiVersion: apps/v1
kind: Deployment
metadata:
name: abc-deployment
spec:
selector:
matchLabels:
app: abc
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
template:
metadata:
labels:
app: abc
collect_logs_with_filebeat: "true"
annotations:
sidecar.istio.io/rewriteAppHTTPProbers: "false"
spec:
containers:
- name: abc
image: abc-repository:latest
ports:
- containerPort: 8087
readinessProbe:
httpGet:
path: /healthcheck
port: 8087
initialDelaySeconds: 540
timeoutSeconds: 10
periodSeconds: 10
failureThreshold: 20
successThreshold: 1
imagePullPolicy: Always
resources:
limits:
cpu: 0.2
memory: 1000Mi
requests:
cpu: 0.2
memory: 1000Mi
On a new build deployment, two new pod gets created on a new node(because node1 doestn't have enough
memory and cpu to accomodate new pods) say node2. once new container is in running state these newly created pod of node2. the old pods(running on node1)
get desroyed and now node1 have some free space and memory.
Now the issue which i am facing is that, Since node1 have free memory and cpu, Kubernetes is destroying the newly created pods(running on node2)
and after that create pods on node1 and starts app container on that, which is causing downtime.
So, Basically in my case even after using rollingupdate strategy and healthcheck point, I am not able to achieve zero downtime.
Please help here!
You could look at the concept of Pod Disruption Budget that is used mostly for achieving zero downtime for an application.
You could also read a related answer of mine which shows an example of how to achieve the zero down time for an application using the PDBs.

K8s pod priority & outOfPods

we had the situation that the k8s-cluster was running out of pods after an update (kubernetes or more specific: ICP) resulting in "OutOfPods" error messages. The reason was a lower "podsPerCore"-setting which we corrected afterwards. Until then there were pods with a provided priorityClass (1000000) which cannot be scheduled. Others - without a priorityClass (0) - were scheduled. I assumed a different behaviour. I thought that the K8s scheduler would kill pods with no priority so that a pod with priority can be scheduled. Was I wrong?
Thats just a question for understanding because I want to guarantee that the priority pods are running, no matter what.
Thanks
Pod with Prio:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-12-16T13:39:21Z"
generateName: dms-config-server-555dfc56-
labels:
app: config-server
pod-template-hash: 555dfc56
release: dms-config-server
name: dms-config-server-555dfc56-2ssxb
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: dms-config-server-555dfc56
uid: c29c40e1-1da7-11ea-b646-005056a72568
resourceVersion: "65065735"
selfLink: /api/v1/namespaces/dms/pods/dms-config-server-555dfc56-2ssxb
uid: 7758e138-2009-11ea-9ff4-005056a72568
spec:
containers:
- env:
- name: CONFIG_SERVER_GIT_USERNAME
valueFrom:
secretKeyRef:
key: username
name: dms-config-server-git
- name: CONFIG_SERVER_GIT_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: dms-config-server-git
envFrom:
- configMapRef:
name: dms-config-server-app-env
- configMapRef:
name: dms-config-server-git
image: docker.repository..../infra/config-server:2.0.8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: config-server
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 250m
memory: 600Mi
requests:
cpu: 10m
memory: 300Mi
securityContext:
capabilities:
drop:
- MKNOD
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 1000000
priorityClassName: infrastructure
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
Pod without Prio (just an example within the same namespace):
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-09-10T09:09:28Z"
generateName: produkt-service-57d448979d-
labels:
app: produkt-service
pod-template-hash: 57d448979d
release: dms-produkt-service
name: produkt-service-57d448979d-4x5qs
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: produkt-service-57d448979d
uid: 4096ab97-5cee-11e9-97a2-005056a72568
resourceVersion: "65065755"
selfLink: /api/v1/namespaces/dms/pods/produkt-service-57d448979d-4x5qs
uid: b112c5f7-d3aa-11e9-9b1b-005056a72568
spec:
containers:
- image: docker-snapshot.repository..../dms/produkt- service:0b6e0ecc88a28d2a91ffb1db61f8ca99c09a9d92
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: produkt-service
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
capabilities:
drop:
- MKNOD
procMount: Default
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
There could be a lot of circumstances that will alter the work of the scheduler. There is a documentation talking about it: Pod priority and preemption.
Be aware of the fact that this features were deemed stable at version 1.14.0
From the IBM perspective please take in mind that the version 1.13.9 will be supported until 19 of February 2020!.
You are correct that pods with lower priority should be replaced with higher priority pods.
Let me elaborate on that with an example:
Example
Let's assume a Kubernetes cluster with 3 nodes (1 master and 2 nodes):
By default you cannot schedule normal pods on master node
The only worker node that can schedule the pods has 8GB of RAM.
2nd worker node has a taint that disables scheduling.
This example will base on RAM usage but it can be used in the same manner as CPU time.
Priority Class
There are 2 priority classes:
zero-priority (0)
high-priority (1 000 000)
YAML definition of zero priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: zero-priority
value: 0
globalDefault: false
description: "This is priority class for hello pod"
globalDefault: false is used for objects that do not have assigned priority class. It will assign this class by default.
YAML definition of high priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This is priority class for goodbye pod"
To apply this priority classes you will need to invoke:
$ kubectl apply -f FILE.yaml
Deployments
With above objects you can create deployments:
Hello - deployment with low priority
Goodbye - deployment with high priority
YAML definition of hello deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
spec:
selector:
matchLabels:
app: hello
version: 1.0.0
replicas: 10
template:
metadata:
labels:
app: hello
version: 1.0.0
spec:
containers:
- name: hello
image: "gcr.io/google-samples/hello-app:1.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
Please take a specific look on this fragment:
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
It will limit number of pods because of the requested resources as well as it will assign low priority to this deployment.
YAML definition of goodbye deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: goodbye
spec:
selector:
matchLabels:
app: goodbye
version: 2.0.0
replicas: 3
template:
metadata:
labels:
app: goodbye
version: 2.0.0
spec:
containers:
- name: goodbye
image: "gcr.io/google-samples/hello-app:2.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
Also please take a specific look on this fragment:
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
This pods will have much higher request for RAM and high priority.
Testing and troubleshooting
There is no enough information to properly troubleshoot issues like this. Without extensive logs of many components starting from kubelet to pods,nodes and deployments itself.
Apply hello deployment and see what happens:
$ kubectl apply -f hello.yaml
Get basic information about the deployment with command:
$ kubectl get deployments hello
Output should look like that after a while:
NAME READY UP-TO-DATE AVAILABLE AGE
hello 10/10 10 10 9s
As you can see all of the pods are ready and available. The requested resources were assigned to them.
To get more details for troubleshooting purposes you can invoke:
$ kubectl describe deployment hello
$ kubectl describe node NAME_OF_THE_NODE
Example information about allocated resources from the above command:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 1280Mi (17%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Apply goodbye deployment and see what happens:
$ kubectl apply -f goodbye.yaml
Get basic information about the deployments with command:
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 1/3 3 1 25s
hello 9/10 10 9 11m
As you can see there is goodbye deployment but only 1 pod is available. And despite the fact that the goodbye has much higher priority, the hello pods are still working.
Why it is like that?:
$ kubectl describe node NAME_OF_THE_NODE
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default goodbye-575968c8d6-bnrjc 0 (0%) 0 (0%) 6Gi (83%) 0 (0%) 15m
default hello-fdfb55c96-6hkwp 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
default hello-fdfb55c96-djrwf 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
Take a look at requested memory for goodbye pod. It is as described above as 6Gi.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 7296Mi (98%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
The memory usage is near 100%.
Getting information about specific goodbye pod that is in Pending state will yield some more information $ kubectl describe pod NAME_OF_THE_POD_IN_PENDING_STATE:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38s (x3 over 53s) default-scheduler 0/3 nodes are available: 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
Goodbye pod was not created because there were not enough resources that could be satisfied. But there still was some left resources for hello pods.
There is possibility for a situation that will kill lower priority pods and schedule higher priority pods.
Change the requested memory for goodbye pod to 2304Mi. It will allow scheduler to assign of all required pods (3):
resources:
requests:
memory: "2304Mi"
You can delete the previous deployment and apply new one with memory parameter changed.
Invoke command: $ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 3/3 3 3 5m59s
hello 3/10 10 3 48m
As you can see all of the goodbye pods are available.
Hello pods got reduced to make space for pods with higher priority (goodbye).

Kubectl rollout restart for statefulset

As per the kubectl docs, kubectl rollout restart is applicable for deployments, daemonsets and statefulsets. It works as expected for deployments. But for statefulsets, it restarts only one pod of the 2 pods.
✗ k rollout restart statefulset alertmanager-main (playground-fdp/monitoring)
statefulset.apps/alertmanager-main restarted
✗ k rollout status statefulset alertmanager-main (playground-fdp/monitoring)
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 2 pods at revision alertmanager-main-59d7ccf598...
✗ kgp -l app=alertmanager (playground-fdp/monitoring)
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 21h
alertmanager-main-1 2/2 Running 0 20s
As you can see the pod alertmanager-main-1 has been restarted and its age is 20s. Whereas the other pod in the statefulset alertmanager, i.e., pod alertmanager-main-0 has not been restarted and it is age is 21h. Any idea how we can restart a statefulset after some configmap used by it has been updated?
[Update 1] Here is the statefulset configuration. As you can see the .spec.updateStrategy.rollingUpdate.partition is not set.
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"monitoring.coreos.com/v1","kind":"Alertmanager","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"main","namespace":"monitoring"},"spec":{"baseImage":"10.47.2.76:80/alm/alertmanager","nodeSelector":{"kubernetes.io/os":"linux"},"replicas":2,"securityContext":{"fsGroup":2000,"runAsNonRoot":true,"runAsUser":1000},"serviceAccountName":"alertmanager-main","version":"v0.19.0"}}
creationTimestamp: "2019-12-02T07:17:49Z"
generation: 4
labels:
alertmanager: main
name: alertmanager-main
namespace: monitoring
ownerReferences:
- apiVersion: monitoring.coreos.com/v1
blockOwnerDeletion: true
controller: true
kind: Alertmanager
name: main
uid: 3e3bd062-6077-468e-ac51-909b0bce1c32
resourceVersion: "521307"
selfLink: /apis/apps/v1/namespaces/monitoring/statefulsets/alertmanager-main
uid: ed4765bf-395f-4d91-8ec0-4ae23c812a42
spec:
podManagementPolicy: Parallel
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
alertmanager: main
app: alertmanager
serviceName: alertmanager-operated
template:
metadata:
creationTimestamp: null
labels:
alertmanager: main
app: alertmanager
spec:
containers:
- args:
- --config.file=/etc/alertmanager/config/alertmanager.yaml
- --cluster.listen-address=[$(POD_IP)]:9094
- --storage.path=/alertmanager
- --data.retention=120h
- --web.listen-address=:9093
- --web.external-url=http://10.47.0.234
- --web.route-prefix=/
- --cluster.peer=alertmanager-main-0.alertmanager-operated.monitoring.svc:9094
- --cluster.peer=alertmanager-main-1.alertmanager-operated.monitoring.svc:9094
env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
image: 10.47.2.76:80/alm/alertmanager:v0.19.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /-/healthy
port: web
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
name: alertmanager
ports:
- containerPort: 9093
name: web
protocol: TCP
- containerPort: 9094
name: mesh-tcp
protocol: TCP
- containerPort: 9094
name: mesh-udp
protocol: UDP
readinessProbe:
failureThreshold: 10
httpGet:
path: /-/ready
port: web
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
resources:
requests:
memory: 200Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager/config
name: config-volume
- mountPath: /alertmanager
name: alertmanager-main-db
- args:
- -webhook-url=http://localhost:9093/-/reload
- -volume-dir=/etc/alertmanager/config
image: 10.47.2.76:80/alm/configmap-reload:v0.0.1
imagePullPolicy: IfNotPresent
name: config-reloader
resources:
limits:
cpu: 100m
memory: 25Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager/config
name: config-volume
readOnly: true
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccount: alertmanager-main
serviceAccountName: alertmanager-main
terminationGracePeriodSeconds: 120
volumes:
- name: config-volume
secret:
defaultMode: 420
secretName: alertmanager-main
- emptyDir: {}
name: alertmanager-main-db
updateStrategy:
type: RollingUpdate
status:
collisionCount: 0
currentReplicas: 2
currentRevision: alertmanager-main-59d7ccf598
observedGeneration: 4
readyReplicas: 2
replicas: 2
updateRevision: alertmanager-main-59d7ccf598
updatedReplicas: 2
You did not provide whole scenario. It might depends on Readiness Probe or Update Strategy.
StatefulSet restart pods from index 0 to n-1. Details can be found here.
Reason 1*
Statefulset have 4 update strategies.
On Delete
Rolling Updates
Partitions
Forced Rollback
In Partition update you can find information that:
If a partition is specified, all Pods with an ordinal that is greater
than or equal to the partition will be updated when the StatefulSet’s
.spec.template is updated. All Pods with an ordinal that is less
than the partition will not be updated, and, even if they are deleted,
they will be recreated at the previous version. If a StatefulSet’s
.spec.updateStrategy.rollingUpdate.partition is greater than its
.spec.replicas, updates to its .spec.template will not be
propagated to its Pods. In most cases you will not need to use a
partition, but they are useful if you want to stage an update, roll
out a canary, or perform a phased roll out.
So if somewhere in StatefulSet you have set updateStrategy.rollingUpdate.partition: 1 it will restart all pods with index 1 or higher.
Example of partition: 3
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 30m
web-1 1/1 Running 0 30m
web-2 1/1 Running 0 31m
web-3 1/1 Running 0 2m45s
web-4 1/1 Running 0 3m
web-5 1/1 Running 0 3m13s
Reason 2
Configuration of Readiness probe.
If your values of initialDelaySeconds and periodSeconds are high, it might take a while before another one will be restarted. Details about those parameters can be found here.
In below example, pod will wait 10 seconds it will be running, and readiness probe is checking this each 2 seconds. Depends on values it might be cause of this behavior.
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
Reason 3
I saw that you have 2 containers in each pod.
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 21h
alertmanager-main-1 2/2 Running 0 20s
As describe in docs:
Running - The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting.
It would be good to check if everything is ok with both containers (readinessProbe/livenessProbe, restarts etc.)
You would need to delete it. Stateful set are removed following their ordinal index with the highest ordinal index first.
Also you do not need to restart pod to re-read updated config map. This is happening automatically (after some period of time).
This might be related to your ownerReferences definition. You can try it without any owner and do the rollout again.