We are working on an akka-cluster based application which we run in a Kubernetes cluster. We are now in a situation where we will like the application to scale up in case there is an increase in load on the cluster. We are using HorizontalPodAutoscaler to achieve this. Our manifest files looks like:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: app
namespace: some-namespace
labels:
componentName: our-component
app: our-component
version: some-version
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "9252"
spec:
serviceName: our-component
replicas: 2
selector:
matchLabels:
componentName: our-component
app: our-app
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
componentName: our-component
app: our-app
spec:
containers:
- name: our-component-container
image: image-path
imagePullPolicy: Always
resources:
requests:
cpu: .1
memory: 500Mi
limits:
cpu: 1
memory: 1Gi
command:
- "/microservice/bin/our-component"
ports:
- name: remoting
containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /ready
port: 9085
initialDelaySeconds: 40
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /alive
port: 9085
initialDelaySeconds: 130
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 5
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
namespace: some-namespace
labels:
componentName: our-component
app: our-app
spec:
minReplicas: 2
maxReplicas: 8
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: our-component
targetCPUUtilizationPercentage: 75
The issue we face is, as soon as we deploy the application it scales to the maxReplicas defined, even if there is no load on the application. Also, the application never seem to scale down.
Can someone who faced a similar issue in their application share their experience of why this happens and if they were able to resolve this?
The issue is with the resource request and limits. You have set requested CPU to be "1" whereas limit to be "0.1". Therefore, what is happening is, as soon as your pod runs, the limit is naturally exceeded, and the autoscaling kicks in and keeps scaling to the max number of replicas.
You need to switch the parameter names so that your request becomes 0.1 and limit becomes 1.0. This way, your pod will start with 0.1 units of CPU shares, and once average usage grows larger than 65% across all pods, you will get more replicas, and if the usage drops, you will have a scale down, just as expected.
As a general rule of thumb, request is less than limit, or at least equal to it. It cannot be higher, because you end up with an infinitely scaling infrastructure.
I suspect this is just because you have such a low CPU request value. With a 0.1 request anytime there is more than 10% of a vCPU being used by a Pod it's going to create new pods. And startup activity could easily use more than a 0.1 CPU. So even the startup activity is enough to force the HBA to spawn more pods. But then those new pods get added to the cluster, so there is consensus activity. Which, again, might push all of the pods above an average 0.1 request.
It's a little surprising to me, because you'd think an idle application would stabilize below 0.1 vCPU, but 0.1 vCPU is very tiny.
I'd test:
First, just bumping up to more reasonable CPU requests. If you have 1.0 CPU request and 2.0 CPU limit, does this still happen? If not, then I was right and the request value was just set so low that "overhead" style activity could exceed the target.
If you still are seeing this behavior, even then, then I'd verify your HBA settings. The defaults should be OK, but I'd just validate everything. Run a describe on the hba to see the status and events. Maybe play around with periodSeconds and stabilization settings.
If that still doesn't give you any clues, I'd just run the HBA samples and make sure that they work. Maybe there is a problem with the collected metrics or something similar.
Related
I wonder if anyone is able to help me understand what im doing wrong here.. I have 3 deployments in my namespace, each of them has a Horizontal Pod Autoscaler configured. However, despite each of the HPA's being configured with a target of its respective deployment, They all seem to be reacting to the same CPU metric
$ kubectl get hpa -n my-namespace
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app1-hpa Deployment/app1 54%/100% 2 5 5 11h
app2-hpa Deployment/app2 54%/100% 10 40 39 11h
app3-hpa Deployment/app3 54%/100% 10 50 39 11h
In my example, app3 is the only one that is busy, but if you look at the TARGETS column, the % utilisation for all 3 HPA's is being calculated the same, so they have ALL scaled up... app1 for example, which is completely idle (using only 1m of cpu per pod) has scaled up to 5 because the metric says that its on 54%/100% ....
What im trying to achieve is that each HPA reacts ONLY to the CPU metrics of the deployment its paired with. therefore in the above example, app1 would clearly stay at 2 instances
my HPA configuration looks like this (below is an example for app1)
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: app1-hpa
namespace: my-namespace
spec:
maxReplicas: 5
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app1
targetCPUUtilizationPercentage: 100
Here is the deployment code for app1 (app2 and app3 are the same all but in name)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app1
labels:
app: my-namespace
spec:
replicas: 2
selector:
matchLabels:
app: app
strategy:
type: Recreate
template:
metadata:
labels:
app: app
spec:
imagePullSecrets:
- name: container-registry-access
containers:
- name: app
image: "path_to_registry/location/imagename"
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 512Mi
cpu: 1000m
env:
- name: LOG_LEVEL
value: info
command: ["/entrypoint-app1.sh"]
imagePullPolicy: Always
restartPolicy: Always
Does anyone know what im doing wrong here? it seems to be scaling on an overall CPU average of all pods or something like that? If app3 is really busy , i dont want app1 and app2 to scale up as well when they are actually idle
Any help on this would be greatly appreciated
if all of your deployments are the same except for the name i would suggest that you change the spec.template.metadata.labels.app and the spec.selector.matchLabels.app to correspond to the right application, meaning that each of those value for app1 would be app1 and not app.
my guess that deployments think that all the apps are the same and this is why the cpu average the same for all.
I have 2 pods running with each CPU : 0.2 Core and Mi : 1 Gi
My node has limit of 0.4 Core and 2 Gi. I can't increase the node limits.
For Zero downtime I have done following config -
apiVersion: apps/v1
kind: Deployment
metadata:
name: abc-deployment
spec:
selector:
matchLabels:
app: abc
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
template:
metadata:
labels:
app: abc
collect_logs_with_filebeat: "true"
annotations:
sidecar.istio.io/rewriteAppHTTPProbers: "false"
spec:
containers:
- name: abc
image: abc-repository:latest
ports:
- containerPort: 8087
readinessProbe:
httpGet:
path: /healthcheck
port: 8087
initialDelaySeconds: 540
timeoutSeconds: 10
periodSeconds: 10
failureThreshold: 20
successThreshold: 1
imagePullPolicy: Always
resources:
limits:
cpu: 0.2
memory: 1000Mi
requests:
cpu: 0.2
memory: 1000Mi
On a new build deployment, two new pod gets created on a new node(because node1 doestn't have enough
memory and cpu to accomodate new pods) say node2. once new container is in running state these newly created pod of node2. the old pods(running on node1)
get desroyed and now node1 have some free space and memory.
Now the issue which i am facing is that, Since node1 have free memory and cpu, Kubernetes is destroying the newly created pods(running on node2)
and after that create pods on node1 and starts app container on that, which is causing downtime.
So, Basically in my case even after using rollingupdate strategy and healthcheck point, I am not able to achieve zero downtime.
Please help here!
You could look at the concept of Pod Disruption Budget that is used mostly for achieving zero downtime for an application.
You could also read a related answer of mine which shows an example of how to achieve the zero down time for an application using the PDBs.
I created a deployment with liveness and readiness probes and initial delay which works fine. If I want to replace the initial delay with a startup probe the startupProbe key and its nested elements are never included in the deployment descrioptor when created with kubectl apply and get deleted from the deployment yaml in the GKE deployment editor after saving.
An example:
apiVersion: v1
kind: Namespace
metadata:
name: "test"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-sleep
namespace: test
spec:
selector:
matchLabels:
app: postgres-sleep
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 50%
template:
metadata:
labels:
app: postgres-sleep
spec:
containers:
- name: postgres-sleep
image: krichter/microk8s-startup-probe-ignored:latest
ports:
- name: postgres
containerPort: 5432
readinessProbe:
tcpSocket:
port: 5432
periodSeconds: 3
livenessProbe:
tcpSocket:
port: 5432
periodSeconds: 3
startupProbe:
tcpSocket:
port: 5432
failureThreshold: 60
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: postgres-sleep
namespace: test
spec:
selector:
app: httpd
ports:
- protocol: TCP
port: 5432
targetPort: 5432
---
with krichter/microk8s-startup-probe-ignored:latest being
FROM postgres:11
CMD sleep 30 && postgres
I'm reusing this example from the same issue with microk8s where I could solve it by changing the kubelet and kubeapi-server configuration files (see https://github.com/ubuntu/microk8s/issues/770 in case you're interested). I assume this is not possible with GKE clusters as they don't expose these files, probably for good reasons.
I assume that the feature needs to be enable since it's behind a feature gate. How can I enable it on Google Kubernetes Engine (GKE) clusters with version >= 1.16? Currently I'm using the default from the regular channel 1.16.8-gke.15.
As I mentioned in my comments, I was able to reproduce the same behavior in my test environment, and after some researches I found the reason.
In GKE, features gates are only permitted if you are using an Alpha Cluster. You can see a complete list of feature gates here
I've created an alpha cluster and applied the same yaml, it works for me, the startupProbe is there in the place.
So, you will only be able to use startupProbe in a GKE Alpha clusters, follow this documentation to create a new one.
Be aware of the limitations in alpha clusters:
Alpha clusters have the following limitations:
Not covered by the GKE SLA
Cannot be upgraded
Node auto-upgrade and auto-repair are disabled on alpha clusters
Automatically deleted after 30 days
Do not receive security updates
Also, Google don't recommend use for production workloads:
Warning: Do not use Alpha clusters or alpha features for production workloads. Alpha clusters expire after thirty days and do not receive security updates. You must migrate your data from alpha clusters before they expire. GKE does not automatically save data stored on alpha clusters.
I have deployed Influxdb 2.0.0 as Statefulset with EBS volume persistence. I've noticed that, if for some reason, pod gets rescheduled to other node or even if we scale down statefulset pod replicas = 0 and then scale up, the effect would be the same on persisted data: they will be lost.
Initially, in case of pod that gets rescheduled to other node, I would thought the problem is with EBS volume, it doesn't get unmounted and them mounted to another node where pod replica is running but that is NOT the case. EBS volume is present, same pv/pvc exists, but data is lost.
To figure out what might be the problem, I've purposely done influxdb setup and added data and then did this:
kubectl scale statefulsets influxdb --replicas=0
...
kubectl scale statefulsets influxdb --replicas=1
The effect was the same just like when influxdb pod got rescheduled. Data was lost.
Any specific reason why would something like that happen?
My environment:
I'm using EKS k8s environment with 1.15 k8s version of control plane/workers.
Fortunately, the problem was due to the big changes that happened between influxdb 1.x and 2.0.0 beta version in terms on where the actual data is persisted.
In 1.x version, data was persisted in:
/var/lib/influxdb
while on the 2.x version, data is persisted, by default, on:
/root/.influxdbv2
My EBS volume was mounted on the 1.x version location and with every restart of the pod (either caused by scaling down or by scheduling to other node), EBS volume was regularly attached but on the wrong location. That was the reason why there was no data.
Also, one difference that I see is that configuration params cannot be provided for 2.x version via configuration file (like it was on 1.x where I had configuration file mounted into the container as configmap). We have to provide additional configuration params inline. This link explains how: https://v2.docs.influxdata.com/v2.0/reference/config-options/
At the end this is the working version of Statefulset:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: influxdb
name: influxdb
spec:
replicas: 1
selector:
matchLabels:
app: influxdb
serviceName: influxdb
template:
metadata:
labels:
app: influxdb
spec:
containers:
- image: quay.io/influxdb/influxdb:2.0.0-beta
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: influxdb
ports:
- containerPort: 9999
name: api
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "800m"
memory: 1200Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /root/.influxdbv2
name: influxdb-data
volumeClaimTemplates:
- metadata:
name: influxdb-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
I am experimenting with GKE cluster upgrades in a 6 nodes (in two node pools) test cluster before I try it on our staging or production cluster. Upgrading when I only had a 12 replicas nginx deployment, the nginx ingress controller and cert-manager (as helm chart) installed took 10 minutes per node pool (3 nodes). I was very satisfied. I decided to try again with something that looks more like our setup. I removed the nginx deploy and added 2 node.js deployments, the following helm charts: mongodb-0.4.27, mcrouter-0.1.0 (as a statefulset), redis-ha-2.0.0, and my own www-redirect-0.0.1 chart (simple nginx which does redirect). The problem seems to be with mcrouter. Once the node starts draining, the status of that node changes to Ready,SchedulingDisabled (which seems normal) but the following pods remains:
mcrouter-memcached-0
fluentd-gcp-v2.0.9-4f87t
kube-proxy-gke-test-upgrade-cluster-default-pool-74f8edac-wblf
I do not know why those two kube-system pods remains, but that mcrouter is mine and it won't go quickly enough. If I wait long enough (1 hour+) then it eventually work, I am not sure why. The current node pool (of 3 nodes) started upgrading 2h46 minutes ago and 2 nodes are upgraded, the 3rd one is still upgrading but nothing is moving... I presume it will complete in the next 1-2 hours...
I tried to run the drain command with --ignore-daemonsets --force but it told me it was already drained.
I tried to delete the pods, but they just come back and the upgrade does not move any faster.
Any thoughts?
Update #1
The mcrouter helm chart was installed like this:
helm install stable/mcrouter --name mcrouter --set controller=statefulset
The statefulsets it created for mcrouter part is:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
name: mcrouter-mcrouter
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
serviceName: mcrouter-mcrouter
template:
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-mcrouter
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- args:
- -p 5000
- --config-file=/etc/mcrouter/config.json
command:
- mcrouter
image: jphalip/mcrouter:0.36.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 5
name: mcrouter-mcrouter
ports:
- containerPort: 5000
name: mcrouter-port
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 1
resources:
limits:
cpu: 256m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/mcrouter
name: config
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: mcrouter-mcrouter
name: config
updateStrategy:
type: OnDelete
and here is the memcached statefulset:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
name: mcrouter-memcached
spec:
podManagementPolicy: OrderedReady
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
serviceName: mcrouter-memcached
template:
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-memcached
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- command:
- memcached
- -m 64
- -o
- modern
- -v
image: memcached:1.4.36-alpine
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 5
name: mcrouter-memcached
ports:
- containerPort: 11211
name: memcache
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 1
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
updateStrategy:
type: OnDelete
status:
replicas: 0
That is a bit complex question and I am definitely not sure that it is like how I thinking, but... Let's try to understand what is happening.
You have an upgrade process and have 6 nodes in the cluster. The system will upgrade it one by one using Drain to remove all workload from the pod.
Drain process itself respecting your settings and number of replicas and desired state of workload has higher priority than the drain of the node itself.
During the drain process, Kubernetes will try to schedule all your workload on resources where scheduling available. Scheduling on a node which system want to drain is disabled, you can see it in its state - Ready,SchedulingDisabled.
So, Kubernetes scheduler trying to find a right place for your workload on all available nodes. It will wait as long as it needs to place everything you describe in a cluster configuration.
Now the most important thing. You set that you need replicas: 5 for your mcrouter-memcached. It cannot run more than one replica per node because of podAntiAffinity and a node for a running it should have enough resources for that, which is calculated using resources: block of ReplicaSet.
So, I think, that your cluster just does not has enough resource for a run new replica of mcrouter-memcached on the remaining 5 nodes. As an example, on the last node where a replica of it still not running, you have not enough memory because of other workloads.
I think if you will set replicaset for mcrouter-memcached to 4, it will solve a problem. Or you can try to use a bit more powerful instances for that workload, or add one more node to the cluster, it also should help.
Hope I gave enough explanation of my logic, ask me if something not clear to you. But first please try to solve an issue by provided solution:)
The problem was a combination of the minAvailable value from a PodDisruptionBudget (that was part of the memcached helm chart which is a dependency of the mcrouter helm chart) and the replicas value for the memcached replicaset. Both were set to 5 and therefore none of them could be deleted during the drain. I tried changing the minAvailable to 4 but PDB are immutable at this time. What I did was remove the helm chart and replace it.
helm delete --purge myproxy
helm install ./charts/mcrouter-0.1.0-croy.1.tgz --name myproxy --set controller=statefulset --set memcached.replicaCount=5 --set memcached.pdbMinAvailable=4
Once that was done, I was able to get the cluster to upgrade normally.
What I should have done (but only thought about it after) was to change the replicas value to 6, this way I would not have needed to delete and replace the whole chart.
Thank you #AntonKostenko for trying to help me finding this issue.
This issue also helped me.
Thanks to the folks in Slack#Kubernetes, specially to Paris who tried to get my issue more visibility and the volonteers of the Kubernetes Office Hours (which happened to be yesterday, lucky me!) for also taking a look.
Finally, thank you to psycotica0 from Kubernetes Canada to also give me some pointers.