GKE Kubernetes Node Pool Upgrade very slow - kubernetes

I am experimenting with GKE cluster upgrades in a 6 nodes (in two node pools) test cluster before I try it on our staging or production cluster. Upgrading when I only had a 12 replicas nginx deployment, the nginx ingress controller and cert-manager (as helm chart) installed took 10 minutes per node pool (3 nodes). I was very satisfied. I decided to try again with something that looks more like our setup. I removed the nginx deploy and added 2 node.js deployments, the following helm charts: mongodb-0.4.27, mcrouter-0.1.0 (as a statefulset), redis-ha-2.0.0, and my own www-redirect-0.0.1 chart (simple nginx which does redirect). The problem seems to be with mcrouter. Once the node starts draining, the status of that node changes to Ready,SchedulingDisabled (which seems normal) but the following pods remains:
mcrouter-memcached-0
fluentd-gcp-v2.0.9-4f87t
kube-proxy-gke-test-upgrade-cluster-default-pool-74f8edac-wblf
I do not know why those two kube-system pods remains, but that mcrouter is mine and it won't go quickly enough. If I wait long enough (1 hour+) then it eventually work, I am not sure why. The current node pool (of 3 nodes) started upgrading 2h46 minutes ago and 2 nodes are upgraded, the 3rd one is still upgrading but nothing is moving... I presume it will complete in the next 1-2 hours...
I tried to run the drain command with --ignore-daemonsets --force but it told me it was already drained.
I tried to delete the pods, but they just come back and the upgrade does not move any faster.
Any thoughts?
Update #1
The mcrouter helm chart was installed like this:
helm install stable/mcrouter --name mcrouter --set controller=statefulset
The statefulsets it created for mcrouter part is:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
name: mcrouter-mcrouter
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
serviceName: mcrouter-mcrouter
template:
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-mcrouter
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- args:
- -p 5000
- --config-file=/etc/mcrouter/config.json
command:
- mcrouter
image: jphalip/mcrouter:0.36.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 5
name: mcrouter-mcrouter
ports:
- containerPort: 5000
name: mcrouter-port
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 1
resources:
limits:
cpu: 256m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/mcrouter
name: config
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: mcrouter-mcrouter
name: config
updateStrategy:
type: OnDelete
and here is the memcached statefulset:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
name: mcrouter-memcached
spec:
podManagementPolicy: OrderedReady
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
serviceName: mcrouter-memcached
template:
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-memcached
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- command:
- memcached
- -m 64
- -o
- modern
- -v
image: memcached:1.4.36-alpine
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 5
name: mcrouter-memcached
ports:
- containerPort: 11211
name: memcache
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 1
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
updateStrategy:
type: OnDelete
status:
replicas: 0

That is a bit complex question and I am definitely not sure that it is like how I thinking, but... Let's try to understand what is happening.
You have an upgrade process and have 6 nodes in the cluster. The system will upgrade it one by one using Drain to remove all workload from the pod.
Drain process itself respecting your settings and number of replicas and desired state of workload has higher priority than the drain of the node itself.
During the drain process, Kubernetes will try to schedule all your workload on resources where scheduling available. Scheduling on a node which system want to drain is disabled, you can see it in its state - Ready,SchedulingDisabled.
So, Kubernetes scheduler trying to find a right place for your workload on all available nodes. It will wait as long as it needs to place everything you describe in a cluster configuration.
Now the most important thing. You set that you need replicas: 5 for your mcrouter-memcached. It cannot run more than one replica per node because of podAntiAffinity and a node for a running it should have enough resources for that, which is calculated using resources: block of ReplicaSet.
So, I think, that your cluster just does not has enough resource for a run new replica of mcrouter-memcached on the remaining 5 nodes. As an example, on the last node where a replica of it still not running, you have not enough memory because of other workloads.
I think if you will set replicaset for mcrouter-memcached to 4, it will solve a problem. Or you can try to use a bit more powerful instances for that workload, or add one more node to the cluster, it also should help.
Hope I gave enough explanation of my logic, ask me if something not clear to you. But first please try to solve an issue by provided solution:)

The problem was a combination of the minAvailable value from a PodDisruptionBudget (that was part of the memcached helm chart which is a dependency of the mcrouter helm chart) and the replicas value for the memcached replicaset. Both were set to 5 and therefore none of them could be deleted during the drain. I tried changing the minAvailable to 4 but PDB are immutable at this time. What I did was remove the helm chart and replace it.
helm delete --purge myproxy
helm install ./charts/mcrouter-0.1.0-croy.1.tgz --name myproxy --set controller=statefulset --set memcached.replicaCount=5 --set memcached.pdbMinAvailable=4
Once that was done, I was able to get the cluster to upgrade normally.
What I should have done (but only thought about it after) was to change the replicas value to 6, this way I would not have needed to delete and replace the whole chart.
Thank you #AntonKostenko for trying to help me finding this issue.
This issue also helped me.
Thanks to the folks in Slack#Kubernetes, specially to Paris who tried to get my issue more visibility and the volonteers of the Kubernetes Office Hours (which happened to be yesterday, lucky me!) for also taking a look.
Finally, thank you to psycotica0 from Kubernetes Canada to also give me some pointers.

Related

kubernetes k8 unable to pull latest image

Hi I am working in Kubernetes. Below is my k8 for deployment.
apiVersion: apps/v1
kind: Deployment
metadata: #Dictionary
name: webapp
spec: # Dictionary
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
# maxUnavailable will set up how many pods we can add at a time
maxUnavailable: 50%
# maxSurge define how many pods can be unavailable during the rolling update
maxSurge: 1
selector:
matchLabels:
app: webapp
instance: app
template:
metadata: # Dictionary
name: webapplication
labels: # Dictionary
app: webapp # Key value paids
instance: app
annotations:
vault.security.banzaicloud.io/vault-role: al-dev
spec:
serviceAccountName: default
terminationGracePeriodSeconds: 30
containers: # List
- name: al-webapp-container
image: ghcr.io/my-org/al.web:latest
imagePullPolicy: Always
ports:
- containerPort: 3000
resources:
requests:
memory: "1Gi"
cpu: "900m"
limits:
memory: "1Gi"
cpu: "1000m"
imagePullSecrets:
- name: githubpackagesecret
whenever I deploy this into kubernetes, Its not picking the latest image from the github packages. What should I do in order to pull the latest image and update the current running pod with latest image? Can someone help me to fix this issue. Any help would be appreciated. Thank you
There could be chances if you are doing deployment with the same latest tag deployment might not be getting the updated as same imageTag.
Pod restart is required so it will download the new image each time, if still it's the same there is an issue with building of the cache image.
What you can do as of now to try the
kubectl rollout restart deploy <deployment-name> -n <namespace>
this will restart the pods and it will fetch the new image for all PODs and check if latest code running.
Since you have imagePullPolicy: Always set it should always pull the image. Can you do kubectl describe to the pod while its starting so we can seed the logs ?

Akka: Application not scaling down when using Kubernetes HorizontalPodAutoscaler

We are working on an akka-cluster based application which we run in a Kubernetes cluster. We are now in a situation where we will like the application to scale up in case there is an increase in load on the cluster. We are using HorizontalPodAutoscaler to achieve this. Our manifest files looks like:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: app
namespace: some-namespace
labels:
componentName: our-component
app: our-component
version: some-version
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "9252"
spec:
serviceName: our-component
replicas: 2
selector:
matchLabels:
componentName: our-component
app: our-app
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
componentName: our-component
app: our-app
spec:
containers:
- name: our-component-container
image: image-path
imagePullPolicy: Always
resources:
requests:
cpu: .1
memory: 500Mi
limits:
cpu: 1
memory: 1Gi
command:
- "/microservice/bin/our-component"
ports:
- name: remoting
containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /ready
port: 9085
initialDelaySeconds: 40
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /alive
port: 9085
initialDelaySeconds: 130
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 5
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
namespace: some-namespace
labels:
componentName: our-component
app: our-app
spec:
minReplicas: 2
maxReplicas: 8
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: our-component
targetCPUUtilizationPercentage: 75
The issue we face is, as soon as we deploy the application it scales to the maxReplicas defined, even if there is no load on the application. Also, the application never seem to scale down.
Can someone who faced a similar issue in their application share their experience of why this happens and if they were able to resolve this?
The issue is with the resource request and limits. You have set requested CPU to be "1" whereas limit to be "0.1". Therefore, what is happening is, as soon as your pod runs, the limit is naturally exceeded, and the autoscaling kicks in and keeps scaling to the max number of replicas.
You need to switch the parameter names so that your request becomes 0.1 and limit becomes 1.0. This way, your pod will start with 0.1 units of CPU shares, and once average usage grows larger than 65% across all pods, you will get more replicas, and if the usage drops, you will have a scale down, just as expected.
As a general rule of thumb, request is less than limit, or at least equal to it. It cannot be higher, because you end up with an infinitely scaling infrastructure.
I suspect this is just because you have such a low CPU request value. With a 0.1 request anytime there is more than 10% of a vCPU being used by a Pod it's going to create new pods. And startup activity could easily use more than a 0.1 CPU. So even the startup activity is enough to force the HBA to spawn more pods. But then those new pods get added to the cluster, so there is consensus activity. Which, again, might push all of the pods above an average 0.1 request.
It's a little surprising to me, because you'd think an idle application would stabilize below 0.1 vCPU, but 0.1 vCPU is very tiny.
I'd test:
First, just bumping up to more reasonable CPU requests. If you have 1.0 CPU request and 2.0 CPU limit, does this still happen? If not, then I was right and the request value was just set so low that "overhead" style activity could exceed the target.
If you still are seeing this behavior, even then, then I'd verify your HBA settings. The defaults should be OK, but I'd just validate everything. Run a describe on the hba to see the status and events. Maybe play around with periodSeconds and stabilization settings.
If that still doesn't give you any clues, I'd just run the HBA samples and make sure that they work. Maybe there is a problem with the collected metrics or something similar.

Re-route traffic in kubernetes to a working pod

Not sure if such if there was such a question, so pardon me if I couldn't find such.
I have a cluster based on 3 nodes, my application consists of a frontend and a backend with each running 2 replicas:
front1 - running on node1
front2 - running on node2
be1 - node1
be2 - node2
Both FE pods are served behind frontend-service
Both BE pods are service behind be-service
When I shutdown node-2, the application stopped and in my UI I could see application errors.
I've checked the logs and found out that my application attempted to reach the service type of the backend pods and it failed to respond since be2 wasn't running, the scheduler is yet to terminate the existing one.
Only when the node was terminated and removed from the cluster, the pods were rescheduled to the 3rd node and the application was back online.
I know a service mesh can help by removing the pods that aren't responding from the traffic, however, I don't want to implement it yet, and trying to understand what is the best solution to route the traffic to the healthy pods in a fast and easy way, 5 minutes of downtime is a lot of time.
Here's my be deployment spec:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: backend
name: backend
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app: backend
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: backend
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-Application
operator: In
values:
- "true"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- backend
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: SSL_ENABLED
value: "false"
image: quay.io/something:latest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /liveness
port: 16006
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 10
name: backend
ports:
- containerPort: 16006
protocol: TCP
- containerPort: 8457
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: 16006
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: 1500m
memory: 8500Mi
requests:
cpu: 6m
memory: 120Mi
dnsPolicy: ClusterFirst
Here's my backend service:
apiVersion: v1
kind: Service
metadata:
labels:
app: identity
name: backend
namespace: default
spec:
clusterIP: 10.233.34.115
ports:
- name: tcp
port: 16006
protocol: TCP
targetPort: 16006
- name: internal-http-rpc
port: 8457
protocol: TCP
targetPort: 8457
selector:
app: backend
sessionAffinity: None
type: ClusterIP
This is a community wiki answer. Feel free to expand it.
As already mentioned by #TomerLeibovich the main issue here was due to the Probes Configuration:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness and startup Probes. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
Plus the proper Pod eviction configuration:
The kubelet needs to preserve node stability when available compute
resources are low. This is especially important when dealing with
incompressible compute resources, such as memory or disk space. If
such resources are exhausted, nodes become unstable.
Changing the threshold to 1 instead of 3 and reducing the pod-eviction solved the issue as the Pod is now being evicted sooner.
EDIT:
The other possible solution in this scenario is to label other nodes with the app backend to make sure that each backend/pod was deployed on different nodes. In your current situation one pod deployed on the node was removed from the endpoint and the application became unresponsive.
Also, the workaround for triggering pod eviction from the unhealthy node is to add tolerations to
deployment.spec. template.spec: tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 60
instead of using the default value: tolerationSeconds: 300.
You can find more information in this documentation.

Kubernetes : How to attain 0 downtime

I have 2 pods running with each CPU : 0.2 Core and Mi : 1 Gi
My node has limit of 0.4 Core and 2 Gi. I can't increase the node limits.
For Zero downtime I have done following config -
apiVersion: apps/v1
kind: Deployment
metadata:
name: abc-deployment
spec:
selector:
matchLabels:
app: abc
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
template:
metadata:
labels:
app: abc
collect_logs_with_filebeat: "true"
annotations:
sidecar.istio.io/rewriteAppHTTPProbers: "false"
spec:
containers:
- name: abc
image: abc-repository:latest
ports:
- containerPort: 8087
readinessProbe:
httpGet:
path: /healthcheck
port: 8087
initialDelaySeconds: 540
timeoutSeconds: 10
periodSeconds: 10
failureThreshold: 20
successThreshold: 1
imagePullPolicy: Always
resources:
limits:
cpu: 0.2
memory: 1000Mi
requests:
cpu: 0.2
memory: 1000Mi
On a new build deployment, two new pod gets created on a new node(because node1 doestn't have enough
memory and cpu to accomodate new pods) say node2. once new container is in running state these newly created pod of node2. the old pods(running on node1)
get desroyed and now node1 have some free space and memory.
Now the issue which i am facing is that, Since node1 have free memory and cpu, Kubernetes is destroying the newly created pods(running on node2)
and after that create pods on node1 and starts app container on that, which is causing downtime.
So, Basically in my case even after using rollingupdate strategy and healthcheck point, I am not able to achieve zero downtime.
Please help here!
You could look at the concept of Pod Disruption Budget that is used mostly for achieving zero downtime for an application.
You could also read a related answer of mine which shows an example of how to achieve the zero down time for an application using the PDBs.

Influxdb 2.0 running in K8s gets data lost every time Statefulset pod is scaled down or rescheduled

I have deployed Influxdb 2.0.0 as Statefulset with EBS volume persistence. I've noticed that, if for some reason, pod gets rescheduled to other node or even if we scale down statefulset pod replicas = 0 and then scale up, the effect would be the same on persisted data: they will be lost.
Initially, in case of pod that gets rescheduled to other node, I would thought the problem is with EBS volume, it doesn't get unmounted and them mounted to another node where pod replica is running but that is NOT the case. EBS volume is present, same pv/pvc exists, but data is lost.
To figure out what might be the problem, I've purposely done influxdb setup and added data and then did this:
kubectl scale statefulsets influxdb --replicas=0
...
kubectl scale statefulsets influxdb --replicas=1
The effect was the same just like when influxdb pod got rescheduled. Data was lost.
Any specific reason why would something like that happen?
My environment:
I'm using EKS k8s environment with 1.15 k8s version of control plane/workers.
Fortunately, the problem was due to the big changes that happened between influxdb 1.x and 2.0.0 beta version in terms on where the actual data is persisted.
In 1.x version, data was persisted in:
/var/lib/influxdb
while on the 2.x version, data is persisted, by default, on:
/root/.influxdbv2
My EBS volume was mounted on the 1.x version location and with every restart of the pod (either caused by scaling down or by scheduling to other node), EBS volume was regularly attached but on the wrong location. That was the reason why there was no data.
Also, one difference that I see is that configuration params cannot be provided for 2.x version via configuration file (like it was on 1.x where I had configuration file mounted into the container as configmap). We have to provide additional configuration params inline. This link explains how: https://v2.docs.influxdata.com/v2.0/reference/config-options/
At the end this is the working version of Statefulset:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: influxdb
name: influxdb
spec:
replicas: 1
selector:
matchLabels:
app: influxdb
serviceName: influxdb
template:
metadata:
labels:
app: influxdb
spec:
containers:
- image: quay.io/influxdb/influxdb:2.0.0-beta
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: influxdb
ports:
- containerPort: 9999
name: api
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "800m"
memory: 1200Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /root/.influxdbv2
name: influxdb-data
volumeClaimTemplates:
- metadata:
name: influxdb-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem