Not sure if such if there was such a question, so pardon me if I couldn't find such.
I have a cluster based on 3 nodes, my application consists of a frontend and a backend with each running 2 replicas:
front1 - running on node1
front2 - running on node2
be1 - node1
be2 - node2
Both FE pods are served behind frontend-service
Both BE pods are service behind be-service
When I shutdown node-2, the application stopped and in my UI I could see application errors.
I've checked the logs and found out that my application attempted to reach the service type of the backend pods and it failed to respond since be2 wasn't running, the scheduler is yet to terminate the existing one.
Only when the node was terminated and removed from the cluster, the pods were rescheduled to the 3rd node and the application was back online.
I know a service mesh can help by removing the pods that aren't responding from the traffic, however, I don't want to implement it yet, and trying to understand what is the best solution to route the traffic to the healthy pods in a fast and easy way, 5 minutes of downtime is a lot of time.
Here's my be deployment spec:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: backend
name: backend
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app: backend
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: backend
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-Application
operator: In
values:
- "true"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- backend
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: SSL_ENABLED
value: "false"
image: quay.io/something:latest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /liveness
port: 16006
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 10
name: backend
ports:
- containerPort: 16006
protocol: TCP
- containerPort: 8457
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: 16006
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: 1500m
memory: 8500Mi
requests:
cpu: 6m
memory: 120Mi
dnsPolicy: ClusterFirst
Here's my backend service:
apiVersion: v1
kind: Service
metadata:
labels:
app: identity
name: backend
namespace: default
spec:
clusterIP: 10.233.34.115
ports:
- name: tcp
port: 16006
protocol: TCP
targetPort: 16006
- name: internal-http-rpc
port: 8457
protocol: TCP
targetPort: 8457
selector:
app: backend
sessionAffinity: None
type: ClusterIP
This is a community wiki answer. Feel free to expand it.
As already mentioned by #TomerLeibovich the main issue here was due to the Probes Configuration:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness and startup Probes. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
Plus the proper Pod eviction configuration:
The kubelet needs to preserve node stability when available compute
resources are low. This is especially important when dealing with
incompressible compute resources, such as memory or disk space. If
such resources are exhausted, nodes become unstable.
Changing the threshold to 1 instead of 3 and reducing the pod-eviction solved the issue as the Pod is now being evicted sooner.
EDIT:
The other possible solution in this scenario is to label other nodes with the app backend to make sure that each backend/pod was deployed on different nodes. In your current situation one pod deployed on the node was removed from the endpoint and the application became unresponsive.
Also, the workaround for triggering pod eviction from the unhealthy node is to add tolerations to
deployment.spec. template.spec: tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 60
instead of using the default value: tolerationSeconds: 300.
You can find more information in this documentation.
Related
I try to deploy Matabase on my GKE cluster but I got Readiness probe failed.
I build on my local and get localhost:3000/api/health i got status 200 but on k8s it's not works.
Dockerfile. I create my own for push and build to my GitLab registry
FROM metabase/metabase:v0.41.6
EXPOSE 3000
CMD ["/app/run_metabase.sh" ]
my deployment.yaml
# apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metaba-dev
spec:
selector:
matchLabels:
app: metaba-dev
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 50%
maxSurge: 100%
template:
metadata:
labels:
app: metaba-dev
spec:
restartPolicy: Always
imagePullSecrets:
- name: gitlab-login
containers:
- name: metaba-dev
image: registry.gitlab.com/team/metabase:dev-{{BUILD_NUMBER}}
command: ["/app/run_metabase.sh" ]
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
imagePullPolicy: Always
ports:
- name: metaba-dev-port
containerPort: 3000
terminationGracePeriodSeconds: 90
I got this error from
kubectl describe pod metaba-dev
Warning Unhealthy 61s (x3 over 81s) kubelet Readiness probe failed: Get "http://10.207.128.197:3000/api/health": dial tcp 10.207.128.197:3000: connect: connection refused
Warning Unhealthy 61s (x3 over 81s) kubelet Liveness probe failed: Get "http://10.207.128.197:3000/api/health": dial tcp 10.207.128.197:3000: connect: connection refused
kubectl logs
Picked up JAVA_TOOL_OPTIONS: -Xmx1g -Xms1g -Xmx1g
Warning: environ value jdk-11.0.13+8 for key :java-version has been overwritten with 11.0.13
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-01-28 15:32:23,966 INFO metabase.util :: Maximum memory available to JVM: 989.9 MB
2022-01-28 15:33:09,703 INFO util.encryption :: Saved credentials encryption is ENABLED for this Metabase instance. 🔐
For more information, see https://metabase.com/docs/latest/operations-guide/encrypting-database-details-at-rest.html
Here Solution
I add initialDelaySeconds: to 1200 and check logging it's cause about network mypod cannot connect to database and when i checking log i did not see that cause i has been restart and it's was a new log
Try to change your initialDelaySeconds: 60 to 100
And you should always set the resource request and limit in your container to avoid the probe failure, this is because when your app starts hitting the resource limit the kubernetes starts throttling your container.
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "100Mi"
cpu: "500m"
I have 2 pods running with each CPU : 0.2 Core and Mi : 1 Gi
My node has limit of 0.4 Core and 2 Gi. I can't increase the node limits.
For Zero downtime I have done following config -
apiVersion: apps/v1
kind: Deployment
metadata:
name: abc-deployment
spec:
selector:
matchLabels:
app: abc
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
template:
metadata:
labels:
app: abc
collect_logs_with_filebeat: "true"
annotations:
sidecar.istio.io/rewriteAppHTTPProbers: "false"
spec:
containers:
- name: abc
image: abc-repository:latest
ports:
- containerPort: 8087
readinessProbe:
httpGet:
path: /healthcheck
port: 8087
initialDelaySeconds: 540
timeoutSeconds: 10
periodSeconds: 10
failureThreshold: 20
successThreshold: 1
imagePullPolicy: Always
resources:
limits:
cpu: 0.2
memory: 1000Mi
requests:
cpu: 0.2
memory: 1000Mi
On a new build deployment, two new pod gets created on a new node(because node1 doestn't have enough
memory and cpu to accomodate new pods) say node2. once new container is in running state these newly created pod of node2. the old pods(running on node1)
get desroyed and now node1 have some free space and memory.
Now the issue which i am facing is that, Since node1 have free memory and cpu, Kubernetes is destroying the newly created pods(running on node2)
and after that create pods on node1 and starts app container on that, which is causing downtime.
So, Basically in my case even after using rollingupdate strategy and healthcheck point, I am not able to achieve zero downtime.
Please help here!
You could look at the concept of Pod Disruption Budget that is used mostly for achieving zero downtime for an application.
You could also read a related answer of mine which shows an example of how to achieve the zero down time for an application using the PDBs.
I have deployed Influxdb 2.0.0 as Statefulset with EBS volume persistence. I've noticed that, if for some reason, pod gets rescheduled to other node or even if we scale down statefulset pod replicas = 0 and then scale up, the effect would be the same on persisted data: they will be lost.
Initially, in case of pod that gets rescheduled to other node, I would thought the problem is with EBS volume, it doesn't get unmounted and them mounted to another node where pod replica is running but that is NOT the case. EBS volume is present, same pv/pvc exists, but data is lost.
To figure out what might be the problem, I've purposely done influxdb setup and added data and then did this:
kubectl scale statefulsets influxdb --replicas=0
...
kubectl scale statefulsets influxdb --replicas=1
The effect was the same just like when influxdb pod got rescheduled. Data was lost.
Any specific reason why would something like that happen?
My environment:
I'm using EKS k8s environment with 1.15 k8s version of control plane/workers.
Fortunately, the problem was due to the big changes that happened between influxdb 1.x and 2.0.0 beta version in terms on where the actual data is persisted.
In 1.x version, data was persisted in:
/var/lib/influxdb
while on the 2.x version, data is persisted, by default, on:
/root/.influxdbv2
My EBS volume was mounted on the 1.x version location and with every restart of the pod (either caused by scaling down or by scheduling to other node), EBS volume was regularly attached but on the wrong location. That was the reason why there was no data.
Also, one difference that I see is that configuration params cannot be provided for 2.x version via configuration file (like it was on 1.x where I had configuration file mounted into the container as configmap). We have to provide additional configuration params inline. This link explains how: https://v2.docs.influxdata.com/v2.0/reference/config-options/
At the end this is the working version of Statefulset:
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: influxdb
name: influxdb
spec:
replicas: 1
selector:
matchLabels:
app: influxdb
serviceName: influxdb
template:
metadata:
labels:
app: influxdb
spec:
containers:
- image: quay.io/influxdb/influxdb:2.0.0-beta
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: influxdb
ports:
- containerPort: 9999
name: api
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: api
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "800m"
memory: 1200Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /root/.influxdbv2
name: influxdb-data
volumeClaimTemplates:
- metadata:
name: influxdb-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
I have a Kubernetes deployment on GCP and a ClusterIP service to discover pods in this deployment. The deployment contains multiple replica set pods which come and go based on our horizontal pod scalar configuration (based on CPU Utilization).
Now, when a new replica set pod is created, it takes some time for the application to start servicing. But the ClusterIP already starts distributing requests to new replica set pod before the application is ready, which causes the requests to be not serviced.
ClusterIP service yaml:
apiVersion: v1
kind: Service
metadata:
labels:
app: service-name
tier: backend
environment: "dev"
creator: internal
name: service-name
spec:
clusterIP: None
ports:
- name: https
protocol: TCP
port: 7070
targetPort: 7070
selector:
app: dep-name
tier: "backend"
environment: "dev"
creator: "ME"
type: ClusterIP
How can the ClusterIP be told to start distributing requests to the new pod after the application starts? Can there be any initial delay or liveness probe set for this purpose?
Kubernetes provides readiness probe for it. With readiness probes, Kubernetes will not send traffic to a pod until the probe is successful. When updating a deployment, it will also leave old replica(s) running until probes have been successful on new replica. That means that if your new pods are broken in some way, they’ll never see traffic, your old pods will continue to serve all traffic for the deployment.
You need to update the deployment file with following readiness probe:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
If your application have http probe then you can set readiness probe in HTTP mode as well.
For more information how can you use readiness probe refer:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes
You should have a readiness probe as defined in the documentation at
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes.
As defined in the documentation you should be able to configure using initialDelaySeconds and periodSeconds.
Your current behavior is probably because the service load balancer has seen that all the containers in the pod are started. You can define your readyness checks like the example below picked from documentation.
kind: Pod
metadata:
name: goproxy
labels:
app: goproxy
spec:
containers:
- name: goproxy
image: k8s.gcr.io/goproxy:0.1
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
I am experimenting with GKE cluster upgrades in a 6 nodes (in two node pools) test cluster before I try it on our staging or production cluster. Upgrading when I only had a 12 replicas nginx deployment, the nginx ingress controller and cert-manager (as helm chart) installed took 10 minutes per node pool (3 nodes). I was very satisfied. I decided to try again with something that looks more like our setup. I removed the nginx deploy and added 2 node.js deployments, the following helm charts: mongodb-0.4.27, mcrouter-0.1.0 (as a statefulset), redis-ha-2.0.0, and my own www-redirect-0.0.1 chart (simple nginx which does redirect). The problem seems to be with mcrouter. Once the node starts draining, the status of that node changes to Ready,SchedulingDisabled (which seems normal) but the following pods remains:
mcrouter-memcached-0
fluentd-gcp-v2.0.9-4f87t
kube-proxy-gke-test-upgrade-cluster-default-pool-74f8edac-wblf
I do not know why those two kube-system pods remains, but that mcrouter is mine and it won't go quickly enough. If I wait long enough (1 hour+) then it eventually work, I am not sure why. The current node pool (of 3 nodes) started upgrading 2h46 minutes ago and 2 nodes are upgraded, the 3rd one is still upgrading but nothing is moving... I presume it will complete in the next 1-2 hours...
I tried to run the drain command with --ignore-daemonsets --force but it told me it was already drained.
I tried to delete the pods, but they just come back and the upgrade does not move any faster.
Any thoughts?
Update #1
The mcrouter helm chart was installed like this:
helm install stable/mcrouter --name mcrouter --set controller=statefulset
The statefulsets it created for mcrouter part is:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
name: mcrouter-mcrouter
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
serviceName: mcrouter-mcrouter
template:
metadata:
labels:
app: mcrouter-mcrouter
chart: mcrouter-0.1.0
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-mcrouter
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- args:
- -p 5000
- --config-file=/etc/mcrouter/config.json
command:
- mcrouter
image: jphalip/mcrouter:0.36.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 5
name: mcrouter-mcrouter
ports:
- containerPort: 5000
name: mcrouter-port
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: mcrouter-port
timeoutSeconds: 1
resources:
limits:
cpu: 256m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/mcrouter
name: config
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: mcrouter-mcrouter
name: config
updateStrategy:
type: OnDelete
and here is the memcached statefulset:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
name: mcrouter-memcached
spec:
podManagementPolicy: OrderedReady
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
serviceName: mcrouter-memcached
template:
metadata:
labels:
app: mcrouter-memcached
chart: memcached-1.2.1
heritage: Tiller
release: mcrouter
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: mcrouter-memcached
release: mcrouter
topologyKey: kubernetes.io/hostname
containers:
- command:
- memcached
- -m 64
- -o
- modern
- -v
image: memcached:1.4.36-alpine
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 5
name: mcrouter-memcached
ports:
- containerPort: 11211
name: memcache
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: memcache
timeoutSeconds: 1
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
updateStrategy:
type: OnDelete
status:
replicas: 0
That is a bit complex question and I am definitely not sure that it is like how I thinking, but... Let's try to understand what is happening.
You have an upgrade process and have 6 nodes in the cluster. The system will upgrade it one by one using Drain to remove all workload from the pod.
Drain process itself respecting your settings and number of replicas and desired state of workload has higher priority than the drain of the node itself.
During the drain process, Kubernetes will try to schedule all your workload on resources where scheduling available. Scheduling on a node which system want to drain is disabled, you can see it in its state - Ready,SchedulingDisabled.
So, Kubernetes scheduler trying to find a right place for your workload on all available nodes. It will wait as long as it needs to place everything you describe in a cluster configuration.
Now the most important thing. You set that you need replicas: 5 for your mcrouter-memcached. It cannot run more than one replica per node because of podAntiAffinity and a node for a running it should have enough resources for that, which is calculated using resources: block of ReplicaSet.
So, I think, that your cluster just does not has enough resource for a run new replica of mcrouter-memcached on the remaining 5 nodes. As an example, on the last node where a replica of it still not running, you have not enough memory because of other workloads.
I think if you will set replicaset for mcrouter-memcached to 4, it will solve a problem. Or you can try to use a bit more powerful instances for that workload, or add one more node to the cluster, it also should help.
Hope I gave enough explanation of my logic, ask me if something not clear to you. But first please try to solve an issue by provided solution:)
The problem was a combination of the minAvailable value from a PodDisruptionBudget (that was part of the memcached helm chart which is a dependency of the mcrouter helm chart) and the replicas value for the memcached replicaset. Both were set to 5 and therefore none of them could be deleted during the drain. I tried changing the minAvailable to 4 but PDB are immutable at this time. What I did was remove the helm chart and replace it.
helm delete --purge myproxy
helm install ./charts/mcrouter-0.1.0-croy.1.tgz --name myproxy --set controller=statefulset --set memcached.replicaCount=5 --set memcached.pdbMinAvailable=4
Once that was done, I was able to get the cluster to upgrade normally.
What I should have done (but only thought about it after) was to change the replicas value to 6, this way I would not have needed to delete and replace the whole chart.
Thank you #AntonKostenko for trying to help me finding this issue.
This issue also helped me.
Thanks to the folks in Slack#Kubernetes, specially to Paris who tried to get my issue more visibility and the volonteers of the Kubernetes Office Hours (which happened to be yesterday, lucky me!) for also taking a look.
Finally, thank you to psycotica0 from Kubernetes Canada to also give me some pointers.