Is it normal for bokeh serve on Kubernetes to restart periodically?

Is it normal for bokeh serve on Kubernetes to restart periodically? - kubernetes

I have a bokeh dashboard served in a docker container, which is running on kubernetes. I can access my dashboard remotely, no problems. But I noticed my pod containing the bokeh serve code restarts a lot, i.e. 14 times in the past 2 hours. Sometimes the status will come back as 'CrashLoopBackOff' and sometimes it will be 'Running' normally.
My question is, is there something about the way bokeh serve works that requires kubernetes to restart it so frequently? Is it something to do with memory (OOMKilled)?
Here is a section of my describe pod:
Name: bokeh-744d4bc9d-5pkzq
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 10.183.226.51/10.183.226.51
Start Time: Tue, 18 Feb 2020 11:55:44 +0000
Labels: name=bokeh
pod-template-hash=744d4bc9d
Annotations: kubernetes.io/psp: xyz-privileged-psp
Status: Running
IP: 172.30.255.130
Controlled By: ReplicaSet/bokeh-744d4bc9d
Containers:
dashboard-application:
Container ID: containerd://16d10dc5dd89235b0xyz2b5b31f8e313f3f0bb7efe82a12e00c1f01708e2f894
Image: us.icr.io/oss-data-science-np-dal/bokeh:118
Image ID: us.icr.io/oss-data-science-np-dal/bokeh#sha256:037a5b52a6e7c792fdxy80b01e29772dbfc33b10e819774462bee650cf0da
Port: 5006/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 18 Feb 2020 14:25:36 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 18 Feb 2020 14:15:26 +0000
Finished: Tue, 18 Feb 2020 14:23:54 +0000
Ready: True
Restart Count: 17
Limits:
cpu: 800m
memory: 600Mi
Requests:
cpu: 600m
memory: 400Mi
Liveness: http-get http://:5006/ delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:5006/ delay=10s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cjhfk (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-cjhfk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cjhfk
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 36m (x219 over 150m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 21m (x34 over 134m) kubelet, 10.183.226.51 Back-off restarting failed container
Warning Unhealthy 10m (x72 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 6m4s (x957 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 68s (x23 over 147m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I'm new to k8s, so any information you have to spare on this kind of issue will be much appreciated!

If a Container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure. This is documented here.
You may have to increase limits and requests in your pod spec. Check the official doc here.
Other way to look at it is to try to optimize your code so that it does not exceed the memory specified in limits.

OOMKill means your pod is consuming too much RAM and was killed in order to avoid disruption of the other workload running on the node.
You can either edit your code to use less RAM if feasible, or increase limits.memory.
You generally want to have requests = limits, except if your pod run some heavy stuff at the beginning and then does nothing.
You may want to take a look at the official documentation.

Related

Kubernetes CoreDNS in CrashLoopBackOff on worker node

I've searched CoreDns in CrashLoopBackOff but nothing has helped me through.
My Set
k8s - v1.20.2
CoreDns-1.7.0
Installed by kubespray with this one https://kubernetes.io/ko/docs/setup/production-environment/tools/kubespray
My Problem
CoreDNS pods on master Node are in a running state
but on worker Node coreDns pods are in crashLoopBackOff state.
enter image description here
kubectl logs -f coredns-847f564ccf-msbvp -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 5b233a0166923d642fdbca0794b712ab
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
CoreDns container runs a command "/coredns -conf /etc/resolv.conf" for a while
and then it is destroyed.
enter image description here
Here is Corefile
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf {
prefer_udp
}
cache 30
loop
reload
loadbalance
}
And one of crashed pod's event
kubectl get event --namespace kube-system --field-selector involvedObject.name=coredns-847f564ccf-lqnxs
LAST SEEN TYPE REASON OBJECT MESSAGE
4m55s Warning Unhealthy pod/coredns-847f564ccf-lqnxs Liveness probe failed: Get "http://10.216.50.2:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
9m59s Warning BackOff pod/coredns-847f564ccf-lqnxs Back-off restarting failed container
And Here is CoreDns Description
Containers:
coredns:
Container ID: docker://a174cb3a3800181d1c7b78831bfd37bbf69caf60a82051d6fb29b4b9deeacce9
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns#sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Wed, 21 Apr 2021 21:51:44 +0900
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 21 Apr 2021 21:44:42 +0900
Finished: Wed, 21 Apr 2021 21:46:32 +0900
Ready: False
Restart Count: 9943
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=0s timeout=5s period=10s #success=1 #failure=10
Readiness: http-get http://:8181/ready delay=0s timeout=5s period=10s #success=1 #failure=10
Environment: <none>
Mounts:
/etc/coredns from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qqhn6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 18m (x9940 over 30d) kubelet Container image "k8s.gcr.io/coredns:1.7.0" already present on machine
Warning Unhealthy 8m37s (x99113 over 30d) kubelet Liveness probe failed: Get "http://10.216.50.2:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning BackOff 3m35s (x121901 over 30d) kubelet Back-off restarting failed container
At this point, any suggestion at all will be helpful
I found something weird.
I test in node1, I can access Coredns pod in the node2, but I can not access Coredns pod in the node1.
I use calico for cni
in node1, coredns1 - 1.1.1.1
in node2, coredns2 - 2.2.2.2
in node1.
access 1.1.1.1:8080/health -> timeout
access 2.2.2.2:8080/health -> ok
in node2.
access 1.1.1.1:8080/health -> ok
access 2.2.2.2:8080/health -> timeout

If Containerd, and Kubelet are under Proxy, please add private IP range: 10.0.0.0/8 into NO_PROXY configuration to make sure they can pull the images.
E.g:
[root#dev-systrdemo301z phananhtuan01]# cat /etc/systemd/system/containerd.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=dev-proxy.prod.xx.local:8300"
Environment="HTTPS_PROXY=dev-proxy.prod.xx.local:8300"
Environment="NO_PROXY=localhost,127.0.0.0/8,100.67.253.157/24,10.0.0.0/8"
Please refer to this article.

kubernetes ingress-controller CrashLoopBackOff Error

I've set up a Kubernetes (1.17.11) cluster (Azure), and I've installed nginx-ingress-controller via
helm install nginx-ingress --namespace z1 stable/nginx-ingress --set controller.publishService.enabled=true
the setup seems to be ok and it's working but every now and then it fails, when I check running pods (kubectl get pod -n z1) I see there is a number of restarts for the ingress-controller pod.
I thought maybe there is a huge load so better to increase replicas so I ran helm upgrade --namespace z1 stable/ingress --set controller.replicasCount=3 but still only one of the pods (out of 3) seems to be in use and one has fails due to CrashLoopBackOff sometimes (not constantly).
One thing worth mentioning, installed nginx-ingress version is 0.34.1 but 0.41.2 is also available, do you think the upgrade will help, and how can I upgrade the installed version to the new one (AFAIK helm upgrade won't replace the chart with a newer version, I may be wrong) ?
Any idea?
kubectl describe pod result:
Name: nginx-ingress-controller-58467bccf7-jhzlx
Namespace: z1
Priority: 0
Node: aks-agentpool-41415378-vmss000000/10.240.0.4
Start Time: Thu, 19 Nov 2020 09:01:30 +0100
Labels: app=nginx-ingress
app.kubernetes.io/component=controller
component=controller
pod-template-hash=58467bccf7
release=nginx-ingress
Annotations: <none>
Status: Running
IP: 10.244.1.18
IPs:
IP: 10.244.1.18
Controlled By: ReplicaSet/nginx-ingress-controller-58467bccf7
Containers:
nginx-ingress-controller:
Container ID: docker://719655d41c1c8cdb8c9e88c21adad7643a44d17acbb11075a1a60beb7553e2cf
Image: us.gcr.io/k8s-artifacts-prod/ingress-nginx/controller:v0.34.1
Image ID: docker-pullable://us.gcr.io/k8s-artifacts-prod/ingress-nginx/controller#sha256:0e072dddd1f7f8fc8909a2ca6f65e76c5f0d2fcfb8be47935ae3457e8bbceb20
Ports: 80/TCP, 443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--default-backend-service=z1/nginx-ingress-default-backend
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=z1/nginx-ingress-controller
State: Running
Started: Thu, 19 Nov 2020 09:54:07 +0100
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Thu, 19 Nov 2020 09:50:41 +0100
Finished: Thu, 19 Nov 2020 09:51:12 +0100
Ready: True
Restart Count: 8
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-controller-58467bccf7-jhzlx (v1:metadata.name)
POD_NAMESPACE: z1 (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-token-7rmtk (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nginx-ingress-token-7rmtk:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-token-7rmtk
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned z1/nginx-ingress-controller-58467bccf7-jhzlx to aks-agentpool-41415378-vmss000000
Normal Killing 58m kubelet, aks-agentpool-41415378-vmss000000 Container nginx-ingress-controller failed liveness probe, will be restarted
Warning Unhealthy 57m (x4 over 58m) kubelet, aks-agentpool-41415378-vmss000000 Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 57m kubelet, aks-agentpool-41415378-vmss000000 Readiness probe failed: Get http://10.244.1.18:10254/healthz: read tcp 10.244.1.1:54126->10.244.1.18:10254: read: connection reset by peer
Normal Pulled 57m (x2 over 59m) kubelet, aks-agentpool-41415378-vmss000000 Container image "us.gcr.io/k8s-artifacts-prod/ingress-nginx/controller:v0.34.1" already present on machine
Normal Created 57m (x2 over 59m) kubelet, aks-agentpool-41415378-vmss000000 Created container nginx-ingress-controller
Normal Started 57m (x2 over 59m) kubelet, aks-agentpool-41415378-vmss000000 Started container nginx-ingress-controller
Warning Unhealthy 57m kubelet, aks-agentpool-41415378-vmss000000 Liveness probe failed: Get http://10.244.1.18:10254/healthz: dial tcp 10.244.1.18:10254: connect: connection refused
Warning Unhealthy 56m kubelet, aks-agentpool-41415378-vmss000000 Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 23m (x10 over 58m) kubelet, aks-agentpool-41415378-vmss000000 Liveness probe failed: Get http://10.244.1.18:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 14m (x6 over 57m) kubelet, aks-agentpool-41415378-vmss000000 Readiness probe failed: Get http://10.244.1.18:10254/healthz: dial tcp 10.244.1.18:10254: connect: connection refused
Warning BackOff 9m28s (x12 over 12m) kubelet, aks-agentpool-41415378-vmss000000 Back-off restarting failed container
Warning Unhealthy 3m51s (x24 over 58m) kubelet, aks-agentpool-41415378-vmss000000 Readiness probe failed: Get http://10.244.1.18:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Some logs from the controller
NGINX Ingress controller
Release: v0.34.1
Build: v20200715-ingress-nginx-2.11.0-8-gda5fa45e2
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.1
-------------------------------------------------------------------------------
I1119 08:54:07.267185 6 main.go:275] Running in Kubernetes cluster version v1.17 (v1.17.11) - git (clean) commit 3a3612132641768edd7f7e73d07772225817f630 - platform linux/amd64
I1119 08:54:07.276120 6 main.go:87] Validated z1/nginx-ingress-default-backend as the default backend.
I1119 08:54:07.430459 6 main.go:105] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W1119 08:54:07.497816 6 store.go:659] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I1119 08:54:07.617458 6 nginx.go:263] Starting NGINX Ingress controller
I1119 08:54:08.748938 6 backend_ssl.go:66] Adding Secret "z1/z1-tls-secret" to the local store
I1119 08:54:08.801385 6 event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"z2", Name:"zalenium", UID:"8d395a18-811b-4852-8dd5-3cdd682e2e6e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"13667218", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress z2/zalenium
I1119 08:54:08.801908 6 backend_ssl.go:66] Adding Secret "z2/z2-tls-secret" to the local store
I1119 08:54:08.802837 6 event.go:278] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"z1", Name:"zalenium", UID:"244ae6f5-897e-432e-8ec3-fd142f0255dc", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"13667219", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress z1/zalenium
I1119 08:54:08.839946 6 nginx.go:307] Starting NGINX process
I1119 08:54:08.840375 6 leaderelection.go:242] attempting to acquire leader lease z1/ingress-controller-leader-nginx...
I1119 08:54:08.845041 6 controller.go:141] Configuration changes detected, backend reload required.
I1119 08:54:08.919965 6 status.go:86] new leader elected: nginx-ingress-controller-58467bccf7-5thwb
I1119 08:54:09.084800 6 controller.go:157] Backend successfully reloaded.
I1119 08:54:09.096999 6 controller.go:166] Initial sync, sleeping for 1 second.

As OP confirmed in comment section, I am posting solution for this issue.
Yes I tried and I replaced the deprecated version with the latest version, it completely solved the nginx issue.
In this setup OP used helm chart from stable repository. In Github page, dedicated to stable/nginx-ingress there is an information that this specific chart is DEPRECATED. It was updated 12 days ago so this is a fresh change.
This chart is deprecated as we have moved to the upstream repo ingress-nginx The chart source can be found here: https://github.com/kubernetes/ingress-nginx/tree/master/charts/ingress-nginx
In Nginx Ingress Controller deploy guide using Helm option is already with new repository.
To list current repository on the cluster use command $ helm repo list.
$ helm repo list
NAME URL
stable https://kubernetes-charts.storage.googleapis.com
ingress-nginx https://kubernetes.github.io/ingress-nginx
If you don't have new ingress-nginx repository, you have to:
Add new repository:
$ helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
Update it:
$ helm update
Deploy Nginx Ingress Controller:
$ helm install my-release ingress-nginx/ingress-nginx
Disclaimer!
Above commands are specific to Helm v3.

GCP AI Platform - Pipelines - Clusters - Does not have minimum availability

I can't create pipelines. I can't even load the samples / tutorials on the AI Platform Pipelines Dashboard because it doesn't seem to be able to proxy to whatever it needs to.
An error occurred
Error occured while trying to proxy to: ...
I looked into the cluster's details and found 3 components with errors:
Deployment metadata-grpc-deployment Does not have minimum availability
Deployment ml-pipeline Does not have minimum availability
Deployment ml-pipeline-persistenceagent Does not have minimum availability
Creating the clusters involve approx. 3 clicks in GCP Kubernetes Engine so I don't think I messed up this step.
Anyone have an idea of how to achieve "minimum availability"?
UPDATE 1
Nodes have adequate resources and are Ready.
YAML file looks good.
I have 2 clusters in diff regions/zones and both have the deployment errors listed above.
2 Pods are not ok.
Name: ml-pipeline-65479485c8-mcj9x
Namespace: default
Priority: 0
Node: gke-cluster-3-default-pool-007784cb-qcsn/10.150.0.2
Start Time: Thu, 17 Sep 2020 22:15:19 +0000
Labels: app=ml-pipeline
app.kubernetes.io/name=kubeflow-pipelines-3
pod-template-hash=65479485c8
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ml-pipeline-api-server
Status: Running
IP: 10.4.0.8
IPs:
IP: 10.4.0.8
Controlled By: ReplicaSet/ml-pipeline-65479485c8
Containers:
ml-pipeline-api-server:
Container ID: ...
Image: ...
Image ID: ...
Ports: 8888/TCP, 8887/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Fri, 18 Sep 2020 10:27:31 +0000
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Fri, 18 Sep 2020 10:20:38 +0000
Finished: Fri, 18 Sep 2020 10:27:31 +0000
Ready: False
Restart Count: 98
Requests:
cpu: 100m
Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Environment:
HAS_DEFAULT_BUCKET: true
BUCKET_NAME:
PROJECT_ID: <set to the key 'project_id' of config map 'gcp-default-config'> Optional: false
POD_NAMESPACE: default (v1:metadata.namespace)
DEFAULTPIPELINERUNNERSERVICEACCOUNT: pipeline-runner
OBJECTSTORECONFIG_SECURE: false
OBJECTSTORECONFIG_BUCKETNAME:
DBCONFIG_DBNAME: kubeflow_pipelines_3_pipeline
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-credential'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-credential'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-77xl8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
ml-pipeline-token-77xl8:
Type: Secret (a volume populated by a Secret)
SecretName: ml-pipeline-token-77xl8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 52m (x409 over 11h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
Warning Unhealthy 31m (x94 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Readiness probe failed:
Warning Unhealthy 31m (x29 over 10h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn (combined from similar events): Readiness probe failed: c
annot exec in a stopped state: unknown
Warning Unhealthy 17m (x95 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe failed:
Normal Pulled 7m26s (x97 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Container image "gcr.io/cloud-marketplace/google-cloud-ai
-platform/kubeflow-pipelines/apiserver:1.0.0" already present on machine
Warning Unhealthy 75s (x78 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe errored: rpc error: code = DeadlineExceede
d desc = context deadline exceeded
And the other pod:
Name: ml-pipeline-persistenceagent-67db8b8964-mlbmv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 32s (x2238 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
SOLUTION
Do not let google handle any storage. Uncheck "Use managed storage" and set up your own artifact collections manually. You don't actually need to enter anything in these fields since the pipeline will be launched anyway.

The Does not have minimum availability error is generic. There could be many issues that trigger it. You need to analyse more in-depth in order to find the actual problem. Here are some possible causes:
Insufficient resources: check if your Node has adequate resources (CPU/Memory). If Node is ok than check the Pod's status.
Liveliness probe and/or Readiness probe failure: execute kubectl describe pod <pod-name> to check if they failed and why.
Deployment misconfiguration: review your deployment yaml file to see if there are any errors or leftovers from previous configurations.
You can also try to wait a bit as sometimes it takes some time in order to deploy everything and/or try changing your Region/Zone.

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

I am trying to build and deploy microservices images to a single-node Kubernetes cluster running on my development machine using minikube. I am using the cloud-native microservices demo application Online Boutique by Google to understand the use of technologies like Kubernetes, Istio etc.
Link to github repo: microservices-demo
I have followed all the installation process to locally build and deploy the microservices, and am able to access the web frontend through my browser. However, when I click on any of the product images say, I see this error page.
HTTP Status: 500 Internal Server Error
On doing a check using kubectl get pods
I realize that one of my pods( Recommendation service) has status CrashLoopBackOff.
Running kubectl describe pods recommendationservice-55b4d6c477-kxv8r:
Namespace: default
Priority: 0
Node: minikube/192.168.99.116
Start Time: Thu, 23 Jul 2020 19:58:38 +0530
Labels: app=recommendationservice
app.kubernetes.io/managed-by=skaffold-v1.11.0
pod-template-hash=55b4d6c477
skaffold.dev/builder=local
skaffold.dev/cleanup=true
skaffold.dev/deployer=kubectl
skaffold.dev/docker-api-version=1.40
skaffold.dev/run-id=49913ced-e8df-40a7-9336-a227b56bcb5f
skaffold.dev/tag-policy=git-commit
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/recommendationservice-55b4d6c477
Containers:
server:
Container ID: docker://2d92aa966a82fbe58c8f40f6ecf9d6d55c29f8081cb40e0423a2397e1419350f
Image: recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Image ID: docker://sha256:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 Jul 2020 21:09:33 +0530
Finished: Thu, 23 Jul 2020 21:09:53 +0530
Ready: False
Restart Count: 29
Limits:
cpu: 200m
memory: 450Mi
Requests:
cpu: 100m
memory: 220Mi
Liveness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Readiness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Environment:
PORT: 8080
PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sbpcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-sbpcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sbpcx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 44m (x15 over 74m) kubelet, minikube Container image "recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb" already present on machine
Warning Unhealthy 9m33s (x99 over 74m) kubelet, minikube Readiness probe failed: timeout: failed to connect service ":8080" within 1s
Warning BackOff 4m25s (x294 over 72m) kubelet, minikube Back-off restarting failed container
In Events, I see Readiness probe failed: timeout: failed to connect service ":8080" within 1s.
What is the reason and how can I resolve this?
Thanks for the help!

Answer
The timeout of the Readiness Probe (1 second) was too short.
More Info
The relevant Readiness Probe is defined such that /bin/grpc_health_probe -addr=:8080 is run inside the server container.
You would expect a 1 second timeout to be sufficient for such a probe but this is running on Minikube so that could be impacting the timeout of the probe.

Kafka Pod doesn't start on GKE

I followed this tutorial and when I tried to run it on GKE I was not able to start kafka pod.
It returns CrashLoopBackOff all the time. And I don't know how to show pod error logs.
Here is the result when I hit kubectl describe pod my-pod-xxx:
Name: kafka-broker1-54cb95fb44-hlj5b
Namespace: default
Node: gke-xxx-default-pool-f9e313ed-zgcx/10.146.0.4
Start Time: Thu, 25 Oct 2018 11:40:21 +0900
Labels: app=kafka
id=1
pod-template-hash=1076519600
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container kafka
Status: Running
IP: 10.48.8.10
Controlled By: ReplicaSet/kafka-broker1-54cb95fb44
Containers:
kafka:
Container ID: docker://88ee6a1df4157732fc32b7bd8a81e329dbdxxxx9cbe614689e775d183dbcd61
Image: wurstmeister/kafka
Image ID: docker-pullable://wurstmeister/kafka#sha256:4f600a95fa1288f7b1xxxxxa32ca00b4fb13b83b31533fa6b40499bd9bdf192f
Port: 9092/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 25 Oct 2018 14:35:32 +0900
Finished: Thu, 25 Oct 2018 14:35:51 +0900
Ready: False
Restart Count: 37
Requests:
cpu: 100m
Environment:
KAFKA_ADVERTISED_PORT: 9092
KAFKA_ADVERTISED_HOST_NAME: 35.194.100.32
KAFKA_ZOOKEEPER_CONNECT: zoo1:2181
KAFKA_BROKER_ID: 1
KAFKA_CREATE_TOPICS: topic1:3:3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-w6s7n (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-w6s7n:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-w6s7n
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 5m (x716 over 2h) kubelet, gke-xxx-default-pool-f9e313ed-zgcx Back-off restarting failed container
Normal Pulling 36s (x38 over 2h) kubelet, gke-xxxdefault-pool-f9e313ed-zgcx pulling image "wurstmeister/kafka"
I noticed that on the first run it is going well but after that,Node is changing status to NotReady and kafka pod is entering the CrashLoopBackOff
state.
Here is the log before it goes down:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned kafka-broker1-54cb95fb44-wwf2h to gke-xxx-default-pool-f9e313ed-8mr6
Normal SuccessfulMountVolume 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 MountVolume.SetUp succeeded for volume "default-token-w6s7n"
Normal Pulling 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 pulling image "wurstmeister/kafka"
Normal Pulled 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Successfully pulled image "wurstmeister/kafka"
Normal Created 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Created container
Normal Started 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Started container
Normal NodeControllerEviction 38s node-controller Marking for deletion Pod kafka-broker1-54cb95fb44-wwf2h from Node gke-dev-centurion-default-pool-f9e313ed-8mr6
Could anyone tell me what's wrong with my pod and how can I catch the error for pod failure?

I just figured out that my cluster's nodes have not enough resources.
After creating a new cluster with more memory, it works.

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is it normal for bokeh serve on Kubernetes to restart periodically? - kubernetes

Related

Kubernetes CoreDNS in CrashLoopBackOff on worker node

kubernetes ingress-controller CrashLoopBackOff Error

GCP AI Platform - Pipelines - Clusters - Does not have minimum availability

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

Kafka Pod doesn't start on GKE

Categories

Resources