GCP AI Platform - Pipelines - Clusters - Does not have minimum availability - kubernetes

I can't create pipelines. I can't even load the samples / tutorials on the AI Platform Pipelines Dashboard because it doesn't seem to be able to proxy to whatever it needs to.
An error occurred
Error occured while trying to proxy to: ...
I looked into the cluster's details and found 3 components with errors:
Deployment metadata-grpc-deployment Does not have minimum availability
Deployment ml-pipeline Does not have minimum availability
Deployment ml-pipeline-persistenceagent Does not have minimum availability
Creating the clusters involve approx. 3 clicks in GCP Kubernetes Engine so I don't think I messed up this step.
Anyone have an idea of how to achieve "minimum availability"?
UPDATE 1
Nodes have adequate resources and are Ready.
YAML file looks good.
I have 2 clusters in diff regions/zones and both have the deployment errors listed above.
2 Pods are not ok.
Name: ml-pipeline-65479485c8-mcj9x
Namespace: default
Priority: 0
Node: gke-cluster-3-default-pool-007784cb-qcsn/10.150.0.2
Start Time: Thu, 17 Sep 2020 22:15:19 +0000
Labels: app=ml-pipeline
app.kubernetes.io/name=kubeflow-pipelines-3
pod-template-hash=65479485c8
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ml-pipeline-api-server
Status: Running
IP: 10.4.0.8
IPs:
IP: 10.4.0.8
Controlled By: ReplicaSet/ml-pipeline-65479485c8
Containers:
ml-pipeline-api-server:
Container ID: ...
Image: ...
Image ID: ...
Ports: 8888/TCP, 8887/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Fri, 18 Sep 2020 10:27:31 +0000
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Fri, 18 Sep 2020 10:20:38 +0000
Finished: Fri, 18 Sep 2020 10:27:31 +0000
Ready: False
Restart Count: 98
Requests:
cpu: 100m
Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Environment:
HAS_DEFAULT_BUCKET: true
BUCKET_NAME:
PROJECT_ID: <set to the key 'project_id' of config map 'gcp-default-config'> Optional: false
POD_NAMESPACE: default (v1:metadata.namespace)
DEFAULTPIPELINERUNNERSERVICEACCOUNT: pipeline-runner
OBJECTSTORECONFIG_SECURE: false
OBJECTSTORECONFIG_BUCKETNAME:
DBCONFIG_DBNAME: kubeflow_pipelines_3_pipeline
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-credential'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-credential'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-77xl8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
ml-pipeline-token-77xl8:
Type: Secret (a volume populated by a Secret)
SecretName: ml-pipeline-token-77xl8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 52m (x409 over 11h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
Warning Unhealthy 31m (x94 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Readiness probe failed:
Warning Unhealthy 31m (x29 over 10h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn (combined from similar events): Readiness probe failed: c
annot exec in a stopped state: unknown
Warning Unhealthy 17m (x95 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe failed:
Normal Pulled 7m26s (x97 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Container image "gcr.io/cloud-marketplace/google-cloud-ai
-platform/kubeflow-pipelines/apiserver:1.0.0" already present on machine
Warning Unhealthy 75s (x78 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe errored: rpc error: code = DeadlineExceede
d desc = context deadline exceeded
And the other pod:
Name: ml-pipeline-persistenceagent-67db8b8964-mlbmv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 32s (x2238 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
SOLUTION
Do not let google handle any storage. Uncheck "Use managed storage" and set up your own artifact collections manually. You don't actually need to enter anything in these fields since the pipeline will be launched anyway.

The Does not have minimum availability error is generic. There could be many issues that trigger it. You need to analyse more in-depth in order to find the actual problem. Here are some possible causes:
Insufficient resources: check if your Node has adequate resources (CPU/Memory). If Node is ok than check the Pod's status.
Liveliness probe and/or Readiness probe failure: execute kubectl describe pod <pod-name> to check if they failed and why.
Deployment misconfiguration: review your deployment yaml file to see if there are any errors or leftovers from previous configurations.
You can also try to wait a bit as sometimes it takes some time in order to deploy everything and/or try changing your Region/Zone.

Related

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

I am trying to build and deploy microservices images to a single-node Kubernetes cluster running on my development machine using minikube. I am using the cloud-native microservices demo application Online Boutique by Google to understand the use of technologies like Kubernetes, Istio etc.
Link to github repo: microservices-demo
I have followed all the installation process to locally build and deploy the microservices, and am able to access the web frontend through my browser. However, when I click on any of the product images say, I see this error page.
HTTP Status: 500 Internal Server Error
On doing a check using kubectl get pods
I realize that one of my pods( Recommendation service) has status CrashLoopBackOff.
Running kubectl describe pods recommendationservice-55b4d6c477-kxv8r:
Namespace: default
Priority: 0
Node: minikube/192.168.99.116
Start Time: Thu, 23 Jul 2020 19:58:38 +0530
Labels: app=recommendationservice
app.kubernetes.io/managed-by=skaffold-v1.11.0
pod-template-hash=55b4d6c477
skaffold.dev/builder=local
skaffold.dev/cleanup=true
skaffold.dev/deployer=kubectl
skaffold.dev/docker-api-version=1.40
skaffold.dev/run-id=49913ced-e8df-40a7-9336-a227b56bcb5f
skaffold.dev/tag-policy=git-commit
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/recommendationservice-55b4d6c477
Containers:
server:
Container ID: docker://2d92aa966a82fbe58c8f40f6ecf9d6d55c29f8081cb40e0423a2397e1419350f
Image: recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Image ID: docker://sha256:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 Jul 2020 21:09:33 +0530
Finished: Thu, 23 Jul 2020 21:09:53 +0530
Ready: False
Restart Count: 29
Limits:
cpu: 200m
memory: 450Mi
Requests:
cpu: 100m
memory: 220Mi
Liveness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Readiness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Environment:
PORT: 8080
PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sbpcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-sbpcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sbpcx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 44m (x15 over 74m) kubelet, minikube Container image "recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb" already present on machine
Warning Unhealthy 9m33s (x99 over 74m) kubelet, minikube Readiness probe failed: timeout: failed to connect service ":8080" within 1s
Warning BackOff 4m25s (x294 over 72m) kubelet, minikube Back-off restarting failed container
In Events, I see Readiness probe failed: timeout: failed to connect service ":8080" within 1s.
What is the reason and how can I resolve this?
Thanks for the help!
Answer
The timeout of the Readiness Probe (1 second) was too short.
More Info
The relevant Readiness Probe is defined such that /bin/grpc_health_probe -addr=:8080 is run inside the server container.
You would expect a 1 second timeout to be sufficient for such a probe but this is running on Minikube so that could be impacting the timeout of the probe.

CockroachDB Cluster on Kubernetes Pods Crashing

I'm trying to install a CockroachDB Helm chart on a 2 node Kubernetes cluster using this command:
helm install my-release --set statefulset.replicas=2 stable/cockroachdb
I have already created 2 persistent volumes:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv00001 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-0 11m
pv00002 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-1 11m
I'm getting a weird error and I'm new to Kubernetes so I'm not sure what I'm doing wrong. I've tried creating a StorageClass and using it with my PVs but then the CockroachDB PVCs won't bind to them. I suspect there may be something wrong with my PV setup?
I've tried using kubectl logs but the only error I'm seeing is this:
standard_init_linux.go:211: exec user process caused "exec format
error"
and the pods are crashing over and over:
NAME READY STATUS RESTARTS AGE
my-release-cockroachdb-0 0/1 Pending 0 11m
my-release-cockroachdb-1 0/1 CrashLoopBackOff 7 11m
my-release-cockroachdb-init-tfcks 0/1 CrashLoopBackOff 5 5m29s
Any idea why the pods are crashing?
Here's kubectl describe for the init pod:
Name: my-release-cockroachdb-init-tfcks
Namespace: default
Priority: 0
Node: axon/192.168.1.7
Start Time: Sat, 04 Apr 2020 00:22:19 +0100
Labels: app.kubernetes.io/component=init
app.kubernetes.io/instance=my-release
app.kubernetes.io/name=cockroachdb
controller-uid=54c7c15d-eb1c-4392-930a-d9b8e9225a45
job-name=my-release-cockroachdb-init
Annotations: <none>
Status: Running
IP: 10.44.0.1
IPs:
IP: 10.44.0.1
Controlled By: Job/my-release-cockroachdb-init
Containers:
cluster-init:
Container ID: docker://82a062c6862a9fd5047236feafe6e2654ec1f6e3064fd0513341a1e7f36eaed3
Image: cockroachdb/cockroach:v19.2.4
Image ID: docker-pullable://cockroachdb/cockroach#sha256:511b6d09d5bc42c7566477811a4e774d85d5689f8ba7a87a114b96d115b6149b
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
while true; do initOUT=$(set -x; /cockroach/cockroach init --insecure --host=my-release-cockroachdb-0.my-release-cockroachdb:26257 2>&1); initRC="$?"; echo $initOUT; [[ "$initRC" == "0" ]] && exit 0; [[ "$initOUT" == *"cluster has already been initialized"* ]] && exit 0; sleep 5; done
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 04 Apr 2020 00:28:04 +0100
Finished: Sat, 04 Apr 2020 00:28:04 +0100
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cz2sn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-cz2sn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cz2sn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/my-release-cockroachdb-init-tfcks to axon
Normal Pulled 5m9s (x5 over 6m45s) kubelet, axon Container image "cockroachdb/cockroach:v19.2.4" already present on machine
Normal Created 5m8s (x5 over 6m45s) kubelet, axon Created container cluster-init
Normal Started 5m8s (x5 over 6m44s) kubelet, axon Started container cluster-init
Warning BackOff 92s (x26 over 6m42s) kubelet, axon Back-off restarting failed container
When Pods get crashed, the most important thing to troubleshoot is their descriptions(kubectl describe) and logs.
Logs of the failed Pod show that the arch of the cockroach image doesn't match to the nodes.
Run kubectl get po -o wide to get nodes where cockroach runs and check their arch.
A 2-node CockroachDB cluster is an anti-pattern. You need 3 or more nodes to avoid data or cluster-wide unavailability when a single node fails. Consider checking out these videos explaining how data in CockroachDB is organized and then how the nodes in a cluster work together to keep data available in the face of node failure.
Only if you have 3 nodes (or more), you will not risk losing data if any of the notes gets corrupted. Apart from it, its easier to explain how to do it right, than finding out what went wrong, and to find out what went wrong, one must go through the logs.
If you attach the log, I can take a look.
I also wrote a detailed guide that may address the "doing it right" part of my answer. I elaborated even more about the entire process here.

Is it normal for bokeh serve on Kubernetes to restart periodically?

I have a bokeh dashboard served in a docker container, which is running on kubernetes. I can access my dashboard remotely, no problems. But I noticed my pod containing the bokeh serve code restarts a lot, i.e. 14 times in the past 2 hours. Sometimes the status will come back as 'CrashLoopBackOff' and sometimes it will be 'Running' normally.
My question is, is there something about the way bokeh serve works that requires kubernetes to restart it so frequently? Is it something to do with memory (OOMKilled)?
Here is a section of my describe pod:
Name: bokeh-744d4bc9d-5pkzq
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 10.183.226.51/10.183.226.51
Start Time: Tue, 18 Feb 2020 11:55:44 +0000
Labels: name=bokeh
pod-template-hash=744d4bc9d
Annotations: kubernetes.io/psp: xyz-privileged-psp
Status: Running
IP: 172.30.255.130
Controlled By: ReplicaSet/bokeh-744d4bc9d
Containers:
dashboard-application:
Container ID: containerd://16d10dc5dd89235b0xyz2b5b31f8e313f3f0bb7efe82a12e00c1f01708e2f894
Image: us.icr.io/oss-data-science-np-dal/bokeh:118
Image ID: us.icr.io/oss-data-science-np-dal/bokeh#sha256:037a5b52a6e7c792fdxy80b01e29772dbfc33b10e819774462bee650cf0da
Port: 5006/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 18 Feb 2020 14:25:36 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 18 Feb 2020 14:15:26 +0000
Finished: Tue, 18 Feb 2020 14:23:54 +0000
Ready: True
Restart Count: 17
Limits:
cpu: 800m
memory: 600Mi
Requests:
cpu: 600m
memory: 400Mi
Liveness: http-get http://:5006/ delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:5006/ delay=10s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cjhfk (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-cjhfk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cjhfk
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 36m (x219 over 150m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 21m (x34 over 134m) kubelet, 10.183.226.51 Back-off restarting failed container
Warning Unhealthy 10m (x72 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 6m4s (x957 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 68s (x23 over 147m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I'm new to k8s, so any information you have to spare on this kind of issue will be much appreciated!
If a Container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure. This is documented here.
You may have to increase limits and requests in your pod spec. Check the official doc here.
Other way to look at it is to try to optimize your code so that it does not exceed the memory specified in limits.
OOMKill means your pod is consuming too much RAM and was killed in order to avoid disruption of the other workload running on the node.
You can either edit your code to use less RAM if feasible, or increase limits.memory.
You generally want to have requests = limits, except if your pod run some heavy stuff at the beginning and then does nothing.
You may want to take a look at the official documentation.

docker-registry deploys to K8S get an issue "CrashLoopBackOff"

I am stuck with docker-resgitry deployment to K8S. Here I show detail what I did. Hope you can give me any ideas.
My K8S version:
ii kubeadm 1.14.1-00 amd64 Kubernetes Cluster Bootstrapping Tool
ii kubectl 1.14.1-00 amd64 Kubernetes Command Line Tool
ii kubelet 1.14.1-00 amd64 Kubernetes Node Agent
ii kubernetes-cni 0.7.5-00 amd64 Kubernetes CNI
What I did?
Create selfcert
$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout cert.key -out cert.crt
Import selfcert to K8S
$ kubectl create secret tls registry-cert-secret --key cert.key --cert cert.crt
$ vim chart_values.yaml
ingress:
enabled: true
hosts:
- registry.mgmt.home.local
annotations:
kubernetes.io/ingress.class: traefik
tls:
- secretName: registry-cert-secret
hosts:
- registry.mgmt.home.local
secrets:
htpasswd: "admin:$2y$05$f95dCd6fRxQdDoPJ6mJIb.YMvR0qfhddSl3NSL1wCk1ZMl4JyFBDW"
s3:
accessKey: "admin"
secretKey: "admin2019"
storage: s3
s3:
region: us-east-1
regionEndpoint: http://minio.home.local:9000
secure: true
bucket: registry
then install with helm
$ helm install stable/docker-registry -f chart_values.yaml --name docker-registry
NAME: docker-registry
LAST DEPLOYED: Thu Oct 31 16:29:31 2019
NAMESPACE: default
STATUS: DEPLOYED
show the kubectl deployments
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
docker-registry 0/1 1 0 35m
get pods
$ kubectl get pods --namespace default
NAME READY STATUS RESTARTS AGE
docker-registry-6989668db6-78d84 0/1 **CrashLoopBackOff** 7 13m
docker-registry-6989668db6-jttrz 1/1 Terminating 0 37m
describe pod
$ kubectl describe pod docker-registry-6989668db6-78d84 --namespace default
Name: docker-registry-6989668db6-78d84
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: k8s-worker-promox/10.102.11.223
Start Time: Thu, 31 Oct 2019 18:03:13 +0800
Labels: app=docker-registry
pod-template-hash=6989668db6
release=docker-registry
Annotations: checksum/config: 89b20bb43a348d6b8dedacac583a596ccef4e570a935e7c5b464ba746eb88307
Status: Running
IP: 10.244.52.10
Controlled By: ReplicaSet/docker-registry-6989668db6
Containers:
docker-registry:
Container ID: docker://9a40c5e100711b122ddd78439c9fa21790f04f5a442b704140639f8fbfbd8929
Image: registry:2.7.1
Image ID: docker-pullable://registry#sha256:8004747f1e8cd820a148fb7499d71a76d45ff66bac6a29129bfdbfdc0154d146
Port: 5000/TCP
Host Port: 0/TCP
Command:
/bin/registry
serve
/etc/docker/registry/config.yml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 31 Oct 2019 18:14:21 +0800
Finished: Thu, 31 Oct 2019 18:15:19 +0800
Ready: False
Restart Count: 7
Liveness: http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
REGISTRY_AUTH: htpasswd
REGISTRY_AUTH_HTPASSWD_REALM: Registry Realm
REGISTRY_AUTH_HTPASSWD_PATH: /auth/htpasswd
REGISTRY_HTTP_SECRET: <set to the key 'haSharedSecret' in secret 'docker-registry-secret'> Optional: false
REGISTRY_STORAGE_S3_ACCESSKEY: <set to the key 's3AccessKey' in secret 'docker-registry-secret'> Optional: false
REGISTRY_STORAGE_S3_SECRETKEY: <set to the key 's3SecretKey' in secret 'docker-registry-secret'> Optional: false
REGISTRY_STORAGE_S3_REGION: us-east-1
REGISTRY_STORAGE_S3_REGIONENDPOINT: http://10.102.11.218:9000
REGISTRY_STORAGE_S3_BUCKET: registry
REGISTRY_STORAGE_S3_SECURE: true
Mounts:
/auth from auth (ro)
/etc/docker/registry from docker-registry-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qfwkm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
auth:
Type: Secret (a volume populated by a Secret)
SecretName: docker-registry-secret
Optional: false
docker-registry-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: docker-registry-config
ingress:
Optional: false
default-token-qfwkm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qfwkm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned default/docker-registry-6989668db6-78d84 to k8s-worker-promox
Normal Pulled 12m (x3 over 14m) kubelet, k8s-worker-promox Container image "registry:2.7.1" already present on machine
Normal Created 12m (x3 over 14m) kubelet, k8s-worker-promox Created container docker-registry
Normal Started 12m (x3 over 14m) kubelet, k8s-worker-promox Started container docker-registry
Normal Killing 12m (x2 over 13m) kubelet, k8s-worker-promox Container docker-registry failed liveness probe, will be restarted
Warning Unhealthy 12m (x7 over 14m) kubelet, k8s-worker-promox Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 9m8s (x15 over 13m) kubelet, k8s-worker-promox Readiness probe failed: HTTP probe failed with statuscode: 503
Warning BackOff 4m26s (x18 over 8m40s) kubelet, k8s-worker-promox Back-off restarting failed container
I see the issue related to Liveness and Readiness. So they made the pod is trying to start/ restart many times, then it gets "Back-off".
Following the troubleshooting, I see that should be related to DNS. But, DNS should not have any issues. I tried to lookup at K8S host.
$ nslookup minio.home.local
Server: 10.102.11.201
Address: 10.102.11.201#53
Non-authoritative answer:
Name: minio.home.local
Address: 10.101.12.213
Updated November 1st. I went into another pod, then nslookup, this pod could not find minio.home.local. Is that related this issue? also I tried to replace minio.home.local to IP in *.yaml, but also get the same issue.
$ kubectl exec -it net-utils-5b5f89f777-2cwgq bash
root#net-utils-5b5f89f777-2cwgq:/#
root#net-utils-5b5f89f777-2cwgq:/#
root#net-utils-5b5f89f777-2cwgq:/#
root#net-utils-5b5f89f777-2cwgq:/# nslookup minio.home.local
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find minio.skylab.local: NXDOMAIN
root#net-utils-5b5f89f777-2cwgq:/# ping minio.home.local
ping: unknown host
Googled/ Github discussion, but I still could not fix it. Do you have any ideas?
Thank you so much.

Kafka Pod doesn't start on GKE

I followed this tutorial and when I tried to run it on GKE I was not able to start kafka pod.
It returns CrashLoopBackOff all the time. And I don't know how to show pod error logs.
Here is the result when I hit kubectl describe pod my-pod-xxx:
Name: kafka-broker1-54cb95fb44-hlj5b
Namespace: default
Node: gke-xxx-default-pool-f9e313ed-zgcx/10.146.0.4
Start Time: Thu, 25 Oct 2018 11:40:21 +0900
Labels: app=kafka
id=1
pod-template-hash=1076519600
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container kafka
Status: Running
IP: 10.48.8.10
Controlled By: ReplicaSet/kafka-broker1-54cb95fb44
Containers:
kafka:
Container ID: docker://88ee6a1df4157732fc32b7bd8a81e329dbdxxxx9cbe614689e775d183dbcd61
Image: wurstmeister/kafka
Image ID: docker-pullable://wurstmeister/kafka#sha256:4f600a95fa1288f7b1xxxxxa32ca00b4fb13b83b31533fa6b40499bd9bdf192f
Port: 9092/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 25 Oct 2018 14:35:32 +0900
Finished: Thu, 25 Oct 2018 14:35:51 +0900
Ready: False
Restart Count: 37
Requests:
cpu: 100m
Environment:
KAFKA_ADVERTISED_PORT: 9092
KAFKA_ADVERTISED_HOST_NAME: 35.194.100.32
KAFKA_ZOOKEEPER_CONNECT: zoo1:2181
KAFKA_BROKER_ID: 1
KAFKA_CREATE_TOPICS: topic1:3:3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-w6s7n (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-w6s7n:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-w6s7n
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 5m (x716 over 2h) kubelet, gke-xxx-default-pool-f9e313ed-zgcx Back-off restarting failed container
Normal Pulling 36s (x38 over 2h) kubelet, gke-xxxdefault-pool-f9e313ed-zgcx pulling image "wurstmeister/kafka"
I noticed that on the first run it is going well but after that,Node is changing status to NotReady and kafka pod is entering the CrashLoopBackOff
state.
Here is the log before it goes down:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned kafka-broker1-54cb95fb44-wwf2h to gke-xxx-default-pool-f9e313ed-8mr6
Normal SuccessfulMountVolume 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 MountVolume.SetUp succeeded for volume "default-token-w6s7n"
Normal Pulling 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 pulling image "wurstmeister/kafka"
Normal Pulled 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Successfully pulled image "wurstmeister/kafka"
Normal Created 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Created container
Normal Started 5m kubelet, gke-xxx-default-pool-f9e313ed-8mr6 Started container
Normal NodeControllerEviction 38s node-controller Marking for deletion Pod kafka-broker1-54cb95fb44-wwf2h from Node gke-dev-centurion-default-pool-f9e313ed-8mr6
Could anyone tell me what's wrong with my pod and how can I catch the error for pod failure?
I just figured out that my cluster's nodes have not enough resources.
After creating a new cluster with more memory, it works.