troubleshoot Google kubernetes load balancer unhealthy nodes - kubernetes

I'm wondering what steps to take when troubleshooting why a Google load balancer would see Nodes within a cluster as unhealthy?
Using Google Kubernetes, I have a cluster with 3 nodes, all deployments are running readiness and liveness checks. All are reporting that they are healthy.
The load balancer is built from the helm nginx-ingress:
https://github.com/helm/charts/tree/master/stable/nginx-ingress
It's used as a single ingress for all the deployment applications within the cluster.
Visually scanning the ingress controllers logs:
kubectl logs <ingress-controller-name>
shows only the usual nginx output ... HTTP/1.1" 200 ... I can't see any health checks within these logs. Not sure if I should but nothing to suggest anything is unhealhty.
Running a describe against the ingress controller shows no events, but it does show a liveness and readiness check which I'm not too sure would actually pass:
Name: umbrella-ingress-controller-****
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: gke-multi-client-n1--2cpu-4ram-****/10.154.0.50
Start Time: Fri, 15 Nov 2019 21:23:36 +0000
Labels: app=ingress
component=controller
pod-template-hash=7c55db4f5c
release=umbrella
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ingress-controller
Status: Running
IP: ****
Controlled By: ReplicaSet/umbrella-ingress-controller-7c55db4f5c
Containers:
ingress-controller:
Container ID: docker://****
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller#sha256:****
Ports: 80/TCP, 443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--default-backend-service=default/umbrella-ingress-default-backend
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=default/umbrella-ingress-controller
State: Running
Started: Fri, 15 Nov 2019 21:24:38 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: umbrella-ingress-controller-**** (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from umbrella-ingress-token-**** (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
umbrella-ingress-token-2tnm9:
Type: Secret (a volume populated by a Secret)
SecretName: umbrella-ingress-token-****
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
However, using Googles console, I navigate to the load balancers details and can see the following:
Above 2 of the nodes seem to be having issues, albeit I can't find the issues.
At this point the load balancer is still serving traffic via the third, healthy node, however it will occasionally drop that and show me the following:
At this point no traffic gets past the load balancer, so all the applications on the nodes are unreachable.
Any help with where I should be looking to troubleshoot this would be great.
---- edit 17/11/19
Below is the nginx-ingress config passed via helm:
ingress:
enabled: true
rbac.create: true
controller:
service:
externalTrafficPolicy: Local
loadBalancerIP: ****
configData:
proxy-connect-timeout: "15"
proxy-read-timeout: "600"
proxy-send-timeout: "600"
proxy-body-size: "100m"

This is expected behavior. Using externalTrafficPolicy: local configures the service so that only nodes where a serving pod exists will accept traffic. What this means is that any node that does not have a serving pod that receives traffic to the service will drop the packet.
The GCP Network Loadbalancer is still sending traffic to each node to test the health. The health check will use the service NodePort. Any node that contains nginx loadbalancer pods will respond to the health check. Any node that does not have an nginx load balancer pod will drop the packet so the check fails.
This results in only certain nodes showing as healthy.
For the nginx ingress controller, I recommend using the default value of cluster instead of changing it to local.

Related

Kubernetes fix gke-metrics-agent stuck in terminating state on GKE

GKE had an outage about 2 days ago in their London datacentre (https://status.cloud.google.com/incident/compute/20013), since which time one of my nodes has been acting up. I've had to manually terminate a number of pods running on it and I'm having issues with a couple of sites, I assume due to their liveness checks failing temporarily which might have something to do with the below error in gke-metrics-agent?
Looking at the system pods I can see one instance of gke-metrics-agent is stuck in a terminating state and has been since last night:
kubectl get pods -n kube-system
reports:
...
gke-metrics-agent-k47g8 0/1 Terminating 0 32d
gke-metrics-agent-knr9h 1/1 Running 0 31h
gke-metrics-agent-vqkpw 1/1 Running 0 32d
...
I've looked at the describe output for the pod but can't see anything that helps me understand what it needs done:
kubectl describe pod gke-metrics-agent-k47g8 -n kube-system
Name: gke-metrics-agent-k47g8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <node-name>/<IP>
Start Time: Mon, 09 Nov 2020 03:41:14 +0000
Labels: component=gke-metrics-agent
controller-revision-hash=f8c5b8bfb
k8s-app=gke-metrics-agent
pod-template-generation=4
Annotations: components.gke.io/component-name: gke-metrics-agent
components.gke.io/component-version: 0.27.1
configHash: <config-hash>
Status: Terminating (lasts 15h)
Termination Grace Period: 30s
IP: <IP>
IPs:
IP: <IP>
Controlled By: DaemonSet/gke-metrics-agent
Containers:
gke-metrics-agent:
Container ID: docker://<id>
Image: gcr.io/gke-release/gke-metrics-agent:0.1.3-gke.0
Image ID: docker-pullable://gcr.io/gke-release/gke-metrics-agent#sha256:<hash>
Port: <none>
Host Port: <none>
Command:
/otelsvc
--config=/conf/gke-metrics-agent-config.yaml
--metrics-level=NONE
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Nov 2020 03:41:17 +0000
Finished: Thu, 10 Dec 2020 21:16:50 +0000
Ready: False
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 3m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: gke-metrics-agent-k47g8 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBELET_HOST: 127.0.0.1
ARG1: ${1}
ARG2: ${2}
Mounts:
/conf from gke-metrics-agent-config-vol (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gke-metrics-agent-token-cn6ss (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gke-metrics-agent-config-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gke-metrics-agent-conf
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
gke-metrics-agent-token-cn6ss:
Type: Secret (a volume populated by a Secret)
SecretName: gke-metrics-agent-token-cn6ss
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
components.gke.io/gke-managed-components
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
I'm not used to having to work on the system pods, in the past my experience troubleshooting issues often falls back on force deleting them when all else fails:
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force
My concern is I don't fully understand what this might do for a system pod and was hoping someone with expertise could advise on a sensible way forward?
I'm also looking at draining this node so Kubernetes can rebuild a new one. Would this potentially be the easiest way to go?
Following up on this I found the pod that was experiencing the issues with gke-metrics-agent became even less stable as the day went on.
I, therefore, had to drain it. The resources it was running are now on new nodes which are working as expected and all system pods are running as expected (including gke-metrics-agent).
Prior to draining this node I ensured, Pod Disruption Budgets were in place as a number of services run on 1 or 2 instances:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
This meant I could run:
kubectl drain <node-name>
The deployments then ensured they had enough live pods prior to the bad node being taken offline and seems to have avoided any downtime.

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

I am trying to build and deploy microservices images to a single-node Kubernetes cluster running on my development machine using minikube. I am using the cloud-native microservices demo application Online Boutique by Google to understand the use of technologies like Kubernetes, Istio etc.
Link to github repo: microservices-demo
I have followed all the installation process to locally build and deploy the microservices, and am able to access the web frontend through my browser. However, when I click on any of the product images say, I see this error page.
HTTP Status: 500 Internal Server Error
On doing a check using kubectl get pods
I realize that one of my pods( Recommendation service) has status CrashLoopBackOff.
Running kubectl describe pods recommendationservice-55b4d6c477-kxv8r:
Namespace: default
Priority: 0
Node: minikube/192.168.99.116
Start Time: Thu, 23 Jul 2020 19:58:38 +0530
Labels: app=recommendationservice
app.kubernetes.io/managed-by=skaffold-v1.11.0
pod-template-hash=55b4d6c477
skaffold.dev/builder=local
skaffold.dev/cleanup=true
skaffold.dev/deployer=kubectl
skaffold.dev/docker-api-version=1.40
skaffold.dev/run-id=49913ced-e8df-40a7-9336-a227b56bcb5f
skaffold.dev/tag-policy=git-commit
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/recommendationservice-55b4d6c477
Containers:
server:
Container ID: docker://2d92aa966a82fbe58c8f40f6ecf9d6d55c29f8081cb40e0423a2397e1419350f
Image: recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Image ID: docker://sha256:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 Jul 2020 21:09:33 +0530
Finished: Thu, 23 Jul 2020 21:09:53 +0530
Ready: False
Restart Count: 29
Limits:
cpu: 200m
memory: 450Mi
Requests:
cpu: 100m
memory: 220Mi
Liveness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Readiness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Environment:
PORT: 8080
PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sbpcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-sbpcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sbpcx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 44m (x15 over 74m) kubelet, minikube Container image "recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb" already present on machine
Warning Unhealthy 9m33s (x99 over 74m) kubelet, minikube Readiness probe failed: timeout: failed to connect service ":8080" within 1s
Warning BackOff 4m25s (x294 over 72m) kubelet, minikube Back-off restarting failed container
In Events, I see Readiness probe failed: timeout: failed to connect service ":8080" within 1s.
What is the reason and how can I resolve this?
Thanks for the help!
Answer
The timeout of the Readiness Probe (1 second) was too short.
More Info
The relevant Readiness Probe is defined such that /bin/grpc_health_probe -addr=:8080 is run inside the server container.
You would expect a 1 second timeout to be sufficient for such a probe but this is running on Minikube so that could be impacting the timeout of the probe.

Kubernetes master doesn't attach FlexVolume

I'm trying to attach the dummy-attachable FlexVolume sample for Kubernetes which seems to initialize normally according to my logs on both the nodes and master:
Loaded volume plugin "flexvolume-k8s/dummy-attachable
But when I try to attach the volume to a pod, the attach method never gets called from the master. The logs from the node read:
flexVolume driver k8s/dummy-attachable: using default GetVolumeName for volume dummy-attachable
operationExecutor.VerifyControllerAttachedVolume started for volume "dummy-attachable"
Operation for "\"flexvolume-k8s/dummy-attachable/dummy-attachable\"" failed. No retries permitted until 2019-04-22 13:42:51.21390334 +0000 UTC m=+4814.674525788 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"dummy-attachable\" (UniqueName: \"flexvolume-k8s/dummy-attachable/dummy-attachable\") pod \"nginx-dummy-attachable\"
Here's how I'm attempting to mount the volume:
apiVersion: v1
kind: Pod
metadata:
name: nginx-dummy-attachable
namespace: default
spec:
containers:
- name: nginx-dummy-attachable
image: nginx
volumeMounts:
- name: dummy-attachable
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: dummy-attachable
flexVolume:
driver: "k8s/dummy-attachable"
Here is the ouput of kubectl describe pod nginx-dummy-attachable:
Name: nginx-dummy-attachable
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: [node id]
Start Time: Wed, 24 Apr 2019 08:03:21 -0400
Labels: <none>
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container nginx-dummy-attachable
Status: Pending
IP:
Containers:
nginx-dummy-attachable:
Container ID:
Image: nginx
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/data from dummy-attachable (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hcnhj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
dummy-attachable:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: k8s/dummy-attachable
FSType:
SecretRef: nil
ReadOnly: false
Options: map[]
default-token-hcnhj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hcnhj
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 41s (x6 over 11m) kubelet, [node id] Unable to mount volumes for pod "nginx-dummy-attachable_default([id])": timeout expired waiting for volumes to attach or mount for pod "default"/"nginx-dummy-attachable". list of unmounted volumes=[dummy-attachable]. list of unattached volumes=[dummy-attachable default-token-hcnhj]
I added debug logging to the FlexVolume, so I was able to verify that the attach method was never called on the master node. I'm not sure what I'm missing here.
I don't know if this matters, but the cluster is being launched with KOPS. I've tried with both k8s 1.11 and 1.14 with no success.
So this is a fun one.
Even though kubelet initializes the FlexVolume plugin on master, kube-controller-manager, which is containerized in KOPs, is the application that's actually responsible for attaching the volume to the pod. KOPs doesn't mount the default plugin directory /usr/libexec/kubernetes/kubelet-plugins/volume/exec into the kube-controller-manager pod, so it doesn't know anything about your FlexVolume plugins on master.
There doesn't appear to be a non-hacky way to do this other than to use a different Kubernetes deployment tool until KOPs addresses this problem.

Installing kong-ingress-controller to manage ingress on kubernetes

I am installing kong ingress controller on my AKS cluster, but I don't want to have the postgres Statefulset service inside my cluster. Instead, I have a postgres database in my azure infrastructure, and I want connect it from my kong-ingress-controller deplopyment creating the postgres credentials like secrets in my aks cluster and store it in an environment variables.
I've create the secret
⟩ kubectl create secret generic az-pg-db-user-pass --from-literal=username='az-pg-username' --from-literal=password='az-pg-password' --namespace kong
secret/az-pg-db-user-pass created
And in my kongwithingress.yaml file, I have the deployment manifest declarations, which I did want to present from this gist link in order to don't fill the body question of a lot of yaml code lines.
This gist is based in this AKS deployment all in one, but removing postgres like Statefulset and Service due to the previous reasons, my objective is setup connection with my own azure managed postgres service
I've configured the az-pg-db-user-pass generic secret created in the kong-ingress-controller deployment and my kong deployment and my kong-migrations job presents in my whole gist script on order to create an environment variables such as follow:
KONG_PG_USERNAME
KONG_PG_PASSWORD
These environment variables has been created and referenced as a secrets in the kong-ingress-controller deployment and kong deployment and kong-migrations job which need access or connect with the postgres database
When I execute the kubectl apply -f kongwithingres.yaml command I get the following output:
The kong-ingress-controller deployment, kong deployment and kong-migrations job were created successfully.
⟩ kubectl apply -f kongwithingres.yaml
namespace/kong unchanged
customresourcedefinition.apiextensions.k8s.io/kongplugins.configuration.konghq.com unchanged
customresourcedefinition.apiextensions.k8s.io/kongconsumers.configuration.konghq.com unchanged
customresourcedefinition.apiextensions.k8s.io/kongcredentials.configuration.konghq.com unchanged
customresourcedefinition.apiextensions.k8s.io/kongingresses.configuration.konghq.com unchanged
serviceaccount/kong-serviceaccount unchanged
clusterrole.rbac.authorization.k8s.io/kong-ingress-clusterrole unchanged
role.rbac.authorization.k8s.io/kong-ingress-role unchanged
rolebinding.rbac.authorization.k8s.io/kong-ingress-role-nisa-binding unchanged
clusterrolebinding.rbac.authorization.k8s.io/kong-ingress-clusterrole-nisa-binding unchanged
service/kong-ingress-controller created
deployment.extensions/kong-ingress-controller created
service/kong-proxy created
deployment.extensions/kong created
job.batch/kong-migrations created
[I]
But their respective pods have the CrashLoopBackOff status
NAME READY STATUS RESTARTS AGE
pod/kong-d8b88df99-j6hvl 0/1 Init:CrashLoopBackOff 5 4m24s
pod/kong-ingress-controller-984fc9666-cd2b5 0/2 Init:CrashLoopBackOff 5 4m24s
pod/kong-migrations-t6n7p 0/1 CrashLoopBackOff 5 4m24s
I am checking the respective logs of each pod and I found this:
The pod/kong-d8b88df99-j6hvl:
⟩ kubectl logs pod/kong-d8b88df99-j6hvl -p -n kong
Error from server (BadRequest): previous terminated container "kong-proxy" in pod "kong-d8b88df99-j6hvl" not found
And in their describe information this pod is getting the environment variables and the image
⟩ kubectl describe pod/kong-d8b88df99-j6hvl -n kong
Name: kong-d8b88df99-j6hvl
Namespace: kong
Status: Pending
IP: 10.244.1.18
Controlled By: ReplicaSet/kong-d8b88df99
Init Containers:
wait-for-migrations:
Container ID: docker://7007a89ada215daf853ec103d79dca60ccc5fb3a14c51ac6c5c56655da6da62f
Image: kong:1.0.0
Image ID: docker-pullable://kong#sha256:8fd6a312d7715a9cc85c49625a4c2f53951f6e4422926091e4d2ae67c480b6d5
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
kong migrations list
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 26 Feb 2019 16:25:01 +0100
Finished: Tue, 26 Feb 2019 16:25:01 +0100
Ready: False
Restart Count: 6
Environment:
KONG_ADMIN_LISTEN: off
KONG_PROXY_LISTEN: off
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_PG_HOST: zcrm365-postgresql1.postgres.database.azure.com
KONG_PG_USERNAME: <set to the key 'username' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_PASSWORD: <set to the key 'password' in secret 'az-pg-db-user-pass'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gnkjq (ro)
Containers:
kong-proxy:
Container ID:
Image: kong:1.0.0
Image ID:
Ports: 8000/TCP, 8443/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
KONG_PG_USERNAME: <set to the key 'username' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_PASSWORD: <set to the key 'password' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_HOST: zcrm365-postgresql1.postgres.database.azure.com
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: off
KUBERNETES_PORT_443_TCP_ADDR: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_SERVICE_HOST: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gnkjq (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-gnkjq:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gnkjq
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m44s default-scheduler Successfully assigned kong/kong-d8b88df99-j6hvl to aks-default-75800594-1
Normal Pulled 7m9s (x5 over 8m40s) kubelet, aks-default-75800594-1 Container image "kong:1.0.0" already present on machine
Normal Created 7m8s (x5 over 8m40s) kubelet, aks-default-75800594-1 Created container
Normal Started 7m7s (x5 over 8m40s) kubelet, aks-default-75800594-1 Started container
Warning BackOff 3m34s (x26 over 8m38s) kubelet, aks-default-75800594-1 Back-off restarting failed container
The pod/kong-ingress-controller-984fc9666-cd2b5:
kubectl logs pod/kong-ingress-controller-984fc9666-cd2b5 -p -n kong
Error from server (BadRequest): a container name must be specified for pod kong-ingress-controller-984fc9666-cd2b5, choose one of: [admin-api ingress-controller] or one of the init containers: [wait-for-migrations]
[I]
And their respective description
⟩ kubectl describe pod/kong-ingress-controller-984fc9666-cd2b5 -n kong
Name: kong-ingress-controller-984fc9666-cd2b5
Namespace: kong
Status: Pending
IP: 10.244.2.18
Controlled By: ReplicaSet/kong-ingress-controller-984fc9666
Init Containers:
wait-for-migrations:
Container ID: docker://8eb035f755322b3ac72792d922974811933ba9a71afb1f4549cfe7e0a6519619
Image: kong:1.0.0
Image ID: docker-pullable://kong#sha256:8fd6a312d7715a9cc85c49625a4c2f53951f6e4422926091e4d2ae67c480b6d5
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
kong migrations list
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 26 Feb 2019 16:29:56 +0100
Finished: Tue, 26 Feb 2019 16:29:56 +0100
Ready: False
Restart Count: 7
Environment:
KONG_ADMIN_LISTEN: off
KONG_PROXY_LISTEN: off
KONG_PROXY_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_PROXY_ERROR_LOG: /dev/stderr
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_PG_HOST: zcrm365-postgresql1.postgres.database.azure.com
KONG_PG_USERNAME: <set to the key 'username' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_PASSWORD: <set to the key 'password' in secret 'az-pg-db-user-pass'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kong-serviceaccount-token-rc4sp (ro)
Containers:
admin-api:
Container ID:
Image: kong:1.0.0
Image ID:
Port: 8001/TCP
Host Port: 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Liveness: http-get http://:8001/status delay=30s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8001/status delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
KONG_PG_USERNAME: <set to the key 'username' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_PASSWORD: <set to the key 'password' in secret 'az-pg-db-user-pass'> Optional: false
KONG_PG_HOST: zcrm365-postgresql1.postgres.database.azure.com
KONG_ADMIN_ACCESS_LOG: /dev/stdout
KONG_ADMIN_ERROR_LOG: /dev/stderr
KONG_ADMIN_LISTEN: 0.0.0.0:8001, 0.0.0.0:8444 ssl
KONG_PROXY_LISTEN: off
KUBERNETES_PORT_443_TCP_ADDR: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_SERVICE_HOST: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kong-serviceaccount-token-rc4sp (ro)
ingress-controller:
Container ID:
Image: kong-docker-kubernetes-ingress-controller.bintray.io/kong-ingress-controller:0.3.0
Image ID:
Port: <none>
Host Port: <none>
Args:
/kong-ingress-controller
--kong-url=https://localhost:8444
--admin-tls-skip-verify
--default-backend-service=kong/kong-proxy
--publish-service=kong/kong-proxy
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Liveness: http-get http://:10254/healthz delay=30s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: kong-ingress-controller-984fc9666-cd2b5 (v1:metadata.name)
POD_NAMESPACE: kong (v1:metadata.namespace)
KUBERNETES_PORT_443_TCP_ADDR: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://zcrm365-d73ab78d.hcp.westeurope.azmk8s.io:443
KUBERNETES_SERVICE_HOST: zcrm365-d73ab78d.hcp.westeurope.azmk8s.io
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kong-serviceaccount-token-rc4sp (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
kong-serviceaccount-token-rc4sp:
Type: Secret (a volume populated by a Secret)
SecretName: kong-serviceaccount-token-rc4sp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned kong/kong-ingress-controller-984fc9666-cd2b5 to aks-default-75800594-2
Normal Pulled 10m (x5 over 12m) kubelet, aks-default-75800594-2 Container image "kong:1.0.0" already present on machine
Normal Created 10m (x5 over 12m) kubelet, aks-default-75800594-2 Created container
Normal Started 10m (x5 over 12m) kubelet, aks-default-75800594-2 Started container
Warning BackOff 2m14s (x49 over 12m) kubelet, aks-default-75800594-2 Back-off restarting failed container
[I]
~/workspace/ZCRM365/Deployments/Kubernetes/kong · (Deployments±)
⟩
I unknown the reason by which teh CrashLoopBackOff status and their status respective is Waiting: PodInitiazing
How to can I debug this behavior?
Is possible that Kong cannot talk to the Postgres database?
My AKS cluster is on Azure and also my postgres database and they have communication as a services.
UPDATE
These are the logs of my container pods created:
⟩ kubectl logs pod/kong-ingress-controller-984fc9666-w4vvn -p -n kong -c ingress-controller
Error from server (BadRequest): previous terminated container "ingress-controller" in pod "kong-ingress-controller-984fc9666-w4vvn" not found
[I]
⟩ kubectl logs pod/kong-d8b88df99-qsq4j -p -n kong -c kong-proxy
Error from server (BadRequest): previous terminated container "kong-proxy" in pod "kong-d8b88df99-qsq4j" not found
[I]
~/workspace/ZCRM365/Deployments/Kubernetes/kong · (Deployments±)
⟩
My kong-ingress-controller deployment pods are CrashLoopBackOff and some times in Waiting: PodInitiazing because I don't had in mind some things such as follow:
The main reason, such as says #Amityo, the kong-ingress-controller and kong have init-container called - wait-for-migrations which waits for the kong-migrations job before to be executed. Here, I can identify that is necessary perform my kong migrations
But my kong-migrations job was not working because I don't had the KONG_DATABASE environment variable parameter to setup the connection.
Other reason by which my deployment was not working is because kong internally to connect with postgres maybe wait that the user environment variable defined in the container to be called KONG_PG_USER. I was called KONG_PG_USERNAME and it was other reason to fail the execution of my script. (I am not sure completely about this)
⟩ kubectl create -f kongwithingres.yaml
namespace/kong created
secret/az-pg-db-user-pass created
customresourcedefinition.apiextensions.k8s.io/kongplugins.configuration.konghq.com created
customresourcedefinition.apiextensions.k8s.io/kongconsumers.configuration.konghq.com created
customresourcedefinition.apiextensions.k8s.io/kongcredentials.configuration.konghq.com created
customresourcedefinition.apiextensions.k8s.io/kongingresses.configuration.konghq.com created
serviceaccount/kong-serviceaccount created
clusterrole.rbac.authorization.k8s.io/kong-ingress-clusterrole created
role.rbac.authorization.k8s.io/kong-ingress-role created
rolebinding.rbac.authorization.k8s.io/kong-ingress-role-nisa-binding created
clusterrolebinding.rbac.authorization.k8s.io/kong-ingress-clusterrole-nisa-binding created
service/kong-ingress-controller created
deployment.extensions/kong-ingress-controller created
service/kong-proxy created
deployment.extensions/kong created
job.batch/kong-migrations created
[I]
~/workspace/ZCRM365/Deployments/Kubernetes/kong · (Deployments)
By the way, to start with kong I've recommend install konga which is a front-end dashboard tool to manage kong and check the things that we can make via yaml files.
We have this konga.yaml script to be installed like deployment in our kubernetes clusters
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: konga
namespace: kong
spec:
replicas: 1
template:
metadata:
labels:
app: konga
spec:
containers:
- env:
- name: NODE_TLS_REJECT_UNAUTHORIZED
value: "0"
image: pantsel/konga:latest
name: konga
ports:
- containerPort: 1337
And, we can start the service locally on our machines, via kubectl port-forward command
⟩ kubectl port-forward pod/konga-85b66cffff-mxq85 1337:1337 -n kong
Forwarding from 127.0.0.1:1337 -> 1337
Forwarding from [::1]:1337 -> 1337
We are going to http://localhost:1337/ and is necessary to register an admin user to use konga.
Create a kong connection (Connections in the dashboard) with the following kong admin URL http://kong-ingress-controller:8001/

Kubernetes api/dashboard issue

I posted this on serverfault, too, but will hopefully get more views/feedback here:
Trying to get the Dashboard UI working in a kubeadm cluster using kubectl proxy for remote access. Getting
Error: 'dial tcp 192.168.2.3:8443: connect: connection refused'
Trying to reach: 'https://192.168.2.3:8443/'
when accessing http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/ via remote browser.
Looking at API logs, I see that I'm getting the following errors:
I1215 20:18:46.601151 1 log.go:172] http: TLS handshake error from 10.21.72.28:50268: remote error: tls: unknown certificate authority
I1215 20:19:15.444580 1 log.go:172] http: TLS handshake error from 10.21.72.28:50271: remote error: tls: unknown certificate authority
I1215 20:19:31.850501 1 log.go:172] http: TLS handshake error from 10.21.72.28:50275: remote error: tls: unknown certificate authority
I1215 20:55:55.574729 1 log.go:172] http: TLS handshake error from 10.21.72.28:50860: remote error: tls: unknown certificate authority
E1215 21:19:47.246642 1 watch.go:233] unable to encode watch object *v1.WatchEvent: write tcp 134.84.53.162:6443->134.84.53.163:38894: write: connection timed out (&streaming.encoder{writer:(*metrics.fancyResponseWriterDelegator)(0xc42d6fecb0), encoder:(*versioning.codec)(0xc429276990), buf:(*bytes.Buffer)(0xc42cae68c0)})
I presume this is related to not being able to get the Dashboard working, and if so am wondering what the issue with the API server is. Everything else in the cluster appears to be working.
NB, I have admin.conf running locally and am able to access the cluster via kubectl with no issue.
Also, of note is that this had been working when I first got the cluster up. However, I was having networking issues, and had to apply this in order to get CoreDNS to work Coredns service do not work,but endpoint is ok the other SVCs are normal only except dns, so I am wondering if this maybe broke the proxy service?
* EDIT *
Here is output for the dashboard pod:
[gms#thalia0 ~]$ kubectl describe pod kubernetes-dashboard-77fd78f978-tjzxt --namespace=kube-system
Name: kubernetes-dashboard-77fd78f978-tjzxt
Namespace: kube-system
Priority: 0
PriorityClassName: <none>
Node: thalia2.hostdoman/hostip<redacted>
Start Time: Sat, 15 Dec 2018 15:17:57 -0600
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=77fd78f978
Annotations: cni.projectcalico.org/podIP: 192.168.2.3/32
Status: Running
IP: 192.168.2.3
Controlled By: ReplicaSet/kubernetes-dashboard-77fd78f978
Containers:
kubernetes-dashboard:
Container ID: docker://ed5ff580fb7d7b649d2bd1734e5fd80f97c80dec5c8e3b2808d33b8f92e7b472
Image: k8s.gcr.io/kubernetes-dashboard-amd64:v1.10.0
Image ID: docker-pullable://k8s.gcr.io/kubernetes-dashboard-amd64#sha256:1d2e1229a918f4bc38b5a3f9f5f11302b3e71f8397b492afac7f273a0008776a
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
State: Running
Started: Sat, 15 Dec 2018 15:18:04 -0600
Ready: True
Restart Count: 0
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kubernetes-dashboard-token-mrd9k (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
kubernetes-dashboard-token-mrd9k:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-token-mrd9k
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
I checked the service:
[gms#thalia0 ~]$ kubectl -n kube-system get service kubernetes-dashboard
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes-dashboard ClusterIP 10.103.93.93 <none> 443/TCP 4d23h
And also of note, if I curl http://localhost:8001/api from the master node, I do get a valid response.
So, in summary, I'm not sure which if any of these errors are the source of not being able to access the dashboard.
I just upgraded my cluster to 1.13.1, in hopes that this issue would be resolved, but alas, no.
When you do kubectl proxy , the default port 8001 only reachable from the localhost. If you ssh to the machine which the kubernetes is installed, you must map this port to your laptop or any device used to ssh.
You can ssh to master node and map the 8001 port to your localbox by :
ssh -L 8001:localhost:8001 hostname#master_node_IP
I upgraded all nodes in the cluster to version 1.13.1 and voila, the dashboard now works AND so far I have not had to apply the CoreDNS fix noted above.