I try to deploy Matabase on my GKE cluster but I got Readiness probe failed.
I build on my local and get localhost:3000/api/health i got status 200 but on k8s it's not works.
Dockerfile. I create my own for push and build to my GitLab registry
FROM metabase/metabase:v0.41.6
EXPOSE 3000
CMD ["/app/run_metabase.sh" ]
my deployment.yaml
# apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metaba-dev
spec:
selector:
matchLabels:
app: metaba-dev
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 50%
maxSurge: 100%
template:
metadata:
labels:
app: metaba-dev
spec:
restartPolicy: Always
imagePullSecrets:
- name: gitlab-login
containers:
- name: metaba-dev
image: registry.gitlab.com/team/metabase:dev-{{BUILD_NUMBER}}
command: ["/app/run_metabase.sh" ]
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 60
periodSeconds: 10
imagePullPolicy: Always
ports:
- name: metaba-dev-port
containerPort: 3000
terminationGracePeriodSeconds: 90
I got this error from
kubectl describe pod metaba-dev
Warning Unhealthy 61s (x3 over 81s) kubelet Readiness probe failed: Get "http://10.207.128.197:3000/api/health": dial tcp 10.207.128.197:3000: connect: connection refused
Warning Unhealthy 61s (x3 over 81s) kubelet Liveness probe failed: Get "http://10.207.128.197:3000/api/health": dial tcp 10.207.128.197:3000: connect: connection refused
kubectl logs
Picked up JAVA_TOOL_OPTIONS: -Xmx1g -Xms1g -Xmx1g
Warning: environ value jdk-11.0.13+8 for key :java-version has been overwritten with 11.0.13
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-01-28 15:32:23,966 INFO metabase.util :: Maximum memory available to JVM: 989.9 MB
2022-01-28 15:33:09,703 INFO util.encryption :: Saved credentials encryption is ENABLED for this Metabase instance. 🔐
For more information, see https://metabase.com/docs/latest/operations-guide/encrypting-database-details-at-rest.html
Here Solution
I add initialDelaySeconds: to 1200 and check logging it's cause about network mypod cannot connect to database and when i checking log i did not see that cause i has been restart and it's was a new log
Try to change your initialDelaySeconds: 60 to 100
And you should always set the resource request and limit in your container to avoid the probe failure, this is because when your app starts hitting the resource limit the kubernetes starts throttling your container.
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "100Mi"
cpu: "500m"
Related
I'm deploying a simple application image which has readiness, startup and liveness probes through docker desktop app. I tried to search for similar issues but none of them matched the one which I'm facing therefore created this post.
Image : rahulwagh17/kubernetes:jhooq-k8s-springboot
Below is the deployment manifest used.
apiVersion: apps/v1
kind: Deployment
metadata:
name: jhooq-springboot
spec:
replicas: 2
selector:
matchLabels:
app: jhooq-springboot
template:
metadata:
labels:
app: jhooq-springboot
spec:
containers:
- name: springboot
image: rahulwagh17/kubernetes:jhooq-k8s-springboot
resources:
requests:
memory: "128Mi"
cpu: "512m"
limits:
memory: "128Mi"
cpu: "512m"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /hello
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /hello
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
startupProbe:
httpGet:
path: /hello
port: 8080
failureThreshold: 60
periodSeconds: 10
env:
- name: PORT
value: "8080"
---
apiVersion: v1
kind: Service
metadata:
name: jhooq-springboot
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
selector:
app: jhooq-springboot
After deploying, pods status is CrashLoopBackOff due to Startup probe failed: Get "http://10.1.0.36:8080/hello": dial tcp 10.1.0.36:8080: connect: connection refused
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> Successfully assigned default/jhooq-springboot-6dbc755d48-4pqcz to docker-desktop
Warning Unhealthy 7m22s (x7 over 8m42s) kubelet, docker-desktop Startup probe failed: Get "http://10.1.0.36:8080/hello": dial tcp 10.1.0.36:8080: connect: connection refused
Normal Pulled 6m56s (x4 over 8m51s) kubelet, docker-desktop Container image "rahulwagh17/kubernetes:jhooq-k8s-springboot" already present on machine
Normal Created 6m56s (x4 over 8m51s) kubelet, docker-desktop Created container springboot
Normal Started 6m56s (x4 over 8m51s) kubelet, docker-desktop Started container springboot
Warning BackOff 3m40s (x19 over 8m6s) kubelet, docker-desktop Back-off restarting failed container
I have a multi-container pod running on AWS EKS. One web app container running on port 80 and a Redis container running on port 6379.
Once the deployment goes through, manual curl probes on the pod's IP address:port from within the cluster are always good responses.
The ingress to service is fine as well.
However, the kubelet's probes are failing, leading to restarts and I'm not sure how to replicate that probe fail nor fix it yet.
Thanks for reading!
Here are the events:
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Warning Unhealthy pod/app-7cddfb865b-gsvbg Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s Normal Killing pod/app-7cddfb865b-gsvbg Container app failed liveness probe, will be restarted
0s Normal Pulling pod/app-7cddfb865b-gsvbg Pulling image "registry/app:latest"
0s Normal Pulled pod/app-7cddfb865b-gsvbg Successfully pulled image "registry/app:latest"
0s Normal Created pod/app-7cddfb865b-gsvbg Created container app
Making things generic, this is my deployment yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "16"
creationTimestamp: "2021-05-26T22:01:19Z"
generation: 19
labels:
app: app
chart: app-1.0.0
environment: production
heritage: Helm
owner: acme
release: app
name: app
namespace: default
resourceVersion: "234691173"
selfLink: /apis/apps/v1/namespaces/default/deployments/app
uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: app
release: app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 100%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
creationTimestamp: null
labels:
app: app
environment: production
owner: acme
release: app
spec:
containers:
- image: redis:5.0.6-alpine
imagePullPolicy: IfNotPresent
name: redis
ports:
- containerPort: 6379
hostPort: 6379
name: redis
protocol: TCP
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 500m
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- env:
- name: SYSTEM_ENVIRONMENT
value: production
envFrom:
- configMapRef:
name: app-production
- secretRef:
name: app-production
image: registry/app:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
name: app
ports:
- containerPort: 80
hostPort: 80
name: app
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 500Mi
requests:
cpu: "1"
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
priorityClassName: critical-app
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-08-10T17:34:18Z"
lastUpdateTime: "2021-08-10T17:34:18Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-05-26T22:01:19Z"
lastUpdateTime: "2021-09-17T16:48:54Z"
message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 19
readyReplicas: 1
replicas: 1
updatedReplicas: 1
This is my service yaml:
apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2021-05-05T20:11:33Z"
labels:
app: app
chart: app-1.0.0
environment: production
heritage: Helm
owner: acme
release: app
name: app
namespace: default
resourceVersion: "163989104"
selfLink: /api/v1/namespaces/default/services/app
uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
clusterIP: 172.20.184.161
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 32648
port: 80
protocol: TCP
targetPort: 80
selector:
app: app
release: app
sessionAffinity: None
type: NodePort
status:
loadBalancer: {}
Update 10/20/2021:
So I went with the advice to tinker the readiness probe with these generous settings:
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 300
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
These are the events:
5m21s Normal Scheduled pod/app-686494b58b-6cjsq Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s Normal Created pod/app-686494b58b-6cjsq Created container redis
5m20s Normal Started pod/app-686494b58b-6cjsq Started container redis
5m20s Normal Pulling pod/app-686494b58b-6cjsq Pulling image "registry/app:latest"
5m20s Normal Pulled pod/app-686494b58b-6cjsq Successfully pulled image "registry/app:latest"
5m20s Normal Created pod/app-686494b58b-6cjsq Created container app
5m20s Normal Pulled pod/app-686494b58b-6cjsq Container image "redis:5.0.6-alpine" already present on machine
5m19s Normal Started pod/app-686494b58b-6cjsq Started container app
0s Warning Unhealthy pod/app-686494b58b-6cjsq Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I see the readiness probe kicking into action though when I actually request the health check page (root page) manually, which is odd. But be that is it may, the probe failure is not for the containers not running fine -- they are -- but somewhere else.
Let's go over your probes so you can understand what is going and might find a way to fix it:
### Readiness probe - "waiting" for the container to be ready
### to get to work.
###
### Liveness is executed once the pod is running which means that
### you have passed the readinessProbe so you might want to start
### with the readinessProbe first
livenessProbe:
### - Define how many retries to test the URL before restarting the pod.
### Try to increase this number and once your pod is restarted reduce
### it back to a lower value
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
###
### Delay before executing the first test
### As before - try to increase the delay and reduce it
### back when you figured out the correct value
###
initialDelaySeconds: 90
### How often (in seconds) to perform the test.
periodSeconds: 20
successThreshold: 1
### Number of seconds after which the probe times out.
### Since the value is 1 I assume that you did not change it.
### Same as before - increase the value and figure out what
### the current value
timeoutSeconds: 1
### Same comments as above + `initialDelaySeconds`
### Readiness is "waiting" for the container to be ready to
### get to work.
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
### Again, nothing new here, same comments to increase the value
### and then reduce it until you figure out what is desired value
### for this probe
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
View the logs/events
If you are not sure that the probes are the root cause, view the logs and the events to figure out what is the root cause for those failures
Linking my answer for
liveness and readiness probe for multiple containers in a pod
I am getting an error "No nodes are available that match all of the predicates: MatchNodeSelector (7), PodToleratesNodeTaints (1)" for kube-state-metrics. Please guide me how to troubleshoot this issue
admin#ip-172-20-58-79:~/kubernetes-prometheus$ kubectl describe po -n kube-system kube-state-metrics-747bcc4d7d-kfn7t
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s (x20 over 4m) default-scheduler No nodes are available that match all of the predicates: MatchNodeSelector (7), PodToleratesNodeTaints (1).
is this issue related to memory on a node? If yes how do I confirm it?
I checked all nodes only one node seems to be above 80%, remaining are between 45% to 70% memory usage
Node with 44% memory usage:
Total cluster memory usage:
following screenshot shows kube-state-metrics (0/1 up) :
Furthermore, Prometheus showing kubernetes-pods (0/0 up) is it due to kube-state-metrics not working or any other reason? and kubernetes-apiservers (0/1 up) seen in the above screenshot why is not up? How to troubleshoot it?
admin#ip-172-20-58-79:~/kubernetes-prometheus$ sudo tail -f /var/log/kube-apiserver.log | grep error
I0110 10:15:37.153827 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60828: remote error: tls: bad certificate
I0110 10:15:42.153543 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60854: remote error: tls: bad certificate
I0110 10:15:47.153699 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60898: remote error: tls: bad certificate
I0110 10:15:52.153788 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60936: remote error: tls: bad certificate
I0110 10:15:57.154014 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60992: remote error: tls: bad certificate
E0110 10:15:58.929167 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58104: write: connection reset by peer
E0110 10:15:58.931574 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58098: write: connection reset by peer
E0110 10:15:58.933864 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58088: write: connection reset by peer
E0110 10:16:00.842018 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58064: write: connection reset by peer
E0110 10:16:00.844301 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58058: write: connection reset by peer
E0110 10:18:17.275590 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37402: write: connection reset by peer
E0110 10:18:17.275705 7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.276401 7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.277808 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37392: write: connection reset by peer
Update after MaggieO's reply:
admin#ip-172-20-58-79:~/kubernetes-prometheus/kube-state-metrics-configs$ cat deployment.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
spec:
containers:
- image: quay.io/coreos/kube-state-metrics:v1.8.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
Furthermore, I want to add this command to above deployment.yaml but getting indentation error. show please help me where should I add it exactly.
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
Update 2:
#MaggieO even after adding commands/args it is showing same error and pod is in pending state :
Update deployment.yaml :
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"},"name":"kube-state-metrics","namespace":"kube-system"},"spec":{"replicas":1,"selector":{"matchLabels":{"app.kubernetes.io/name":"kube-state-metrics"}},"template":{"metadata":{"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"}},"spec":{"containers":[{"args":["--kubelet-insecure-tls","--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname"],"image":"quay.io/coreos/kube-state-metrics:v1.8.0","imagePullPolicy":"Always","livenessProbe":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":5,"timeoutSeconds":5},"name":"kube-state-metrics","ports":[{"containerPort":8080,"name":"http-metrics"},{"containerPort":8081,"name":"telemetry"}],"readinessProbe":{"httpGet":{"path":"/","port":8081},"initialDelaySeconds":5,"timeoutSeconds":5}}],"nodeSelector":{"kubernetes.io/os":"linux"},"serviceAccountName":"kube-state-metrics"}}}}
creationTimestamp: 2020-01-10T05:33:13Z
generation: 4
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
name: kube-state-metrics
namespace: kube-system
resourceVersion: "178851301"
selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-state-metrics
uid: b20aa645-336a-11ea-9618-0607d7cb72ed
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
spec:
containers:
- args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
image: quay.io/coreos/kube-state-metrics:v1.8.0
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
protocol: TCP
- containerPort: 8081
name: telemetry
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8081
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kube-state-metrics
serviceAccountName: kube-state-metrics
terminationGracePeriodSeconds: 30
status:
conditions:
- lastTransitionTime: 2020-01-10T05:33:13Z
lastUpdateTime: 2020-01-10T05:33:13Z
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: 2020-01-15T07:24:27Z
lastUpdateTime: 2020-01-15T07:29:12Z
message: ReplicaSet "kube-state-metrics-7f8c9c6c8d" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
observedGeneration: 4
replicas: 2
unavailableReplicas: 2
updatedReplicas: 1
Update 3: It is not able to get a node as shown in the following screenshot, let me know how to troubleshoot this issue
Error on kubernetes-apiservers Get https:// ...: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 172.20.58.79 means that controlplane nodes are targeted randomly, and the apiEndpoint only changes when the node is deleted from the cluster, it is not immediately noticeable as it requires changes with nodes in the cluster.
Workaround fix: manually synchronize kube-apiserver.pem between master nodes and restart kube-apiserver container.
You can also remove the apiserver. and apiserver-kubelet-client. and recreate them with commands:
$ kubeadm init phase certs apiserver --config=/etc/kubernetes/kubeadm-config.yaml
$ kubeadm init phase certs apiserver-kubelet-client --config=/etc/kubernetes/kubeadm-config.yaml
$ systemctl stop kubelet
delete the docker container with kubelet
$ systemctl restart kubelet
Similar problems: x509 certificate, kubelet-x509.
Then solve problem with metrics server.
Change the metrics-server-deployment.yaml file, and set the following args:
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
The metrics-server is now able to talk to the node (It was failing before because it could not resolve the hostname of the node).
More information you can find here: metrics-server-issue.
For the past couple of days we have been experiencing an intermittent deployment failure when deploying (via Helm) to Kubernetes v1.11.2.
When it fails, kubectl describe <deployment> usually reports that the container failed to create:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1s default-scheduler Successfully assigned default/pod-fc5c8d4b8-99npr to fh1-node04
Normal Pulling 0s kubelet, fh1-node04 pulling image "docker-registry.internal/pod:0e5a0cb1c0e32b6d0e603333ebb81ade3427ccdd"
Error from server (BadRequest): container "pod" in pod "pod-fc5c8d4b8-99npr" is waiting to start: ContainerCreating
and the only issue we can find in the kubelet logs is:
58468 kubelet_pods.go:146] Mount cannot be satisfied for container "pod", because the volume is missing or the volume mounter is nil: {Name:default-token-q8k7w ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}
58468 kuberuntime_manager.go:733] container start failed: CreateContainerConfigError: cannot find volume "default-token-q8k7w" to mount container start failed: CreateContainerConfigError: cannot find volume "default-token-q8k7w" to mount into container "pod"
It's intermittent which means it fails around once in every 20 or so deployments. Re-running the deployment works as expected.
The cluster and node health all look fine at the time of the deployment, so we are at a loss as to where to go from here. Looking for advice on where to start next on diagnosing the issue.
EDIT: As requested, the deployment file is generated via a Helm template and the output is shown below. For further information, the same Helm template is used for a lot of our services, but only this particular service has this intermittent issue:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: pod
labels:
app: pod
chart: pod-0.1.0
release: pod
heritage: Tiller
environment: integration
annotations:
kubernetes.io/change-cause: https://github.com/path_to_release
spec:
replicas: 2
revisionHistoryLimit: 3
selector:
matchLabels:
app: pod
release: pod
environment: integration
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: pod
release: pod
environment: integration
spec:
containers:
- name: pod
image: "docker-registry.internal/pod:0e5a0cb1c0e32b6d0e603333ebb81ade3427ccdd"
env:
- name: VAULT_USERNAME
valueFrom:
secretKeyRef:
name: "pod-integration"
key: username
- name: VAULT_PASSWORD
valueFrom:
secretKeyRef:
name: "pod-integration"
key: password
imagePullPolicy: IfNotPresent
command: ['mix', 'phx.server']
ports:
- name: http
containerPort: 80
protocol: TCP
envFrom:
- configMapRef:
name: pod
livenessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: http
initialDelaySeconds: 10
resources:
limits:
cpu: 750m
memory: 200Mi
requests:
cpu: 500m
memory: 150Mi
When running a deployment I get downtime. Requests failing after a variable amount of time (20-40 seconds).
The readiness check for the entry container fails when the preStop sends SIGUSR1, waits for 31 seconds, then sends SIGTERM. In that timeframe the pod should be removed from the service as the readiness check is set to fail after 2 failed attempts with 5 second intervals.
How can I see the events for pods being added and removed from the service to find out what's causing this?
And events around the readiness checks themselves?
I use Google Container Engine version 1.2.2 and use GCE's network load balancer.
service:
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
type: LoadBalancer
ports:
- name: http
port: 80
targetPort: http
protocol: TCP
- name: https
port: 443
targetPort: https
protocol: TCP
selector:
app: myapp
deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
strategy:
type: RollingUpdate
revisionHistoryLimit: 10
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
version: 1.0.0-61--66-6
spec:
containers:
- name: myapp
image: ****
resources:
limits:
cpu: 100m
memory: 250Mi
requests:
cpu: 10m
memory: 125Mi
ports:
- name: http-direct
containerPort: 5000
livenessProbe:
httpGet:
path: /status
port: 5000
initialDelaySeconds: 30
timeoutSeconds: 1
lifecycle:
preStop:
exec:
# SIGTERM triggers a quick exit; gracefully terminate instead
command: ["sleep 31;"]
- name: haproxy
image: travix/haproxy:1.6.2-r0
imagePullPolicy: Always
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 10m
memory: 25Mi
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
env:
- name: "SSL_CERTIFICATE_NAME"
value: "ssl.pem"
- name: "OFFLOAD_TO_PORT"
value: "5000"
- name: "HEALT_CHECK_PATH"
value: "/status"
volumeMounts:
- name: ssl-certificate
mountPath: /etc/ssl/private
livenessProbe:
httpGet:
path: /status
port: 443
scheme: HTTPS
initialDelaySeconds: 30
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /readiness
port: 81
initialDelaySeconds: 0
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 2
lifecycle:
preStop:
exec:
# SIGTERM triggers a quick exit; gracefully terminate instead
command: ["kill -USR1 1; sleep 31; kill 1"]
volumes:
- name: ssl-certificate
secret:
secretName: ssl-c324c2a587ee-20160331
When the probe fails, the prober will emit a warning event with reason as Unhealthy and message as xx probe errored: xxx.
You should be able to find those events using either kubectl get events or kubectl describe pods -l app=myapp,version=1.0.0-61--66-6 (filter pods by its label).