I'm using AWS EKS Fargate to deploy my work. After applying the deployment yaml file, everything goes well in first 10mins, but after that, I failed to access the pod by using kubectl exec <podname> -- bash, when typing kubectl describe pod <podname>, both readinessProbe and livenessProbe return similar messages as below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled 16m fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled 15m fargate-scheduler Successfully assigned k8s-fargate/k8s-api-5765846f76-d7nws to fargate-ip-10-0-130-250.ap-east-1.compute.internal
Normal Pulling 15m kubelet Pulling image "awsaccid.dkr.ecr.ap-east-1.amazonaws.com/k8s-api-test:1.0.0"
Normal Pulled 14m kubelet Successfully pulled image "awsaccid.dkr.ecr.ap-east-1.amazonaws.com/k8s-api-test:1.0.0" in 1m18.703187993s
Normal Created 14m kubelet Created container k8s-api
Normal Started 14m kubelet Started container k8s-api
Warning Unhealthy 2m18s kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "c2a2e9750a44684104a7e76a92bf7abe814ba29f306b092a48e17b90aab7f2dd": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: resource temporarily unavailable: unknown
Warning Unhealthy 2m13s kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "bcb60f638e2c364adc8694bc12f00660e2b0d7647d3861d3462727976d2df08c": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: resource temporarily unavailable: unknown
Warning Unhealthy 2m8s kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "988d29870b88fdcaae3cedf1071e79d2a786638c801364d71b6c7886f0be79e1": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: resource temporarily unavailable: unknown
Moreover, livenessProbe hasn't restart the pod even it is unhealthy.
I spent a whole day for this but still failed to solve it, anyone knows the problem? Thank you so much
Here's my deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: k8s-fargate
name: k8s-api
spec:
replicas: 1
selector:
matchLabels:
app: k8s-api
template:
metadata:
labels:
app: k8s-api
spec:
volumes:
- name: k8s-properties
configMap:
name: k8s-properties
containers:
- name: k8s-api
image: awsaccountid.dkr.ecr.ap-east-1.amazonaws.com/k8s-test:1.0.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8443
resources:
requests:
memory: "1024Mi"
cpu: "200m"
limits:
memory: "2500Mi"
cpu: "1000m"
volumeMounts:
- name: k8s-properties
mountPath: "/usr/local/folder"
readOnly: false
livenessProbe:
exec:
command:
- cat
- /usr/local/folder/file
initialDelaySeconds: 5
periodSeconds: 30
readinessProbe:
exec:
command:
- cat
- /usr/local/folder/file
initialDelaySeconds: 5
periodSeconds: 5
Problem solved by creating new Docker image. Still have no idea of the error, but problem likely comes from the image container itself.
Related
I had successfully created a custom kafka connector image containing confluent hub connectors.
I am trying to create pod and service to launch it in GCP with kubernetes.
How should I configure yaml file ? The next part of code I took from quick-start guide. This is what I've tried:
Dockerfile:
FROM confluentinc/cp-kafka-connect-base:latest
ENV CONNECT_PLUGIN_PATH="/usr/share/java,/usr/share/confluent-hub-components,/usr/share/java/kafka-connect-jdbc"
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:10.2.6
RUN confluent-hub install --no-prompt debezium/debezium-connector-mysql:1.7.1
RUN confluent-hub install --no-prompt debezium/debezium-connector-postgresql:1.7.1
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-oracle-cdc:1.5.0
RUN wget -O /usr/share/confluent-hub-components/confluentinc-kafka-connect-jdbc/lib/mysql-connector-java-8.0.26.jar https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.26/mysql-connector-java-8.0.26.jar
Modifield part of confluent-platform.yaml
apiVersion: platform.confluent.io/v1beta1
kind: Connect
metadata:
name: connect
namespace: confluent
spec:
replicas: 1
image:
application: maxprimeaery/kafka-connect-jdbc:latest #confluentinc/cp-server-connect:7.0.1
init: confluentinc/confluent-init-container:2.2.0-1
configOverrides:
server:
- config.storage.replication.factor=1
- offset.storage.replication.factor=1
- status.storage.replication.factor=1
podTemplate:
resources:
requests:
cpu: 200m
memory: 512Mi
probe:
liveness:
periodSeconds: 10
failureThreshold: 5
timeoutSeconds: 500
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
runAsNonRoot: true
And that's the error I get in console for connect-0 pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 45m default-scheduler Successfully assigned confluent/connect-0 to gke-my-kafka-cluster-default-pool-6ee97fb9-fh9w
Normal Pulling 45m kubelet Pulling image "confluentinc/confluent-init-container:2.2.0-1"
Normal Pulled 45m kubelet Successfully pulled image "confluentinc/confluent-init-container:2.2.0-1" in 17.447881861s
Normal Created 45m kubelet Created container config-init-container
Normal Started 45m kubelet Started container config-init-container
Normal Pulling 45m kubelet Pulling image "maxprimeaery/kafka-connect-jdbc:latest"
Normal Pulled 44m kubelet Successfully pulled image "maxprimeaery/kafka-connect-jdbc:latest" in 23.387676944s
Normal Created 44m kubelet Created container connect
Normal Started 44m kubelet Started container connect
Warning Unhealthy 41m (x5 over 42m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 41m kubelet Container connect failed liveness probe, will be restarted
Warning Unhealthy 5m (x111 over 43m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 404
Warning BackOff 17s (x53 over 22m) kubelet Back-off restarting failed container
Should I create separate pod and service for custom kafka connector or I have to configure the code above ?
UPDATE to my question
I've found out how to configure it in kubernetes adding this to connect pod
apiVersion: platform.confluent.io/v1beta1
kind: Connect
metadata:
name: connect
namespace: confluent
spec:
replicas: 1
image:
application: confluentinc/cp-server-connect:7.0.1
init: confluentinc/confluent-init-container:2.2.0-1
configOverrides:
server:
- config.storage.replication.factor=1
- offset.storage.replication.factor=1
- status.storage.replication.factor=1
build:
type: onDemand
onDemand:
plugins:
locationType: confluentHub
confluentHub:
- name: kafka-connect-jdbc
owner: confluentinc
version: 10.2.6
- name: kafka-connect-oracle-cdc
owner: confluentinc
version: 1.5.0
- name: debezium-connector-mysql
owner: debezium
version: 1.7.1
- name: debezium-connector-postgresql
owner: debezium
version: 1.7.1
storageLimit: 4Gi
podTemplate:
resources:
requests:
cpu: 200m
memory: 1024Mi
probe:
liveness:
periodSeconds: 180 #DONT CHANGE THIS
failureThreshold: 5
timeoutSeconds: 500
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
runAsNonRoot: true
But I still can't add mysql-connector from Maven repo
I tried also making new docker image but it doesn't work. Also I tried new part of code:
locationType: url #NOT WORKING. NO IDEA HOW TO CONFIGURE THAT
url:
- name: mysql-connector-java
archivePath: https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.26/mysql-connector-java-8.0.26.jar
checksum: sha512sum #definitely wrong
After some retries I found out that I just had to wait a little bit longer.
probe:
liveness:
periodSeconds: 180 #DONT CHANGE THIS
failureThreshold: 5
timeoutSeconds: 500
This part periodSeconds: 180 will add more time to make the pod Running and I can just use my own image.
image:
application: maxprimeaery/kafka-connect-jdbc:5.0
init: confluentinc/confluent-init-container:2.2.0-1
And build part can be removed after those changes.
I was trying to test one scenario where pod will mount a volume and it will try to write one file to it. Below mentioned yaml works fine when I exclude command and args. However with command and args it fails with "crashloopbackoff".
The describe command is not providing much information for the failure. What's wrong here?
Note: I was running this yaml on katacoda.
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
run: voltest
name: voltest
spec:
replicas: 1
selector:
matchLabels:
run: voltest
template:
metadata:
creationTimestamp: null
labels:
run: voltest
spec:
containers:
- image: nginx
name: voltest
volumeMounts:
- mountPath: /var/local/aaa
name: mydir
command: ["/bin/sh"]
args: ["-c", "echo 'test complete' > /var/local/aaa/testOut.txt"]
volumes:
- name: mydir
hostPath:
path: /var/local/aaa
type: DirectoryOrCreate
Describe command output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 49s default-scheduler Successfully assigned default/voltest-78678dd56c-h5frs to controlplane
Normal Pulling 19s (x3 over 48s) kubelet, controlplane Pulling image "nginx"
Normal Pulled 17s (x3 over 39s) kubelet, controlplane Successfully pulled image "nginx"
Normal Created 17s (x3 over 39s) kubelet, controlplane Created container voltest
Normal Started 17s (x3 over 39s) kubelet, controlplane Started container voltest
Warning BackOff 5s (x4 over 35s) kubelet, controlplane Back-off restarting failed container
You've configured your pod to run a single shell command:
command: ["/bin/sh"]
args: ["-c", "echo 'test complete' > /var/testOut.txt"]
This means that the pod starts up, runs echo 'test complete' > /var/testOut.txt, and then immediately exits. From the perspective
of kubernetes, this is a crash.
You've replaced the default behavior of the nginx image ("run
nginx") with a shell command.
If you want the pod to continue running, you'll need to arrange for it
to run some sort of long-running command. A simple solution would be
something like:
command: ["/bin/sh"]
args: ["-c", "echo 'test complete' > /var/testOut.txt; sleep 3600"]
This will cause the pod to sleep for an hour before exiting, giving
you time to inspect the results of your shell command.
Note that your shell command isn't testing anything useful; you've
mounted your mydir volume on /var/local/aaa, but your shell
command is writing to /var/testOut.txt, so it's not making any use
of the volume.
I'm running kubernetes using an ec2 machine on aws.
Node is in Ubuntu.
my metrics-server version.
wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.7/components.yaml
components.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server/metrics-server:v0.3.7
imagePullPolicy: IfNotPresent
args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-type=InternalIP,ExternalIP,Hostname
- --kubelet-insecure-tls
Even after adding args, the error appears.
error :
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
or
error: metrics not available yet
No matter how long I wait, that error appears.
my kops version : Version 1.18.0 (git-698bf974d8)
i use networking calico.
please help...
++
I try to wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.0/components.yaml
view logs..
kubectl logs -n kube-system deploy/metrics-server
"Failed to scrape node" err="GET "https://172.20.51.226:10250/stats/summary?only_cpu_and_memory=true": bad status code "401 Unauthorized"" node="ip-172-20-51-226.ap-northeast-2.compute.internal"
"Failed probe" probe="metric-storage-ready" err="not metrics to serve"
Download the components.yaml file manually:
wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Then edit the args section under Deployment:
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
add there two more lines:
- --kubelet-insecure-tls=true
- --kubelet-preferred-address-types=InternalIP
kubelet Of 10250 The port uses https agreement , The connection needs to be verified by tls certificate. Adding ,--kubelet-insecure-tls tells it do not verify client certificate.
After this modification just apply the manifest:
kubectl apply -f components.yaml
wait a minute and you will see metrics server pod is up
Last comment is useful.You can edit the deploy directly as well and adding line "--kubelet-insecure-tls=true" its enought for me:
Edit deploy:
$ kubectl edit deployment.apps/metrics-server -n kube-system
Add the line:
- --kubelet-insecure-tls=true
Similar result:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls=true
And save with ":wq" and enjoy.
~$ kubectl top pods -n kube-system
NAME CPU(cores) MEMORY(bytes)
coredns-6d4b75cb6d-k8dmc 3m 18Mi
coredns-6d4b75cb6d-wxxn6 3m 17Mi
kube-apiserver-k8s-master1 82m 306Mi
kube-apiserver-k8s-master2 65m 247Mi
kube-controller-manager-k8s-master1 32m 47Mi
kube-controller-manager-k8s-master2 4m 19Mi
kube-proxy-9dbgk 1m 9Mi
kube-proxy-bwhdm 1m 14Mi
kube-proxy-fz8v8 1m 15Mi
kube-proxy-vcnrc 1m 9Mi
kube-scheduler-k8s-master1 7m 18Mi
kube-scheduler-k8s-master2 4m 16Mi
metrics-server-79576f7ff-97tpc 6m 15Mi
metrics-server-79576f7ff-qzczp 4m 13Mi
~$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s-master1 318m 15% 1047Mi 55%
k8s-master2 208m 10% 1002Mi 52%
k8s-worker1 30m 3% 804Mi 42%
k8s-worker2 35m 3% 550Mi 29%
I'm using K8S 1.14 and Helm 3.3.1.
I have an app which works when deployed without probes. Then I set two trivial probes:
livenessProbe:
exec:
command:
- ls
- /mnt
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
exec:
command:
- ls
- /mnt
initialDelaySeconds: 5
periodSeconds: 5
When I deploy via helm upgrade, the command eventually (~5 mins) fails with:
Error: UPGRADE FAILED: release my-app failed, and has been rolled back due to atomic being set: timed out waiting for the condition
But in the events log there is no trace of any probe:
5m21s Normal ScalingReplicaSet deployment/my-app Scaled up replica set my-app-7 to 1
5m21s Normal Scheduled pod/my-app-7-6 Successfully assigned default/my-app-7-6 to gke-foo-testing-foo-testing-node-po-111-r0cu
5m21s Normal LoadBalancerNegNotReady pod/my-app-7-6 Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-222-default-my-app-80-54]
5m21s Normal SuccessfulCreate replicaset/my-app-7 Created pod: my-app-7-6
5m20s Normal Pulling pod/my-app-7-6 Pulling image "my-registry/my-app:v0.1"
5m20s Normal Pulled pod/my-app-7-6 Successfully pulled image "my-registry/my-app:v0.1"
5m20s Normal Created pod/my-app-7-6 Created container my-app
5m20s Normal Started pod/my-app-7-6 Started container my-app
5m15s Normal Attach service/my-app Attach 1 network endpoint(s) (NEG "k8s1-222-default-my-app-80-54" in zone "europe-west3-a")
19s Normal ScalingReplicaSet deployment/my-app Scaled down replica set my-app-7 to 0
19s Normal SuccessfulDelete replicaset/my-app-7 Deleted pod: my-app-7-6
19s Normal Killing pod/my-app-7-6 Stopping container my-app
Hence the question: what are the probes doing and where?
Try deleting the helm then re-apply it again: helm del --purge <APPNAME>
Also which helm version are you using? Try upgrading to v3.2.1, there's an open issue that tries to fix this incident with previously failed upgrades: https://github.com/helm/helm/issues/5939
I reproduced the same scenario here and everything went fine. The release was deployed and the pod is running. Did you check within the container if the /mnt really exists?
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m41s Successfully assigned default/nginx-deployment2-5cdd568667-blsc7 to minikube
Normal Pulling 3m41s kubelet, minikube Pulling image "nginx"
Normal Pulled 3m38s kubelet, minikube Successfully pulled image "nginx" in 2.769840982s
Normal Created 3m38s kubelet, minikube Created container nginx
Normal Started 3m38s kubelet, minikube Started container nginx
NAME READY STATUS RESTARTS AGE
nginx-deployment2-5cdd568667-blsc7 1/1 Running 0 4m59s
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment2
spec:
selector:
matchLabels:
app: ameba
replicas: 1
template:
metadata:
labels:
app: ameba
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: nginx-port
livenessProbe:
exec:
command:
- ls
- /mnt
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
exec:
command:
- ls
- /mnt
initialDelaySeconds: 5
periodSeconds: 5
I don't if you image include bash, but if you just want to verify if the directory exists, you can do the samething using others shell commands, try this:
livenessProbe:
exec:
command:
- /bin/bash
- -c
- ls /mnt
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
exec:
command:
- /bin/bash
- -c
- ls /mnt
initialDelaySeconds: 5
periodSeconds: 5
In bash you can also try to use the test built-in function:
[[ -d /mnt ]] = The -d verify if the directory /mnt exists.
As an alternative, there is also the command stat:
stat /mnt
If you want to check if the directory has any specific file, use the complete path with filename include.
Linux container pod, with docker images from Azure Container registry, keeps restarting with restartPolicy as Always. Pod description is as below.
kubectl describe pod example-pod
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jun 2020 03:27:11 +0000
Finished: Thu, 11 Jun 2020 03:27:12 +0000
...
Back-off restarting failed container
This pod is created with secret to access ACR registry repository.
Reason is that pod completes execution successfully with exit code 0. However, It should keep listening at particular port number. Microsoft document link is at this URL Container Group Runtime under header "Container continually exits and restarts"
deployment-example.yml file content is as below.
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
namespace: development
labels:
app: example
spec:
replicas: 1
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
containers:
- name: example
image: contentocr.azurecr.io/example:latest
#command: ["ping -t localhost"]
imagePullPolicy: Always
ports:
- name: http-port
containerPort: 3000
imagePullSecrets:
- name: regpass
restartPolicy: Always
nodeSelector:
agent: linux
---
apiVersion: v1
kind: Service
metadata:
name: example
namespace: development
labels:
app: example
spec:
ports:
- name: http-port
port: 3000
targetPort: 3000
selector:
app: example
type: LoadBalancer
Output of kubectl get events is as below.
3m39s Normal Scheduled pod/example-deployment-5dc964fcf8-gbm5t Successfully assigned development/example-deployment-5dc964fcf8-gbm5t to aks-agentpool-18342716-vmss000000
2m6s Normal Pulling pod/example-deployment-5dc964fcf8-gbm5t Pulling image "contentocr.azurecr.io/example:latest"
2m5s Normal Pulled pod/example-deployment-5dc964fcf8-gbm5t Successfully pulled image "contentocr.azurecr.io/example:latest"
2m5s Normal Created pod/example-deployment-5dc964fcf8-gbm5t Created container example
2m49s Normal Started pod/example-deployment-5dc964fcf8-gbm5t Started container example
2m20s Warning BackOff pod/example-deployment-5dc964fcf8-gbm5t Back-off restarting failed container
6m6s Normal SuccessfulCreate replicaset/example-deployment-5dc964fcf8 Created pod: example-deployment-5dc964fcf8-2fdt5
3m39s Normal SuccessfulCreate replicaset/example-deployment-5dc964fcf8 Created pod: example-deployment-5dc964fcf8-gbm5t
6m6s Normal ScalingReplicaSet deployment/example-deployment Scaled up replica set example-deployment-5dc964fcf8 to 1
3m39s Normal ScalingReplicaSet deployment/example-deployment Scaled up replica set example-deployment-5dc964fcf8 to 1
3m38s Normal EnsuringLoadBalancer service/example Ensuring load balancer
3m34s Normal EnsuredLoadBalancer service/example Ensured load balancer
Docker file entry point is like ENTRYPOINT ["npm", "start"] with CMD ["tail -f /dev/null/"]
It runs locally. Implicitly, it assigns CI="true" flag. However, in docker-compose stdin_open: true or tty: true is to be set and in Kubernetes deployment file, ENV named variable CI is to be set up with value "true".
The below command solved my problem:-
az aks update -n aks-nks-k8s-cluster -g aks-nks-k8s-rg --attach-acr aksnksk8s
After executing the above command, below will be displayed:-
Add ROLE Propagation done [###############] 100.0000%
and then,
Running.. followed by Response trail after some time.
Here,
aks-nks-k8s-cluster : Cluster name I have created and using
aks-nks-k8s-rg : Resource Group have created and using
aksnksk8s : Container Registries which I have created and using