Trying to create pods in a cluster in GKE. There is a docker container containing some python code with a sidecar container to access the sql database. The deployment worked perfectly previously, however after a few weeks I tried to redeploy with kubectl apply -f file_name.yaml.
The pods got temporarily created with a 'Pending' status and disappeared after 15 seconds. Happens every time. I am unable to access logs. kubectl get pods returns nothing after 15 seconds as well.
Not sure where to go from here... Any help would be appreciated!
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
pyxy-web-v1'
creationTimestamp: "2020-05-14T00:38:09Z"
labels:
run: pyxy-web-v1
name: pyxy-web-v1
namespace: default
resourceVersion: "1215073"
selfLink: /api/v1/namespaces/default/pods/pyxy-web-v1
uid: *omitted
spec:
containers:
- image: gcr.io/my-project-{*omitted}/pyxy-web:latest
imagePullPolicy: Always
name: pyxy-web-v1
ports:
- containerPort: 8080
protocol: TCP
env:
- name: DB_USER
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: *omitted
- name: DB_PASS
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: *omitted
resources:
requests:
cpu: 100m
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-94bct
readOnly: true
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.16
command: ["/cloud_sql_proxy",
"-instances=my-project-{*omitted}:us-central1:routing-app-v1=tcp:3306",
# If running on a VPC, the Cloud SQL proxy can connect via Private IP. See:
# https://cloud.google.com/sql/docs/mysql/private-ip for more info.
# "-ip_address_types=PRIVATE",
"-credential_file=/secrets/cloudsql/credentials.json"]
# [START cloudsql_security_context]
securityContext:
runAsUser: 2 # non-root user
allowPrivilegeEscalation: false
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: gke-pyxy-cluster-default-pool-{*omitted}
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 180
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-94bct
secret:
defaultMode: 420
secretName: default-token-94bct
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
During the 15 seconds-long pending period the kubectl describe pods returns the following.
Name: pyxy-web-v1
Namespace: default
Priority: 0
Node: gke-pyxy-cluster-default-pool-{*omitted}/
Labels: run=pyxy-web-v1
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container cloudsql-proxy
Status: Pending
IP:
IPs: <none>
Containers:
pyxy-web-v1:
Image: gcr.io/my-project-{*omitted}/pyxy-web:latest
Port: 8080/TCP
Host Port: 0/TCP
Requests:
cpu: 100m
Environment:
DB_USER: <set to the key '*omitted' in secret 'cloudsql-db-credentials'> Optional: false
DB_PASS: <set to the key '*omitted' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-94bct (ro)
cloudsql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.16
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=my-project-{*omitted}:us-central1:routing-app-v1=tcp:3306
-credential_file=/secrets/cloudsql/credentials.json
Requests:
cpu: 100m
Environment: <none>
Mounts:
/secrets/cloudsql from cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-94bct (ro)
Volumes:
default-token-94bct:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-94bct
Optional: false
cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-instance-credentials
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
However after this time, it returns
'No resources found in default namespace.'
Answer
The Pod spec had a Node Name for a Node that was no longer in the cluster (due to an upgrade). That is to say the pod.spec.nodeName was erroneous.
From kubectl explain pod.spec:
nodeName <string>
NodeName is a request to schedule this pod onto a specific node. If it is
non-empty, the scheduler simply schedules this pod onto that node, assuming
that it fits resource requirements.
During the ~15 second window the Pod was in Pending state, the following error log pointed to the solution:
Error from server (NotFound): pods "gke-pyxy-cluster-default-pool-94aa0302-pm35" not found
Related
After I install the promethus using helm in kubernetes cluster, the pod shows error like this:
0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
this is the deployment yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-prometheus-1660560589-node-exporter-n7rzg
generateName: kube-prometheus-1660560589-node-exporter-
namespace: reddwarf-monitor
uid: 73986565-ccd8-421c-bcbb-33879437c4f3
resourceVersion: '71494023'
creationTimestamp: '2022-08-15T10:51:07Z'
labels:
app.kubernetes.io/instance: kube-prometheus-1660560589
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-exporter
controller-revision-hash: 65c69f9b58
helm.sh/chart: node-exporter-3.0.8
pod-template-generation: '1'
ownerReferences:
- apiVersion: apps/v1
kind: DaemonSet
name: kube-prometheus-1660560589-node-exporter
uid: 921f98b9-ccc9-4e84-b092-585865bca024
controller: true
blockOwnerDeletion: true
status:
phase: Pending
conditions:
- type: PodScheduled
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-08-15T10:51:07Z'
reason: Unschedulable
message: >-
0/1 nodes are available: 1 node(s) didn't have free ports for the
requested pod ports.
qosClass: BestEffort
spec:
volumes:
- name: proc
hostPath:
path: /proc
type: ''
- name: sys
hostPath:
path: /sys
type: ''
- name: kube-api-access-9fj8v
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: node-exporter
image: docker.io/bitnami/node-exporter:1.3.1-debian-11-r23
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--web.listen-address=0.0.0.0:9100'
- >-
--collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- >-
--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
ports:
- name: metrics
hostPort: 9100
containerPort: 9100
protocol: TCP
resources: {}
volumeMounts:
- name: proc
readOnly: true
mountPath: /host/proc
- name: sys
readOnly: true
mountPath: /host/sys
- name: kube-api-access-9fj8v
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
livenessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 120
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
readinessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 1001
runAsNonRoot: true
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: kube-prometheus-1660560589-node-exporter
serviceAccount: kube-prometheus-1660560589-node-exporter
hostNetwork: true
hostPID: true
securityContext:
fsGroup: 1001
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- k8smasterone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/instance: kube-prometheus-1660560589
app.kubernetes.io/name: node-exporter
namespaces:
- reddwarf-monitor
topologyKey: kubernetes.io/hostname
schedulerName: default-scheduler
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/pid-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/unschedulable
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/network-unavailable
operator: Exists
effect: NoSchedule
priority: 0
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
I have checked the host machine and found the port 9100 is free, why still told that no port for this pod? what should I do to avoid this problem? this is the host port 9100 check command:
[root#k8smasterone grafana]# lsof -i:9100
[root#k8smasterone grafana]#
this is the pod describe info:
➜ ~ kubectl describe pod kube-prometheus-1660560589-node-exporter-n7rzg -n reddwarf-monitor
Name: kube-prometheus-1660560589-node-exporter-n7rzg
Namespace: reddwarf-monitor
Priority: 0
Node: <none>
Labels: app.kubernetes.io/instance=kube-prometheus-1660560589
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=node-exporter
controller-revision-hash=65c69f9b58
helm.sh/chart=node-exporter-3.0.8
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/kube-prometheus-1660560589-node-exporter
Containers:
node-exporter:
Image: docker.io/bitnami/node-exporter:1.3.1-debian-11-r23
Port: 9100/TCP
Host Port: 9100/TCP
Args:
--path.procfs=/host/proc
--path.sysfs=/host/sys
--web.listen-address=0.0.0.0:9100
--collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
Liveness: http-get http://:metrics/ delay=120s timeout=5s period=10s #success=1 #failure=6
Readiness: http-get http://:metrics/ delay=30s timeout=5s period=10s #success=1 #failure=6
Environment: <none>
Mounts:
/host/proc from proc (ro)
/host/sys from sys (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fj8v (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
kube-api-access-9fj8v:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m54s (x233 over 3h53m) default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
this is the netstat:
[root#k8smasterone ~]# netstat -plant |grep 9100
[root#k8smasterone ~]#
I also tried this to allow the pods running in master node by add this config:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
still did not fixed this problem.
When you configure your pod with hostNetwork: true, the containers running in this pod can directly see the network interfaces of the host machine where the pod was started.
The container port will be exposed to the external network at :, the hostPort is the port requested by the user in the configuration hostPort.
To bypass your problem, you have two options:
setting hostNetwork: false
choose a different hostPort (it is better in the range 49152 to 65535)
I have a problem with one of the pods. It says that it is in a pending state.
If I describe the pod, this is what I can see:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 1m (x58 over 11m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
Warning FailedScheduling 1m (x34 over 11m) default-scheduler 0/6 nodes are available: 6 node(s) didn't match node selector.
If I check the logs, there is nothing in there (it just outputs empty value).
--- Update ---
This is my pod yaml file
apiVersion: v1
kind: Pod
metadata:
annotations:
checksum/config: XXXXXXXXXXX
checksum/dashboards-config: XXXXXXXXXXX
creationTimestamp: 2020-02-11T10:15:15Z
generateName: grafana-654667db5b-
labels:
app: grafana-grafana
component: grafana
pod-template-hash: "2102238616"
release: grafana
name: grafana-654667db5b-tnrlq
namespace: monitoring
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: grafana-654667db5b
uid: xxxx-xxxxx-xxxxxxxx-xxxxxxxx
resourceVersion: "98843547"
selfLink: /api/v1/namespaces/monitoring/pods/grafana-654667db5b-tnrlq
uid: xxxx-xxxxx-xxxxxxxx-xxxxxxxx
spec:
containers:
- env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
key: xxxx
name: grafana
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
key: xxxx
name: grafana
- name: GF_INSTALL_PLUGINS
valueFrom:
configMapKeyRef:
key: grafana-install-plugins
name: grafana-config
image: grafana/grafana:5.0.4
imagePullPolicy: Always
name: grafana
ports:
- containerPort: 3000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources:
requests:
cpu: 200m
memory: 100Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/grafana
name: config-volume
- mountPath: /var/lib/grafana/dashboards
name: dashboard-volume
- mountPath: /var/lib/grafana
name: storage-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-tqb6j
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- sh
- -c
- cp /tmp/config-volume-configmap/* /tmp/config-volume 2>/dev/null || true; cp
/tmp/dashboard-volume-configmap/* /tmp/dashboard-volume 2>/dev/null || true
image: busybox
imagePullPolicy: Always
name: copy-configs
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp/config-volume-configmap
name: config-volume-configmap
- mountPath: /tmp/dashboard-volume-configmap
name: dashboard-volume-configmap
- mountPath: /tmp/config-volume
name: config-volume
- mountPath: /tmp/dashboard-volume
name: dashboard-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-tqb6j
readOnly: true
nodeSelector:
nodePool: cluster
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 300
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir: {}
name: config-volume
- emptyDir: {}
name: dashboard-volume
- configMap:
defaultMode: 420
name: grafana-config
name: config-volume-configmap
- configMap:
defaultMode: 420
name: grafana-dashs
name: dashboard-volume-configmap
- name: storage-volume
persistentVolumeClaim:
claimName: grafana
- name: default-token-tqb6j
secret:
defaultMode: 420
secretName: default-token-tqb6j
status:
conditions:
- lastProbeTime: 2020-02-11T10:45:37Z
lastTransitionTime: 2020-02-11T10:15:15Z
message: '0/6 nodes are available: 6 node(s) didn''t match node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
Do you know how should I further debug this?
Solution : You can do one of the two things to allow scheduler to fullfil your pod creation request.
you can choose to remove these lines from your pod yaml and start your pod creation again from scratch (if you need a selector for a reason go for approach as on next step 2)
nodeSelector:
nodePool: cluster
or
You can ensure that you add this nodePool: cluster as label to all your nodes so the pod will be scheduled by using the available selector.
You can use this command to label all nodes
kubectl label nodes <your node name> nodePool=cluster
Run above command by replacing node name from your cluster details for each node or only the nodes you want to be select with this label.
Your pod probably uses a node selector which can not fulfilled by scheduler.
Check pod description for something like that
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
...
nodeSelector:
disktype: ssd
And check whether your nodes are labeled accordingly.
The simplest option would be to use "nodeName" in the Pod yaml.
First, get the node where you want to run the Pod:
kubectl get nodes
Use the below attribute inside the Pod definition( yaml) so that the Pod is forced to run under the below mentioned node only.
nodeName: seliiuvd05714
I have an issue with a Job in kubernetes. When I want to debug, I do:
kubectl describe job -n influx pamela-1578898800
Name: pamela-1578898800
Namespace: influx
Selector: controller-uid=xxx
Labels: controller-uid=xxx
job-name=pamela-1578898800
Annotations: <none>
Controlled By: CronJob/pamela
Parallelism: 1
Completions: 1
Start Time: Mon, 13 Jan 2020 08:00:04 +0100
Pods Statuses: 0 Running / 0 Succeeded / 5 Failed
Pod Template:
Labels: controller-uid=53110b24-35d2-11ea-bca1-06ecc706e86a
job-name=pamela-1578898800
Containers:
pamela:
Image: registry.gitlab.com/xxx/pamela:latest
Port: <none>
Host Port: <none>
Limits:
cpu: 800m
memory: 1000Mi
Requests:
cpu: 800m
memory: 1000Mi
Environment Variables from:
pamela-env Secret Optional: false
Environment: <none>
Mounts:
/config from pamela-keys (rw)
/log from pamela-claim (rw,path="log")
/raw from pamela-claim (rw,path="raw")
Volumes:
pamela-claim:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pamela-claim
ReadOnly: false
pamela-keys:
Type: Secret (a volume populated by a Secret)
SecretName: pamela-keys
Optional: false
Events: <none>
Here you can see 5 failing pods, but I don't know how to see the logs of failing pods.
When I do :
kubectl get po -A
I have no pods "pamela-xxx"
How should I see the issue ?
EDIT:
Here are the scripts I use to trigger job.
# job.yaml - THIS ONE WORKS, LAUNCHING MANUALLY
apiVersion: batch/v1
kind: Job
metadata:
name: pamela-singlerun
namespace: influx
spec:
template:
spec:
containers:
- image: registry.gitlab.com/xxx/pamela:latest
envFrom:
- secretRef:
name: pamela-env
name: pamela
volumeMounts:
- mountPath: /raw
name: pamela-claim
subPath: raw
- mountPath: /log
name: pamela-claim
subPath: log
- mountPath: /config
name: pamela-keys
restartPolicy: Never
volumes:
- name: pamela-claim
persistentVolumeClaim:
claimName: pamela-claim
- name: pamela-keys
secret:
secretName: pamela-keys
items:
- key: keys.yml
path: keys.yml
nodeSelector:
kops.k8s.io/instancegroup: pamela-nodes
imagePullSecrets:
- name: gitlab-registry
And cronjob.yml, THIS ONE DOESN'T WORK
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: pamela
namespace: influx
spec:
schedule: "0 7,19 * * *"
concurrencyPolicy: Replace
jobTemplate:
spec:
template:
spec:
containers:
- image: registry.gitlab.com/xxx/pamela:latest
envFrom:
- secretRef:
name: pamela-env
name: pamela
resources:
limits:
cpu: 800m
memory: 1000Mi
requests:
cpu: 800m
memory: 1000Mi
volumeMounts:
- mountPath: /raw
name: pamela-claim
subPath: raw
- mountPath: /log
name: pamela-claim
subPath: log
- mountPath: /config
name: pamela-keys
restartPolicy: Never
volumes:
- name: pamela-claim
persistentVolumeClaim:
claimName: pamela-claim
- name: pamela-keys
secret:
secretName: pamela-keys
items:
- key: keys.yml
path: keys.yml
nodeSelector:
kops.k8s.io/instancegroup: pamela-nodes
imagePullSecrets:
- name: gitlab-registry
EDIT 2: After running cron each 10 minutes, I can see my jobs, and I get expected results ( means it works )
pamela-1578992400-ppgtx 0/1 Completed 0 21m
pamela-1578993000-kn8nd 0/1 Completed 0 11m
But when right after this, I get:
Error from server (NotFound): pods "pamela-1578992400-ppgtx" not found
when trying to get logs, means that ttl should be 10 min. When trying to increase ttl, I get a feature-gates disabled issue. checking how to fix it
It is weird, after setting cron job each 10 min, I get:
➜ kubectl get jobs -n influx
NAME COMPLETIONS DURATION AGE
pamela-1578898800 0/1 32h 32h
pamela-1579007400 1/1 99s 159m
pamela-1579011000 1/1 97s 99m
pamela-1579014600 1/1 108s 39m
I use: schedule: "10 * * * *"
Don't understand what's going on here...
I'm running a microk8s cluster on an Ubuntu server at home, and I have it connected to a local NAS server for persistent storage. I've been using it as my personal proving grounds for learning Kubernetes, but I seem to encounter problem after problem at just about every step of the way.
I've got the NFS Client Provisioner Helm chart installed which I've confirmed works - it will dynamically provision PVCs on my NAS server. I later was able to successfully install the Postgres Helm chart, or so I thought. After creating it I was able to connect to it using a SQL client, and I was feeling good.
Until a couple of days later, I noticed the pod was showing 0/1 containers ready. Although interestingly, the nfs-client-provisioner pod was still showing 1/1. Long story short: I've deleted/purged the Postgres Helm chart, and attempted to reinstall it, but now it no longer works. In fact, nothing new that I try to deploy works. Everything looks as though it's going to work, but then just hangs on either Init or ContainerCreating forever.
With Postgres in particular, the command I've been running is this:
helm install --name postgres stable/postgresql -f postgres.yaml
And my postgres.yaml file looks like this:
persistence:
storageClass: nfs-client
accessMode: ReadWriteMany
size: 2Gi
But if I do a kubectl get pods I see still see this:
NAME READY STATUS RESTARTS AGE
nfs-client-provisioner 1/1 Running 1 11d
postgres-postgresql-0 0/1 Init:0/1 0 3h51m
If I do a kubectl describe pod postgres-postgresql-0, this is the output:
Name: postgres-postgresql-0
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: stjohn/192.168.1.217
Start Time: Thu, 28 Mar 2019 12:51:02 -0500
Labels: app=postgresql
chart=postgresql-3.11.7
controller-revision-hash=postgres-postgresql-5bfb9cc56d
heritage=Tiller
release=postgres
role=master
statefulset.kubernetes.io/pod-name=postgres-postgresql-0
Annotations: <none>
Status: Pending
IP:
Controlled By: StatefulSet/postgres-postgresql
Init Containers:
init-chmod-data:
Container ID:
Image: docker.io/bitnami/minideb:latest
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
chown -R 1001:1001 /bitnami
if [ -d /bitnami/postgresql/data ]; then
chmod 0700 /bitnami/postgresql/data;
fi
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 256Mi
Environment: <none>
Mounts:
/bitnami/postgresql from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h4gph (ro)
Containers:
postgres-postgresql:
Container ID:
Image: docker.io/bitnami/postgresql:10.7.0
Image ID:
Port: 5432/TCP
Host Port: 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 256Mi
Liveness: exec [sh -c exec pg_isready -U "postgres" -h localhost] delay=30s timeout=5s period=10s #success=1 #failure=6
Readiness: exec [sh -c exec pg_isready -U "postgres" -h localhost] delay=5s timeout=5s period=10s #success=1 #failure=6
Environment:
PGDATA: /bitnami/postgresql
POSTGRES_USER: postgres
POSTGRES_PASSWORD: <set to the key 'postgresql-password' in secret 'postgres-postgresql'> Optional: false
Mounts:
/bitnami/postgresql from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h4gph (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-postgres-postgresql-0
ReadOnly: false
default-token-h4gph:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-h4gph
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
And if I do a kubectl get pod postgres-postgresql-0 -o yaml, this is the output:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2019-03-28T17:51:02Z"
generateName: postgres-postgresql-
labels:
app: postgresql
chart: postgresql-3.11.7
controller-revision-hash: postgres-postgresql-5bfb9cc56d
heritage: Tiller
release: postgres
role: master
statefulset.kubernetes.io/pod-name: postgres-postgresql-0
name: postgres-postgresql-0
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: postgres-postgresql
uid: 0d3ef673-5182-11e9-bf14-b8975a0ca30c
resourceVersion: "1953329"
selfLink: /api/v1/namespaces/default/pods/postgres-postgresql-0
uid: 0d4dfb56-5182-11e9-bf14-b8975a0ca30c
spec:
containers:
- env:
- name: PGDATA
value: /bitnami/postgresql
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: postgresql-password
name: postgres-postgresql
image: docker.io/bitnami/postgresql:10.7.0
imagePullPolicy: Always
livenessProbe:
exec:
command:
- sh
- -c
- exec pg_isready -U "postgres" -h localhost
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: postgres-postgresql
ports:
- containerPort: 5432
name: postgresql
protocol: TCP
readinessProbe:
exec:
command:
- sh
- -c
- exec pg_isready -U "postgres" -h localhost
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: 250m
memory: 256Mi
securityContext:
procMount: Default
runAsUser: 1001
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bitnami/postgresql
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-h4gph
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: postgres-postgresql-0
initContainers:
- command:
- sh
- -c
- |
chown -R 1001:1001 /bitnami
if [ -d /bitnami/postgresql/data ]; then
chmod 0700 /bitnami/postgresql/data;
fi
image: docker.io/bitnami/minideb:latest
imagePullPolicy: Always
name: init-chmod-data
resources:
requests:
cpu: 250m
memory: 256Mi
securityContext:
procMount: Default
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bitnami/postgresql
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-h4gph
readOnly: true
nodeName: stjohn
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1001
serviceAccount: default
serviceAccountName: default
subdomain: postgres-postgresql-headless
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: data
persistentVolumeClaim:
claimName: data-postgres-postgresql-0
- name: default-token-h4gph
secret:
defaultMode: 420
secretName: default-token-h4gph
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with incomplete status: [init-chmod-data]'
reason: ContainersNotInitialized
status: "False"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with unready status: [postgres-postgresql]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with unready status: [postgres-postgresql]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: docker.io/bitnami/postgresql:10.7.0
imageID: ""
lastState: {}
name: postgres-postgresql
ready: false
restartCount: 0
state:
waiting:
reason: PodInitializing
hostIP: 192.168.1.217
initContainerStatuses:
- image: docker.io/bitnami/minideb:latest
imageID: ""
lastState: {}
name: init-chmod-data
ready: false
restartCount: 0
state:
waiting:
reason: PodInitializing
phase: Pending
qosClass: Burstable
startTime: "2019-03-28T17:51:02Z"
I don't see anything obvious in these to be able to pinpoint what might be going on. And I've already rebooted the server just to see if that might help. Any thoughts? Why won't my containers start?
It looks like your initContainer is stuck in the PodInitializing state. The most likely scenario is that your PVCs are not ready. I recommend you describe your data-postgres-postgresql-0 PVC to make sure that the volume has actually been provisioned and is in the READY state. Your NFS provisioner may be working, but that specific PV/PVC may not have been created due to an error. I have run into similar phenomena with the EFS provisioner with AWS.
You can use the event command of kubectl. This will give you the event for your pod.
To filter for a specific pod you can use a field-selector:
kubectl get event --namespace abc-namespace --field-selector involvedObject.name=my-pod-zl6m6
I'm running Spinnaker on Kubernetes 1.10.111. One of the Spinnaker services is a Pod running a service called Clouddriver. This Pod was running fine, but then the readinessProbe started erroring continuously. Kubernetes docs say
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.
— https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
But this Pod's IP is still in the Service's endpoints. Why?
Clouddriver Pod YAML
kubectl -n spinnaker-test get pods spin-clouddriver-5559d44484-mp8q9 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: spotify.backend-service
creationTimestamp: 2019-02-15T20:46:38Z
generateName: spin-clouddriver-5559d44484-
labels:
app: spin
app.kubernetes.io/managed-by: halyard
app.kubernetes.io/name: clouddriver
app.kubernetes.io/part-of: spinnaker
app.kubernetes.io/version: 1.12.1
cluster: spin-clouddriver
pod-template-hash: "1115800040"
name: spin-clouddriver-5559d44484-mp8q9
namespace: spinnaker-test
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: spin-clouddriver-5559d44484
uid: ce79561c-3161-11e9-acdf-42010a800082
resourceVersion: "53541277"
selfLink: /api/v1/namespaces/spinnaker-test/pods/spin-clouddriver-5559d44484-mp8q9
uid: caa66d7c-3162-11e9-acdf-42010a800082
spec:
containers:
- env:
- name: JAVA_OPTS
value: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
- name: SPRING_PROFILES_ACTIVE
value: local
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imagePullPolicy: IfNotPresent
lifecycle: {}
name: clouddriver
ports:
- containerPort: 7002
protocol: TCP
readinessProbe:
exec:
command:
- wget
- --no-check-certificate
- --spider
- -q
- http://localhost:7002/health
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "20"
memory: 5000Mi
requests:
cpu: "20"
memory: 5000Mi
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/spinnaker/config
name: spin-clouddriver-files-1952526246
- mountPath: /home/halyard/.hal/k8s-spinnaker/staging/dependencies
name: spin-clouddriver-files-1757773194
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-w2lt5
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-production-us-ce-terraform-201812-d63606d6-9vq9
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 720
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: spin-clouddriver-files-1952526246
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1952526246
- name: spin-clouddriver-files-1757773194
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1757773194
- name: default-token-w2lt5
secret:
defaultMode: 420
secretName: default-token-w2lt5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:53:40Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imageID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver#sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
lastState: {}
name: clouddriver
ready: true
restartCount: 0
state:
running:
startedAt: 2019-02-15T20:47:26Z
hostIP: 10.178.32.98
phase: Running
podIP: 10.179.34.24
qosClass: Guaranteed
startTime: 2019-02-15T20:46:38Z
Describing the Pod shows the readinessProbe has been continuously erroring for over a day.
kubectl -n spinnaker-test describe pods spin-clouddriver-5559d44484-mp8q9
Name: spin-clouddriver-5559d44484-mp8q9
Namespace: spinnaker-test
Node: gke-production-us-ce-terraform-201812-d63606d6-9vq9/10.178.32.98
Start Time: Fri, 15 Feb 2019 15:46:38 -0500
Labels: app=spin
app.kubernetes.io/managed-by=halyard
app.kubernetes.io/name=clouddriver
app.kubernetes.io/part-of=spinnaker
app.kubernetes.io/version=1.12.1
cluster=spin-clouddriver
pod-template-hash=1115800040
Annotations: kubernetes.io/psp=spotify.backend-service
Status: Running
IP: 10.179.34.24
Controlled By: ReplicaSet/spin-clouddriver-5559d44484
Containers:
clouddriver:
Container ID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
Image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
Image ID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver#sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
Port: 7002/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 15 Feb 2019 15:47:26 -0500
Ready: True
Restart Count: 0
Limits:
cpu: 20
memory: 5000Mi
Requests:
cpu: 20
memory: 5000Mi
Readiness: exec [wget --no-check-certificate --spider -q http://localhost:7002/health] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
JAVA_OPTS: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
SPRING_PROFILES_ACTIVE: local
Mounts:
/home/halyard/.hal/k8s-spinnaker/staging/dependencies from spin-clouddriver-files-1757773194 (rw)
/opt/spinnaker/config from spin-clouddriver-files-1952526246 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-w2lt5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
spin-clouddriver-files-1952526246:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1952526246
Optional: false
spin-clouddriver-files-1757773194:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1757773194
Optional: false
default-token-w2lt5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-w2lt5
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m (x321 over 1d) kubelet, gke-production-us-ce-terraform-201812-d63606d6-9vq9 Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
But Service still has the Pod's IP of 10.179.34.24 in its Endpoints.
kubectl -n spinnaker-test describe services spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"spin","cluster":"spin-clouddriver"},"name":"spin-clouddriver","namesp...
Selector: app=spin,cluster=spin-clouddriver
Type: ClusterIP
IP: 10.178.65.100
Port: <unset> 7002/TCP
TargetPort: 7002/TCP
Endpoints: 10.179.34.24:7002
Session Affinity: None
Events: <none>
kubectl -n spinnaker-test describe endpoints spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: <none>
Subsets:
Addresses: 10.179.34.24
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
<unset> 7002 TCP
Events: <none>
footnotes
GKE 1.10.11-gke.1 to be exact, but the fact that it's GKE shouldn't matter.
A probe by the kubelet can end in one of three states:
successful
failed (command returned a non-0 exit code)
errored (command did not return before the timeout elapsed, the command does not exist inside the container, etc)
Here is the code (in 1.10.11) where the event probe errored is recorded. Note that err != nil.
Here is the code that calls the above function - when err != nil (the probe returned an error), the result is discarded.
Only probes that fail will actually cause the pod's ready state to be changed.