I'm running Spinnaker on Kubernetes 1.10.111. One of the Spinnaker services is a Pod running a service called Clouddriver. This Pod was running fine, but then the readinessProbe started erroring continuously. Kubernetes docs say
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.
— https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
But this Pod's IP is still in the Service's endpoints. Why?
Clouddriver Pod YAML
kubectl -n spinnaker-test get pods spin-clouddriver-5559d44484-mp8q9 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: spotify.backend-service
creationTimestamp: 2019-02-15T20:46:38Z
generateName: spin-clouddriver-5559d44484-
labels:
app: spin
app.kubernetes.io/managed-by: halyard
app.kubernetes.io/name: clouddriver
app.kubernetes.io/part-of: spinnaker
app.kubernetes.io/version: 1.12.1
cluster: spin-clouddriver
pod-template-hash: "1115800040"
name: spin-clouddriver-5559d44484-mp8q9
namespace: spinnaker-test
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: spin-clouddriver-5559d44484
uid: ce79561c-3161-11e9-acdf-42010a800082
resourceVersion: "53541277"
selfLink: /api/v1/namespaces/spinnaker-test/pods/spin-clouddriver-5559d44484-mp8q9
uid: caa66d7c-3162-11e9-acdf-42010a800082
spec:
containers:
- env:
- name: JAVA_OPTS
value: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
- name: SPRING_PROFILES_ACTIVE
value: local
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imagePullPolicy: IfNotPresent
lifecycle: {}
name: clouddriver
ports:
- containerPort: 7002
protocol: TCP
readinessProbe:
exec:
command:
- wget
- --no-check-certificate
- --spider
- -q
- http://localhost:7002/health
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "20"
memory: 5000Mi
requests:
cpu: "20"
memory: 5000Mi
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/spinnaker/config
name: spin-clouddriver-files-1952526246
- mountPath: /home/halyard/.hal/k8s-spinnaker/staging/dependencies
name: spin-clouddriver-files-1757773194
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-w2lt5
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-production-us-ce-terraform-201812-d63606d6-9vq9
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 720
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: spin-clouddriver-files-1952526246
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1952526246
- name: spin-clouddriver-files-1757773194
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1757773194
- name: default-token-w2lt5
secret:
defaultMode: 420
secretName: default-token-w2lt5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:53:40Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imageID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver#sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
lastState: {}
name: clouddriver
ready: true
restartCount: 0
state:
running:
startedAt: 2019-02-15T20:47:26Z
hostIP: 10.178.32.98
phase: Running
podIP: 10.179.34.24
qosClass: Guaranteed
startTime: 2019-02-15T20:46:38Z
Describing the Pod shows the readinessProbe has been continuously erroring for over a day.
kubectl -n spinnaker-test describe pods spin-clouddriver-5559d44484-mp8q9
Name: spin-clouddriver-5559d44484-mp8q9
Namespace: spinnaker-test
Node: gke-production-us-ce-terraform-201812-d63606d6-9vq9/10.178.32.98
Start Time: Fri, 15 Feb 2019 15:46:38 -0500
Labels: app=spin
app.kubernetes.io/managed-by=halyard
app.kubernetes.io/name=clouddriver
app.kubernetes.io/part-of=spinnaker
app.kubernetes.io/version=1.12.1
cluster=spin-clouddriver
pod-template-hash=1115800040
Annotations: kubernetes.io/psp=spotify.backend-service
Status: Running
IP: 10.179.34.24
Controlled By: ReplicaSet/spin-clouddriver-5559d44484
Containers:
clouddriver:
Container ID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
Image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
Image ID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver#sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
Port: 7002/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 15 Feb 2019 15:47:26 -0500
Ready: True
Restart Count: 0
Limits:
cpu: 20
memory: 5000Mi
Requests:
cpu: 20
memory: 5000Mi
Readiness: exec [wget --no-check-certificate --spider -q http://localhost:7002/health] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
JAVA_OPTS: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
SPRING_PROFILES_ACTIVE: local
Mounts:
/home/halyard/.hal/k8s-spinnaker/staging/dependencies from spin-clouddriver-files-1757773194 (rw)
/opt/spinnaker/config from spin-clouddriver-files-1952526246 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-w2lt5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
spin-clouddriver-files-1952526246:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1952526246
Optional: false
spin-clouddriver-files-1757773194:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1757773194
Optional: false
default-token-w2lt5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-w2lt5
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m (x321 over 1d) kubelet, gke-production-us-ce-terraform-201812-d63606d6-9vq9 Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
But Service still has the Pod's IP of 10.179.34.24 in its Endpoints.
kubectl -n spinnaker-test describe services spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"spin","cluster":"spin-clouddriver"},"name":"spin-clouddriver","namesp...
Selector: app=spin,cluster=spin-clouddriver
Type: ClusterIP
IP: 10.178.65.100
Port: <unset> 7002/TCP
TargetPort: 7002/TCP
Endpoints: 10.179.34.24:7002
Session Affinity: None
Events: <none>
kubectl -n spinnaker-test describe endpoints spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: <none>
Subsets:
Addresses: 10.179.34.24
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
<unset> 7002 TCP
Events: <none>
footnotes
GKE 1.10.11-gke.1 to be exact, but the fact that it's GKE shouldn't matter.
A probe by the kubelet can end in one of three states:
successful
failed (command returned a non-0 exit code)
errored (command did not return before the timeout elapsed, the command does not exist inside the container, etc)
Here is the code (in 1.10.11) where the event probe errored is recorded. Note that err != nil.
Here is the code that calls the above function - when err != nil (the probe returned an error), the result is discarded.
Only probes that fail will actually cause the pod's ready state to be changed.
Related
After I install the promethus using helm in kubernetes cluster, the pod shows error like this:
0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
this is the deployment yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-prometheus-1660560589-node-exporter-n7rzg
generateName: kube-prometheus-1660560589-node-exporter-
namespace: reddwarf-monitor
uid: 73986565-ccd8-421c-bcbb-33879437c4f3
resourceVersion: '71494023'
creationTimestamp: '2022-08-15T10:51:07Z'
labels:
app.kubernetes.io/instance: kube-prometheus-1660560589
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-exporter
controller-revision-hash: 65c69f9b58
helm.sh/chart: node-exporter-3.0.8
pod-template-generation: '1'
ownerReferences:
- apiVersion: apps/v1
kind: DaemonSet
name: kube-prometheus-1660560589-node-exporter
uid: 921f98b9-ccc9-4e84-b092-585865bca024
controller: true
blockOwnerDeletion: true
status:
phase: Pending
conditions:
- type: PodScheduled
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-08-15T10:51:07Z'
reason: Unschedulable
message: >-
0/1 nodes are available: 1 node(s) didn't have free ports for the
requested pod ports.
qosClass: BestEffort
spec:
volumes:
- name: proc
hostPath:
path: /proc
type: ''
- name: sys
hostPath:
path: /sys
type: ''
- name: kube-api-access-9fj8v
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: node-exporter
image: docker.io/bitnami/node-exporter:1.3.1-debian-11-r23
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--web.listen-address=0.0.0.0:9100'
- >-
--collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- >-
--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
ports:
- name: metrics
hostPort: 9100
containerPort: 9100
protocol: TCP
resources: {}
volumeMounts:
- name: proc
readOnly: true
mountPath: /host/proc
- name: sys
readOnly: true
mountPath: /host/sys
- name: kube-api-access-9fj8v
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
livenessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 120
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
readinessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 1001
runAsNonRoot: true
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: kube-prometheus-1660560589-node-exporter
serviceAccount: kube-prometheus-1660560589-node-exporter
hostNetwork: true
hostPID: true
securityContext:
fsGroup: 1001
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- k8smasterone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/instance: kube-prometheus-1660560589
app.kubernetes.io/name: node-exporter
namespaces:
- reddwarf-monitor
topologyKey: kubernetes.io/hostname
schedulerName: default-scheduler
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/pid-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/unschedulable
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/network-unavailable
operator: Exists
effect: NoSchedule
priority: 0
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
I have checked the host machine and found the port 9100 is free, why still told that no port for this pod? what should I do to avoid this problem? this is the host port 9100 check command:
[root#k8smasterone grafana]# lsof -i:9100
[root#k8smasterone grafana]#
this is the pod describe info:
➜ ~ kubectl describe pod kube-prometheus-1660560589-node-exporter-n7rzg -n reddwarf-monitor
Name: kube-prometheus-1660560589-node-exporter-n7rzg
Namespace: reddwarf-monitor
Priority: 0
Node: <none>
Labels: app.kubernetes.io/instance=kube-prometheus-1660560589
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=node-exporter
controller-revision-hash=65c69f9b58
helm.sh/chart=node-exporter-3.0.8
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/kube-prometheus-1660560589-node-exporter
Containers:
node-exporter:
Image: docker.io/bitnami/node-exporter:1.3.1-debian-11-r23
Port: 9100/TCP
Host Port: 9100/TCP
Args:
--path.procfs=/host/proc
--path.sysfs=/host/sys
--web.listen-address=0.0.0.0:9100
--collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
Liveness: http-get http://:metrics/ delay=120s timeout=5s period=10s #success=1 #failure=6
Readiness: http-get http://:metrics/ delay=30s timeout=5s period=10s #success=1 #failure=6
Environment: <none>
Mounts:
/host/proc from proc (ro)
/host/sys from sys (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fj8v (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
kube-api-access-9fj8v:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m54s (x233 over 3h53m) default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
this is the netstat:
[root#k8smasterone ~]# netstat -plant |grep 9100
[root#k8smasterone ~]#
I also tried this to allow the pods running in master node by add this config:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
still did not fixed this problem.
When you configure your pod with hostNetwork: true, the containers running in this pod can directly see the network interfaces of the host machine where the pod was started.
The container port will be exposed to the external network at :, the hostPort is the port requested by the user in the configuration hostPort.
To bypass your problem, you have two options:
setting hostNetwork: false
choose a different hostPort (it is better in the range 49152 to 65535)
I have a helm chart that has one deployment/pod and one service. I set the deployment terminationGracePeriodSeconds to 300s.
I didn't have any pod lifecycle hook, so if I terminate the pod, the pod should terminate immediately. However, now the pod will determine until my grace period ends!
Below is the deployment template for my pod:
$ kubectl get pod hpa-poc---jcc-7dbbd66d86-xtfc5 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
creationTimestamp: "2021-02-01T18:12:34Z"
generateName: hpa-poc-jcc-7dbbd66d86-
labels:
app.kubernetes.io/instance: hpa-poc
app.kubernetes.io/name: -
pod-template-hash: 7dbbd66d86
name: hpa-poc-jcc-7dbbd66d86-xtfc5
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: hpa-poc-jcc-7dbbd66d86
uid: 66db29d8-9e2d-4097-94fc-b0b827466e10
resourceVersion: "127938945"
selfLink: /api/v1/namespaces/default/pods/hpa-poc-jcc-7dbbd66d86-xtfc5
uid: 82ed4134-95de-4093-843b-438e94e408dd
spec:
containers:
- env:
- name: _CONFIG_LINK
value: xxx
- name: _USERNAME
valueFrom:
secretKeyRef:
key: username
name: hpa-jcc-poc
- name: _PASSWORD
valueFrom:
secretKeyRef:
key: password
name: hpa-jcc-poc
image: xxx
imagePullPolicy: IfNotPresent
name: -
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: 500m
memory: 2Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-hzmwh
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: xxx
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 300
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-hzmwh
secret:
defaultMode: 420
secretName: default-token-hzmwh
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-02-01T18:12:34Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-02-01T18:12:36Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-02-01T18:12:36Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-02-01T18:12:34Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://c4c969ec149f43ff4494339930c8f0640d897b461060dd810c63a5d1f17fdc47
image: xxx
imageID: xxx
lastState: {}
name: -
ready: true
restartCount: 0
state:
running:
startedAt: "2021-02-01T18:12:35Z"
hostIP: 10.0.35.137
phase: Running
podIP: 10.0.21.35
qosClass: Burstable
startTime: "2021-02-01T18:12:34Z"
When I tried to terminate the pod (I used helm delete command), you can see in the time, it terminated after 5 min which is the grace period time.
$ helm delete hpa-poc
release "hpa-poc" uninstalled
$ kubectl get pod -w | grep hpa
hpa-poc-jcc-7dbbd66d86-xtfc5 1/1 Terminating 0 3h10m
hpa-poc-jcc-7dbbd66d86-xtfc5 0/1 Terminating 0 3h15m
hpa-poc-jcc-7dbbd66d86-xtfc5 0/1 Terminating 0 3h15m
So I suspect it's something for my pod/container configuration issue. Because I have tried with the other simple Java App deployment and it can terminate immediately once I terminate the pod.
BTW, I am using AWS EKS Cluster. Not sure its AWS specific as well.
So any suggestions?
I find the issue. When I exec into the container, I noticed there is one process running, which is the tailing log process.
So, I need to kill the process and add that into the prestop hook. After that, my container can shut down immediately.
Trying to create pods in a cluster in GKE. There is a docker container containing some python code with a sidecar container to access the sql database. The deployment worked perfectly previously, however after a few weeks I tried to redeploy with kubectl apply -f file_name.yaml.
The pods got temporarily created with a 'Pending' status and disappeared after 15 seconds. Happens every time. I am unable to access logs. kubectl get pods returns nothing after 15 seconds as well.
Not sure where to go from here... Any help would be appreciated!
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
pyxy-web-v1'
creationTimestamp: "2020-05-14T00:38:09Z"
labels:
run: pyxy-web-v1
name: pyxy-web-v1
namespace: default
resourceVersion: "1215073"
selfLink: /api/v1/namespaces/default/pods/pyxy-web-v1
uid: *omitted
spec:
containers:
- image: gcr.io/my-project-{*omitted}/pyxy-web:latest
imagePullPolicy: Always
name: pyxy-web-v1
ports:
- containerPort: 8080
protocol: TCP
env:
- name: DB_USER
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: *omitted
- name: DB_PASS
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: *omitted
resources:
requests:
cpu: 100m
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-94bct
readOnly: true
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.16
command: ["/cloud_sql_proxy",
"-instances=my-project-{*omitted}:us-central1:routing-app-v1=tcp:3306",
# If running on a VPC, the Cloud SQL proxy can connect via Private IP. See:
# https://cloud.google.com/sql/docs/mysql/private-ip for more info.
# "-ip_address_types=PRIVATE",
"-credential_file=/secrets/cloudsql/credentials.json"]
# [START cloudsql_security_context]
securityContext:
runAsUser: 2 # non-root user
allowPrivilegeEscalation: false
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: gke-pyxy-cluster-default-pool-{*omitted}
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 180
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-94bct
secret:
defaultMode: 420
secretName: default-token-94bct
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
During the 15 seconds-long pending period the kubectl describe pods returns the following.
Name: pyxy-web-v1
Namespace: default
Priority: 0
Node: gke-pyxy-cluster-default-pool-{*omitted}/
Labels: run=pyxy-web-v1
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container cloudsql-proxy
Status: Pending
IP:
IPs: <none>
Containers:
pyxy-web-v1:
Image: gcr.io/my-project-{*omitted}/pyxy-web:latest
Port: 8080/TCP
Host Port: 0/TCP
Requests:
cpu: 100m
Environment:
DB_USER: <set to the key '*omitted' in secret 'cloudsql-db-credentials'> Optional: false
DB_PASS: <set to the key '*omitted' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-94bct (ro)
cloudsql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.16
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=my-project-{*omitted}:us-central1:routing-app-v1=tcp:3306
-credential_file=/secrets/cloudsql/credentials.json
Requests:
cpu: 100m
Environment: <none>
Mounts:
/secrets/cloudsql from cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-94bct (ro)
Volumes:
default-token-94bct:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-94bct
Optional: false
cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-instance-credentials
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
However after this time, it returns
'No resources found in default namespace.'
Answer
The Pod spec had a Node Name for a Node that was no longer in the cluster (due to an upgrade). That is to say the pod.spec.nodeName was erroneous.
From kubectl explain pod.spec:
nodeName <string>
NodeName is a request to schedule this pod onto a specific node. If it is
non-empty, the scheduler simply schedules this pod onto that node, assuming
that it fits resource requirements.
During the ~15 second window the Pod was in Pending state, the following error log pointed to the solution:
Error from server (NotFound): pods "gke-pyxy-cluster-default-pool-94aa0302-pm35" not found
I am add prometheus(prom/prometheus:v2.16.0) alertmanager,now I add rule config in prometheus-configmap.xml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
prometheus.yml: |
rule_files:
- /etc/prometheus/rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: traefik
metrics_path: /metrics
static_configs:
- targets:
- traefik.kube-system.svc.cluster.local:8080
rules.yml: |
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
and I refresh the config:
kubectl apply -f prometheus-configmap.xm
kubectl exec -it soa-room-service-686959b94d-9g5q2 /bin/bash
curl -X POST http://prometheus.kube-system.svc.cluster.local:9090/-/reload
the prometheus dashboard config shows like this:
global:
scrape_interval: 1m
scrape_timeout: 10s
evaluation_interval: 1m
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scheme: http
timeout: 10s
api_version: v1
rule_files:
- /etc/prometheus/rules.yml
scrape_configs:
- job_name: traefik
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- traefik.kube-system.svc.cluster.local:8080
the alert config rules not valid,what should I do to make it works?
This is how to install prometheus :
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
namespace: kube-system
labels:
k8s-app: prometheus
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v2.2.1
spec:
serviceName: "prometheus"
replicas: 1
podManagementPolicy: "Parallel"
updateStrategy:
type: "RollingUpdate"
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
serviceAccountName: prometheus
initContainers:
- name: "init-chown-data"
image: "busybox:latest"
imagePullPolicy: "IfNotPresent"
command: ["chown", "-R", "65534:65534", "/data"]
volumeMounts:
- name: prometheus-data
mountPath: /data
subPath: ""
containers:
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9090/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
- name: prometheus-server
image: "prom/prometheus:v2.16.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
# based on 10 running nodes with 30 pods each
resources:
limits:
cpu: 200m
memory: 1000Mi
requests:
cpu: 200m
memory: 1000Mi
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "16Gi"
This is my pod describe output:
kubectl describe pods prometheus-0 -n kube-system
Name: prometheus-0
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: azshara-k8s01/172.19.104.231
Start Time: Wed, 11 Mar 2020 19:28:28 +0800
Labels: controller-revision-hash=prometheus-cf5dc9d8b
k8s-app=prometheus
statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 172.30.224.4
IPs: <none>
Controlled By: StatefulSet/prometheus
Init Containers:
init-chown-data:
Container ID: docker://a3adc4bce1dccbdd6adb27ca38c54b7ae670d605b6273d53e85f601649357709
Image: busybox:latest
Image ID: docker-pullable://busybox#sha256:b26cd013274a657b86e706210ddd5cc1f82f50155791199d29b9e86e935ce135
Port: <none>
Host Port: <none>
Command:
chown
-R
65534:65534
/data
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 11 Mar 2020 19:28:29 +0800
Finished: Wed, 11 Mar 2020 19:28:29 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data from prometheus-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-k8d22 (ro)
Containers:
prometheus-server-configmap-reload:
Container ID: docker://9d31d10c9246ddfa94d84d59737edd03f06e008960657b000461ae886d030516
Image: jimmidyson/configmap-reload:v0.1
Image ID: docker-pullable://jimmidyson/configmap-reload#sha256:2d40c2eaa6f435b2511d0cfc5f6c0a681eeb2eaa455a5d5ac25f88ce5139986e
Port: <none>
Host Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://localhost:9090/-/reload
State: Running
Started: Wed, 11 Mar 2020 19:28:30 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 10m
memory: 10Mi
Requests:
cpu: 10m
memory: 10Mi
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-k8d22 (ro)
prometheus-server:
Container ID: docker://65d2870debb187a20a102786cac3725745e5bc0d60f3e04cb38c2beea6f5c128
Image: prom/prometheus:v2.16.0
Image ID: docker-pullable://prom/prometheus#sha256:e4ca62c0d62f3e886e684806dfe9d4e0cda60d54986898173c1083856cfda0f4
Port: 9090/TCP
Host Port: 0/TCP
Args:
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Running
Started: Wed, 11 Mar 2020 19:28:30 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 200m
memory: 1000Mi
Requests:
cpu: 200m
memory: 1000Mi
Liveness: http-get http://:9090/-/healthy delay=30s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/data from prometheus-data (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-k8d22 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
prometheus-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-data-prometheus-0
ReadOnly: false
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-config
Optional: false
prometheus-token-k8d22:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-token-k8d22
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 360s
node.kubernetes.io/unreachable:NoExecute for 360s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 50m default-scheduler Successfully assigned kube-system/prometheus-0 to azshara-k8s01
Normal Pulled 50m kubelet, azshara-k8s01 Container image "busybox:latest" already present on machine
Normal Created 50m kubelet, azshara-k8s01 Created container init-chown-data
Normal Started 50m kubelet, azshara-k8s01 Started container init-chown-data
Normal Pulled 50m kubelet, azshara-k8s01 Container image "jimmidyson/configmap-reload:v0.1" already present on machine
Normal Created 50m kubelet, azshara-k8s01 Created container prometheus-server-configmap-reload
Normal Started 50m kubelet, azshara-k8s01 Started container prometheus-server-configmap-reload
Normal Pulled 50m kubelet, azshara-k8s01 Container image "prom/prometheus:v2.16.0" already present on machine
Normal Created 50m kubelet, azshara-k8s01 Created container prometheus-server
Normal Started 50m kubelet, azshara-k8s01 Started container prometheus-server
You have some possible way of checking your configuration.
https://prometheus.io/docs/alerting/configuration/ (check documentation and try to run alertmanager in console on your computer to see log messages during startup)
https://prometheus.io/webtools/alerting/routing-tree-editor/ (visualization for alerting routes, can be handy. parsing error can be seen by wrong visualization)
https://github.com/prometheus/alertmanager/issues/333 (tool for checking config directly)
I am not familiar with your kubernates set-up, so I am not able to verify it for you. I hope my links will help
if it is deployed using Prometheus-operator, then you need to create an prometheusrule object. once you create an prometheusrule object it will automatically pick the new alerts rule. below is the sample:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: service-prometheus
role: alert-rules
name: prometheus-service-rules
namespace: monitoring
spec:
groups:
- name: general.rules
rules:
- alert: TargetDown-serviceprom
annotations:
description: '{{ $value }}% of {{ $labels.job }} targets are down.'
summary: Targets are down
expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10
for: 10m
labels:
severity: warning
- alert: DeadMansSwitch-serviceprom
annotations:
description: This is a DeadMansSwitch meant to ensure that the entire Alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
expr: vector(1)
labels:
severity: none
The rule.yml file in the path /etc/config,not in the /etc/prometheus,so change the rules file read path,the rules path config like this:
rule_files:
- /etc/config/rules.yml
I'm running a microk8s cluster on an Ubuntu server at home, and I have it connected to a local NAS server for persistent storage. I've been using it as my personal proving grounds for learning Kubernetes, but I seem to encounter problem after problem at just about every step of the way.
I've got the NFS Client Provisioner Helm chart installed which I've confirmed works - it will dynamically provision PVCs on my NAS server. I later was able to successfully install the Postgres Helm chart, or so I thought. After creating it I was able to connect to it using a SQL client, and I was feeling good.
Until a couple of days later, I noticed the pod was showing 0/1 containers ready. Although interestingly, the nfs-client-provisioner pod was still showing 1/1. Long story short: I've deleted/purged the Postgres Helm chart, and attempted to reinstall it, but now it no longer works. In fact, nothing new that I try to deploy works. Everything looks as though it's going to work, but then just hangs on either Init or ContainerCreating forever.
With Postgres in particular, the command I've been running is this:
helm install --name postgres stable/postgresql -f postgres.yaml
And my postgres.yaml file looks like this:
persistence:
storageClass: nfs-client
accessMode: ReadWriteMany
size: 2Gi
But if I do a kubectl get pods I see still see this:
NAME READY STATUS RESTARTS AGE
nfs-client-provisioner 1/1 Running 1 11d
postgres-postgresql-0 0/1 Init:0/1 0 3h51m
If I do a kubectl describe pod postgres-postgresql-0, this is the output:
Name: postgres-postgresql-0
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: stjohn/192.168.1.217
Start Time: Thu, 28 Mar 2019 12:51:02 -0500
Labels: app=postgresql
chart=postgresql-3.11.7
controller-revision-hash=postgres-postgresql-5bfb9cc56d
heritage=Tiller
release=postgres
role=master
statefulset.kubernetes.io/pod-name=postgres-postgresql-0
Annotations: <none>
Status: Pending
IP:
Controlled By: StatefulSet/postgres-postgresql
Init Containers:
init-chmod-data:
Container ID:
Image: docker.io/bitnami/minideb:latest
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
chown -R 1001:1001 /bitnami
if [ -d /bitnami/postgresql/data ]; then
chmod 0700 /bitnami/postgresql/data;
fi
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 256Mi
Environment: <none>
Mounts:
/bitnami/postgresql from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h4gph (ro)
Containers:
postgres-postgresql:
Container ID:
Image: docker.io/bitnami/postgresql:10.7.0
Image ID:
Port: 5432/TCP
Host Port: 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 256Mi
Liveness: exec [sh -c exec pg_isready -U "postgres" -h localhost] delay=30s timeout=5s period=10s #success=1 #failure=6
Readiness: exec [sh -c exec pg_isready -U "postgres" -h localhost] delay=5s timeout=5s period=10s #success=1 #failure=6
Environment:
PGDATA: /bitnami/postgresql
POSTGRES_USER: postgres
POSTGRES_PASSWORD: <set to the key 'postgresql-password' in secret 'postgres-postgresql'> Optional: false
Mounts:
/bitnami/postgresql from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-h4gph (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-postgres-postgresql-0
ReadOnly: false
default-token-h4gph:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-h4gph
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
And if I do a kubectl get pod postgres-postgresql-0 -o yaml, this is the output:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2019-03-28T17:51:02Z"
generateName: postgres-postgresql-
labels:
app: postgresql
chart: postgresql-3.11.7
controller-revision-hash: postgres-postgresql-5bfb9cc56d
heritage: Tiller
release: postgres
role: master
statefulset.kubernetes.io/pod-name: postgres-postgresql-0
name: postgres-postgresql-0
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: postgres-postgresql
uid: 0d3ef673-5182-11e9-bf14-b8975a0ca30c
resourceVersion: "1953329"
selfLink: /api/v1/namespaces/default/pods/postgres-postgresql-0
uid: 0d4dfb56-5182-11e9-bf14-b8975a0ca30c
spec:
containers:
- env:
- name: PGDATA
value: /bitnami/postgresql
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: postgresql-password
name: postgres-postgresql
image: docker.io/bitnami/postgresql:10.7.0
imagePullPolicy: Always
livenessProbe:
exec:
command:
- sh
- -c
- exec pg_isready -U "postgres" -h localhost
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: postgres-postgresql
ports:
- containerPort: 5432
name: postgresql
protocol: TCP
readinessProbe:
exec:
command:
- sh
- -c
- exec pg_isready -U "postgres" -h localhost
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: 250m
memory: 256Mi
securityContext:
procMount: Default
runAsUser: 1001
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bitnami/postgresql
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-h4gph
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: postgres-postgresql-0
initContainers:
- command:
- sh
- -c
- |
chown -R 1001:1001 /bitnami
if [ -d /bitnami/postgresql/data ]; then
chmod 0700 /bitnami/postgresql/data;
fi
image: docker.io/bitnami/minideb:latest
imagePullPolicy: Always
name: init-chmod-data
resources:
requests:
cpu: 250m
memory: 256Mi
securityContext:
procMount: Default
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bitnami/postgresql
name: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-h4gph
readOnly: true
nodeName: stjohn
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1001
serviceAccount: default
serviceAccountName: default
subdomain: postgres-postgresql-headless
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: data
persistentVolumeClaim:
claimName: data-postgres-postgresql-0
- name: default-token-h4gph
secret:
defaultMode: 420
secretName: default-token-h4gph
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with incomplete status: [init-chmod-data]'
reason: ContainersNotInitialized
status: "False"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with unready status: [postgres-postgresql]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
message: 'containers with unready status: [postgres-postgresql]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-03-28T17:51:02Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: docker.io/bitnami/postgresql:10.7.0
imageID: ""
lastState: {}
name: postgres-postgresql
ready: false
restartCount: 0
state:
waiting:
reason: PodInitializing
hostIP: 192.168.1.217
initContainerStatuses:
- image: docker.io/bitnami/minideb:latest
imageID: ""
lastState: {}
name: init-chmod-data
ready: false
restartCount: 0
state:
waiting:
reason: PodInitializing
phase: Pending
qosClass: Burstable
startTime: "2019-03-28T17:51:02Z"
I don't see anything obvious in these to be able to pinpoint what might be going on. And I've already rebooted the server just to see if that might help. Any thoughts? Why won't my containers start?
It looks like your initContainer is stuck in the PodInitializing state. The most likely scenario is that your PVCs are not ready. I recommend you describe your data-postgres-postgresql-0 PVC to make sure that the volume has actually been provisioned and is in the READY state. Your NFS provisioner may be working, but that specific PV/PVC may not have been created due to an error. I have run into similar phenomena with the EFS provisioner with AWS.
You can use the event command of kubectl. This will give you the event for your pod.
To filter for a specific pod you can use a field-selector:
kubectl get event --namespace abc-namespace --field-selector involvedObject.name=my-pod-zl6m6