K8S events: restarting container, pods: zzzz? - kubernetes

k = kubectl. I'm getting these logs
$ k get events -w
...snip
2018-02-03 13:46:06 +0100 CET 2018-02-03 13:46:06 +0100 CET 1 consul-0.150fd18470775752 Pod spec.containers{consul} Normal Started kubelet, gke-projectid-default-pool-2de02f1c-059w Started container
2018-02-03 13:46:06 +0100 CET 2018-02-03 13:46:06 +0100 CET 1 consul-0.150fd184668e88a6 Pod spec.containers{consul} Normal Created kubelet, gke-projectid-default-pool-2de02f1c-059w Created container
2018-02-03 13:47:35 +0100 CET 2018-02-03 13:47:35 +0100 CET 1 consul-0.150fd1993877443c Pod Warning FailedMount kubelet, gke-projectid-default-pool-2de02f1c-059w Unable to mount volumes for pod "consul-0_staging(1f35ac42-08e0-11e8-850a-42010af001f0)": timeout expired waiting for volumes to attach/mount for pod "staging"/"consul-0". list of unattached/unmounted volumes=[data config tls default-token-93wx3]
Meanwhile, at the same time:
$ k get pods
consul-0 1/1 Running 0 49m
consul-1 1/1 Running 0 1h
consul-2 1/1 Running 0 1h
...snip
What is going on? Why is events telling me it's restarting/starting the container? k logs pods/consul-0 and -1 and -2 don't tell anything about them being restarted.

The third column of the events output tells you the number of times an event has been seen. In your case, that value is 1. So it's not restarting your container: it's just telling you that at some point in the past, it created and started the container. That's why you can see it's running when you kubectl get pods.

Related

Kubelet Master stays in KubeletNotReady because of cni missing

Kubelet has been initialized with pod network for Calico :
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --image-repository=someserver
Then i get calico.yaml v3.11 and applied it :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" apply -f calico.yaml
Right after i check on the pod status :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady master 7m21s v1.17.2
on describe i've got cni config unitialized, but i thought that calico should have done that ?
MemoryPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
In fact i have nothing under /etc/cni/net.d/ so it seems it forgot something ?
ll /etc/cni/net.d/
total 0
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-f7lqq 0/1 Pending 0 3h
calico-node-f4xzh 0/1 Init:ImagePullBackOff 0 3h
coredns-7fb8cdf968-bbqbz 0/1 Pending 0 3h24m
coredns-7fb8cdf968-vdnzx 0/1 Pending 0 3h24m
etcd-master-1 1/1 Running 0 3h24m
kube-apiserver-master-1 1/1 Running 0 3h24m
kube-controller-manager-master-1 1/1 Running 0 3h24m
kube-proxy-9m879 1/1 Running 0 3h24m
kube-scheduler-master-1 1/1 Running 0 3h24m
As explained i'm running through a local repo and journalctl says :
kubelet[21935]: E0225 14:30:54.830683 21935 pod_workers.go:191] Error syncing pod cec2f72b-844a-4d6b-8606-3aff06d4a36d ("calico-node-f4xzh_kube-system(cec2f72b-844a-4d6b-8606-3aff06d4a36d)"), skipping: failed to "StartContainer" for "upgrade-ipam" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://repo:10000/v2/calico/cni/manifests/v3.11.2: no basic auth credentials"
kubelet[21935]: E0225 14:30:56.008989 21935 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feels like it's not only CNI the issue
Core DNS pod will be pending and master will be NotReady till calico pods are successfully running and CNI is setup properly.
It seems to be network issue to download calico docker images from docker.io. So you can pull calico images from docker.io and and push it to your internal container registry and then modify the calico yaml to refer that registry in images section of calico.yaml and finally apply the modified calico yaml to the kubernetes cluster.
So the issue with Init:ImagePullBackOff was that it cannot apply image from my private repo automatically. I had to pull all images for calico from docker. Then i deleted the calico pod it's recreate itself with the newly pushed image
sudo docker pull private-repo/calico/pod2daemon-flexvol:v3.11.2
sudo docker pull private-repo/calico/node:v3.11.2
sudo docker pull private-repo/calico/cni:v3.11.2
sudo docker pull private-repo/calico/kube-controllers:v3.11.2
sudo kubectl -n kube-system delete po/calico-node-y7g5
After that the node re-do all the init phase and :
sudo kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-qkf47 1/1 Running 0 11s
calico-node-mkcsr 1/1 Running 0 21m
coredns-7fb8cdf968-bgqvj 1/1 Running 0 37m
coredns-7fb8cdf968-v85jx 1/1 Running 0 37m
etcd-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-apiserver-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-controller-manager-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-proxy-9hkns 1/1 Running 0 37m
kube-scheduler-lin-1k8w1dv-vmh 1/1 Running 0 38m

kubernetes pod restart count shows inconsistent values when kubectl get pod -w is run

I have been playing around with minikube and after a set of operations, the output of kubectl get pod -w is like this-
nginx 1/1 Running 2 10m
nginx 1/1 Running 3 10m
nginx 0/1 Completed 2 10m
nginx 0/1 CrashLoopBackOff 2 11m
nginx 1/1 Running 3 11m
nginx 1/1 Running 3 12m
I don't understand the count shown at line 3 and 4. What does restart count convey exactly?
About the CrashLoopBackOff Status:
A CrashloopBackOff means that you have a pod starting, crashing, starting again, and then crashing again.
Failed containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution.
CrashLoopBackOff events occurs for different reasons, most of te cases related to the following:
- The application inside the container keeps crashing
- Some parameter of the pod or container have been configured incorrectly
- An error made during the deployment
Whenever you face a CrashLoopBackOff do a kubectl describe to investigate:
kubectl describe pod POD_NAME --namespace NAMESPACE_NAME
user#minikube:~$ kubectl describe pod ubuntu-5d4bb4fd84-8gl67 --namespace default
Name: ubuntu-5d4bb4fd84-8gl67
Namespace: default
Priority: 0
Node: minikube/192.168.39.216
Start Time: Thu, 09 Jan 2020 09:51:03 +0000
Labels: app=ubuntu
pod-template-hash=5d4bb4fd84
Status: Running
Controlled By: ReplicaSet/ubuntu-5d4bb4fd84
Containers:
ubuntu:
Container ID: docker://c4c0295e1e050b5e395fc7b368a8170f863159879821dd2562bc2938d17fc6fc
Image: ubuntu
Image ID: docker-pullable://ubuntu#sha256:250cc6f3f3ffc5cdaa9d8f4946ac79821aafb4d3afc93928f0de9336eba21aa4
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 09 Jan 2020 09:54:37 +0000
Finished: Thu, 09 Jan 2020 09:54:37 +0000
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 09 Jan 2020 09:53:05 +0000
Finished: Thu, 09 Jan 2020 09:53:05 +0000
Ready: False
Restart Count: 5
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxst (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xxxst:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxst
Optional: false
QoS Class: BestEffort
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m16s default-scheduler Successfully assigned default/ubuntu-5d4bb4fd84-8gl67 to minikube
Normal Created 5m59s (x4 over 6m52s) kubelet, minikube Created container ubuntu
Normal Started 5m58s (x4 over 6m52s) kubelet, minikube Started container ubuntu
Normal Pulling 5m17s (x5 over 7m5s) kubelet, minikube Pulling image "ubuntu"
Normal Pulled 5m15s (x5 over 6m52s) kubelet, minikube Successfully pulled image "ubuntu"
Warning BackOff 2m2s (x24 over 6m43s) kubelet, minikube Back-off restarting failed container
The Events section will provide you with detailed explanation on what happened.
RestartCount represents the number of times the container inside a pod has been restarted, it is based on the number of dead containers that have not yet been removed. Note that this is calculated from dead containers. 
-w on the command is for watch flag and various headers are as listed below
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 21m
To get detailed output use -o wide flag
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 1 21h 10.244.2.36 worker-node-2 <none> <none>
So the READY field represents the containers inside the pods and can be seen in detailed by describe pod command. Refer POD Lifecycle
$ kubectl describe pod nginx| grep -i -A6 "Conditions"
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
RESTARTS Field is tracked under Restart Count , grep it from pod description as below.
$ kubectl describe pod nginx | grep -i "Restart"
Restart Count: 0
So as a test we now try to restart the above container and see what field are updated.
We find the node where our container is running and kill the container from node using docker command and it should be restarted automatically by kubernetes
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 21h 10.244.2.36 worker-node-2 <none> <none>
ubuntu#worker-node-2:~$ sudo docker ps -a | grep -i nginx
4c8e2e6bf67c nginx "nginx -g 'daemon of…" 22 hours ago Up 22 hours
ubuntu#worker-node-2:~$ sudo docker kill 4c8e2e6bf67c
4c8e2e6bf67c
POD Status is changed to ERROR
READY count goes to 0/1
ubuntu#cluster-master:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 0/1 Error 0 21h 10.244.2.36 worker-node-2 <none> <none>
Once POD recovers the failed container.
READY count is 1/1 again
STATUS changes back to running
RESTARTS count is incremented by 1
ubuntu#cluster-master:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 1 21h 10.244.2.36 worker-node-2 <none> <none>
Check restart by describe command as well
$ kubectl describe pods nginx | grep -i "Restart"
Restart Count: 1
The values in your output are not inconsistent .. that is how the pod with a restartPolicy of Always will work it will try to bring back the failed container until CrashLoopBackOff limit is reached.
Refer POD State Examples
Pod is running and has one Container. Container exits with success.
Log completion event.
If restartPolicy is:
Always: Restart Container; Pod phase stays Running.
OnFailure: Pod phase becomes Succeeded.
Never: Pod phase becomes Succeeded.
List the Restarted pods accross all namespaces:
kubectl get pods -A |awk '$5 != "0" {print $0}'

Kubernetes kubectl shows pods restarts as zero but pods age has changed

Can somebody explain why the following command shows that there have been no restarts but the age is 2 hours when it was started 17 days ago
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
api-depl-nm-xxx 1/1 Running 0 17d xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal
ei-depl-nm-xxx 1/1 Running 0 2h xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal
jenkins-depl-nm-xxx 1/1 Running 0 2h xxx.xxx.xxx.xxx ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal
The deployments have been running for 17 days:
kubectl get deploy -o wide
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE CONTAINER(S) IMAGE(S) SELECTOR
api-depl-nm 1 1 1 1 17d api-depl-nm xxx name=api-depl-nm
ei-depl-nm 1 1 1 1 17d ei-depl-nm xxx name=ei-depl-nm
jenkins-depl-nm 1 1 1 1 17d jenkins-depl-nm xxx name=jenkins-depl-nm
The start time was 2 hours ago:
kubectl describe po ei-depl-nm-xxx | grep Start
Start Time: Tue, 24 Jul 2018 09:07:05 +0100
Started: Tue, 24 Jul 2018 09:10:33 +0100
The application logs show it restarted.
So why is the restarts 0?
Updated with more information as a response to answer.
I may be wrong but I don't think the deployment was updated or scaled it certainly was not done be me and no one else has access to the system.
kubectl describe deployment ei-depl-nm
...
CreationTimestamp: Fri, 06 Jul 2018 17:06:24 +0100
Labels: name=ei-depl-nm
...
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
...
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: ei-depl-nm-xxx (1/1 replicas created)
Events: <none>
I may be wrong but I don't think the worker node was restarted or shut down
kubectl describe nodes ip-xxx.eu-west-1.compute.internal
Taints: <none>
CreationTimestamp: Fri, 06 Jul 2018 16:39:40 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 06 Jul 2018 16:39:45 +0100 Fri, 06 Jul 2018 16:39:45 +0100 RouteCreated RouteController created a route
OutOfDisk False Wed, 25 Jul 2018 16:30:36 +0100 Fri, 06 Jul 2018 16:39:40 +0100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 25 Jul 2018 16:30:36 +0100 Wed, 25 Jul 2018 02:23:01 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 25 Jul 2018 16:30:36 +0100 Wed, 25 Jul 2018 02:23:01 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Wed, 25 Jul 2018 16:30:36 +0100 Wed, 25 Jul 2018 02:23:11 +0100 KubeletReady kubelet is posting ready status
......
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default ei-depl-nm-xxx 100m (5%) 0 (0%) 0 (0%) 0 (0%)
default jenkins-depl-nm-xxx 100m (5%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-dns-xxx 260m (13%) 0 (0%) 110Mi (1%) 170Mi (2%)
kube-system kube-proxy-ip-xxx.eu-west-1.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
560m (28%) 0 (0%) 110Mi (1%) 170Mi (2%)
Events: <none>
There are two things that might happen:
The deployment was updated or scaled:
age of deployment does not change
new ReplicaSet is created, old ReplicaSet is deleted. You can check it by running
$ kubectl describe deployment <deployment_name>
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 1m deployment-controller Scaled up replica set testdep1-75488876f6 to 1
Normal ScalingReplicaSet 1m deployment-controller Scaled down replica set testdep1-d4884df5f to 0
pods created by old ReplicaSet are terminated, new ReplicaSet created brand new pod with restarts 0 and age 0 sec.
Worker node was restarted or shut down.
Pod on old worker node disappears
Scheduler creates a brand new pod on the first available node (it can be the same node after reboot) with restarts 0 and age 0 sec.
You can check the node start events by running
kubectl describe nodes <node_name>
...
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 32s kubelet, <node-name> Starting kubelet.
Normal NodeHasSufficientPID 31s (x5 over 32s) kubelet, <node-name> Node <node-name> status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 31s kubelet, <node-name> Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 30s (x6 over 32s) kubelet, <node-name> Node <node-name> status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 30s (x6 over 32s) kubelet, <node-name> Node <node-name> status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 30s (x6 over 32s) kubelet, <node-name> Node <node-name> status is now: NodeHasNoDiskPressure
Normal Starting 10s kube-proxy, <node-name> Starting kube-proxy.

kubernetes HA cluster masters nodes not ready

I have deployed a kubernetes HA cluster using the next config.yaml:
etcd:
endpoints:
- "http://172.16.8.236:2379"
- "http://172.16.8.237:2379"
- "http://172.16.8.238:2379"
networking:
podSubnet: "192.168.0.0/16"
apiServerExtraArgs:
endpoint-reconciler-type: lease
When I check kubectl get nodes:
NAME STATUS ROLES AGE VERSION
master1 Ready master 22m v1.10.4
master2 NotReady master 17m v1.10.4
master3 NotReady master 16m v1.10.4
If I check the pods, I can see too much are failing:
[ikerlan#master1 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-etcd-5jftb 0/1 NodeLost 0 16m
calico-etcd-kl7hb 1/1 Running 0 16m
calico-etcd-z7sps 0/1 NodeLost 0 16m
calico-kube-controllers-79dccdc4cc-vt5t7 1/1 Running 0 16m
calico-node-dbjl2 2/2 Running 0 16m
calico-node-gkkth 0/2 NodeLost 0 16m
calico-node-rqzzl 0/2 NodeLost 0 16m
kube-apiserver-master1 1/1 Running 0 21m
kube-controller-manager-master1 1/1 Running 0 22m
kube-dns-86f4d74b45-rwchm 1/3 CrashLoopBackOff 17 22m
kube-proxy-226xd 1/1 Running 0 22m
kube-proxy-jr2jq 0/1 ContainerCreating 0 18m
kube-proxy-zmjdm 0/1 ContainerCreating 0 17m
kube-scheduler-master1 1/1 Running 0 21m
If I run kubectl describe node master2:
[ikerlan#master1 ~]$ kubectl describe node master2
Name: master2
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=master2
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Mon, 11 Jun 2018 12:06:03 +0200
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:00 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 172.16.8.237
Hostname: master2
Capacity:
cpu: 2
ephemeral-storage: 37300436Ki
Then if I check the pods, kubectl describe pod -n kube-system calico-etcd-5jftb:
[ikerlan#master1 ~]$ kubectl describe pod -n kube-system calico-etcd-5jftb
Name: calico-etcd-5jftb
Namespace: kube-system
Node: master2/
Labels: controller-revision-hash=4283683065
k8s-app=calico-etcd
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Terminating (lasts 20h)
Termination Grace Period: 30s
Reason: NodeLost
Message: Node master2 which was running pod calico-etcd-5jftb is unresponsive
IP:
Controlled By: DaemonSet/calico-etcd
Containers:
calico-etcd:
Image: quay.io/coreos/etcd:v3.1.10
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/etcd
Args:
--name=calico
--data-dir=/var/etcd/calico-data
--advertise-client-urls=http://$CALICO_ETCD_IP:6666
--listen-client-urls=http://0.0.0.0:6666
--listen-peer-urls=http://0.0.0.0:6667
--auto-compaction-retention=1
Environment:
CALICO_ETCD_IP: (v1:status.podIP)
Mounts:
/var/etcd from var-etcd (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-tj6d7 (ro)
Volumes:
var-etcd:
Type: HostPath (bare host directory volume)
Path: /var/etcd
HostPathType:
default-token-tj6d7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-tj6d7
Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events: <none>
I have tried to update etcd cluster, to version 3.3 and now I can see the next logs (and some more timeouts):
2018-06-12 09:17:51.305960 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices/v1beta1.authentication.k8s.io\" " took too long (190.475363ms) to execute
2018-06-12 09:18:06.788558 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (109.543763ms) to execute
2018-06-12 09:18:34.875823 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (136.649505ms) to execute
2018-06-12 09:18:41.634057 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " took too long (106.00073ms) to execute
2018-06-12 09:18:42.345564 W | etcdserver: request "header:<ID:4449666326481959890 > lease_revoke:<ID:4449666326481959752 > " took too long (142.771179ms) to execute
I have checked: kubectl get events
22m 22m 1 master2.15375fdf087fc69f Node Normal Starting kube-proxy, master2 Starting kube-proxy.
22m 22m 1 master3.15375fe744055758 Node Normal Starting kubelet, master3 Starting kubelet.
22m 22m 5 master3.15375fe74d47afa2 Node Normal NodeHasSufficientDisk kubelet, master3 Node master3 status is now: NodeHasSufficientDisk
22m 22m 5 master3.15375fe74d47f80f Node Normal NodeHasSufficientMemory kubelet, master3 Node master3 status is now: NodeHasSufficientMemory
22m 22m 5 master3.15375fe74d48066e Node Normal NodeHasNoDiskPressure kubelet, master3 Node master3 status is now: NodeHasNoDiskPressure
22m 22m 5 master3.15375fe74d481368 Node Normal NodeHasSufficientPID kubelet, master3 Node master3 status is now: NodeHasSufficientPID
I see multiple calico-etcd pods attempting to be ran, if you have used a calico.yaml that deploys etcd for you, that will not work in a multi-master environment.
That manifest is not intended for production deployment and will not work in a multi-master environment because the etcd it deploys is not configured to attempt to form a cluster.
You could still use that manifest but you would need to remove the etcd pods it deploys and set the etcd_endpoints to an etcd cluster you have deployed.
I have solved it:
Adding all the masters IPs and LB IP to the apiServerCertSANs
Copying the kubernetes certificates from the first master to the other masters.

Pods hang in pending state indefinitely

I've been working with a 6 node cluster for the last few weeks without issue. Earlier today we ran into an open file issue (https://github.com/kubernetes/kubernetes/pull/12443/files) and I patched and restarted kube-proxy.
Since then, all rc deployed pods to ALL BUT node-01 get stuck in pending state and there log messages stating the cause.
Looking at the docker daemon on the nodes, the containers in the pod are actually running and a delete of the rc removes them. It appears to be some sort of callback issue between the state according to kubelet and the kube-apiserver.
Cluster is running v1.0.3
Here's an example of the state
docker run --rm -it lachie83/kubectl:prod get pods --namespace=kube-system -o wide
NAME READY STATUS RESTARTS AGE NODE
kube-dns-v8-i0yac 0/4 Pending 0 4s 10.1.1.35
kube-dns-v8-jti2e 0/4 Pending 0 4s 10.1.1.34
get events
Wed, 16 Sep 2015 06:25:42 +0000 Wed, 16 Sep 2015 06:25:42 +0000 1 kube-dns-v8 ReplicationController successfulCreate {replication-controller } Created pod: kube-dns-v8-i0yac
Wed, 16 Sep 2015 06:25:42 +0000 Wed, 16 Sep 2015 06:25:42 +0000 1 kube-dns-v8-i0yac Pod scheduled {scheduler } Successfully assigned kube-dns-v8-i0yac to 10.1.1.35
Wed, 16 Sep 2015 06:25:42 +0000 Wed, 16 Sep 2015 06:25:42 +0000 1 kube-dns-v8-jti2e Pod scheduled {scheduler } Successfully assigned kube-dns-v8-jti2e to 10.1.1.34
Wed, 16 Sep 2015 06:25:42 +0000 Wed, 16 Sep 2015 06:25:42 +0000 1 kube-dns-v8 ReplicationController successfulCreate {replication-controller } Created pod: kube-dns-v8-jti2e
scheduler log
I0916 06:25:42.897814 10076 event.go:203] Event(api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-dns-v8-jti2e", UID:"c1cafebe-5c3b-11e5-b3c4-020443b6797d", APIVersion:"v1", ResourceVersion:"670117", FieldPath:""}): reason: 'scheduled' Successfully assigned kube-dns-v8-jti2e to 10.1.1.34
I0916 06:25:42.904195 10076 event.go:203] Event(api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-dns-v8-i0yac", UID:"c1cafc69-5c3b-11e5-b3c4-020443b6797d", APIVersion:"v1", ResourceVersion:"670118", FieldPath:""}): reason: 'scheduled' Successfully assigned kube-dns-v8-i0yac to 10.1.1.35
tailing kubelet log file during pod create
tail -f kubelet.kube-node-03.root.log.INFO.20150916-060744.10668
I0916 06:25:04.448916 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:25:24.449253 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:25:44.449522 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:26:04.449774 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:26:24.450400 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:26:44.450995 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:27:04.451501 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:27:24.451910 10668 config.go:253] Setting pods for source file : {[] 0 file}
I0916 06:27:44.452511 10668 config.go:253] Setting pods for source file : {[] 0 file}
kubelet process
root#kube-node-03:/var/log/kubernetes# ps -ef | grep kubelet
root 10668 1 1 06:07 ? 00:00:13 /opt/bin/kubelet --address=10.1.1.34 --port=10250 --hostname_override=10.1.1.34 --api_servers=https://kube-master-01.sj.lithium.com:6443 --logtostderr=false --log_dir=/var/log/kubernetes --cluster_dns=10.1.2.53 --config=/etc/kubelet/conf --cluster_domain=prod-kube-sjc1-1.internal --v=4 --tls-cert-file=/etc/kubelet/certs/kubelet.pem --tls-private-key-file=/etc/kubelet/certs/kubelet-key.pem
node list
docker run --rm -it lachie83/kubectl:prod get nodes
NAME LABELS STATUS
10.1.1.30 kubernetes.io/hostname=10.1.1.30,name=node-1 Ready
10.1.1.32 kubernetes.io/hostname=10.1.1.32,name=node-2 Ready
10.1.1.34 kubernetes.io/hostname=10.1.1.34,name=node-3 Ready
10.1.1.35 kubernetes.io/hostname=10.1.1.35,name=node-4 Ready
10.1.1.42 kubernetes.io/hostname=10.1.1.42,name=node-5 Ready
10.1.1.43 kubernetes.io/hostname=10.1.1.43,name=node-6 Ready
The issue turned out to be an MTU issue between the node and the master. Once that was fixed the problem was resolved.
Looks like you were building your cluster from scratch. Have you run conformance test against your cluster yet? If no, could you please run it and the detail information can be found at:
https://github.com/kubernetes/kubernetes/blob/e8009e828c864a46bf2e1d5c7dab8ef413c8bbe5/hack/conformance-test.sh
The conformance test should failed, or at least give us more information on your cluster setup. Please post the test result somewhere, so that we can diagnose your problem more.
The problem most likely your kubelet and your kube-apiserver don't agree upon the node name here. And I also noticed that you are using hostname_override too.