I recently upgraded my GKE cluster from 1.10.x to 1.11.x and since then my calico-node pods fail to connect to the etcd cluster and end up in a CrashLoopBackOff due to livenessProbe error.
I saw that the calico-etcd DaemonSet has desired state 0 and was wondering about that. nodeSelector is at node-role.kubernetes.io/master=.
From the logs of such calico-nodes:
2018-12-19 19:18:28.989 [INFO][7] etcd.go 373: Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
2018-12-19 19:18:28.989 [INFO][7] startup.go 254: Unable to query node configuration Name="gke-brokerme-ubuntu-pool-852d0318-j5ft" error=client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
State of the DaemonSets:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-etcd 0 0 0 0 0 node-role.kubernetes.io/master= 3d
calico-node 2 2 0 2 0 <none> 3d
k get nodes --show-labels:
NAME STATUS ROLES AGE VERSION LABELS
gke-brokerme-ubuntu-pool-852d0318-7v4m Ready <none> 4d v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-7v4m,os=ubuntu
gke-brokerme-ubuntu-pool-852d0318-j5ft Ready <none> 1h v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-j5ft,os=ubuntu
I did not modify any calico manifests, they should be 1:1 provisioned by GKE.
I would expect either the calico-nodes connect to the etc of my Kubernetes cluster, or to a calico-etcd provisioned by the DaemonSet. As there is no master node that I can control in GKE, I kind of get why calico-etcd is at state 0, but then, to which etc are the calico-nodes supposed to connect? What's wrong with my small and basic setup?
We are aware of the issue of calico crash looping in GKE 1.11.x. You can fix this issue, by upgrading to newer versions. , I would recommend you to upgrade to the version '1.11.4-gke.12' or '1.11.3-gke.23' which does not have this issue.
Related
After upgrading Kubernetes node pool from 1.21 to 1.22, ingress-nginx-controller pods started crashing. The same deployment has been working fine in EKS. I'm just having this issue in GKE. Does anyone have any ideas about the root cause?
$ kubectl logs ingress-nginx-controller-5744fc449d-8t2rq -c controller
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.3.1
Build: 92534fa2ae799b502882c8684db13a25cde68155
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
W0219 21:23:08.194770 8 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0219 21:23:08.194995 8 main.go:209] "Creating API client" host="https://10.1.48.1:443"
Ingress pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned infra/ingress-nginx-controller-5744fc449d-8t2rq to gke-infra-nodep-ffe54a41-s7qx
Normal Pulling 27m kubelet Pulling image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974"
Normal Started 27m kubelet Started container controller
Normal Pulled 27m kubelet Successfully pulled image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" in 6.443361484s
Warning Unhealthy 26m (x6 over 26m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 502
Normal Killing 26m kubelet Container controller failed liveness probe, will be restarted
Normal Created 26m (x2 over 27m) kubelet Created container controller
Warning FailedPreStopHook 26m kubelet Exec lifecycle hook ([/wait-shutdown]) for Container "controller" in Pod "ingress-nginx-controller-5744fc449d-8t2rq_infra(c4c166ff-1d86-4385-a22c-227084d569d6)" failed - error: command '/wait-shutdown' exited with 137: , message: ""
Normal Pulled 26m kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" already present on machine
Warning BackOff 7m7s (x52 over 21m) kubelet Back-off restarting failed container
Warning Unhealthy 2m9s (x55 over 26m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 502
The Beta API versions (extensions/v1beta1 and networking.k8s.io/v1beta1) of Ingress are no longer served (removed) for GKE clusters created on versions 1.22 and later. Please refer to the official GKE ingress documentation for changes in the GA API version.
Also refer to Official Kubernetes documentation for API removals for Kubernetes v1.22 for more information.
Before upgrading your Ingress API as a client, make sure that every ingress controller that you use is compatible with the v1 Ingress API. See Ingress Prerequisites for more context about Ingress and ingress controllers.
Also check below possible causes for Crashloopbackoff :
Increasing the initialDelaySeconds value for the livenessProbe setting may help to alleviate the issue, as it will give the container more time to start up and perform its initial work operations before the liveness probe server checks its health.
Check “Container restart policy”, the spec of a Pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always.
Out of memory or resources : Try to increase the VM size. Containers may crash due to memory limits, then new ones spun up, the health check failed and Ingress served up 502.
Check externalTrafficPolicy=Local is set on the NodePort service will prevent nodes from forwarding traffic to other nodes.
Refer to the Github issue Document how to avoid 502s #34 for more information.
I've deployed the redis helm chart on k8s with Sentinel enabled.
I've set up the Master-Replicas with Sentinel topology, it means one master and two slaves. Each pod is running both the redis and sentinel container successfully:
NAME READY STATUS RESTARTS AGE IP NODE
my-redis-pod-0 2/2 Running 0 5d22h 10.244.0.173 node-pool-u
my-redis-pod-1 2/2 Running 0 5d22h 10.244.1.96 node-pool-j
my-redis-pod-2 2/2 Running 0 3d23h 10.244.1.145 node-pool-e
Now, I've a python script that connects to redis and discovers the master by passing it the pod's ip.
sentinel = Sentinel([('10.244.0.173', 26379),
('10.244.1.96',26379),
('10.244.1.145',26379)],
sentinel_kwargs={'password': 'redispswd'})
host, port = sentinel.discover_master('mymaster')
redis_client = StrictRedis(
host=host,
port=port,
password='redispswd')
Let's suposse the master node is on my-redis-pod-0, when I do kubectl delete pod to simulate a problem that leads me to loss the pod, Sentinel will promote one of the others slaves to master and kubernetes will give me a new pod with redis and sentinel.
NAME READY STATUS RESTARTS AGE IP NODE
my-redis-pod-0 2/2 Running 0 3m 10.244.0.27 node-pool-u
my-redis-pod-1 2/2 Running 0 5d22h 10.244.1.96 node-pool-j
my-redis-pod-2 2/2 Running 0 3d23h 10.244.1.145 node-pool-e
The question is, how can I do to tell Sentinel to add this new ip to the list automatically (without code changes)?
Thanks!
Instead of using IPs, you may use the dns entries for a headless service.
A headless service is created by explicitly specifying
ClusterIP: None
Then you will be able to use the dns entries as under, where redis-0 will be the master
#syntax
pod_name.service_name.namespace.svc.cluster.local
#Example
redis-0.redis.redis.svc.cluster.local
redis-1.redis.redis.svc.cluster.local
redis-2.redis.redis.svc.cluster.local
References:
What is a headless service, what does it do/accomplish, and what are some legitimate use cases for it?
https://www.containiq.com/post/deploy-redis-cluster-on-kubernetes
I am beginner to kubernetes. I am trying to install minikube wanted to run my application in kubernetes. I am using ubuntu 16.04
I have followed the installation instructions provided here
https://kubernetes.io/docs/setup/learning-environment/minikube/#using-minikube-with-an-http-proxy
Issue1:
After installing kubectl, virtualbox and minikube I have run the command
minikube start --vm-driver=virtualbox
It is failing with following error
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
E0912 17:39:12.486830 17689 start.go:305] Error restarting
cluster: restarting kube-proxy: waiting for kube-proxy to be
up for configmap update: timed out waiting for the condition
But when I checked the virtualbox I see the minikube VM running and when I run the kubectl
kubectl create deployment hello-minikube --image=k8s.gcr.io/echoserver:1.10
I see the deployments
kubectl get deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
hello-minikube 1 1 1 1 27m
I exposed the hello-minikube deployment as service
kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-minikube LoadBalancer 10.102.236.236 <pending> 8080:31825/TCP 15m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 19h
I got the url for the service
minikube service hello-minikube --url
http://192.168.99.100:31825
When I try to curl the url I am getting the following error
curl http://192.168.99.100:31825
curl: (7) Failed to connect to 192.168.99.100 port 31825: Connection refused
1)If minikube cluster got failed while starting, how did the kubectl able to connect to minikube to do deployments and services?
2) If cluster is fine, then why am i getting connection refused ?
I was looking at this proxy(https://kubernetes.io/docs/setup/learning-environment/minikube/#starting-a-cluster) what is my_proxy in this ?
Is this minikube ip and some port ?
I have tried this
Error restarting cluster: restarting kube-proxy: waiting for kube-proxy to be up for configmap update: timed out waiting for the condition
but do not understand how #3(set proxy) in solution will be done. Can some one help me getting instructions for proxy ?
Adding the command output which was asked in the comments
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
etcd-minikube 1/1 Running 0 4m
kube-addon-manager-minikube 1/1 Running 0 5m
kube-apiserver-minikube 1/1 Running 0 4m
kube-controller-manager-minikube 1/1 Running 0 6m
kube-dns-86f4d74b45-sdj6p 3/3 Running 0 5m
kube-proxy-7ndvl 1/1 Running 0 5m
kube-scheduler-minikube 1/1 Running 0 5m
kubernetes-dashboard-5498ccf677-4x7sr 1/1 Running 0 5m
storage-provisioner 1/1 Running 0 5m
I deleted minikube and removed all files under ~/.minikube and
reinstalled minikube. Now it is working fine. I did not get the output
before but I have attached it after it is working to the question. Can
you tell me what does the output of this command tells ?
It will be very difficult or even impossible to tell what was exactly wrong with your Minikube Kubernetes cluster when it is already removed and set up again.
Basically there were a few things that you could do to properly troubleshoot or debug your issue.
Adding the command output which was asked in the comments
The output you posted is actually only part of the task that #Eduardo Baitello asked you to do. kubectl get po -n kube-system command simply shows you a list of Pods in kube-system namespace. In other words this is the list of system pods forming your Kubernetes cluster and, as you can imagine, proper functioning of each of these components is crucial. As you can see in your output the STATUS of your kube-proxy pod is Running:
kube-proxy-7ndvl 1/1 Running 0 5m
You were also asked in #Eduardo's question to check its logs. You can do it by issuing:
kubectl logs kube-proxy-7ndvl
It could tell you what was wrong with this particular pod at the time when the problem occured. Additionally in such case you may use describe command to see other pod details (sometimes looking at pod events may be very helpful to figure out what's going on with it):
kubectl describe pod kube-proxy-7ndvl
The suggestion to check this particular Pod status and logs was most probably motivated by this fragment of the error messages shown during your Minikube startup process:
E0912 17:39:12.486830 17689 start.go:305] Error restarting
cluster: restarting kube-proxy: waiting for kube-proxy to be
up for configmap update: timed out waiting for the condition
As you can see this message clearly suggests that there is in short "something wrong" with kube-proxy so it made a lot of sense to check it first.
There is one more thing you may have not noticed:
kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-minikube LoadBalancer 10.102.236.236 <pending> 8080:31825/TCP 15m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 19h
Your hello-minikube service was not completely ready. In EXTERNAL-IP column you can see that its state was pending. As you can use describe command to describe Pods you can do so to get details of the service. Simple:
describe service hello-minikube
could tell you quite a lot in such case.
1)If minikube cluster got failed while starting, how did the kubectl
able to connect to minikube to do deployments and services? 2) If
cluster is fine, then why am i getting connection refused ?
Remember that Kubernetes Cluster is not a monolith structure and consists of many parts that depend on one another. The fact that kubectl worked and you could create deployment doesn't mean that the whole cluster was working fine and as you can see in the error message it was suggesting that one of its components, namely kube-proxy, could actually not function properly.
Going back to the beginning of your question...
I have followed the installation instructions provided here
https://kubernetes.io/docs/setup/learning-environment/minikube/#using-minikube-with-an-http-proxy
Issue1: After installing kubectl, virtualbox and minikube I have run
the command
minikube start --vm-driver=virtualbox
as far as I understood you don't use the http proxy so you didn't follow instructions from this particular fragment of the docs that you posted, did you ?
I have the impression that you mix 2 concepts. kube-proxy which is a Kubernetes cluster component and which is deployed as pod in kube-system space and http proxy server mentioned in this fragment of documentation.
I was looking at this
proxy(https://kubernetes.io/docs/setup/learning-environment/minikube/#starting-a-cluster)
what is my_proxy in this ?
If you don't know what is your http proxy address, most probably you simply don't use it and if you don't use it to connect to the Internet from your computer, it doesn't apply to your case in any way.
Otherwise you need to set it up for your Minikube by providing additional flags when you start it as follows:
minikube start --docker-env http_proxy=http://$YOURPROXY:PORT \
--docker-env https_proxy=https://$YOURPROXY:PORT
If you were able to start your Minikube and now it works properly only using the command:
minikube start --vm-driver=virtualbox
your issue was caused by something else and you don't need to provide the above mentioned flags to tell your Minikube what is your http proxy server that you're using.
As far as I understand currently everything is up and running and you can access the url returned by the command minikube service hello-minikube --url without any problem, right ? You can also run the command kubectl get service hello-minikube and check if its output differs from what you posted before. As you didn't attach any yaml definition files it's difficult to tell if it was nothing wrong with your service definition. Also note that Load Balancer is a service type designed to work with external load balancers provided by cloud providers and minikube uses NodePort instead of it.
I have 3 node Kubernetes cluster on 1.11 deployed with kubeadm and weave(CNI) running of version 2.5.1. I am providing weave CIDR of IP range of 128 IP's. After two reboot of nodes some of the pods stuck in containerCreating state.
Once you run kubectl describe pod <pod_name> you will see following errors:
Events:
Type Reason Age From Message
---- ------ ---- ----
-------
Normal SandboxChanged 20m (x20 over 1h) kubelet, 10.0.1.63 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 30s (x25 over 1h) kubelet, 10.0.1.63 Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
If I check how many containers are running and how many IP address are allocated to those, I can see 24 containers:
[root#ip-10-0-1-63 centos]# weave ps | wc -l
26
The number of total IP's to weave at that node is 42.
[root#ip-10-0-1-212 centos]# kubectl exec -n kube-system -it weave-net-6x4cp -- /home/weave/weave --local status ipam
Defaulting container name to weave.
Use 'kubectl describe pod/weave-net-6x4cp -n kube-system' to see all of the containers in this pod.
6e:0d:f3:d7:f5:49(10.0.1.63) 42 IPs (32.8% of total) (42 active)
7a:24:6f:3c:1b:be(10.0.1.212) 40 IPs (31.2% of total)
ee:00:d4:9f:9d:79(10.0.1.43) 46 IPs (35.9% of total)
You can see all 42 IP's are active so no more IP's are available to allocate to new containers. But out of 42 only 26 are actually allocated to containers, I am not sure where are remaining IP's. It is happening on all three nodes.
Here is the output of weave status for your reference:
[root#ip-10-0-1-212 centos]# weave status
Version: 2.5.1 (version 2.5.2 available - please upgrade!)
Service: router
Protocol: weave 1..2
Name: 7a:24:6f:3c:1b:be(10.0.1.212)
Encryption: disabled
PeerDiscovery: enabled
Targets: 3
Connections: 3 (2 established, 1 failed)
Peers: 3 (with 6 established connections)
TrustedSubnets: none
Service: ipam
Status: waiting for IP(s) to become available
Range: 192.168.13.0/25
DefaultSubnet: 192.168.13.0/25
If you need anymore information, I would happy to provide. Any Clue?
Not sure if we have the same problem.
But before i reboot a node. I need to drain it first. So, all pods in that nodes will be evicted. We are safe to reboot the node.
After that node is up. You need to uncordon again. The node will be available to scheduling pod again.
My reference https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
I guess that 16 IP's have reserved for Pods reuse purpose. These are the maximum pods per node based on CIDR ranges.
Maximum Pods per Node CIDR Range per Node
8 /28
9 to 16 /27
17 to 32 /26
33 to 64 /25
65 to 110 /24
In case if you're weave IP's are exhausted and some of the IP's are not released after reboot. You can delete the file /var/lib/weave/weave-netdata.db and restart the weave pods.
For my case, I have added a systemd script which on every reboot or shutdown of the system removes the /var/lib/weave/weave-netdata.db file and Once system comes up it allocates new Ip's to all the pods and the weave IP exhaust were never seen again.
Posting this here in hope someone else will find it useful for their use case.
istio-pilot pod on minikube kubernetes cluster is always in Pending state. Increased CPU=4 and memory=8GB. Still the status of istio-pilot pod is Pending.
Is specific change required to run istio on minikube other than the ones mentioned in documentation?
Resolved the issue . Im running minikube with Virtual box and running minikube with higher memory and CPU does not reflect until minikube is deleted and started with new parameters. Without this it was resulting in Insufficient memory.
I saw istio-pilot in 1.1 rc3 consume a lot of CPU and was in Pending state due to the following message in kubectl describe <istio-pilot pod name> -n=istio-system:
Warning FailedScheduling 1m (x25 over 3m) default-scheduler 0/2 nodes are available:
1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate.
I was able to reduce it by doing --set pilot.resources.requests.cpu=30m when installing istio using helm.
https://github.com/istio/istio/blob/1.1.0-rc.3/install/kubernetes/helm/istio/charts/pilot/values.yaml#L16