Kubernetes cluster down error - kubernetes

When running following command for cluster down in Kubernetes, I am getting following error:
KUBERNETES_PROVIDER=ubuntu ./kube-down.sh
rm: cannot remove ‘/var/lib/kubelet/pods/16981b98-a3bb-11e5-99fb-00505622b20d/volumes/kubernetes.io~secret/default-token-0i2n6’: Device or resource busy
I tried to remove it forcefully but then also its not getting removed.

This isn't taking account pods that are Terminating, nor pods in namespaces other than the default namespace. Filed an issue:
https://github.com/kubernetes/kubernetes/issues/20469

Related

EKS kubectl logs <podname> suddenly stop working

I have pods running on eks, and pulling the container logs worked fine couple days ago. but today when i tried to run kubectl logs podname i get a tls error.
Error from server: Get "https://host:10250/containerLogs/dev/pod-748b649458-bczdq/server": remote error: tls: internal error
does anyone know how to fix this? the other answers on stackoverflow seems to suggest deleting the kubenetes cluster and rebuilding it...... is there no better solutions?
This could probably due to some firewall rules or security settings that were introduced. I would encourage you to check it along with the following troubleshooting steps -
Ensure all EKS nodes are in running state.
Restart nodes as required
Checking networking configuration and see if other kubectl commands are running.

How to find the pod that led to an error in GKE

If I look at my logs in GCP logs, I see for instance that I got a request that gave 500
log_message: "Method: some_cloud_goo.Endpoint failed: INTERNAL_SERVER_ERROR"
I would like to quickly go to that pod and do a kubectl logs on it. But I did not find a way to do this.
I am fairly new to k8s and GKE, any way to traceback the pod that handled that request?
You could run command "kubectl get pods " on each node to check the status of all pods and could figure out accordingly by running for detail description of an error " kubectl describe pod pod-name"
As mentioned in #Neelam answer, you can can get the pod names with the command kubectl get pods -A and log into all your pods to find the error.
Or, alternatively, you could deploy a custom monitoring system like Elastic GKE Logging available in GCP github Click-to-deploy.
See here to install from MarketPlace with few clicks.
It is a free alternative to have a complete monitoring system and you can filter your logs in Kibana dashboard after deployed.

After uninstalling calico, new pods are stuck in container creating state

After uninstalling calico, kubectl -f calico.yaml, not able to create new pods in the cluster. Any new pods in the cluster are stuck in container creating state. Kubectl describe shows the errors below:
Warning FailedCreatePodSandBox 2m kubelet, 10.0.12.2 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to set up pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to teardown pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
The main issue is caused because calico has an init container but does not have a cleanup container. T
To undeploy calico, we have to do the usual kubectl delete -f <yaml>, and then delete a calico conf file in each of the nodes /etc/cni/net.d/. This configuration file along with other binaries are loaded on to the host by the init container.
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
From this link we can see that kubelet reads the configuration file from the default directory, and if there are multiple configuration files, then it applies the CNI plugin from the config file that appears first in an alphabetical order (why, oh god why??).
So, in our case, after uninstalling calico, it would be removed from all the admin privileges but the nodes would still try to apply calico rules based upon the config file it picked up from the default directory. Then restart the node to get rid of the iptable rules.
Removing the file and restarting the node solves the issue and we get back to normal behavior. Another way to solve the same problem is by simply terminating the node from the cluster if you are on a managed kubernetes cluster. Since, public cloud infrastructure automatically boots up another node to keep the same state, it no longer has the calico configuration file.

fluentd daemon set container for papertrail failing to start in kubernetes cluster

Am trying to setup fluentd in kubernetes cluster to aggregate logs in papertrail, as per the documentation provided here.
The configuration file is fluentd-daemonset-papertrail.yaml
It basically creates a daemon set for fluentd container and a config map for fluentd configuration.
When I apply the configuration, the pod is assigned to a node and the container is created. However, its either not completing the initialization or the pod gets killed immediately after it is started.
As the pods are getting killed, am loosing the logs too. Couldn't investigate the cause of the issue.
Looking through the events for kube-system namespace has below errors,
Error: failed to start container "fluentd": Error response from daemon: OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/75026/ns/ipc\" caused \"lstat /proc/75026/ns/ipc: no such file or directory\"": unknown
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9559643bf77e29d270c23bddbb17a9480ff126b0b6be10ba480b558a0733161c" network for pod "fluentd-papertrail-b9t5b": NetworkPlugin kubenet failed to set up pod "fluentd-papertrail-b9t5b_kube-system" network: Error adding container to network: failed to open netns "/proc/111610/ns/net": failed to Statfs "/proc/111610/ns/net": no such file or directory
Am not sure whats causing these errors. Appreciate any help to understand and troubleshoot these errors.
Also, is it possible to look at logs/events that could tell us why a pod is given a terminate signal?
Please ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes.
Take a look: sandbox.
With help from papertrail support team, I was able to resolve the issue by removing below entry from manifest file.
kubernetes.io/cluster-service: "true"
Above annotation seems to have been deprecated.
Relevant github issues:
https://github.com/fluent/fluentd-kubernetes-daemonset/issues/296
https://github.com/kubernetes/kubernetes/issues/72757

See wrong client URL when listing the etcd member

I have a Stacked master K8s cluster (etcd is also local/internal) with three master and 9 worker nodes.
And my cluster version is currently 1.12.3, while going through etcd commands, i tried listing the etcd member, executing
ETCDCTL_API=3 etcdctl member list
, and found that the client Url's of master2 and master3 is wrong.
Below is the image,
As per my understanding ip for peers and client should be same, but as I can see IP is 127.0.0.1 in case of master2 and master3.
When I check the endpoint status I get below error as,
Failed to get the status of endpoint :2379 (context deadline exceeded)
while I am successfully getting the status for master1,
Could anyone please help me out in solving this.
Things I tried:
1) Edited the manifest file, etcd pods got restarted, but still nothing changed when I listed the member.
2) I have also successfully removed and added master3 in the etcd cluster, and this worked (IP's got corrected and getting the status of master3), but when I did the same for master2 getting error as
"error validating peerURLs {{ID: xyz, PeerUrls:xyz, clienturl:xyz},{&ID:xyz......}}: member count is unequal"
Editing etcd manifest file and correcting the IP worked for me.
Previously it wasn't working because there was one etcd.yml.bkp file present in the manifest folder (probably i took the backup of etcd manifest there it self before upgrading) and found that etcd pods referring to that yml file, removing that yml file from manifest folder resolved the issue.
Also found IP mentioned in the kube-apiserver.yml files was incorrect, for correcting it tried below two methods both worked:
Manually edited the file and corrected the IP
Or, We can generate a new manifest file for kube-api server executing kubeadm init
phase control-plane apiserver --kubernetes-version 1.14.5