Kubernetes is Down - kubernetes

All of a sudden, I get this error when I run kubectl commands
The connection to the server xx.xx.xxx.xx:6443 was refused - did you specify the right host or port?
I can see kubelet running, but none of the other Kubernetes related services are running.
docker ps -a | grep kube-api
returns me nothing.
What I did after searching for the resolution in google:
Turned off swap --> issue persists
Restarted the linux machine -->
Right after restart, I could see the kubectl commands giving me output, but after about 15 seconds, it again went back to the error I mentioned above.
Restarted Kubelet --> For 1 sec, I could see the output for kubectl commands, but again back to square one.
I'm not sure what exactly I'm supposed to do here.
NB: K8 cluster is installed with kubeadm
Also, I can see the pods in EVICTED state for a brief time when I could see the kubectl get pods output

Related

Periodic problem with kubernetes: An error: "The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?"

Good afternoon!
Setting up my first k8s cluster :)
I set up a virtual machine on vmware, set up a control plane, connected a worker node, set up kubectl on the master node and on a laptop (vmware is installed on it). I observe the following problem: periodically, every 2 - 5 minutes, the api-server stops responding, when you run any kubectl command (for example, "kubectl get nodes"), an error appears: "The connection to the server 192.168.131.133:6443 was refused - did you specify the right host or port?" A few minutes pass - everything is restored, in response to the "kubectl get nodes", the system shows the nodes. A few more minutes - again the same error. The error synchronously appears both on the master node and on the laptop.
This is what it looks like (for about 10 minutes):
At the same time, if you execute commands on the master node
$ sudo systemctl stop kubelet
$ sudo systemctl start kubelet
everything is immediately restored. And after a few minutes again the same error.
I would be grateful if you could help interpret these logs and tell me how to fix this problem?
kubectl logs at the time of the error (20:42:55):
log
Could imagine that the process on 192.168.131.133 is restarting which is leading to a connection refused when it is not listening any more on the API port.
You should start investigating if you can see any hw issues.
Either CPU is increasing leading to a restart. Or memory leak.
You can check the running processes with.
ps -ef
Use
top
Command to see CPU consumption.
There should be some logs and events in k8s available as well.
It seems no connectivity issue as you are receiving a clear failure back.

Kubernetes: view logs of crashed Airflow worker pod

Pods on our k8s cluster are scheduled with Airflow's KubernetesExecutor, which runs all Tasks in a new pod.
I have a such a Task for which the pod instantly (after 1 or 2 seconds) crashes, and for which of course I want to see the logs.
This seems hard. As soon the pod crashes, it gets deleted, along with the ability to retrieve crash logs. I already tried all of:
kubectl logs -f <pod> -p: cannot be used since these pods are named uniquely
(courtesy of KubernetesExecutor).
kubectl logs -l label_name=label_value: I
struggle to apply the labels to the pod (if this is a known/used way of working, I'm happy to try further)
An shared nfs is mounted on all pods on a fixed log directory. The failing pod however, does not log to this folder.
When I am really quick I run kubectl logs -f -l dag_id=sample_dag --all-containers (dag_idlabel is added byAirflow)
between running and crashing and see Error from server (BadRequest): container "base" in pod "my_pod" is waiting to start: ContainerCreating. This might give me some clue but:
these are only but the last log lines
this is really backwards
I'm basically looking for the canonical way of retrieving logs from transient pods
You need to enable remote logging. Code sample below is for using S3. In airflow.cfg set the following:
remote_logging = True
remote_log_conn_id = my_s3_conn
remote_base_log_folder = s3://airflow/logs
The my_s3_conn can be set in airflow>Admin>Connections. In the Conn Type dropdown, select S3.

after some time my kubernetes cluster does not work

I have kubernetes cluster.every thing work fine. but after 8 days when i run kubectl get pods it shows:
The connection to the server <host>:6443 was refused - did you specify the right host or port?
I have one master and one worker.
I run them in my lab without any cloud.
systemctl kubelet status
show **node not found**
my /etc/hosts was checked and it is correct
i have lack of hardware. I run this command to solve the issue
sudo -i
swapoff -a
exit
strace -eopenat kubectl version
most likely that the servers are rebooted. i had similar problem.
check kubelet logs on master server and take action.
if you can share the kubelet logs then we will be able to offer you further help
Reboot itself should not be a problem - but if you did not disabled swap permanently reboot will enable swap again and API server will not launch - it could be first shot.
Second - check free disk space, API server will not respond if disk is full (will raise disk pressure event and will try to evict pods).
If it will not help - please add logs from Kubelet (systemctl and journalctl).
verify /var/log/messages to get further information about the error
or
systemctl status kubelet
or
Alternately journalctl will also show the details.

calico-policy-container on the worker node is on a restart loop. how can i check why?

I have two coreos stable machines (with latest stable version installed) to test Kubernetes. i installed kubernetes 1.5.1 using the script from https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/generic and patched it with https://github.com/kfirufk/coreos-kubernetes-multi-node-generic-install-script.
I installed controller script on one and worker script on the other. kubectl get nodes shows both servers.
kubectl get pods --namespace=kube-system shows that calico-policy-controller-2j5dn restarts a lot. in the worker server I do see that calico-policy-controller restarts a lot. any idea how to investigate this issue further?
how can I check why it restarts? are there any logs for this container?
kubectl logs --previous $id —namespace=kube-system
i added --previous because when the controller restart it has a different random characters appended to it.
in my case that kube-policy-controller what started on one server, and requested the etcd2 certificates that where generated on a different server.

kubernetes pods spawn across all servers but kubectl only shows 1 running and 1 pending

I have new setup of Kubernetes and I created replication with 2. However what I see when I do " kubectl get pods' is that one is running another is "pending". Yet when I go to my 7 test nodes and do docker ps I see that all of them are running.
What I think is happening is that I had to change the default insecure port from 8080 to 7080 (the docker app actually runs on 8080), however I don't know how to tell if I am right, or where else to look.
Along the same vein, is there any way to setup config for kubectl where I can specify the port. Doing kubectl --server="" is a bit annoying (yes I know I can alias this).
If you changed the API port, did you also update the nodes to point them at the new port?
For the kubectl --server=... question, you can use kubectl config set-cluster to set cluster info in your ~/.kube/config file to avoid having to use --server all the time. See the following docs for details:
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-cluster.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-context.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_use-context.html