Kubenetes upgrade from 1.8.7 to 1.13.0 - kubernetes

Context
We currently have 3 stable clusters on kubernetes(v1.8.7). These clusters were created by an external team which is no longer available and we have limited documentation. We are trying to upgrade to a higher stable version(v1.13.0). We're aware that we need to upgrade 1 version at a time so 1.8 -> 1.9 -> 1.10 & so on.
Solved Questions
Any pointers on how to upgrade from 1.8 to 1.9 ?
We tried to install kubeadm v1.8.7 & run kubeadm upgrade plan, but it fails with output -
[preflight] Running pre-flight checks
couldn't create a Kubernetes client from file "/etc/kubernetes/admin.conf": failed to load admin kubeconfig [open /etc/kubernetes/admin.conf: no such file or directory]
we can not find the file admin.conf. Any suggestions on how we can regenerate this or what information would it need ?
New Question
Since we now have the admin.conf file, we installed kubectl,kubeadm and kubelet v 1.9.0 -
apt-get install kubelet=1.9.0-00 kubeadm=1.9.0-00 kubectl=1.9.0-00.
When I run kubeadm upgrade plan v1.9.0
I get
root#k8s-master-dev-0:/home/azureuser# kubeadm upgrade plan v1.9.0
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/health] FATAL: [preflight] Some fatal errors occurred:
[ERROR APIServerHealth]: the API Server is unhealthy; /healthz didn't return "ok"
[ERROR MasterNodesReady]: couldn't list masters in cluster: Get https://<k8s-master-dev-0 ip>:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D: dial tcp <k8s-master-dev-0 ip>:6443: getsockopt: connection refused
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
root#k8s-master-dev-0:/home/azureuser# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
heapster-75f8df9884-nxn2z 2/2 Running 0 42d
kube-addon-manager-k8s-master-dev-0 1/1 Running 2 1d
kube-addon-manager-k8s-master-dev-1 1/1 Running 4 123d
kube-addon-manager-k8s-master-dev-2 1/1 Running 2 169d
kube-apiserver-k8s-master-dev-0 1/1 Running 100 1d
kube-apiserver-k8s-master-dev-1 1/1 Running 4 123d
kube-apiserver-k8s-master-dev-2 1/1 Running 2 169d
kube-controller-manager-k8s-master-dev-0 1/1 Running 3 1d
kube-controller-manager-k8s-master-dev-1 1/1 Running 4 123d
kube-controller-manager-k8s-master-dev-2 1/1 Running 4 169d
kube-dns-v20-5d9fdc7448-smf9s 3/3 Running 0 42d
kube-dns-v20-5d9fdc7448-vtjh4 3/3 Running 0 42d
kube-proxy-cklcx 1/1 Running 1 123d
kube-proxy-dldnd 1/1 Running 4 169d
kube-proxy-gg89s 1/1 Running 0 169d
kube-proxy-mrkqf 1/1 Running 4 149d
kube-proxy-s95mm 1/1 Running 10 169d
kube-proxy-zxnb7 1/1 Running 2 169d
kube-scheduler-k8s-master-dev-0 1/1 Running 2 1d
kube-scheduler-k8s-master-dev-1 1/1 Running 6 123d
kube-scheduler-k8s-master-dev-2 1/1 Running 4 169d
kubernetes-dashboard-8555bd85db-4txtm 1/1 Running 0 42d
tiller-deploy-6677dc8d46-5n5cp 1/1 Running 0 42d

Lets go by step and first generate the admin.conf file in your cluster:
You can generate the admin.conf file using following command:
kubeadm alpha phase kubeconfig admin --cert-dir /etc/kubernetes/pki --kubeconfig-dir /etc/kubernetes/
Now, you can check out my following answer how to upgrade kubernetes cluster by kubeadm (The answer is for 1.10.0 to 1.10.11 but it is applicable also to 1.8 to 1.9, you just need to change the version for the package you download)
how to upgrade kubernetes from v1.10.0 to v1.10.11
Hope this helps.

Any pointers on how to upgrade from 1.8 to 1.9 ?
Definitely kubeadm
We tried to install kubeadm v1.8.7 & run kubeadm upgrade plan, but it
fails with output -
[preflight] Running pre-flight checks couldn't create a Kubernetes
client from file "/etc/kubernetes/admin.conf": failed to load admin
kubeconfig [open /etc/kubernetes/admin.conf: no such file or
directory] we can not find the file admin.conf. Any suggestions on how
we can regenerate this or what information would it need ?
kubeadm requires a couple of things:
ConfigMap in-cluster
Authentication / Credentials file
Firstly, I'd check kube-system namespace for a kubeadm-config ConfigMap. If that exists, you should be able to continue relatively painless.
If this doesn't exist, you will need to go ahead and create it.
kubeadm config upload from-flags would be a good starting point. You can specify the kubelet flags from your systemd unit file and it should get you in good shape.
https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-config/#cmd-config-from-flags
Secondly, kubeadm needs a conf file with credentials. I'd imagine there's one of these in your /etc/kubernetes directory somewhere; so poke around.
This file will look like your local kubeconfigs, starting with:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:

Related

Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

my pod stucks in ContainerCreating status with this massage :
Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898" network for pod "my-nginx": networkPlugin cni failed to set up pod "my-nginx_default" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898", failed to clean up sandbox container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898" network for pod "my-nginx": networkPlugin cni failed to teardown pod "my-nginx_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
the state of worker node is Ready .
but the output of kubectl get pods -n kube-system seems to have issues :
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6dfcd885bf-ktbhb 1/1 Running 0 22h
calico-node-4fs2v 0/1 Init:RunContainerError 1 22h
calico-node-l9qvc 0/1 Running 0 22h
coredns-f9fd979d6-8pzcd 1/1 Running 0 23h
coredns-f9fd979d6-v4cq8 1/1 Running 0 23h
etcd-k8s-master 1/1 Running 1 23h
kube-apiserver-k8s-master 1/1 Running 128 23h
kube-controller-manager-k8s-master 1/1 Running 4 23h
kube-proxy-bwtwj 0/1 CrashLoopBackOff 342 23h
kube-proxy-stq7q 1/1 Running 1 23h
kube-scheduler-k8s-master 1/1 Running 4 23h
and the resualt of command kubectl -n kube-system logs kube-proxy-bwtwj the resulst was :
failed to try resolving symlinks in path "/var/log/pods/kube-system_kube-proxy-bwtwj_1a0f4b93-cc6f-46b9-bf29-125feba593cb/kube-proxy/348.log": lstat /var/log/pods/kube-system_kube-proxy-bwtwj_1a0f4b93-cc6f-46b9-bf29-125feba593cb/kube-proxy/348.log: no such file or directory
I see two topics here:
The default --pod-network-cidr for calico is 192.168.0.0/16. You can use a different one but always make sure that there are no overlays. However, I have tested with the default one and my cluster runs with no problems. In order to start over with a proper config, you should Remove the node and Clean up the control plane. Than proceed with:
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml
After that, join your worker nodes with kubeadm join
Use sudo where/if needed. All necessary details can be found in the official documentation.
The failed to try resolving symlinks error means that kubelet is looking for the pod logs in a wrong directory. In order to fix it you need to pass the --log-dir=/var/log flag to kubelet. After adding the flag you have run systemctl daemon-reload so the kubelet would be restarted. This has to be done on all of your nodes.
Make sure you deploy calico before joining other nodes to your cluster. When you have other nodes in your cluster calico-kube-controllers sometimes gets push to a worker node. This can lead to issues
You need to carefully check logs for calico-node pods.
In my case i have some other network interfaces and the autodetection mechanism in calico was detecting wrong interface (ip address).
You need to consult this documentation https://projectcalico.docs.tigera.io/reference/node/configuration.
What i did in my case, was simply:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=cidr=172.16.8.0/24
cidr is my "working network".
After this, all calico-nodes restarted and suddenly everything was fine.

How to install kube-apiserver on centos?

I have installed etcd and kubernetes on centos, now I wanna install kube-apiserver. I installed kube-apiserver by snap.
sudo yum install epel-release
sudo yum install snapd
sudo systemctl enable --now snapd.socket
sudo ln -s /var/lib/snapd/snap /snap
sudo snap install kube-apiserver
I start kube-apiserver with the guide by this link.
Unfortunately, I got failed with ***error etcd certificate file not found in /etc/kubernetes/apiserver/apiserver.pem. But I found that the certificate file exists, how to run the kube-apiserver successfully?
I don't know the reason of your failure. But I suggest you to install kubernetes by kubeadm, it's a great k8s tool. If you install k8s by kubeadm, the kube-apiserver will be installed as a k8s pod. The guide to install kubeadm via this link.
I run the command kubectl get pods -A,
[karl#centos-linux ~]$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-66bff467f8-64pt6 1/1 Running 6 4d18h
kube-system coredns-66bff467f8-xpnsr 1/1 Running 6 4d18h
kube-system etcd-centos-linux.shared 1/1 Running 6 4d18h
kube-system kube-apiserver-centos-linux.shared 1/1 Running 6 4d18h
kube-system kube-controller-manager-centos-linux.shared 1/1 Running 6 4d18h
kube-system kube-flannel-ds-amd64-48stf 1/1 Running 8 4d18h
kube-system kube-proxy-9w8gh 1/1 Running 6 4d18h
kube-system kube-scheduler-centos-linux.shared 1/1 Running 6 4d18h
kube-apiserver-centos-linux.shared is a kube-apiserver pod, it is installed successfully.
I suggest using standard tool such as Kubeadm to install kubernetes on centos. kubeadm init will generate necessary certificates and install all the kubernetes control plane components including Kubernetes API Server.
Following this guide you should be able to install a single control plane cluster of kubernetes.
Kubeadm supports kubernetes cluster with multiple control plane node as well as cluster with completely separate ETCD nodes.

Enabling NodeLocalDNS fails

We have 2 clusters on GKE: dev and production. I tried to run this command on dev cluster:
gcloud beta container clusters update "dev" --update-addons=NodeLocalDNS=ENABLED
And everything went great, node-local-dns pods are running and all works, next morning I decided to run same command on production cluster and node-local-dns fails to run, and I noticed that both PILLAR__LOCAL__DNS and PILLAR__DNS__SERVER in yaml aren't changed to proper IPs, I tried to change those variables in config yaml, but GKE keeps overwriting them back to yaml with PILLAR__DNS__SERVER variables...
The only difference between clusters is that dev runs on 1.15.9-gke.24 and production 1.15.11-gke.1.
Apparently 1.15.11-gke.1 version has a bug.
I recreated it first on 1.15.11-gke.1 and can confirm that node-local-dns Pods fall into CrashLoopBackOff state:
node-local-dns-28xxt 0/1 CrashLoopBackOff 5 5m9s
node-local-dns-msn9s 0/1 CrashLoopBackOff 6 8m17s
node-local-dns-z2jlz 0/1 CrashLoopBackOff 6 10m
When I checked the logs:
$ kubectl logs -n kube-system node-local-dns-msn9s
2020/04/07 21:01:52 [FATAL] Error parsing flags - Invalid localip specified - "__PILLAR__LOCAL__DNS__", Exiting
Solution:
Upgrade to 1.15.11-gke.3 helped. First you need to upgrade your master-node and then your node pool. It looks like on this version everything runs nice and smoothly:
$ kubectl get daemonsets -n kube-system node-local-dns
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-local-dns 3 3 3 3 3 addon.gke.io/node-local-dns-ds-ready=true 44m
$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME READY STATUS RESTARTS AGE
node-local-dns-8pjr5 1/1 Running 0 11m
node-local-dns-tmx75 1/1 Running 0 19m
node-local-dns-zcjzt 1/1 Running 0 19m
As it comes to manually fixing this particular daemonset yaml file, I wouldn't recommend it as you can be sure that GKE's auto-repair and auto-upgrade features will overwrite it sooner or later anyway.
I hope it was helpful.

New kubernetes install has remnants of old cluster

I did a complete tear down of a v1.13.1 cluster and am now running v1.15.0 with calico cni v3.8.0. All pods are running:
[gms#thalia0 ~]$ kubectl get po --namespace=kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-59f54d6bbc-2mjxt 1/1 Running 0 7m23s
calico-node-57lwg 1/1 Running 0 7m23s
coredns-5c98db65d4-qjzpq 1/1 Running 0 8m46s
coredns-5c98db65d4-xx2sh 1/1 Running 0 8m46s
etcd-thalia0.ahc.umn.edu 1/1 Running 0 8m5s
kube-apiserver-thalia0.ahc.umn.edu 1/1 Running 0 7m46s
kube-controller-manager-thalia0.ahc.umn.edu 1/1 Running 0 8m2s
kube-proxy-lg4cn 1/1 Running 0 8m46s
kube-scheduler-thalia0.ahc.umn.edu 1/1 Running 0 7m40s
But, when I look at the endpoint, I get the following:
[gms#thalia0 ~]$ kubectl get ep --namespace=kube-system
NAME ENDPOINTS AGE
kube-controller-manager <none> 9m46s
kube-dns 192.168.16.194:53,192.168.16.195:53,192.168.16.194:53 + 3 more... 9m30s
kube-scheduler <none> 9m46s
If I look at the log for the apiserver, I get a ton of TLS handshake errors, along the lines of:
I0718 19:35:17.148852 1 log.go:172] http: TLS handshake error from 10.x.x.160:45042: remote error: tls: bad certificate
I0718 19:35:17.158375 1 log.go:172] http: TLS handshake error from 10.x.x.159:53506: remote error: tls: bad certificate
These IP addresses were from nodes in a previous cluster. I had deleted them and done a kubeadm reset on all nodes, including master, so I have no idea why these are showing up. I would assume this is why the endpoints for the controller-manager and the scheduler are showing up as <none>.
In order to completely wipe your cluster you should do next:
1) Reset cluster
$sudo kubeadm reset (or use appropriate to your cluster command)
2) Wipe your local directory with configs
$rm -rf .kube/
3) Remove /etc/kubernetes/
$sudo rm -rf /etc/kubernetes/
4)And one of the main point is to get rid of your previous etc state configuration.
$sudo rm -rf /var/lib/etcd/

Rook Ceph Operator hangs when checking for cluster status

I've setup a k8s cluster on digital ocean Ubuntu 18.04 LTS droplets using calico on top of wireguard vpn, and was able to setup nginx-ingress with traefik as external LB. I'm now on the step of setting up distributed storage using rook ceph, by following the quickstart at https://rook.io/docs/rook/master/ceph-quickstart.html, but it seems like the monitors never reach a quorum (even when its just one). Actually, monitor a reaches by itself, but neither the operator or any other monitors seem to know that, and the operator hangs when trying to check the status.
I've tried troubleshooting network issues, all the way from wireguard, calico and ufw. I've even set ufw to temporarily allow all traffic by default just to make sure I wasn't allowing one port but the traffic was on another interface (i have wg0, eth1, tunl0 and the calico interfaces).
The I followed the ceph troubleshooting guide unsuccessfully: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap
I've been 4 days at this and I'm out of solutions.
Here's how I setup the storage cluster
cd cluster/examples/kubernetes/ceph
kubectl apply -f common.yaml
kubectl apply -f operator.yaml
kubectl apply -f cluster-test.yaml
Running kubectl get pods returns
NAME READY STATUS RESTARTS AGE
pod/rook-ceph-agent-9ws2p 1/1 Running 0 24s
pod/rook-ceph-agent-v6v9n 1/1 Running 0 24s
pod/rook-ceph-agent-x2jv4 1/1 Running 0 24s
pod/rook-ceph-mon-a-74cc6db5c8-8s5l5 1/1 Running 0 9s
pod/rook-ceph-operator-7cd5d8bd4c-pclxp 1/1 Running 0 25s
pod/rook-discover-24cfj 1/1 Running 0 24s
pod/rook-discover-6xsnp 1/1 Running 0 24s
pod/rook-discover-hj4tc 1/1 Running 0 24s
However, when I try to check the status of the monitors, from the operator pod I get:
#This hangs forever
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph status
#This hangs foverer
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.a
#This returns [errno 2] error calling ping_monitor
#Which I guess should, becasue mon.b does/should not exist
#But I expected a response such as mon.b does not exist
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.b
Pinging the monitor pod from the operator works just fine by the way
Operator logs
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-operator-log
Monitor a logs
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-log
Monitor a status, obtainer directly form monitor pod via socket
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-status
You can execute ceph status command inside ceph toolbox pod.
https://github.com/rook/rook/blob/master/Documentation/ceph-toolbox.md