Kubernetes deployment on local Ubuntu cluster - deployment

I'm simply trying to install Kubernetes on local Ubuntu cluster using the original documentation.(http://kubernetes.io/docs/getting-started-guides/ubuntu/).
The problem is that when i try the kube-up, after creating the binaries, i get the following error:
Deploying master and node on machine 10.86.108.150
make-ca-cert.sh 100% 4136 4.0KB/s 00:00
easy-rsa.tar.gz 100% 42KB 42.4KB/s 00:00
config-default.sh 100% 5438 5.3KB/s 00:00
util.sh 100% 29KB 28.9KB/s 00:00
kubelet.conf 100% 644 0.6KB/s 00:00
kube-proxy.conf 100% 684 0.7KB/s 00:00
kubelet 100% 2158 2.1KB/s 00:00
kube-proxy 100% 2233 2.2KB/s 00:00
kube-controller-manager.conf 100% 744 0.7KB/s 00:00
kube-scheduler.conf 100% 674 0.7KB/s 00:00
kube-apiserver.conf 100% 674 0.7KB/s 00:00
etcd.conf 100% 709 0.7KB/s 00:00
kube-apiserver 100% 2358 2.3KB/s 00:00
kube-scheduler 100% 2360 2.3KB/s 00:00
etcd 100% 2073 2.0KB/s 00:00
kube-controller-manager 100% 2672 2.6KB/s 00:00
reconfDocker.sh 100% 2082 2.0KB/s 00:00
etcdctl 100% 12MB 12.3MB/s 00:00
kube-apiserver 100% 58MB 58.2MB/s 00:00
kube-scheduler 100% 42MB 42.0MB/s 00:00
etcd 100% 14MB 13.8MB/s 00:00
flanneld 100% 11MB 10.8MB/s 00:01
kube-controller-manager 100% 52MB 51.8MB/s 00:00
kubelet 100% 60MB 60.3MB/s 00:00
kube-proxy 100% 35MB 34.8MB/s 00:01
flanneld 100% 11MB 10.8MB/s 00:00
flanneld.conf 100% 577 0.6KB/s 00:00
flanneld 100% 2121 2.1KB/s 00:00
flanneld.conf 100% 568 0.6KB/s 00:00
flanneld 100% 2131 2.1KB/s 00:00
sudo: unable to resolve host kubernetes-master
etcd start/stopping
**Error: client: etcd cluster is unavailable or misconfigured
error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
error #1: dial tcp 127.0.0.1:4001: getsockopt: connection refused**
Thank You for all your answers

Use kubeadm to seupt kubernetes clutter. It is stable now and is the recommended approach. You can have single node cluster initially and can be scaled out if you get spare nodes in future.

Related

Ceph Monitor out of quorum

we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:
cluster:
id: bb1ab46a-d282-4530-bf5c-021e9c940958
health: HEALTH_WARN
insufficient standby MDS daemons available
noout flag(s) set
9 large omap objects
47 pgs not deep-scrubbed in time
application not enabled on 2 pool(s)
1/3 mons down, quorum mon1,mon3
services:
mon: 3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
mgr: mon1(active, since 3d)
mds: filesystem:1 {0=mon1=up:active}
osd: 77 osds: 77 up (since 3d), 77 in (since 2w)
flags noout
rbd-mirror: 1 daemon active (12512649)
rgw: 1 daemon active (mon1)
data:
pools: 13 pools, 1500 pgs
objects: 65.36M objects, 23 TiB
usage: 85 TiB used, 701 TiB / 785 TiB avail
pgs: 1500 active+clean
io:
client: 806 KiB/s wr, 0 op/s rd, 52 op/s wr
systemctl status ceph-mon#2.service shows:
ceph-mon#2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2681 (code=exited, status=1/FAILURE)
Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon#2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon#2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service failed.
Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.
try to start the service manually on MON2 with:
/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph

how to rejoin Mon and mgr Ceph to cluster

i have this situation and cand access to ceph dashboard.i haad 5 mon but 2 of them went down and one of them is the bootstrap mon node so that have mgr and I got this from that node.
2020-10-14T18:59:46.904+0330 7f9d2e8e9700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum srv4,srv5,srv6 (age 2d)
mgr: no daemons active (since 2d)
mds: heyatfs:1 {0=heyfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 47h), 54 in (since 3w)
task status:
scrub status:
mds.heyfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 223.95k objects, 386 GiB
usage: 1.2 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr
I have to say the whole story, I used cephadm to create my cluster at first and I'm so new to ceph i have 15 servers and 14 of them have OSD container and 5 of them had mon and my bootstrap mon that is srv2 have mgr.
2 of these servers have public IP and I used one of them as a client (I know this structure have a lot of question in it but my company forces me to do it and also I'm new to ceph so it's how it's now). 2 weeks ago I lost 2 OSD and I said to datacenter who gives me these servers to change that 2 HDD they restart those servers and unfortunately, those servers were my Mon server. after they restarted those servers on of them came back srv5 but I could see srv3 is out of quorum
so i begon to solve this problem so I used this command in ceph shell --fsid ...
ceph orch apply mon srv3
ceph mon remove srv3
after some while I see in my dashboard srv2 my boostrap mon and mgr is not working and when I used ceph -s ssrv2 isn't there and I can see srv2 mon in removed directory
root#srv2:/var/lib/ceph/e97c1944-e132-11ea-9bdd-e83935b1c392# ls
crash crash.srv2 home mgr.srv2.xpntaf osd.0 osd.1 osd.2 osd.3 removed
but mgr.srv2.xpntaf is running and unfortunately, I lost my access to ceph dashboard now
i tried to add srv2 and 3 to monmap with
576 ceph orch daemon add mon srv2:172.32.X.3
577 history | grep dump
578 ceph mon dump
579 ceph -s
580 ceph mon dump
581 ceph mon add srv3 172.32.X.4:6789
and now
root#srv2:/# ceph -s
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_WARN
no active mgr
2/5 mons down, quorum srv4,srv5,srv6
services:
mon: 5 daemons, quorum srv4,srv5,srv6 (age 16h), out of quorum: srv2, srv3
mgr: no daemons active (since 2d)
mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 2d), 54 in (since 3w)
task status:
scrub status:
mds.heyatfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 223.95k objects, 386 GiB
usage: 1.2 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr
and I must say ceph orch host ls doesn't work and it hangs when I run it and I think it's because of that err no active mgr and also when I see that removed directory mon.srv2 is there and you can see unit.run file so I used that command to run the container again but it says mon.srv2 isn't on mon map and doesn't have specific IP and by the way I must say after ceph orch apply mon srv3 i could see a new container with a new fsid in srv3 server
I now my whole problem is because I ran this command ceph orch apply mon srv3
because when you see the installation document :
To deploy monitors on a specific set of hosts:
# ceph orch apply mon *<host1,host2,host3,...>*
Be sure to include the first (bootstrap) host in this list.
and I didn't see that line !!!
now I manage to have another mgr running but I got this
root#srv2:/var/lib/ceph/mgr# ceph -s
2020-10-15T13:11:59.080+0000 7f957e9cd700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
cluster:
id: e97c1944-e132-11ea-9bdd-e83935b1c392
health: HEALTH_ERR
1 stray daemons(s) not managed by cephadm
2 mgr modules have failed
2/5 mons down, quorum srv4,srv5,srv6
services:
mon: 5 daemons, quorum srv4,srv5,srv6 (age 20h), out of quorum: srv2, srv3
mgr: srv4(active, since 8m)
mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
osd: 54 osds: 54 up (since 2d), 54 in (since 3w)
task status:
scrub status:
mds.heyatfs.srv10.lxizhc: idle
data:
pools: 3 pools, 65 pgs
objects: 301.77k objects, 537 GiB
usage: 1.6 TiB used, 97 TiB / 98 TiB avail
pgs: 65 active+clean
io:
client: 180 KiB/s rd, 597 B/s wr, 0 op/s rd, 0 op/s wr
and when I run the ceph orch host ls i see this
root#srv2:/var/lib/ceph/mgr# ceph orch host ls
HOST ADDR LABELS STATUS
srv10 172.32.x.11
srv11 172.32.x.12
srv12 172.32.x.13
srv13 172.32.x.14
srv14 172.32.x.15
srv15 172.32.x.16
srv2 srv2
srv3 172.32.x.4
srv4 172.32.x.5
srv5 172.32.x.6
srv6 172.32.x.7
srv7 172.32.x.8
srv8 172.32.x.9
srv9 172.32.x.10

What is difference between iops of ceph status and rbd perf image iostat

Trying to find which rbd image is making most write-iops, but can't make any sense out from "rbd perf" output compared to "ceph status". What is difference between iops counters of ceph (160 op/s) vs rbd pref (WR 1/s)?
ceph status | grep client.
client: 493 KiB/s rd, 2.4 MiB/s wr, 10 op/s rd, 160 op/s wr
rbd perf image iostat.
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
ceph/vm-152-disk-0 1/s 0/s 71 KiB/s 0 B/s 13.04 ms 0.00 ns
ceph/vm-136-disk-0 0/s 0/s 819 B/s 0 B/s 919.79 us 0.00 ns
ceph status is summing io's for all pools.
As your rbd images are on the pool 'ceph',
you can run 'ceph osd pool stats ceph' to get specific stats for that pool.
If you have only 1 WR/s on ceph/vm-152-disk-0 and 160 op/s wr on the whole cluster,
it means that 159 op/s wr are done elsewhere, in another pool.

Basic ContainerCreating Failure

Occasionally I see problems where creating my deployments takes a much longer time than usual (this one is typically a minute or two). How do people normally deal with this? Is it best to remove the offending node? What's the right way to debug this?
error: deployment "hillcity-twitter-staging-deployment" exceeded its progress deadline
Waiting for rollout to complete (been 500s)...
NAME READY STATUS RESTARTS AGE IP NODE
hillcity-twitter-staging-deployment-5bf6b48779-5jvgv 2/2 Running 0 8m 10.168.41.12 gke-charles-test-cluster-default-pool-be943055-mq4j
hillcity-twitter-staging-deployment-5bf6b48779-knzkw 2/2 Running 0 8m 10.168.34.34 gke-charles-test-cluster-default-pool-be943055-czqr
hillcity-twitter-staging-deployment-5bf6b48779-qxmg8 0/2 ContainerCreating 0 8m <none> gke-charles-test-cluster-default-pool-be943055-rzg2
I've ssh-ed into the "rzg2" node but didn't see anything particularly wrong with it. Here's the k8s view:
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-charles-test-cluster-default-pool-be943055-2q9f 385m 40% 2288Mi 86%
gke-charles-test-cluster-default-pool-be943055-35fl 214m 22% 2030Mi 76%
gke-charles-test-cluster-default-pool-be943055-3p95 328m 34% 2108Mi 79%
gke-charles-test-cluster-default-pool-be943055-67h0 204m 21% 1783Mi 67%
gke-charles-test-cluster-default-pool-be943055-czqr 342m 36% 2397Mi 90%
gke-charles-test-cluster-default-pool-be943055-jz8v 149m 15% 2299Mi 86%
gke-charles-test-cluster-default-pool-be943055-kl9r 246m 26% 1796Mi 67%
gke-charles-test-cluster-default-pool-be943055-mq4j 123m 13% 1523Mi 57%
gke-charles-test-cluster-default-pool-be943055-mx18 276m 29% 1755Mi 66%
gke-charles-test-cluster-default-pool-be943055-pb48 200m 21% 1667Mi 63%
gke-charles-test-cluster-default-pool-be943055-rzg2 392m 41% 2270Mi 85%
gke-charles-test-cluster-default-pool-be943055-wkxk 274m 29% 1954Mi 73%
```
Added: Here's some of the output of "$ sudo journalctl -u kubelet"
Sep 04 22:14:11 gke-charles-test-cluster-default-pool-be943055-rzg2 kubelet[1442]: E0904 22:14:11.882166 1442 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/docker/overlay/83ed56fdfae736d5b1bd3afc3649555916a2ef24a287415256a408c463186107 with output stdout: , stderr: - signal: killed, rootInodeErr: <nil>, extraDiskErr: <nil>
[...repeated a lot...]
Sep 04 22:25:19 gke-charles-test-cluster-default-pool-be943055-rzg2 kubelet[1442]: E0904 22:25:19.917177 1442 kube_docker_client.go:324] Cancel pulling image "gcr.io/able-store-864/hillcity-worker:0.0.1" because of no progress for 1m0s, latest progress: "43f9fd4bd389: Extracting [=====> ] 32.77 kB/295.9 kB"

kubernetes install on ubuntu close connection in deploying

When I installing kubernetes on 3 ubuntu14.04 node,it going to deploying and suddenly stopped.
I had 3 nodes of this cluster:
172.25.2.31 ukub01
172.25.2.32 ukub02
172.25.2.33 ukub03
And I followed this document to install:
http://kubernetes.io/v1.0/docs/getting-started-guides/ubuntu.html
config-default.sh setting is:
export nodes=${nodes:-"root#ukub01 root#ukub02 root#ukub03 "}
role=${role:-"ai i i"}
export NUM_MINIONS=${NUM_MINIONS:-3}
export SERVICE_CLUSTER_IP_RANGE=${SERVICE_CLUSTER_IP_RANGE:-172.25.3.0/24}
export FLANNEL_NET=${FLANNEL_NET:-172.16.0.0/16}
deploying messages:
root#ukub01:/opt/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-up.sh
... Starting cluster using provider: ubuntu
... calling verify-prereqs
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
... calling kube-up
Deploying master and node on machine ukub01
make-ca-cert.sh 100% 3398 3.3KB/s 00:00
config-default.sh 100% 3232 3.2KB/s 00:00
util.sh 100% 19KB 19.4KB/s 00:00
kubelet.conf 100% 644 0.6KB/s 00:00
kube-proxy.conf 100% 684 0.7KB/s 00:00
flanneld.conf 100% 577 0.6KB/s 00:00
kube-proxy 100% 2230 2.2KB/s 00:00
kubelet 100% 2155 2.1KB/s 00:00
flanneld 100% 2159 2.1KB/s 00:00
kube-controller-manager.conf 100% 744 0.7KB/s 00:00
kube-apiserver.conf 100% 674 0.7KB/s 00:00
kube-scheduler.conf 100% 674 0.7KB/s 00:00
etcd.conf 100% 664 0.7KB/s 00:00
flanneld.conf 100% 568 0.6KB/s 00:00
kube-controller-manager 100% 2672 2.6KB/s 00:00
etcd 100% 2073 2.0KB/s 00:00
flanneld 100% 2159 2.1KB/s 00:00
kube-apiserver 100% 2358 2.3KB/s 00:00
kube-scheduler 100% 2360 2.3KB/s 00:00
reconfDocker.sh 100% 1759 1.7KB/s 00:00
kube-controller-manager 100% 31MB 30.8MB/s 00:00
etcd 100% 6494KB 6.3MB/s 00:00
flanneld 100% 8695KB 8.5MB/s 00:00
kube-apiserver 100% 37MB 36.9MB/s 00:00
etcdctl 100% 6041KB 5.9MB/s 00:00
kube-scheduler 100% 16MB 16.2MB/s 00:01
kube-proxy 100% 16MB 16.1MB/s 00:01
kubelet 100% 33MB 33.1MB/s 00:00
flanneld 100% 8695KB 8.5MB/s 00:00
Connection to ukub01 closed.
I had checked logs in /var/log/upstart .There are two files and I had not to find the reason occour the error.
flanneld.log:
I1010 14:47:40.249071 05088 main.go:292] Exiting...
systemd-logind.log
New session 3 of user root.
New session 4 of user root.
Removed session 4.
New session 5 of user root.
Removed session 3.
I think kubernetes/etcd/flannel can be installed manually on ubuntu if there are option setting documents,And I installed etcd&flannel on the 3 nodes,but I still can't find the kubernetes part.
Can you help me about this error or tell me where can I find the kubernetes install and options setting document,please?
My guess is that you're suffering a network issue (GWF mostly).
You could execute following command to verify it.
$ curl -L -O https://storage.googleapis.com/kubernetes-release/easy-rsa/easy-rsa.tar.gz