GlusterFS volume with replica 3 arbiter 1 mounted in Kubernetes PODs contains zero size files - kubernetes

I was planning to migrate from replica 3 to replica 3 with arbiter 1, but faced with a strange issue on my third node(that acts as arbiter).
When I mounted new volume endpoint to the node where Gluster arbiter POD is running I'm getting strange behavior: some files are fine, but some are zero sizes. When I mount the same share on another node, all files are fine.
GlusterFS is running as a Kubernetes daemonset and I'm using heketi to manage GlusterFS from Kubernetes automatically.
I'm using glusterfs 4.1.5 and Kubernetes 1.11.1.
gluster volume info vol_3ffdfde93880e8aa39c4b4abddc392cf
Type: Replicate
Volume ID: e67d2ade-991a-40f9-8f26-572d0982850d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.2.70:/var/lib/heketi/mounts/vg_426b28072d8d0a4c27075930ddcdd740/brick_35389ca30d8f631004d292b76d32a03b/brick
Brick2: 192.168.2.96:/var/lib/heketi/mounts/vg_3a9b2f229b1e13c0f639db6564f0d820/brick_953450ef6bc25bfc1deae661ea04e92d/brick
Brick3: 192.168.2.148:/var/lib/heketi/mounts/vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b27af182cb69e108c1652dc85b04e44a/brick (arbiter)
Options Reconfigured:
user.heketi.id: 3ffdfde93880e8aa39c4b4abddc392cf
user.heketi.arbiter: true
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
Status Output:
gluster volume status vol_3ffdfde93880e8aa39c4b4abddc392cf
Status of volume: vol_3ffdfde93880e8aa39c4b4abddc392cf
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 192.168.2.70:/var/lib/heketi/mounts/v
g_426b28072d8d0a4c27075930ddcdd740/brick_35
389ca30d8f631004d292b76d32a03b/brick 49152 0 Y 13896
Brick 192.168.2.96:/var/lib/heketi/mounts/v
g_3a9b2f229b1e13c0f639db6564f0d820/brick_95
3450ef6bc25bfc1deae661ea04e92d/brick 49152 0 Y 12111
Brick 192.168.2.148:/var/lib/heketi/mounts/
vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b
27af182cb69e108c1652dc85b04e44a/brick 49152 0 Y 25045
Self-heal Daemon on localhost N/A N/A Y 25069
Self-heal Daemon on worker1-aws-va N/A N/A Y 12134
Self-heal Daemon on 192.168.2.70 N/A N/A Y 13919
Task Status of Volume vol_3ffdfde93880e8aa39c4b4abddc392cf
------------------------------------------------------------------------------
There are no active volume tasks
Heal output:
gluster volume heal vol_3ffdfde93880e8aa39c4b4abddc392cf info
Brick 192.168.2.70:/var/lib/heketi/mounts/vg_426b28072d8d0a4c27075930ddcdd740/brick_35389ca30d8f631004d292b76d32a03b/brick
Status: Connected
Number of entries: 0
Brick 192.168.2.96:/var/lib/heketi/mounts/vg_3a9b2f229b1e13c0f639db6564f0d820/brick_953450ef6bc25bfc1deae661ea04e92d/brick
Status: Connected
Number of entries: 0
Brick 192.168.2.148:/var/lib/heketi/mounts/vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b27af182cb69e108c1652dc85b04e44a/brick
Status: Connected
Number of entries: 0
Any ideas how to resolve this issue?

The issues were fixed after updating glusterfs-client and glusterfs-common packages on Kubernetes Workers to a more recent version.

Related

How does OpenEBS determine the location of my volume?

I am experimenting with OpenEBS as storage provider for our Kubernetes cluster. OpenEBS is installed via helm on a cluster consisting of 5 nodes, created by Rancher. It seems to work, however I don't really understand how the volume itself is provisioned.
Each node is created with 2 disks, with logical volumes spanning over the disks. For example:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 19G 0 part
├─centos_intern--rancher--node05-root 253:0 0 50G 0 lvm /
└─centos_intern--rancher--node05-swap 253:1 0 7,9G 0 lvm [SWAP]
sdb 8:16 0 80G 0 disk
└─sdb1 8:17 0 80G 0 part
├─centos_intern--rancher--node05-root 253:0 0 50G 0 lvm /
└─centos_intern--rancher--node05-home 253:2 0 41,1G 0 lvm /home
The node device manage (NDM) is configured with a filter excluding loop,fd0,sr0,/dev/ram,/dev/dm-,/dev/md. So far, so good.
When we list the block device resources created by NDM, it lists 2 resources for this node (other nodes are omitted)
> kubectl get blockdevice --all-namespaces
NAMESPACE NAME NODENAME SIZE CLAIMSTATE STATUS AGE
openebs blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5 intern-rancher-node05 85899345920 Unclaimed Active 18h
openebs sparse-e4ea6423e7d139104049e67566a2b634 intern-rancher-node05 10737418240 Unclaimed Active 18
Exploring the created blockdevice, we see that it uses /dev/sdb as disk:
> kubectl describe blockdevice blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5 -n openebs
Name: blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5
...
Node Attributes:
Node Name: intern-rancher-node05
Partitioned: No
Path: /dev/sdb
Status:
Claim State: Unclaimed
State: Active
Events: <none>
So here stops my understanding. Why did NDM pick /dev/sdb, and not /dev/sda? What is the difference between the disks that one is used and the other not? Should /dev/sdb not be skipped because it is in use by the logical volumes? If I create a persistent volume, does this limit the size of my logical volumes (/home)?
Also, if I create a persistent volume claim (using jiva), a persistent volume is created in /var/openebs, for example /var/openebs/pvc-cdc4c5a2-89e1-41ed-b9e7-c672f27a8bed. Does this mean it doesn't use the disk at all but stores everything in the filesystem in the logical volume?

After rebooting nodes, pods stuck in containerCreating state due to insufficient weave IP

I have 3 node Kubernetes cluster on 1.11 deployed with kubeadm and weave(CNI) running of version 2.5.1. I am providing weave CIDR of IP range of 128 IP's. After two reboot of nodes some of the pods stuck in containerCreating state.
Once you run kubectl describe pod <pod_name> you will see following errors:
Events:
Type Reason Age From Message
---- ------ ---- ----
-------
Normal SandboxChanged 20m (x20 over 1h) kubelet, 10.0.1.63 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 30s (x25 over 1h) kubelet, 10.0.1.63 Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
If I check how many containers are running and how many IP address are allocated to those, I can see 24 containers:
[root#ip-10-0-1-63 centos]# weave ps | wc -l
26
The number of total IP's to weave at that node is 42.
[root#ip-10-0-1-212 centos]# kubectl exec -n kube-system -it weave-net-6x4cp -- /home/weave/weave --local status ipam
Defaulting container name to weave.
Use 'kubectl describe pod/weave-net-6x4cp -n kube-system' to see all of the containers in this pod.
6e:0d:f3:d7:f5:49(10.0.1.63) 42 IPs (32.8% of total) (42 active)
7a:24:6f:3c:1b:be(10.0.1.212) 40 IPs (31.2% of total)
ee:00:d4:9f:9d:79(10.0.1.43) 46 IPs (35.9% of total)
You can see all 42 IP's are active so no more IP's are available to allocate to new containers. But out of 42 only 26 are actually allocated to containers, I am not sure where are remaining IP's. It is happening on all three nodes.
Here is the output of weave status for your reference:
[root#ip-10-0-1-212 centos]# weave status
Version: 2.5.1 (version 2.5.2 available - please upgrade!)
Service: router
Protocol: weave 1..2
Name: 7a:24:6f:3c:1b:be(10.0.1.212)
Encryption: disabled
PeerDiscovery: enabled
Targets: 3
Connections: 3 (2 established, 1 failed)
Peers: 3 (with 6 established connections)
TrustedSubnets: none
Service: ipam
Status: waiting for IP(s) to become available
Range: 192.168.13.0/25
DefaultSubnet: 192.168.13.0/25
If you need anymore information, I would happy to provide. Any Clue?
Not sure if we have the same problem.
But before i reboot a node. I need to drain it first. So, all pods in that nodes will be evicted. We are safe to reboot the node.
After that node is up. You need to uncordon again. The node will be available to scheduling pod again.
My reference https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
I guess that 16 IP's have reserved for Pods reuse purpose. These are the maximum pods per node based on CIDR ranges.
Maximum Pods per Node CIDR Range per Node
8 /28
9 to 16 /27
17 to 32 /26
33 to 64 /25
65 to 110 /24
In case if you're weave IP's are exhausted and some of the IP's are not released after reboot. You can delete the file /var/lib/weave/weave-netdata.db and restart the weave pods.
For my case, I have added a systemd script which on every reboot or shutdown of the system removes the /var/lib/weave/weave-netdata.db file and Once system comes up it allocates new Ip's to all the pods and the weave IP exhaust were never seen again.
Posting this here in hope someone else will find it useful for their use case.

CockroachDB on Single Cluster Kube PODs fail with CrashLoopBackOff

Using VirtualBox and 4 x Centos7 OS installs.
Following a basic Single cluster kubernetes install:
https://kubernetes.io/docs/setup/independent/install-kubeadm/
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
[root#k8s-master cockroach]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 41m v1.13.2
k8s-slave1 Ready <none> 39m v1.13.2
k8s-slave2 Ready <none> 39m v1.13.2
k8s-slave3 Ready <none> 39m v1.13.2
I have created 3 x NFS PV's on master for my slaves to pick up as part of the cockroachdb-statefulset.yaml as described here:
https://www.cockroachlabs.com/blog/running-cockroachdb-on-kubernetes/
However my cockroach PODs just continually fail to communicate with each other.
[root#k8s-slave1 kubernetes]# kubectl get pods
NAME READY STATUS RESTARTS AGE
cockroachdb-0 0/1 CrashLoopBackOff 6 8m47s
cockroachdb-1 0/1 CrashLoopBackOff 6 8m47s
cockroachdb-2 0/1 CrashLoopBackOff 6 8m47s
[root#k8s-slave1 kubernetes]# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
datadir-cockroachdb-0 Bound cockroachdbpv0 10Gi RWO 17m
datadir-cockroachdb-1 Bound cockroachdbpv2 10Gi RWO 17m
datadir-cockroachdb-2 Bound cockroachdbpv1 10Gi RWO 17m
...the cockroach pod logs do not really tell me why...
[root#k8s-slave1 kubernetes]# kubectl logs cockroachdb-0
++ hostname -f
+ exec /cockroach/cockroach start --logtostderr --insecure --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
W190113 17:00:46.589470 1 cli/start.go:1055 RUNNING IN INSECURE MODE!
- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.
Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.1/secure-a-cluster.html
I190113 17:00:46.595544 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory
I190113 17:00:46.600386 1 cli/start.go:1069 CockroachDB CCL v2.1.3 (x86_64-unknown-linux-gnu, built 2018/12/17 19:15:31, go1.10.3)
I190113 17:00:46.759727 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory
I190113 17:00:46.759809 1 server/config.go:386 system total memory: 3.7 GiB
I190113 17:00:46.759872 1 server/config.go:388 server configuration:
max offset 500000000
cache size 947 MiB
SQL memory pool size 947 MiB
scan interval 10m0s
scan min idle time 10ms
scan max idle time 1s
event log enabled true
I190113 17:00:46.759896 1 cli/start.go:913 using local environment variables: COCKROACH_CHANNEL=kubernetes-insecure
I190113 17:00:46.759909 1 cli/start.go:920 process identity: uid 0 euid 0 gid 0 egid 0
I190113 17:00:46.759919 1 cli/start.go:545 starting cockroach node
I190113 17:00:46.762262 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data/cockroach-temp632709623"
I190113 17:00:46.803749 22 server/server.go:851 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I190113 17:00:46.804168 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data"
I190113 17:00:46.828487 22 server/config.go:494 [n?] 1 storage engine initialized
I190113 17:00:46.828526 22 server/config.go:497 [n?] RocksDB cache size: 947 MiB
I190113 17:00:46.828536 22 server/config.go:497 [n?] store 0: RocksDB, max size 0 B, max open file limit 60536
W190113 17:00:46.838175 22 gossip/gossip.go:1499 [n?] no incoming or outgoing connections
I190113 17:00:46.838260 22 cli/start.go:505 initial startup completed, will now wait for `cockroach init`
or a join to a running cluster to start accepting clients.
Check the log file(s) for progress.
I190113 17:00:46.841243 22 server/server.go:1402 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W190113 17:01:16.841095 89 cli/start.go:535 The server appears to be unable to contact the other nodes in the cluster. Please try:
- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.
If problems persist, please see https://www.cockroachlabs.com/docs/v2.1/cluster-setup-troubleshooting.html.
I190113 17:01:31.357765 1 cli/start.go:756 received signal 'terminated'
I190113 17:01:31.359529 1 cli/start.go:821 initiating graceful shutdown of server
initiating graceful shutdown of server
I190113 17:01:31.361064 1 cli/start.go:872 too early to drain; used hard shutdown instead
too early to drain; used hard shutdown instead
...any ideas how to debug this further?
I have gone through *.yaml file at https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/cockroachdb-statefulset.yaml
I noticed that towards the bottom there is no storageClassName mentioned which means that during the volume claim process, pods are going to look for standard storage class.
I am not sure if you used below annotation while provisioning 3 NFS volumes -
storageclass.kubernetes.io/is-default-class=true
You should be able to check the same using -
kubectl get storageclass
If the output does not show Standard storage class then I would suggest either readjusting persistent volumes definitions by adding annotation or add empty string as storageClassName towards the end of the cockroach-statefulset.yaml file
More logs can be viewed using -
kubectl describe cockroachdb-{statefulset}
OK it came down to the fact I had NAT as my virtualbox external facing network adaptor. I changed it to Bridged and it all started working perfectly. If anyone can tell me why, that would be awesome :)
In my case, using helm chart, like below:
$ helm install stable/cockroachdb \
-n cockroachdb \
--namespace cockroach \
--set Storage=10Gi \
--set NetworkPolicy.Enabled=true \
--set Secure.Enabled=true
After wait to finish adding csr's for cockroach:
$ watch kubectl get csr
Several csr's are pending:
$ kubectl get csr
NAME AGE REQUESTOR CONDITION
cockroachdb.client.root 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-0 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-1 129m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-2 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
To approve that run follow command:
$ kubectl get csr -o json | \
jq -r '.items[] | select(.metadata.name | contains("cockroach.")) | .metadata.name' | \
xargs -n 1 kubectl certificate approve

Ceph status HEALTH_WARN while adding an RGW Instance

I want to create ceph cluster and then connect to it through S3 RESTful api.
So, I've deployed ceph cluster (mimic 13.2.4) on "Ubuntu 16.04.5 LTS" with 3 OSD (one per each HDD 10Gb).
Using this tutorials:
1) http://docs.ceph.com/docs/mimic/start/quick-start-preflight/#ceph-deploy-setup
2) http://docs.ceph.com/docs/mimic/start/quick-ceph-deploy/
At this point, ceph status is OK:
root#ubuntu-srv:/home/slavik/my-cluster# ceph -s
cluster:
id: d7459118-8c16-451d-9774-d09f7a926d0e
health: HEALTH_OK
services:
mon: 1 daemons, quorum ubuntu-srv
mgr: ubuntu-srv(active)
osd: 3 osds: 3 up, 3 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 27 GiB / 30 GiB avail
pgs:
3) "To use the Ceph Object Gateway component of Ceph, you must deploy an instance of RGW. Execute the following to create an new instance of RGW:"
root#ubuntu-srv:/home/slavik/my-cluster# ceph-deploy rgw create ubuntu-srv
....
[ceph_deploy.rgw][INFO ] The Ceph Object Gateway (RGW) is now running on host ubuntu-srv and default port 7480
root#ubuntu-srv:/home/slavik/my-cluster# ceph -s
cluster:
id: d7459118-8c16-451d-9774-d09f7a926d0e
health: HEALTH_WARN
too few PGs per OSD (2 < min 30)
services:
mon: 1 daemons, quorum ubuntu-srv
mgr: ubuntu-srv(active)
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 8 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 27 GiB / 30 GiB avail
pgs: 37.500% pgs unknown
62.500% pgs not active
5 creating+peering
3 unknown
Ceph status has been changed to HEALTH_WARN - why and how to resolve it?
Your issue is
health: HEALTH_WARN
too few PGs per OSD (2 < min 30)
Look at you current pg config by running:
ceph osd dump|grep pool
See what each pool is configured for pg count, then goto https://ceph.com/pgcalc/ to calculate what your pools should be configured for.
The warning is that you have a low number of pg's per osd, right now you have 2 per osd, where min should be 30

calico-etcd not scheduled on GKE 1.11 k8s

I recently upgraded my GKE cluster from 1.10.x to 1.11.x and since then my calico-node pods fail to connect to the etcd cluster and end up in a CrashLoopBackOff due to livenessProbe error.
I saw that the calico-etcd DaemonSet has desired state 0 and was wondering about that. nodeSelector is at node-role.kubernetes.io/master=.
From the logs of such calico-nodes:
2018-12-19 19:18:28.989 [INFO][7] etcd.go 373: Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
2018-12-19 19:18:28.989 [INFO][7] startup.go 254: Unable to query node configuration Name="gke-brokerme-ubuntu-pool-852d0318-j5ft" error=client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.96.232.136:6666 exceeded header timeout
State of the DaemonSets:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-etcd 0 0 0 0 0 node-role.kubernetes.io/master= 3d
calico-node 2 2 0 2 0 <none> 3d
k get nodes --show-labels:
NAME STATUS ROLES AGE VERSION LABELS
gke-brokerme-ubuntu-pool-852d0318-7v4m Ready <none> 4d v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-7v4m,os=ubuntu
gke-brokerme-ubuntu-pool-852d0318-j5ft Ready <none> 1h v1.11.5-gke.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-2,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=ubuntu-pool,cloud.google.com/gke-os-distribution=ubuntu,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/hostname=gke-brokerme-ubuntu-pool-852d0318-j5ft,os=ubuntu
I did not modify any calico manifests, they should be 1:1 provisioned by GKE.
I would expect either the calico-nodes connect to the etc of my Kubernetes cluster, or to a calico-etcd provisioned by the DaemonSet. As there is no master node that I can control in GKE, I kind of get why calico-etcd is at state 0, but then, to which etc are the calico-nodes supposed to connect? What's wrong with my small and basic setup?
We are aware of the issue of calico crash looping in GKE 1.11.x. You can fix this issue, by upgrading to newer versions. , I would recommend you to upgrade to the version '1.11.4-gke.12' or '1.11.3-gke.23' which does not have this issue.