Rook Ceph Provisioning issue - ceph

I am having an issue when trying to create my PVC. It appears as though the provisioner is unable to create space.
k describe pvc avl-vam-pvc-media-ceph
Name: avl-vam-pvc-media-ceph
Namespace: default
StorageClass: rook-ceph-block
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 10s (x5 over 67s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
Normal Provisioning 5s (x8 over 67s) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6799bd4cb7-sv4gz_73756eff-f42e-4d8f-8448-d5dedd94d1f2 External provisioner is provisioning volume for claim "default/avl-vam-pvc-media-ceph"
Warning ProvisioningFailed 5s (x8 over 67s) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6799bd4cb7-sv4gz_73756eff-f42e-4d8f-8448-d5dedd94d1f2 failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
Below is my PVC yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: avl-vam-pvc-media-ceph
spec:
storageClassName: "rook-ceph-block"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
I used the ./rook/cluster/examples/kubernetes/ceph/csi/rbd/storageclass.yaml to create my storageclass. I am confused where this is going wrong.
One other thing I find odd in my ceph cluster is my pgs appear to be stuck at undersized
ceph health detail
HEALTH_WARN Degraded data redundancy: 33 pgs undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 33 pgs undersized
pg 1.0 is stuck undersized for 51m, current state active+undersized, last acting [1,0]
pg 2.0 is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.1 is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.2 is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.3 is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.4 is stuck undersized for 44m, current state active+undersized, last acting [2,1]
pg 2.5 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.6 is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.7 is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.8 is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.9 is stuck undersized for 44m, current state active+undersized, last acting [4,1]
pg 2.a is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.b is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.c is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.d is stuck undersized for 44m, current state active+undersized, last acting [0,1]
pg 2.e is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.f is stuck undersized for 44m, current state active+undersized, last acting [1,0]
pg 2.10 is stuck undersized for 44m, current state active+undersized, last acting [2,1]
pg 2.11 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.12 is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.13 is stuck undersized for 44m, current state active+undersized, last acting [0,5]
pg 2.14 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.15 is stuck undersized for 44m, current state active+undersized, last acting [4,3]
pg 2.16 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.17 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.18 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.19 is stuck undersized for 44m, current state active+undersized, last acting [0,3]
pg 2.1a is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.1b is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.1c is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.1d is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.1e is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.1f is stuck undersized for 44m, current state active+undersized, last acting [4,3]
I do have OSDs up
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 10.47958 root default
-3 5.23979 host hostname1
0 ssd 1.74660 osd.0 up 1.00000 1.00000
2 ssd 1.74660 osd.2 up 1.00000 1.00000
4 ssd 1.74660 osd.4 up 1.00000 1.00000
-5 5.23979 host hostname2
1 ssd 1.74660 osd.1 up 1.00000 1.00000
3 ssd 1.74660 osd.3 up 1.00000 1.00000
5 ssd 1.74660 osd.5 up 1.00000 1.00000

You should set accessModes to ReadWriteOnce when using rbd. ReadWriteMany is supported by cephfs.
Also because your replica is 3 and the failure domain (which ceph decide to replicate each copy of data) is by host you should add 3 nodes or more to solve the stuck pgs.

Related

Prometheus metrics yield multiplied values for Kubernetes monitoring on cores, memory and storage

I'm trying to import some pre-built Grafana dashboards for Kubernetes monitoring but I don't get why some metrics seem to be duplicated or multiplied.
For example, this metric is yielding 6 nodes:
sum(kube_node_info{node=~"$node"})
Which is double than the what the cluster actually has:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-agentpool-vmss000000 Ready agent 57d v1.23.5
aks-agentpool-vmss000001 Ready agent 57d v1.23.5
aks-agentpool-vmss000007 Ready agent 35d v1.23.5
Another example:
This metrics is yielding a total of 36 cores, when in reality there are only 12 (3 nodes x 4 cores each)
sum (machine_cpu_cores{kubernetes_io_hostname=~"^$Node$"})
Capacity:
cpu: 4
ephemeral-storage: 129900528Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16393292Ki
pods: 110
If I filter que query by system_uuid, each of the 3 uuids yield 12 cores.
The same goes for memory usage, filesystem storage and so on. Any idea why the metrics are multiplied?
The dashboard in question is this: https://grafana.com/grafana/dashboards/10000

Single controlplane node shows not ready

What happened:
The master node does not show ready anymore. Maybe that happend after an failed update (downloaded kubeadm and kubelet in a way too high version)
s-rtk8s01 Ready Node 2y120d v1.14.1
s-rtk8s02 Ready Node 2y173d v1.14.1
s-rtk8s03 Ready Node 2y174d v1.14.1
s-rtk8s04 Ready Node 2y174d v1.14.1
s-rtk8s05 Ready Node 2y174d v1.14.1
s-rtk8sma01 NotReady,SchedulingDisabled master 2y174d v1.14.1
Scheduler does not show (after it was deleted forcefully) up in the list of pods but docker ps shows that the static pods are getting started in the background. in the
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-hvh6b 1/1 Running 56 288d
coredns-fb8b8dccf-x5r5h 1/1 Running 58 302d
etcd-s-rtk8sma01 1/1 Running 45 535d
kube-apiserver-s-rtk8sma01 1/1 Running 13 535d
kube-controller-manager-s-rtk8sma01 1/1 Running 7 485d
kube-flannel-ds-2fmj4 1/1 Running 6 485d
kube-flannel-ds-5g47f 1/1 Running 5 485d
kube-flannel-ds-5k27n 1/1 Running 5 485d
kube-flannel-ds-cj967 1/1 Running 8 485d
kube-flannel-ds-drjff 1/1 Running 9 485d
kube-flannel-ds-v4sfg 1/1 Running 5 485d
kube-proxy-6ngn6 1/1 Running 11 535d
kube-proxy-85g6c 1/1 Running 10 535d
kube-proxy-gd5jb 1/1 Running 13 535d
kube-proxy-grvsk 1/1 Running 11 535d
kube-proxy-lpht9 1/1 Running 13 535d
kube-proxy-pmdmj 0/1 Pending 0 25h
systemd logs for kubelet shows following (I see those errors with the hostname case remarks and an error with a missing mirror pod - maybe the scheduler?)
kubelet_node_status.go:94] Unable to register node "s-rtk8sma01" with API server: nodes "s-rtk8sma01" is forbidden: node "S-RTK8SMA01" is not allowed to modify node "s-rtk8sma01"
setters.go:739] Error getting volume limit for plugin kubernetes.io/azure-disk
setters.go:739] Error getting volume limit for plugin kubernetes.io/cinder
setters.go:739] Error getting volume limit for plugin kubernetes.io/aws-ebs
setters.go:739] Error getting volume limit for plugin kubernetes.io/gce-pd
Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml
Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml
Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml
Reading config file "/etc/kubernetes/manifests/kube-scheduler.yaml_bck"
Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Setting pods for source file
anager.go:445] Static pod "56ba6ffcb6b23178170f8063052292ee" (kube-scheduler-s-rtk8sma01/kube-system) does not have a corresponding mirror pod; skipping
anager.go:464] Status Manager: syncPod in syncbatch. pod UID: "24db95fbbd2e618dc6ed589132ed7158"
docker ps shows
aec23e01ee2a 2c4adeb21b4f "etcd --advertise-cl…" 7 hours ago Up 7 hours k8s_etcd_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_59
97910491f3b2 20a2d7035165 "/usr/local/bin/kube…" 26 hours ago Up 26 hours k8s_kube-proxy_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0
37d87cdd8886 k8s.gcr.io/pause:3.1 "/pause" 26 hours ago Up 26 hours k8s_POD_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0
83a8af0407e5 cfaa4ad74c37 "kube-apiserver --ad…" 39 hours ago Up 39 hours k8s_kube-apiserver_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1
85250c421db4 k8s.gcr.io/pause:3.1 "/pause" 39 hours ago Up 39 hours k8s_POD_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1
984a3628068c 3fa2504a839b "kube-scheduler --bi…" 40 hours ago Up 40 hours k8s_kube-scheduler_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_7
4d5446906cc5 efb3887b411d "kube-controller-man…" 40 hours ago Up 40 hours k8s_kube-controller-manager_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_3
544423226bed k8s.gcr.io/pause:3.1 "/pause" 40 hours ago Up 40 hours k8s_POD_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_4
a75feece56b5 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_20
1b17cb3ef1c1 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_0
c7c7235ed0dc ff281650a721 "/opt/bin/flanneld -…" 2 months ago Up 2 months k8s_kube-flannel_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_8
d56fe3708565 k8s.gcr.io/pause:3.1 "/pause" 2 months ago Up 2 months k8s_POD_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_7
What you expected to happen:
The master is getting ready again, and the static pods and daemonsets are generated again, so I can start to upgrade the cluster
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I am really lost at this point and tried many hours to find a solution by myself and hope to get a little bit help from the experts, to understand the problem any maybe get some kind of workaround.
Environment:
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OnPremise
OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux S-RTK8SMA01 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
flannel quay.io/coreos/flannel:v0.11.0-amd64
Does anybody know how to fix those mirror pod problems and knows how I can fix the problem with the node name casing?
What I tried so far was, that I started kubelet with hostname override but this did not have any effect.

CockroachDB on Single Cluster Kube PODs fail with CrashLoopBackOff

Using VirtualBox and 4 x Centos7 OS installs.
Following a basic Single cluster kubernetes install:
https://kubernetes.io/docs/setup/independent/install-kubeadm/
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
[root#k8s-master cockroach]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 41m v1.13.2
k8s-slave1 Ready <none> 39m v1.13.2
k8s-slave2 Ready <none> 39m v1.13.2
k8s-slave3 Ready <none> 39m v1.13.2
I have created 3 x NFS PV's on master for my slaves to pick up as part of the cockroachdb-statefulset.yaml as described here:
https://www.cockroachlabs.com/blog/running-cockroachdb-on-kubernetes/
However my cockroach PODs just continually fail to communicate with each other.
[root#k8s-slave1 kubernetes]# kubectl get pods
NAME READY STATUS RESTARTS AGE
cockroachdb-0 0/1 CrashLoopBackOff 6 8m47s
cockroachdb-1 0/1 CrashLoopBackOff 6 8m47s
cockroachdb-2 0/1 CrashLoopBackOff 6 8m47s
[root#k8s-slave1 kubernetes]# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
datadir-cockroachdb-0 Bound cockroachdbpv0 10Gi RWO 17m
datadir-cockroachdb-1 Bound cockroachdbpv2 10Gi RWO 17m
datadir-cockroachdb-2 Bound cockroachdbpv1 10Gi RWO 17m
...the cockroach pod logs do not really tell me why...
[root#k8s-slave1 kubernetes]# kubectl logs cockroachdb-0
++ hostname -f
+ exec /cockroach/cockroach start --logtostderr --insecure --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
W190113 17:00:46.589470 1 cli/start.go:1055 RUNNING IN INSECURE MODE!
- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.
Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.1/secure-a-cluster.html
I190113 17:00:46.595544 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory
I190113 17:00:46.600386 1 cli/start.go:1069 CockroachDB CCL v2.1.3 (x86_64-unknown-linux-gnu, built 2018/12/17 19:15:31, go1.10.3)
I190113 17:00:46.759727 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory
I190113 17:00:46.759809 1 server/config.go:386 system total memory: 3.7 GiB
I190113 17:00:46.759872 1 server/config.go:388 server configuration:
max offset 500000000
cache size 947 MiB
SQL memory pool size 947 MiB
scan interval 10m0s
scan min idle time 10ms
scan max idle time 1s
event log enabled true
I190113 17:00:46.759896 1 cli/start.go:913 using local environment variables: COCKROACH_CHANNEL=kubernetes-insecure
I190113 17:00:46.759909 1 cli/start.go:920 process identity: uid 0 euid 0 gid 0 egid 0
I190113 17:00:46.759919 1 cli/start.go:545 starting cockroach node
I190113 17:00:46.762262 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data/cockroach-temp632709623"
I190113 17:00:46.803749 22 server/server.go:851 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I190113 17:00:46.804168 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data"
I190113 17:00:46.828487 22 server/config.go:494 [n?] 1 storage engine initialized
I190113 17:00:46.828526 22 server/config.go:497 [n?] RocksDB cache size: 947 MiB
I190113 17:00:46.828536 22 server/config.go:497 [n?] store 0: RocksDB, max size 0 B, max open file limit 60536
W190113 17:00:46.838175 22 gossip/gossip.go:1499 [n?] no incoming or outgoing connections
I190113 17:00:46.838260 22 cli/start.go:505 initial startup completed, will now wait for `cockroach init`
or a join to a running cluster to start accepting clients.
Check the log file(s) for progress.
I190113 17:00:46.841243 22 server/server.go:1402 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W190113 17:01:16.841095 89 cli/start.go:535 The server appears to be unable to contact the other nodes in the cluster. Please try:
- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.
If problems persist, please see https://www.cockroachlabs.com/docs/v2.1/cluster-setup-troubleshooting.html.
I190113 17:01:31.357765 1 cli/start.go:756 received signal 'terminated'
I190113 17:01:31.359529 1 cli/start.go:821 initiating graceful shutdown of server
initiating graceful shutdown of server
I190113 17:01:31.361064 1 cli/start.go:872 too early to drain; used hard shutdown instead
too early to drain; used hard shutdown instead
...any ideas how to debug this further?
I have gone through *.yaml file at https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/cockroachdb-statefulset.yaml
I noticed that towards the bottom there is no storageClassName mentioned which means that during the volume claim process, pods are going to look for standard storage class.
I am not sure if you used below annotation while provisioning 3 NFS volumes -
storageclass.kubernetes.io/is-default-class=true
You should be able to check the same using -
kubectl get storageclass
If the output does not show Standard storage class then I would suggest either readjusting persistent volumes definitions by adding annotation or add empty string as storageClassName towards the end of the cockroach-statefulset.yaml file
More logs can be viewed using -
kubectl describe cockroachdb-{statefulset}
OK it came down to the fact I had NAT as my virtualbox external facing network adaptor. I changed it to Bridged and it all started working perfectly. If anyone can tell me why, that would be awesome :)
In my case, using helm chart, like below:
$ helm install stable/cockroachdb \
-n cockroachdb \
--namespace cockroach \
--set Storage=10Gi \
--set NetworkPolicy.Enabled=true \
--set Secure.Enabled=true
After wait to finish adding csr's for cockroach:
$ watch kubectl get csr
Several csr's are pending:
$ kubectl get csr
NAME AGE REQUESTOR CONDITION
cockroachdb.client.root 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-0 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-1 129m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
cockroachdb.node.cockroachdb-cockroachdb-2 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending
To approve that run follow command:
$ kubectl get csr -o json | \
jq -r '.items[] | select(.metadata.name | contains("cockroach.")) | .metadata.name' | \
xargs -n 1 kubectl certificate approve

Ceph status HEALTH_WARN while adding an RGW Instance

I want to create ceph cluster and then connect to it through S3 RESTful api.
So, I've deployed ceph cluster (mimic 13.2.4) on "Ubuntu 16.04.5 LTS" with 3 OSD (one per each HDD 10Gb).
Using this tutorials:
1) http://docs.ceph.com/docs/mimic/start/quick-start-preflight/#ceph-deploy-setup
2) http://docs.ceph.com/docs/mimic/start/quick-ceph-deploy/
At this point, ceph status is OK:
root#ubuntu-srv:/home/slavik/my-cluster# ceph -s
cluster:
id: d7459118-8c16-451d-9774-d09f7a926d0e
health: HEALTH_OK
services:
mon: 1 daemons, quorum ubuntu-srv
mgr: ubuntu-srv(active)
osd: 3 osds: 3 up, 3 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 27 GiB / 30 GiB avail
pgs:
3) "To use the Ceph Object Gateway component of Ceph, you must deploy an instance of RGW. Execute the following to create an new instance of RGW:"
root#ubuntu-srv:/home/slavik/my-cluster# ceph-deploy rgw create ubuntu-srv
....
[ceph_deploy.rgw][INFO ] The Ceph Object Gateway (RGW) is now running on host ubuntu-srv and default port 7480
root#ubuntu-srv:/home/slavik/my-cluster# ceph -s
cluster:
id: d7459118-8c16-451d-9774-d09f7a926d0e
health: HEALTH_WARN
too few PGs per OSD (2 < min 30)
services:
mon: 1 daemons, quorum ubuntu-srv
mgr: ubuntu-srv(active)
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 8 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 27 GiB / 30 GiB avail
pgs: 37.500% pgs unknown
62.500% pgs not active
5 creating+peering
3 unknown
Ceph status has been changed to HEALTH_WARN - why and how to resolve it?
Your issue is
health: HEALTH_WARN
too few PGs per OSD (2 < min 30)
Look at you current pg config by running:
ceph osd dump|grep pool
See what each pool is configured for pg count, then goto https://ceph.com/pgcalc/ to calculate what your pools should be configured for.
The warning is that you have a low number of pg's per osd, right now you have 2 per osd, where min should be 30

Kubernetes stops updating current CPU utilization in HPA

I am having an issue with some (but not all) HPAs in my cluster stopping updating their CPU utilization. This appears to happen after some different HPA scales its target deployment.
Running kubectl describe hpa on the affected HPA yields these events:
56m <invalid> 453 {horizontal-pod-autoscaler } Warning FailedUpdateStatus Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "sync-api": the object has been modified; please apply your changes to the latest version and try again
The controller-manager logs show affected HPAs start having problems right after a scaling event on another HPA:
I0920 03:50:33.807951 1 horizontal.go:403] Successfully updated status for sync-api
I0920 03:50:33.821044 1 horizontal.go:403] Successfully updated status for monolith
I0920 03:50:34.982382 1 horizontal.go:403] Successfully updated status for aurora
I0920 03:50:35.002736 1 horizontal.go:403] Successfully updated status for greyhound-api
I0920 03:50:35.014838 1 horizontal.go:403] Successfully updated status for sync-api
I0920 03:50:35.035785 1 horizontal.go:403] Successfully updated status for monolith
I0920 03:50:48.873503 1 horizontal.go:403] Successfully updated status for aurora
I0920 03:50:48.949083 1 horizontal.go:403] Successfully updated status for greyhound-api
I0920 03:50:49.005793 1 horizontal.go:403] Successfully updated status for sync-api
I0920 03:50:49.103726 1 horizontal.go:346] Successfull rescale of monolith, old size: 7, new size: 6, reason: All metrics below t
arget
I0920 03:50:49.135993 1 horizontal.go:403] Successfully updated status for monolith
I0920 03:50:49.137008 1 event.go:216] Event(api.ObjectReference{Kind:"Deployment", Namespace:"default", Name:"monolith", UID:"086
bfbee-7ec7-11e6-a6f5-0240c833a143", APIVersion:"extensions", ResourceVersion:"4210077", FieldPath:""}): type: 'Normal' reason: 'Scaling
ReplicaSet' Scaled down replica set monolith-1803096525 to 6
E0920 03:50:49.169382 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith"
is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0
I0920 03:50:49.172986 1 replica_set.go:463] Too many "default"/"monolith-1803096525" replicas, need 6, deleting 1
E0920 03:50:49.222184 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith" is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0
I0920 03:50:50.573273 1 event.go:216] Event(api.ObjectReference{Kind:"ReplicaSet", Namespace:"default", Name:"monolith-1803096525", UID:"086e56d0-7ec7-11e6-a6f5-0240c833a143", APIVersion:"extensions", ResourceVersion:"4210080", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: monolith-1803096525-gaz5x
E0920 03:50:50.634225 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith" is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0
I0920 03:50:50.666270 1 horizontal.go:403] Successfully updated status for aurora
I0920 03:50:50.955971 1 horizontal.go:403] Successfully updated status for greyhound-api
W0920 03:50:50.980039 1 horizontal.go:99] Failed to reconcile greyhound-api: failed to update status for greyhound-api: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "greyhound-api": the object has been modified; please apply your changes to the latest version and try again
I0920 03:50:50.995372 1 horizontal.go:403] Successfully updated status for sync-api
W0920 03:50:51.017321 1 horizontal.go:99] Failed to reconcile sync-api: failed to update status for sync-api: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "sync-api": the object has been modified; please apply your changes to the latest version and try again
I0920 03:50:51.032596 1 horizontal.go:403] Successfully updated status for aurora
W0920 03:50:51.084486 1 horizontal.go:99] Failed to reconcile monolith: failed to update status for monolith: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "monolith": the object has been modified; please apply your changes to the latest version and try again
Manually updating affected HPAs using kubectl edit fixes the problem, but this makes me worry how reliable are HPAs for autoscaling.
Any help is appreciated. I am running v1.3.6.
It is not correct to set up more than one HPA pointing to the same target deployment. When two different HPAs point to the same target (as described here), behavior of the system may be weird.