Rook Ceph Provisioning issue - ceph
I am having an issue when trying to create my PVC. It appears as though the provisioner is unable to create space.
k describe pvc avl-vam-pvc-media-ceph
Name: avl-vam-pvc-media-ceph
Namespace: default
StorageClass: rook-ceph-block
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 10s (x5 over 67s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
Normal Provisioning 5s (x8 over 67s) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6799bd4cb7-sv4gz_73756eff-f42e-4d8f-8448-d5dedd94d1f2 External provisioner is provisioning volume for claim "default/avl-vam-pvc-media-ceph"
Warning ProvisioningFailed 5s (x8 over 67s) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6799bd4cb7-sv4gz_73756eff-f42e-4d8f-8448-d5dedd94d1f2 failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
Below is my PVC yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: avl-vam-pvc-media-ceph
spec:
storageClassName: "rook-ceph-block"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
I used the ./rook/cluster/examples/kubernetes/ceph/csi/rbd/storageclass.yaml to create my storageclass. I am confused where this is going wrong.
One other thing I find odd in my ceph cluster is my pgs appear to be stuck at undersized
ceph health detail
HEALTH_WARN Degraded data redundancy: 33 pgs undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 33 pgs undersized
pg 1.0 is stuck undersized for 51m, current state active+undersized, last acting [1,0]
pg 2.0 is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.1 is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.2 is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.3 is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.4 is stuck undersized for 44m, current state active+undersized, last acting [2,1]
pg 2.5 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.6 is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.7 is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.8 is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.9 is stuck undersized for 44m, current state active+undersized, last acting [4,1]
pg 2.a is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.b is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.c is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.d is stuck undersized for 44m, current state active+undersized, last acting [0,1]
pg 2.e is stuck undersized for 44m, current state active+undersized, last acting [2,3]
pg 2.f is stuck undersized for 44m, current state active+undersized, last acting [1,0]
pg 2.10 is stuck undersized for 44m, current state active+undersized, last acting [2,1]
pg 2.11 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.12 is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.13 is stuck undersized for 44m, current state active+undersized, last acting [0,5]
pg 2.14 is stuck undersized for 44m, current state active+undersized, last acting [3,4]
pg 2.15 is stuck undersized for 44m, current state active+undersized, last acting [4,3]
pg 2.16 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.17 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.18 is stuck undersized for 44m, current state active+undersized, last acting [5,2]
pg 2.19 is stuck undersized for 44m, current state active+undersized, last acting [0,3]
pg 2.1a is stuck undersized for 44m, current state active+undersized, last acting [3,2]
pg 2.1b is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.1c is stuck undersized for 44m, current state active+undersized, last acting [5,4]
pg 2.1d is stuck undersized for 44m, current state active+undersized, last acting [3,0]
pg 2.1e is stuck undersized for 44m, current state active+undersized, last acting [2,5]
pg 2.1f is stuck undersized for 44m, current state active+undersized, last acting [4,3]
I do have OSDs up
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 10.47958 root default
-3 5.23979 host hostname1
0 ssd 1.74660 osd.0 up 1.00000 1.00000
2 ssd 1.74660 osd.2 up 1.00000 1.00000
4 ssd 1.74660 osd.4 up 1.00000 1.00000
-5 5.23979 host hostname2
1 ssd 1.74660 osd.1 up 1.00000 1.00000
3 ssd 1.74660 osd.3 up 1.00000 1.00000
5 ssd 1.74660 osd.5 up 1.00000 1.00000
You should set accessModes to ReadWriteOnce when using rbd. ReadWriteMany is supported by cephfs.
Also because your replica is 3 and the failure domain (which ceph decide to replicate each copy of data) is by host you should add 3 nodes or more to solve the stuck pgs.
Related
Prometheus metrics yield multiplied values for Kubernetes monitoring on cores, memory and storage
I'm trying to import some pre-built Grafana dashboards for Kubernetes monitoring but I don't get why some metrics seem to be duplicated or multiplied. For example, this metric is yielding 6 nodes: sum(kube_node_info{node=~"$node"}) Which is double than the what the cluster actually has: kubectl get nodes NAME STATUS ROLES AGE VERSION aks-agentpool-vmss000000 Ready agent 57d v1.23.5 aks-agentpool-vmss000001 Ready agent 57d v1.23.5 aks-agentpool-vmss000007 Ready agent 35d v1.23.5 Another example: This metrics is yielding a total of 36 cores, when in reality there are only 12 (3 nodes x 4 cores each) sum (machine_cpu_cores{kubernetes_io_hostname=~"^$Node$"}) Capacity: cpu: 4 ephemeral-storage: 129900528Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16393292Ki pods: 110 If I filter que query by system_uuid, each of the 3 uuids yield 12 cores. The same goes for memory usage, filesystem storage and so on. Any idea why the metrics are multiplied? The dashboard in question is this: https://grafana.com/grafana/dashboards/10000
Single controlplane node shows not ready
What happened: The master node does not show ready anymore. Maybe that happend after an failed update (downloaded kubeadm and kubelet in a way too high version) s-rtk8s01 Ready Node 2y120d v1.14.1 s-rtk8s02 Ready Node 2y173d v1.14.1 s-rtk8s03 Ready Node 2y174d v1.14.1 s-rtk8s04 Ready Node 2y174d v1.14.1 s-rtk8s05 Ready Node 2y174d v1.14.1 s-rtk8sma01 NotReady,SchedulingDisabled master 2y174d v1.14.1 Scheduler does not show (after it was deleted forcefully) up in the list of pods but docker ps shows that the static pods are getting started in the background. in the NAME READY STATUS RESTARTS AGE coredns-fb8b8dccf-hvh6b 1/1 Running 56 288d coredns-fb8b8dccf-x5r5h 1/1 Running 58 302d etcd-s-rtk8sma01 1/1 Running 45 535d kube-apiserver-s-rtk8sma01 1/1 Running 13 535d kube-controller-manager-s-rtk8sma01 1/1 Running 7 485d kube-flannel-ds-2fmj4 1/1 Running 6 485d kube-flannel-ds-5g47f 1/1 Running 5 485d kube-flannel-ds-5k27n 1/1 Running 5 485d kube-flannel-ds-cj967 1/1 Running 8 485d kube-flannel-ds-drjff 1/1 Running 9 485d kube-flannel-ds-v4sfg 1/1 Running 5 485d kube-proxy-6ngn6 1/1 Running 11 535d kube-proxy-85g6c 1/1 Running 10 535d kube-proxy-gd5jb 1/1 Running 13 535d kube-proxy-grvsk 1/1 Running 11 535d kube-proxy-lpht9 1/1 Running 13 535d kube-proxy-pmdmj 0/1 Pending 0 25h systemd logs for kubelet shows following (I see those errors with the hostname case remarks and an error with a missing mirror pod - maybe the scheduler?) kubelet_node_status.go:94] Unable to register node "s-rtk8sma01" with API server: nodes "s-rtk8sma01" is forbidden: node "S-RTK8SMA01" is not allowed to modify node "s-rtk8sma01" setters.go:739] Error getting volume limit for plugin kubernetes.io/azure-disk setters.go:739] Error getting volume limit for plugin kubernetes.io/cinder setters.go:739] Error getting volume limit for plugin kubernetes.io/aws-ebs setters.go:739] Error getting volume limit for plugin kubernetes.io/gce-pd Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml Reading config file "/etc/kubernetes/manifests/kube-scheduler.yaml_bck" Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml_bck Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck Setting pods for source file anager.go:445] Static pod "56ba6ffcb6b23178170f8063052292ee" (kube-scheduler-s-rtk8sma01/kube-system) does not have a corresponding mirror pod; skipping anager.go:464] Status Manager: syncPod in syncbatch. pod UID: "24db95fbbd2e618dc6ed589132ed7158" docker ps shows aec23e01ee2a 2c4adeb21b4f "etcd --advertise-cl…" 7 hours ago Up 7 hours k8s_etcd_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_59 97910491f3b2 20a2d7035165 "/usr/local/bin/kube…" 26 hours ago Up 26 hours k8s_kube-proxy_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0 37d87cdd8886 k8s.gcr.io/pause:3.1 "/pause" 26 hours ago Up 26 hours k8s_POD_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0 83a8af0407e5 cfaa4ad74c37 "kube-apiserver --ad…" 39 hours ago Up 39 hours k8s_kube-apiserver_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1 85250c421db4 k8s.gcr.io/pause:3.1 "/pause" 39 hours ago Up 39 hours k8s_POD_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1 984a3628068c 3fa2504a839b "kube-scheduler --bi…" 40 hours ago Up 40 hours k8s_kube-scheduler_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_7 4d5446906cc5 efb3887b411d "kube-controller-man…" 40 hours ago Up 40 hours k8s_kube-controller-manager_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_3 544423226bed k8s.gcr.io/pause:3.1 "/pause" 40 hours ago Up 40 hours k8s_POD_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_4 a75feece56b5 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_20 1b17cb3ef1c1 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_0 c7c7235ed0dc ff281650a721 "/opt/bin/flanneld -…" 2 months ago Up 2 months k8s_kube-flannel_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_8 d56fe3708565 k8s.gcr.io/pause:3.1 "/pause" 2 months ago Up 2 months k8s_POD_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_7 What you expected to happen: The master is getting ready again, and the static pods and daemonsets are generated again, so I can start to upgrade the cluster How to reproduce it (as minimally and precisely as possible): Anything else we need to know?: I am really lost at this point and tried many hours to find a solution by myself and hope to get a little bit help from the experts, to understand the problem any maybe get some kind of workaround. Environment: Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"} Cloud provider or hardware configuration: OnPremise OS (e.g: cat /etc/os-release): NAME="Ubuntu" VERSION="18.04.2 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.2 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic Kernel (e.g. uname -a): Linux S-RTK8SMA01 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Install tools: Network plugin and version (if this is a network-related bug): flannel quay.io/coreos/flannel:v0.11.0-amd64 Does anybody know how to fix those mirror pod problems and knows how I can fix the problem with the node name casing? What I tried so far was, that I started kubelet with hostname override but this did not have any effect.
CockroachDB on Single Cluster Kube PODs fail with CrashLoopBackOff
Using VirtualBox and 4 x Centos7 OS installs. Following a basic Single cluster kubernetes install: https://kubernetes.io/docs/setup/independent/install-kubeadm/ https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/ [root#k8s-master cockroach]# kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master Ready master 41m v1.13.2 k8s-slave1 Ready <none> 39m v1.13.2 k8s-slave2 Ready <none> 39m v1.13.2 k8s-slave3 Ready <none> 39m v1.13.2 I have created 3 x NFS PV's on master for my slaves to pick up as part of the cockroachdb-statefulset.yaml as described here: https://www.cockroachlabs.com/blog/running-cockroachdb-on-kubernetes/ However my cockroach PODs just continually fail to communicate with each other. [root#k8s-slave1 kubernetes]# kubectl get pods NAME READY STATUS RESTARTS AGE cockroachdb-0 0/1 CrashLoopBackOff 6 8m47s cockroachdb-1 0/1 CrashLoopBackOff 6 8m47s cockroachdb-2 0/1 CrashLoopBackOff 6 8m47s [root#k8s-slave1 kubernetes]# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE datadir-cockroachdb-0 Bound cockroachdbpv0 10Gi RWO 17m datadir-cockroachdb-1 Bound cockroachdbpv2 10Gi RWO 17m datadir-cockroachdb-2 Bound cockroachdbpv1 10Gi RWO 17m ...the cockroach pod logs do not really tell me why... [root#k8s-slave1 kubernetes]# kubectl logs cockroachdb-0 ++ hostname -f + exec /cockroach/cockroach start --logtostderr --insecure --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25% W190113 17:00:46.589470 1 cli/start.go:1055 RUNNING IN INSECURE MODE! - Your cluster is open for any client that can access <all your IP addresses>. - Any user, even root, can log in without providing a password. - Any user, connecting as root, can read or write any data in your cluster. - There is no network encryption nor authentication, and thus no confidentiality. Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.1/secure-a-cluster.html I190113 17:00:46.595544 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory I190113 17:00:46.600386 1 cli/start.go:1069 CockroachDB CCL v2.1.3 (x86_64-unknown-linux-gnu, built 2018/12/17 19:15:31, go1.10.3) I190113 17:00:46.759727 1 server/status/recorder.go:609 available memory from cgroups (8.0 EiB) exceeds system memory 3.7 GiB, using system memory I190113 17:00:46.759809 1 server/config.go:386 system total memory: 3.7 GiB I190113 17:00:46.759872 1 server/config.go:388 server configuration: max offset 500000000 cache size 947 MiB SQL memory pool size 947 MiB scan interval 10m0s scan min idle time 10ms scan max idle time 1s event log enabled true I190113 17:00:46.759896 1 cli/start.go:913 using local environment variables: COCKROACH_CHANNEL=kubernetes-insecure I190113 17:00:46.759909 1 cli/start.go:920 process identity: uid 0 euid 0 gid 0 egid 0 I190113 17:00:46.759919 1 cli/start.go:545 starting cockroach node I190113 17:00:46.762262 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data/cockroach-temp632709623" I190113 17:00:46.803749 22 server/server.go:851 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled I190113 17:00:46.804168 22 storage/engine/rocksdb.go:574 opening rocksdb instance at "/cockroach/cockroach-data" I190113 17:00:46.828487 22 server/config.go:494 [n?] 1 storage engine initialized I190113 17:00:46.828526 22 server/config.go:497 [n?] RocksDB cache size: 947 MiB I190113 17:00:46.828536 22 server/config.go:497 [n?] store 0: RocksDB, max size 0 B, max open file limit 60536 W190113 17:00:46.838175 22 gossip/gossip.go:1499 [n?] no incoming or outgoing connections I190113 17:00:46.838260 22 cli/start.go:505 initial startup completed, will now wait for `cockroach init` or a join to a running cluster to start accepting clients. Check the log file(s) for progress. I190113 17:00:46.841243 22 server/server.go:1402 [n?] no stores bootstrapped and --join flag specified, awaiting init command. W190113 17:01:16.841095 89 cli/start.go:535 The server appears to be unable to contact the other nodes in the cluster. Please try: - starting the other nodes, if you haven't already; - double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly; - running the 'cockroach init' command if you are trying to initialize a new cluster. If problems persist, please see https://www.cockroachlabs.com/docs/v2.1/cluster-setup-troubleshooting.html. I190113 17:01:31.357765 1 cli/start.go:756 received signal 'terminated' I190113 17:01:31.359529 1 cli/start.go:821 initiating graceful shutdown of server initiating graceful shutdown of server I190113 17:01:31.361064 1 cli/start.go:872 too early to drain; used hard shutdown instead too early to drain; used hard shutdown instead ...any ideas how to debug this further?
I have gone through *.yaml file at https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/cockroachdb-statefulset.yaml I noticed that towards the bottom there is no storageClassName mentioned which means that during the volume claim process, pods are going to look for standard storage class. I am not sure if you used below annotation while provisioning 3 NFS volumes - storageclass.kubernetes.io/is-default-class=true You should be able to check the same using - kubectl get storageclass If the output does not show Standard storage class then I would suggest either readjusting persistent volumes definitions by adding annotation or add empty string as storageClassName towards the end of the cockroach-statefulset.yaml file More logs can be viewed using - kubectl describe cockroachdb-{statefulset}
OK it came down to the fact I had NAT as my virtualbox external facing network adaptor. I changed it to Bridged and it all started working perfectly. If anyone can tell me why, that would be awesome :)
In my case, using helm chart, like below: $ helm install stable/cockroachdb \ -n cockroachdb \ --namespace cockroach \ --set Storage=10Gi \ --set NetworkPolicy.Enabled=true \ --set Secure.Enabled=true After wait to finish adding csr's for cockroach: $ watch kubectl get csr Several csr's are pending: $ kubectl get csr NAME AGE REQUESTOR CONDITION cockroachdb.client.root 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending cockroachdb.node.cockroachdb-cockroachdb-0 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending cockroachdb.node.cockroachdb-cockroachdb-1 129m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending cockroachdb.node.cockroachdb-cockroachdb-2 130m system:serviceaccount:cockroachdb:cockroachdb-cockroachdb Pending To approve that run follow command: $ kubectl get csr -o json | \ jq -r '.items[] | select(.metadata.name | contains("cockroach.")) | .metadata.name' | \ xargs -n 1 kubectl certificate approve
Ceph status HEALTH_WARN while adding an RGW Instance
I want to create ceph cluster and then connect to it through S3 RESTful api. So, I've deployed ceph cluster (mimic 13.2.4) on "Ubuntu 16.04.5 LTS" with 3 OSD (one per each HDD 10Gb). Using this tutorials: 1) http://docs.ceph.com/docs/mimic/start/quick-start-preflight/#ceph-deploy-setup 2) http://docs.ceph.com/docs/mimic/start/quick-ceph-deploy/ At this point, ceph status is OK: root#ubuntu-srv:/home/slavik/my-cluster# ceph -s cluster: id: d7459118-8c16-451d-9774-d09f7a926d0e health: HEALTH_OK services: mon: 1 daemons, quorum ubuntu-srv mgr: ubuntu-srv(active) osd: 3 osds: 3 up, 3 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 3.0 GiB used, 27 GiB / 30 GiB avail pgs: 3) "To use the Ceph Object Gateway component of Ceph, you must deploy an instance of RGW. Execute the following to create an new instance of RGW:" root#ubuntu-srv:/home/slavik/my-cluster# ceph-deploy rgw create ubuntu-srv .... [ceph_deploy.rgw][INFO ] The Ceph Object Gateway (RGW) is now running on host ubuntu-srv and default port 7480 root#ubuntu-srv:/home/slavik/my-cluster# ceph -s cluster: id: d7459118-8c16-451d-9774-d09f7a926d0e health: HEALTH_WARN too few PGs per OSD (2 < min 30) services: mon: 1 daemons, quorum ubuntu-srv mgr: ubuntu-srv(active) osd: 3 osds: 3 up, 3 in data: pools: 1 pools, 8 pgs objects: 0 objects, 0 B usage: 3.0 GiB used, 27 GiB / 30 GiB avail pgs: 37.500% pgs unknown 62.500% pgs not active 5 creating+peering 3 unknown Ceph status has been changed to HEALTH_WARN - why and how to resolve it?
Your issue is health: HEALTH_WARN too few PGs per OSD (2 < min 30) Look at you current pg config by running: ceph osd dump|grep pool See what each pool is configured for pg count, then goto https://ceph.com/pgcalc/ to calculate what your pools should be configured for. The warning is that you have a low number of pg's per osd, right now you have 2 per osd, where min should be 30
Kubernetes stops updating current CPU utilization in HPA
I am having an issue with some (but not all) HPAs in my cluster stopping updating their CPU utilization. This appears to happen after some different HPA scales its target deployment. Running kubectl describe hpa on the affected HPA yields these events: 56m <invalid> 453 {horizontal-pod-autoscaler } Warning FailedUpdateStatus Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "sync-api": the object has been modified; please apply your changes to the latest version and try again The controller-manager logs show affected HPAs start having problems right after a scaling event on another HPA: I0920 03:50:33.807951 1 horizontal.go:403] Successfully updated status for sync-api I0920 03:50:33.821044 1 horizontal.go:403] Successfully updated status for monolith I0920 03:50:34.982382 1 horizontal.go:403] Successfully updated status for aurora I0920 03:50:35.002736 1 horizontal.go:403] Successfully updated status for greyhound-api I0920 03:50:35.014838 1 horizontal.go:403] Successfully updated status for sync-api I0920 03:50:35.035785 1 horizontal.go:403] Successfully updated status for monolith I0920 03:50:48.873503 1 horizontal.go:403] Successfully updated status for aurora I0920 03:50:48.949083 1 horizontal.go:403] Successfully updated status for greyhound-api I0920 03:50:49.005793 1 horizontal.go:403] Successfully updated status for sync-api I0920 03:50:49.103726 1 horizontal.go:346] Successfull rescale of monolith, old size: 7, new size: 6, reason: All metrics below t arget I0920 03:50:49.135993 1 horizontal.go:403] Successfully updated status for monolith I0920 03:50:49.137008 1 event.go:216] Event(api.ObjectReference{Kind:"Deployment", Namespace:"default", Name:"monolith", UID:"086 bfbee-7ec7-11e6-a6f5-0240c833a143", APIVersion:"extensions", ResourceVersion:"4210077", FieldPath:""}): type: 'Normal' reason: 'Scaling ReplicaSet' Scaled down replica set monolith-1803096525 to 6 E0920 03:50:49.169382 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith" is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0 I0920 03:50:49.172986 1 replica_set.go:463] Too many "default"/"monolith-1803096525" replicas, need 6, deleting 1 E0920 03:50:49.222184 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith" is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0 I0920 03:50:50.573273 1 event.go:216] Event(api.ObjectReference{Kind:"ReplicaSet", Namespace:"default", Name:"monolith-1803096525", UID:"086e56d0-7ec7-11e6-a6f5-0240c833a143", APIVersion:"extensions", ResourceVersion:"4210080", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: monolith-1803096525-gaz5x E0920 03:50:50.634225 1 deployment_controller.go:400] Error syncing deployment default/monolith: Deployment.extensions "monolith" is invalid: status.unavailableReplicas: Invalid value: -1: must be greater than or equal to 0 I0920 03:50:50.666270 1 horizontal.go:403] Successfully updated status for aurora I0920 03:50:50.955971 1 horizontal.go:403] Successfully updated status for greyhound-api W0920 03:50:50.980039 1 horizontal.go:99] Failed to reconcile greyhound-api: failed to update status for greyhound-api: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "greyhound-api": the object has been modified; please apply your changes to the latest version and try again I0920 03:50:50.995372 1 horizontal.go:403] Successfully updated status for sync-api W0920 03:50:51.017321 1 horizontal.go:99] Failed to reconcile sync-api: failed to update status for sync-api: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "sync-api": the object has been modified; please apply your changes to the latest version and try again I0920 03:50:51.032596 1 horizontal.go:403] Successfully updated status for aurora W0920 03:50:51.084486 1 horizontal.go:99] Failed to reconcile monolith: failed to update status for monolith: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "monolith": the object has been modified; please apply your changes to the latest version and try again Manually updating affected HPAs using kubectl edit fixes the problem, but this makes me worry how reliable are HPAs for autoscaling. Any help is appreciated. I am running v1.3.6.
It is not correct to set up more than one HPA pointing to the same target deployment. When two different HPAs point to the same target (as described here), behavior of the system may be weird.