1 pg undersized health warn in rook ceph on single node cluster(minikube) - kubernetes

I'm deploying rook-ceph into a minikube cluster. Everything seems to be working. I added 3 unformatted disk to the vm and its connected. The problem that im having is when I run ceph status, I get a health warm message that tells me "1 pg undersized". How exactly do I fix this?
The documentation(https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/) stated "If you are trying to create a cluster on a single node, you must change the default of the osd crush chooseleaf type setting from 1 (meaning host or node) to 0 (meaning osd) in your Ceph configuration file before you create your monitors and OSDs." I don't know where to make this configuration but if there's any other way to fix this that I should know of, please let me know. Thanks!

I came across this problem installing ceph using rook (v1.5.7) with a single data bearing host having multiple OSDs.
The install shipped with a default CRUSH rule replicated_rule which had host as the default failure domain:
$ ceph osd crush rule dump replicated_rule
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
I had to find out the pool name associated with pg 1 that was "undersized", luckily in a default rook-ceph install, there's only one:
$ ceph osd pool ls
device_health_metrics
$ ceph pg ls-by-pool device_health_metrics
PG OBJECTS DEGRADED ... STATE
1.0 0 0 ... active+undersized+remapped
And to confirm the pg is using the default rule:
$ ceph osd pool get device_health_metrics crush_rule
crush_rule: replicated_rule
Instead of modifying the default CRUSH rule, I opted to create a new replicated rule, but this time specifying the osd (aka device) type (docs: CRUSH map Types and Buckets), also assuming the default CRUSH root of default:
# osd crush rule create-replicated <name> <root> <type> [<class>]
$ ceph osd crush rule create-replicated replicated_rule_osd default osd
$ ceph osd crush rule dump replicated_rule_osd
{
"rule_id": 1,
"rule_name": "replicated_rule_osd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_firstn",
"num": 0,
"type": "osd"
},
{
"op": "emit"
}
]
}
And then assigning the new rule to the existing pool:
$ ceph osd pool set device_health_metrics crush_rule replicated_rule_osd
set pool 1 crush_rule to replicated_rule_osd
$ ceph osd pool get device_health_metrics crush_rule
crush_rule: replicated_rule_osd
Finally confirming pg state:
$ ceph pg ls-by-pool device_health_metrics
PG OBJECTS DEGRADED ... STATE
1.0 0 0 ... active+clean

As you mentioned in your question you should change your crush failure-domain-type to OSD that it means it will replicate your data between OSDs not hosts. By default it is host and when you have only one host it doesn't have any other hosts to replicate your data and so your pg will always be undersized.
You should set osd crush chooseleaf type = 0 in your ceph.conf before you create your monitors and OSDs.
This will replicate your data between OSDs rather that hosts.

New account so can't add as comment, wanted to expound on #zamnuts answer as I hit the same in my cluster with rook:v1.7.2, if wanting to change the default device_health_metrics in the Rook/Ceph Helm chart or in the YAML, the following document is relevant
https://github.com/rook/rook/blob/master/deploy/examples/pool-device-health-metrics.yaml
https://github.com/rook/rook/blob/master/Documentation/helm-ceph-cluster.md

Related

Get usage of processes using system-reserved CPU/memory

Updated Question:
Is there an easy way for me to tell how much of my CPU/memory resources I've used up on a node between its system-reserved allocation and node-allocatable allocation?
I'd like to know how much of my node allocatable memory is remaining/available for use at any given time or how much system-reserved memory is remaining/available for use. Same for the CPU usage.
Original Question:
Can I easily see how much CPU/memory is used by an OpenShift worker node by system processes only?
I'm using commands such as oc get nodemetrics to get the CPU/memory usage, which I'm expecting is analogous to a simple Linux top command, so I'm expecting this reports the total of system processes as well as "user" pod processes on the node.
I'm trying to get an idea of how much of my system-reserved allocation is used vs. free.
Here are the pages I have been looking at for reference:
https://docs.openshift.com/container-platform/4.7/nodes/nodes/nodes-nodes-viewing.html
https://docs.openshift.com/container-platform/4.6/nodes/nodes/nodes-nodes-resources-configuring.html
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
I currently have no custom resource set for the machine config, so I'm expecting the allocation to be the default 500m for CPU and 1Gi for memory as described in the docs, and I see this running:
$ oc get node ... -o json
{
...
"status": {
...
"allocatable": {
"cpu": "23500m",
...
"memory": "40036164Ki",
...
},
"capacity": {
"cpu": "24",
...
"memory": "41187140Ki",
...
},
...
From above, 24000m - 235000m = 500m, so I'm expecting that's the system-reserved CPU allocation.
I tried this to get the current usage, which I'm expecting is giving me the total of all system processes and pod processes on the node.
$ oc get nodemetrics ... -o json
{
...
"usage": {
"cpu": "916m",
"memory": "14314036Ki"
},
...
}

How to pass a flag to klog for structured logging

As part of kubernetes 1.19, structured logging has been implemented.
I've read that kubernetes log's engine is klog and structured logs are following this format :
<klog header> "<message>" <key1>="<value1>" <key2>="<value2>" ...
Cool ! But even better, you apparently can pass a --logging-format=json flag to klog so logs are generated in json directly !
{
"ts": 1580306777.04728,
"v": 4,
"msg": "Pod status updated",
"pod":{
"name": "nginx-1",
"namespace": "default"
},
"status": "ready"
}
Unfortunately, I haven't been able to find out how and where I should specify that --logging-format=json flag.
Is it a kubectl command? I'm using Azure's aks.
--logging-format=json is a flag which need to be set on all Kuberentes System Components ( Kubelet, API-Server, Controller-Manager & Scheduler). You can check all flags here.
Unfortunately you cant do it right now with AKS as you have the managed control plane from Microsoft.

Is it possible, and how to limit kubernetes job to create a maxium number of pods if always fail?

As a QA in our company I am daily user of kubernetes, and we use kubernetes job to create performance tests pods. One advantage of job, according to the docs, is
to create one Job object in order to reliably run one Pod to completion
But in our tests this feature will create infinite pods if previous ones fail, which will occupy resources of our team's shared cluster, and deleting such pods will take a lot of time. see this image:
Currently the job manifest is like this:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "upgradeperf",
"namespace": "ntg6-grpc26-tts"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "upgradeperfjob",
"image":
"mycompany.com:5000/ncs-cd-qa/upgradeperf:0.1.1",
"command": [
"python",
"/jmeterwork/jmeter.py",
"-gu",
"git#gitlab-pri-eastus2.dev.mycompany.net:mobility-ncs-tools/tts-cdqa-tool.git",
"-gb",
"upgradeperf",
"-t",
"JMeter/testcases/ttssvc/JMeterTestPlan_ttssvc_cmpsize.jmx",
"-JtestDataFile",
"JMeter/testcases/ttssvc/testData/avaml_opus.csv",
"-JthreadNum",
"3",
"-JthreadLoopCount",
"1500",
"-JresultsFile",
"results_upgradeperf_cavaml_opus_t3_l1500.csv",
"-Jhost",
"mtl-blade32-03.mycompany.com",
"-Jport",
"28416"
]
}
],
"restartPolicy": "Never",
"imagePullSecrets": [
{
"name": "docker-registry-secret"
}
]
}
}
}
}
In some cases, such as misconfiguring of ip/ports, 'reliably run one Pod to completion' is impossible and recreating pods is waste of time and resource.
So is it possible, and how to limit kubernetes job to create a maxium number(say 3) of pods if always fail?
Depending on your kubernetes version, you can resolve this problem with these methods:
set the option: restartPolicy: OnFailure, then the failed container will be restarted in the same Pod, so you will not get lots of failed Pods, instead you will get a Pod with lots of restart.
From Kubernetes 1.8 on, There is a parameter backoffLimit to control the restart policy of failed job. This parameter defines the retry times of the job before treating the job to be failed, default 6 times. For this parameter to work you must set the parameter restartPolicy: Never .
You probably didn't set restartPolicy: Never in your pod spec, add that and I would expect it matches your expected behaviors better.

Unable to fully collect metrics, when installing metric-server

I have installed the metric server on kubernetes, but its not working and logs
unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:xxx: unable to fetch metrics from Kubelet ... (X.X): Get https:....: x509: cannot validate certificate for 1x.x.
x509: certificate signed by unknown authority
I was able to get metrics if modified the deployment yaml and added
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
this now collects metrics, and kubectl top node returns results...
but logs still show
E1120 11:58:45.624974 1 reststorage.go:144] unable to fetch pod metrics for pod dev/pod-6bffbb9769-6z6qz: no metrics known for pod
E1120 11:58:45.625289 1 reststorage.go:144] unable to fetch pod metrics for pod dev/pod-6bffbb9769-rzvfj: no metrics known for pod
E1120 12:00:06.462505 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-1x.x.x.eu-west-1.compute.internal: unable to get CPU for container ...discarding data: missing cpu usage metric, unable to fully scrape metrics from source
so questions
1) All this works on minikube, but not on my dev cluster, why would that be?
2) In production i dont want to do insecure-tls.. so can someone please explain why this issue is arising... or point me to some resource.
Kubeadm generates the kubelet certificate at /var/lib/kubelet/pki and those certificates (kubelet.crt and kubelet.key) are signed by different CA from the one which is used to generate all other certificates at /etc/kubelet/pki.
You need to regenerate the kubelet certificates which is signed by your root CA (/etc/kubernetes/pki/ca.crt)
You can use openssl or cfssl to generate the new certificates(I am using cfssl)
$ mkdir certs; cd certs
$ cp /etc/kubernetes/pki/ca.crt ca.pem
$ cp /etc/kubernetes/pki/ca.key ca-key.pem
Create a file kubelet-csr.json:
{
"CN": "kubernetes",
"hosts": [
"127.0.0.1",
"<node_name>",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [{
"C": "US",
"ST": "NY",
"L": "City",
"O": "Org",
"OU": "Unit"
}]
}
Create a ca-config.json file:
{
"signing": {
"default": {
"expiry": "8760h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "8760h"
}
}
}
}
Now generate the new certificates using above files:
$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem \
--config=ca-config.json -profile=kubernetes \
kubelet-csr.json | cfssljson -bare kubelet
Replace the old certificates with newly generated one:
$ scp kubelet.pem <nodeip>:/var/lib/kubelet/pki/kubelet.crt
$ scp kubelet-key.pem <nodeip>:/var/lib/kubelet/pki/kubelet.key
Now restart the kubelet so that new certificates will take effect on your node.
$ systemctl restart kubelet
Look at the following tickets to get the context of issue:
https://github.com/kubernetes-incubator/metrics-server/issues/146
Hope this helps.

Failed to add new osd into monitor node in ceph

I am trying to add a osd node by following command
ceph-deploy osd prepare ceph-02:/dev/sdb
Found error that config file /etc/ceph/ceph.conf exists
with different content. used --overwrite-conf to overwrite.
How do I use overwrite-conf?
Error log:
[ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.4.1708 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to ceph-02
[ceph-02][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][ERROR ] RuntimeError: config file /etc/ceph/ceph.conf exists
with different content; use --overwrite-conf to overwrite
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
This is a normal behavior for a ceph-deploy command. Just run ceph-deploy --overwrite-conf osd prepare ceph-02:/dev/sdb. This will replace your existing /etc/ceph/ceph.conf.
This will resolve your issue.