Conditions for "mds cluster is degraded" - ceph

I see into Ceph and after running command
$ ceph status
I meet text "mds cluster is degraded". I try to seek it at official Ceph docs, but nothing find.
If I understand correctly, MDS needed for mounting and presenting ceph's data as regular FS.
Can someone talk about conditions of appearance for this type of error?


Config DBConfig.ExtraParams not specified for ml-pipeline pod

I have installed Kubeflow using manifest. After installing ml-pipeline, the pod is in "CrashLoopBackOff" state. I changed the destinationrule for ml-pipeline, ml-pipeline-ui and ml-pipeline-msql to DISABLE but no luck. Can anyone help with this?
Thanks in advance.
There are a bunch of possible root causes for this POD’s status, but I am going to try to focus on the most common ones. To choose the correct one for your accurate situation, you are going to need to take a look into the “describe” and the log from the POD with "CrashLoopBackOff" state.
Verify if the “describe” says something like “Back-off restarting failed container” and the log says something like “a container name must be specified for …”, “F ml_metadata/metadata_store/] Non-OK-status …”.
If yes, the problem is the dynamic volume provisioning regularly, maybe because no volume provisioner is installed.
On the other hand, you can verify your cluster’s size, because anything less than 8 CPUs is going to run only if you reduce each service’s requested cpu in the manifest files.
You do not give details on the affected POD yet; but another option is to try to install Katib only (without Kubeflow or other resources) on your K8s cluster to verify other Kubernetes resources do not affect this connection. You can use the following URL’s information for more empirical cases’ troubleshooting and solutions: Multiple Pods stuck in CrashLoopBackOff, katib-mysql , ml-pipeline-persistenceagent pod keeps crashing.
Finally just confirm that you followed the correct instructions, based on the Distribution you used to deploy Kubeflow, you can visit the following URL: Kubeflow Distributions

Greenplum install on GKE

I am trying to install Greenplum on GKE using the directions here
I make it to step 12: but my operator pod is failing because it cannot pull the secret:
kubectl logs -l app=greenplum-operator -n greenplum
{"level":"INFO","ts":"2020-03-10T18:20:50.803Z","logger":"operator-setup","msg":"Go Info","Version":"go1.13.7","GOOS":"linux","GOARCH":"amd64"}
{"level":"INFO","ts":"2020-03-10T18:20:50.803Z","logger":"operator-setup","msg":"creating operator"}
W0310 18:20:50.803978 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W0310 18:20:50.804036 1 client_config.go:546] error creating inClusterConfig, falling back to default config: open /var/run/secrets/ permission denied
It looks like a permissions issue pulling the image, but the image pull test earlier in the instructions succeeded:
job.batch/greenplum-operator-fetch-test created
job.batch "greenplum-operator-fetch-test" deleted
Has anyone else run into this issue?
There's a bug the current documentation. You most likely did everything right. However, creating a GKE cluster with "Enable Kubernetes alpha features in this cluster" as listed on the prerequisites page ( is no longer necessary. In fact, it's currently causing the exact issue you seem to be having. Try creating a GKE cluster following all of the documentation except make sure to NOT enable GKE "alpha features".

How to change kubelet configuration via kubeadm

I'm fairly new to Kubernetes and trying to wrap my head around how to manage ComponentConfigs in already running clusters.
For example:
Recently I initialized a kubeadm cluster in a test environment running Ubuntu. When I did that, I found CoreDNS to be in a CrashLoopBackoff which turned out to be the case because Ubuntu was configured to use systemd-resolved and so the resolv.conf had a loopback resolver configured. After reading the docs for coredns, I found out that a solution for that would be to change the resolvConf parameter for kubelet - either via commandline arguments or in the config.
So how would one do this properly in a kubeadm-managed cluster?
Reading [this page in the documentation][1] I didn't really get a clue, because it seems to be tailored to the case of initializing a new cluster or joining new nodes.
Of course, in this particular situation I could just use "Kubeadm reset" and initialize it again with a --config parameter but that doesn't seem to be the right solution for a running cluster.
So after digging a bit deeper I found several infos:
I could change the /var/lib/kubelet/kubeadm-flags.env on the node directly, but AFAICT this only makes sense for node-specific changes.
There is a ConfigMap in the kube-system namespace named kubelet-config-1.14. This seems promising for upcoming nodes joining the cluster to get the right configuration - but would changing that CM affect the already running Kubelet?
There is a marshalled version of the running config in /var/lib/config/kubelet.yaml that I could change, but AFAIU this would be overriden by kubelet itself periodically (?) or at least during a kubeadm upgrade.
There seems to be an option to specify a configmap in the node object, to let kubelet dynamically load the configuration from there, but given that there is already an existing configmap it seems more sensible to change that one.
I seemingly had success by some combination of changing aforementioned CM, running kubeadm upgrade something afterwards and rebooting the machine (since restarting the kubelet did not fix the CoreDNS issue ... but maybe I was to impatient).
So I am now asking:
What is the recommended way to carry out changes to the kubelet configuration (or any other configuration I could affect via kubeadm-config.yaml) that works and is upgrade-safe for cases where the configuration is not node-specific?
And if this involves running kubeadm ... config --config - how do I extract the existing Kubeadm-config in a way that I can feed it back to to kubeadm?
I am entirely happy with pointers to the right documentation, I just didn't find the right clues myself.
What you are looking for is well described in official documentation.
The basic workflow for configuring a Kubelet is as follows:
Write a YAML or JSON configuration file containing the Kubelet’s configuration.
Wrap this file in a ConfigMap and save it to the Kubernetes control plane.
Update the Kubelet’s corresponding Node object to use this ConfigMap.
In addition there is DynamicKubeletConfig Feature Gate is enabled by default starting from Kubernetes v1.11, but you need some additional steps to activate it. You need to remember about, that Kubelet’s --dynamic-config-dir flag must be set to a writable directory on the Node.

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?
What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all
I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.

Failed to create pod sandbox kubernetes cluster

I have an weave network plugin.
inside my folder /etc/cni/net.d there is a 10-weave.conf
"name": "weave",
"type": "weave-net",
"hairpinMode": true
My weave pods are running and the dns pod is also running
But when i want to run a pod like a simple nginx wich will pull an nginx image
The pod stuck at container creating , describe pod gives me the error , failed create pod sandbox.
When i run journalctl -u kubelet i get this error
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
is my network plugin not good configured ?
i used this command to configure my weave network
kubectl apply -f
After this won't work i also tried this command
kubectl apply -f “$(kubectl version | base64 | tr -d ‘\n’)”
I even tried flannel and that gives me the same error.
The system i am setting kubernetes on is a raspberry pi.
I am trying to build a raspberry pi cluster with 3 nodes and 1 master with kubernetes
Dose anyone have ideas on this?
Thank you all for responding to my question. I solved my problem now. For anyone who has come to my question in the future the solution was as followed.
I cloned my raspberry pi images because i wanted a basicConfig.img for when i needed to add a new node to my cluster of when one gets down.
Weave network (the plugin i used) got confused because on every node and master the os had the same machine-id. When i deleted the machine id and created a new one (and reboot the nodes) my error got fixed. The commands to do this was
sudo rm /etc/machine-id
sudo rm /var/lib/dbus/machine-id
sudo dbus-uuidgen --ensure=/etc/machine-id
Once again my patience was being tested. Because my kubernetes setup was normal and my raspberry pi os was normal. I founded this with the help of someone in the kubernetes community. This again shows us how important and great are IT community is. To the people of the future who will come to this question. I hope this solution will fix your error and will decrease the amount of time you will be searching after a stupid small thing.
Looking at the pertinent code in Kubernetes and in CNI, the specific error you see seems to indicate that it cannot find any files ending in .json, .conf or .conflist in the directory given.
This makes me think it could be something as the conf file not being present on all the hosts, so I would verify that as a first step.