I have installed Kubeflow using manifest. After installing ml-pipeline, the pod is in "CrashLoopBackOff" state. I changed the destinationrule for ml-pipeline, ml-pipeline-ui and ml-pipeline-msql to DISABLE but no luck. Can anyone help with this?
Thanks in advance.
There are a bunch of possible root causes for this POD’s status, but I am going to try to focus on the most common ones. To choose the correct one for your accurate situation, you are going to need to take a look into the “describe” and the log from the POD with "CrashLoopBackOff" state.
Verify if the “describe” says something like “Back-off restarting failed container” and the log says something like “a container name must be specified for …”, “F ml_metadata/metadata_store/metadata_store_server_main.cc:219] Non-OK-status …”.
If yes, the problem is the dynamic volume provisioning regularly, maybe because no volume provisioner is installed.
On the other hand, you can verify your cluster’s size, because anything less than 8 CPUs is going to run only if you reduce each service’s requested cpu in the manifest files.
You do not give details on the affected POD yet; but another option is to try to install Katib only (without Kubeflow or other resources) on your K8s cluster to verify other Kubernetes resources do not affect this connection. You can use the following URL’s information for more empirical cases’ troubleshooting and solutions: Multiple Pods stuck in CrashLoopBackOff, katib-mysql , ml-pipeline-persistenceagent pod keeps crashing.
Finally just confirm that you followed the correct instructions, based on the Distribution you used to deploy Kubeflow, you can visit the following URL: Kubeflow Distributions
Related
I have deployed a k8s service, however its not showing any pods. This is what I see
kubectl get deployments
It should create on the default namespace
kubectl get nodes (this shows me nothing)
How do I troubleshoot a failed deployment. The test-control-plane is the one deployed by kind this is the k8s one I'm using.
kubectl get nodes
If above command is not showing anything which mean there is no Nodes in your cluster so where your workload will run ?
You need to have at least one worker node in K8s cluster so deployment can schedule the POD on it and run the application.
You can check worker node using same command
kubectl get nodes
You can debug more and check the reason of issue further using
kubectl describe deployment <name of your deployment>
To find out what really went wrong, first follow the steps described in Harsh Manvar in his answer. Perhaps obtaining that information can help you find the problem. If not, check the logs of your deployment. Try to list your pods and see which ones did not boot properly, then check their logs.
You can also use the kubectl describe on pods to see in more detail what went wrong. Since you are using kind, I include a list of known errors for you.
You can also see this visual guide on troubleshooting Kubernetes deployments and 5 Tips for Troubleshooting Kubernetes Deployments.
Am actually new to kubernetes, but as now am good with the terms such as deployment, pods etc.
Well i was trying an example of HPA (Horizontal pod autoscaler), and as prerequisite metrics-servers is already integrated, but after all those things am not able to see HPA working as expected
enter image description here
When execute below cmd-
Kubectl get hpa
Unknown in the target, although i have tried all my luck referring online forum but didn't got any break through
Any help would be really appreciated
Thank you
I was getting same issue, fixed after adding cpu requests in my pod defination. Below points can be the reason in most of the cases
Metric server is not installed in your kubernates cluster, you can check with command
(kubectl get deploy,svc -n kube-system | egrep metrics-server)
Check if you have provided resources for deployment/sts/pod definations
I'm trying to keep the execution logs of containers in Kubernetes.
I added in my cronjob yaml the successfulJobsHistoryLimit: 5 failedJobsHistoryLimit: 5 in order to see the execution history, but when I try to view the logs of the pods I get this error
I assume it is because the pods have been deleted because when I go to a running pod I can see the logs.
So is there a way of keeping the logs in this part of Kubernetes or is there something that I have to setup in order to have this functionality?
Sorry if the question have been asked but I didn't really find something and I'm new to Kubernetes.
Thanks for the replies.
Looking at this problem in a bigger picture it's generally a good idea to have your logs stored via logging agents or directly pushed into an external service as per the official documentation.
Taking advantage of Kubernetes logging architecture explained here you can also try to fetch the logs directly from the log-rotate files in the node hosting the pods. Please note that this option might depend on the specific Kubernetes implementation as log files might be deleted when the pod eviction is triggered.
I'm running Traefik on a Kubernetes cluster to manage Ingress, which has been running ok for a long time.
I recently implemented Cluster-Autoscaling, which works fine except that on one Node (newly created by the Autoscaler) Traefik won't start. It sits in CrashLoopBackoff, and when I log the Pod I get: [date] [time] command traefik error: field not found, node: redirect.
Google found no relevant results, and the error itself is not very descriptive, so I'm not sure where to look.
My best guess is that it has something to do with the RedirectRegex Middleware configured in Traefik's config file:
[entryPoints.http.redirect]
regex = "^http://(.+)(:80)?/(.*)"
replacement = "https://$1/$3"
Traefik actually works still - I can still access all of my apps from their urls in my browser, even those which are on the node with the dead Traefik Pod.
The other Traefik Pods on other Nodes still run happily, and the Nodes are (at least in theory) identical.
After further googling, I found this on Reddit. Turns out Traefik updated a few days ago to v2.0, which is not backwards compatible.
Only this pod had the issue, because it was the only one for which a new (v2.0) image was pulled (being the only recently created Node).
I reverted to v1.7 until I have time to fix it properly. Had update the Daemonset to use v1.7, then kill the Pod so it could be recreated from the old image.
The devs have a Migration Guide that looks like it may help.
"redirect" is gone but now there is "RedirectScheme" and "RedirectRegex" as a new concept of "Middlewares".
It looks like they are moving to a pipeline approach, so you can define a chain of "middlewares" to apply to an "entrypoint" to decide how to direct it and what to add/remove/modify on packets in that chain. "backends" are now "providers", and they have a clearer, modular concept of configuration. It looks like it will offer better organization than earlier versions.
I'm fairly new to Kubernetes and trying to wrap my head around how to manage ComponentConfigs in already running clusters.
For example:
Recently I initialized a kubeadm cluster in a test environment running Ubuntu. When I did that, I found CoreDNS to be in a CrashLoopBackoff which turned out to be the case because Ubuntu was configured to use systemd-resolved and so the resolv.conf had a loopback resolver configured. After reading the docs for coredns, I found out that a solution for that would be to change the resolvConf parameter for kubelet - either via commandline arguments or in the config.
So how would one do this properly in a kubeadm-managed cluster?
Reading [this page in the documentation][1] I didn't really get a clue, because it seems to be tailored to the case of initializing a new cluster or joining new nodes.
Of course, in this particular situation I could just use "Kubeadm reset" and initialize it again with a --config parameter but that doesn't seem to be the right solution for a running cluster.
So after digging a bit deeper I found several infos:
I could change the /var/lib/kubelet/kubeadm-flags.env on the node directly, but AFAICT this only makes sense for node-specific changes.
There is a ConfigMap in the kube-system namespace named kubelet-config-1.14. This seems promising for upcoming nodes joining the cluster to get the right configuration - but would changing that CM affect the already running Kubelet?
There is a marshalled version of the running config in /var/lib/config/kubelet.yaml that I could change, but AFAIU this would be overriden by kubelet itself periodically (?) or at least during a kubeadm upgrade.
There seems to be an option to specify a configmap in the node object, to let kubelet dynamically load the configuration from there, but given that there is already an existing configmap it seems more sensible to change that one.
I seemingly had success by some combination of changing aforementioned CM, running kubeadm upgrade something afterwards and rebooting the machine (since restarting the kubelet did not fix the CoreDNS issue ... but maybe I was to impatient).
So I am now asking:
What is the recommended way to carry out changes to the kubelet configuration (or any other configuration I could affect via kubeadm-config.yaml) that works and is upgrade-safe for cases where the configuration is not node-specific?
And if this involves running kubeadm ... config --config - how do I extract the existing Kubeadm-config in a way that I can feed it back to to kubeadm?
I am entirely happy with pointers to the right documentation, I just didn't find the right clues myself.
TIA
What you are looking for is well described in official documentation.
The basic workflow for configuring a Kubelet is as follows:
Write a YAML or JSON configuration file containing the Kubelet’s configuration.
Wrap this file in a ConfigMap and save it to the Kubernetes control plane.
Update the Kubelet’s corresponding Node object to use this ConfigMap.
In addition there is DynamicKubeletConfig Feature Gate is enabled by default starting from Kubernetes v1.11, but you need some additional steps to activate it. You need to remember about, that Kubelet’s --dynamic-config-dir flag must be set to a writable directory on the Node.