Broken parameters persisting in Deis deployments - kubernetes

An invalid command parameter got into the deployment for a worker process in a Deis app. Now whenever I run a deis pull for a new image this broken parameter gets passed to the deployment so the worker doesn't start up successfully.
If I go into kubectl I can see the following parameter being set in the deployment for the worker (path /spec/template/spec/containers/0)
"command": [
"/bin/bash",
"-c"
],
Which results in the pod not starting up properly:
Error: failed to start container "worker": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"
Error syncing pod
Back-off restarting failed container
This means that for every release/pull I've been going in and manually removing that parameter from the worker deployment setup. I've run kubectl delete deployment and recreated it with valid json (kubectl create -f deployment.json). This fixes things until I run deis pull again, at which point the broken parameter is back.
My thinking is that that broken command parameter is persisted somewhere in the deis database or the like and that it's being reset when I run deis pull.
I've tried the troubleshooting guide and dug around in the deis-database but I can't find where the deployment for the worker process is being created or where the deployment parameters that get passed to kubernetes when you run a deis pull come from.
Running deis v2.10.0 on Google Cloud

Related

Error creating pod on master node: Error failed to get sandbox container task: no task found

Very new to K3s and I'm attempting to just practice by creating a deployment with 3 replicas of an ngnix pod. It creates on 2 of my worker nodes but one of the pods tried creating on my master node but I got a CreateContainerError.
After digging further I found the following error: Error: failed to get sandbox container task: no running task found: task e2829c0383965aa4556c9eecf1ed72feb145211d23f714bdc0962b188572f849 not found: not found.
Any help would be greatly appreciated
After running kubectl describe node and checking the taints for the master node, it shows <none>
So all it needed was a fresh install and that seems to have solved everything. Probably should have tried that first.

How to pull image from private Docker registry in KubernetesPodOperator of Google Cloud Composer?

I'm trying to run a task in an environment built from an image in a private Google Container Registry through the KubernetesPodOperator of the Google Cloud Composer.
The Container Registry and Cloud Composer instances are under the same project.
My code is below.
import datetime
import airflow
from airflow.contrib.operators import kubernetes_pod_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
# Create Airflow DAG the the pipeline
with airflow.DAG(
'my_dag',
schedule_interval=datetime.timedelta(days=1),
start_date=YESTERDAY) as dag:
my_task = kubernetes_pod_operator.KubernetesPodOperator(
task_id='my_task',
name='my_task',
cmds=['echo 0'],
namespace='default',
image=f'gcr.io/<my_private_repository>/<my_image>:latest')
The task fails and I get the following error message in the logs in the Airflow UI and in the logs folder in the storage bucket.
[2020-09-21 08:39:12,675] {taskinstance.py:1147} ERROR - Pod Launching failed: Pod returned a failure: failed
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/contrib/operators/kubernetes_pod_operator.py", line 260, in execut
'Pod returned a failure: {state}'.format(state=final_state
airflow.exceptions.AirflowException: Pod returned a failure: failed
This not very informative...
Any idea what I could be doing wrong?
Or anywhere I can find more informative log messages?
Thank you very much!
In general, the way how we start troubleshooting GCP Composer once getting a failure running the DAG is finely explained in the dedicated chapter of GCP documentation.
Moving to KubernetesPodOperator specifically related issues, the certain user investigation might consists of:
Verifying the particular task status for the corresponded DAG
file;
Inspecting the task inventory logs and events;, logs also can be found in GCP Composer's storage bucket;
With any K8s related resource/objects errors it's strongly
required to check Composer's relevant GKE cluster log/event
journals.
Further analyzing the error context and KubernetesPodOperator.py source code, I assume that this issue might occur due to Pod launching problem on Airflow worker GKE node, ending up with Pod returned a failure: {state}'.format(state=final_state) message once the Pod execution is not successful.
Personally, I prefer to check the image run in prior executing Airflow task in a Kubernetes Pod. Having said this and based on the task command provided, I believe that you can verify the Pod launching process, connecting to GKE cluster and redrafting kubernetes_pod_operator.KubernetesPodOperator definition being adoptable for kubectl command-line executor:
kubectl run test-app --image=eu.gcr.io/<Project_ID>/image --command -- "/bin/sh" "-c" "echo 0"
This would simplify the process of image validation, hence you'll be able to get closer look at Pod logs or event records as well:
kubectl describe po test-app
Or
kubectl logs test-app
If you want to pull or push an image in KubernetesPodOperator from private registry, you should create a Secret in k8s which contains a service account (SA). This SA should have permission for pulling or maybe pushing images (RO/RW permission).
Then just use this secret with SA in KubernetesPodOperator and specify image_pull_secrets argument:
my_task = kubernetes_pod_operator.KubernetesPodOperator(
task_id='my_task',
name='my_task',
cmds=['echo 0'],
namespace='default',
image=f'gcr.io/<my_private_repository>/<my_image>:latest',
image_pull_secrets='your_secret_name')

fluentd daemon set container for papertrail failing to start in kubernetes cluster

Am trying to setup fluentd in kubernetes cluster to aggregate logs in papertrail, as per the documentation provided here.
The configuration file is fluentd-daemonset-papertrail.yaml
It basically creates a daemon set for fluentd container and a config map for fluentd configuration.
When I apply the configuration, the pod is assigned to a node and the container is created. However, its either not completing the initialization or the pod gets killed immediately after it is started.
As the pods are getting killed, am loosing the logs too. Couldn't investigate the cause of the issue.
Looking through the events for kube-system namespace has below errors,
Error: failed to start container "fluentd": Error response from daemon: OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/75026/ns/ipc\" caused \"lstat /proc/75026/ns/ipc: no such file or directory\"": unknown
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9559643bf77e29d270c23bddbb17a9480ff126b0b6be10ba480b558a0733161c" network for pod "fluentd-papertrail-b9t5b": NetworkPlugin kubenet failed to set up pod "fluentd-papertrail-b9t5b_kube-system" network: Error adding container to network: failed to open netns "/proc/111610/ns/net": failed to Statfs "/proc/111610/ns/net": no such file or directory
Am not sure whats causing these errors. Appreciate any help to understand and troubleshoot these errors.
Also, is it possible to look at logs/events that could tell us why a pod is given a terminate signal?
Please ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes.
Take a look: sandbox.
With help from papertrail support team, I was able to resolve the issue by removing below entry from manifest file.
kubernetes.io/cluster-service: "true"
Above annotation seems to have been deprecated.
Relevant github issues:
https://github.com/fluent/fluentd-kubernetes-daemonset/issues/296
https://github.com/kubernetes/kubernetes/issues/72757

OpenShift 3 mkdir error for DockerFile

I am trying to deploy a Springboot app from my Docker hub account (https://hub.docker.com/r/sonamsamdupkhangsar/springboot-docker/~/dockerfile/) on OpenShift. I selected the Yaml section to paste my YAML config from my github (https://github.com/sonamsamdupkhangsar/springboot-docker/blob/master/springboot-hello-deployment.yaml)
After waiting a while I used to get a error saying the "mkdir" error failed. Now this morning I am seeing a another error: warningThe pod has been stuck in the pending state for more than five minutes.
Any ideas?
thanks

Regarding Scheduling of K8S

Regarding the kubelet status and the policy of kube-scheduler.
The status of kubelet running on my eight workers is ready at the time when i did spawning to create container in RC.
The given message by scheduler is that the RC was well scheduled on across all eight worker nodes. but, the pod status is pending.
i waited as much enough for downloading image but the state didn't changed to running. so, i restarted kubelet service on a worker where having the pending pod. then all pod's pending state had changed running state.
Scheduled well(pod) -> pending(pod) -> restart kubelet -> running(pod)
why it was resolved after restart kubelet?
The log(kubelet) looks like below.
factory.go:71] Error trying to work out if we can handle /docker-daemon/docker: error inspecting container: No such container: docker
factory.go:71] Error trying to work out if we can handle /docker: error inspecting container: No such container: docker
factory.go:71] Error trying to work out if we can handle /: error inspecting container: unexpected end of JSON input
factory.go:71] Error trying to work out if we can handle /docker-daemon: error inspecting container: No such container: docker-daemon
factory.go:71] Error trying to work out if we can handle /kubelet: error inspecting container: No such container: kubelet
factory.go:71] Error trying to work out if we can handle /kube-proxy: error inspecting container: No such container: kube-proxy
Another symptom is with below picture.
The scheduled pod works well. but the condition is false middle in picture
(got from "kubectl describe ~~")
working well but false...what the false means?
Thanks