Kubernetes pod marked as `Completed` despite the exit code `255` - kubernetes

Situation:
I've got a CronJob that often fails (this is expected at the moment). Due to the fact that the container performing the job, has a side-car, the dependencies are between the containers are expressed through bash scripts and common mounts of emptyDir in /etc/liveness folder:
spec:
containers:
- args:
- -c
- set -x;
...
./process; # execute the main process
rc=$?;
rm /etc/liveness; # clean-up
exit $rc;
command:
- /bin/bash
Problem:
In the scenarios, where the job fails, I see the following in the logs:
+ rc=255
+ rm /etc/liveness
+ exit 255
With retryPolicy set to never, the failed pod enters the Completed status, which is misleading:
scheduler-1594015200-wl9xc 0/2 Completed 0 24m

According to official doc,
A Job creates one or more Pods and ensures that a specified number of
them successfully terminate.
And containers enter terminated state when
it has successfully completed execution or when it has failed for some
reason.
So if you set retryPolicy to never, this is what will happen.

A Pod's status field is a PodStatus object, which has a phase field.
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Status and Phase is not the same. So I learned, that what happens above is that my pods end up in status Completed and phase Failed.

Related

How to fix Unsupported Config Type "" error in Hyperledger Fabric on Kubernetes?

I am trying to follow this tutorial on deploying Hyperledger Fabric on Kubernetes. But instead of IBM Cloud, I'm doing it with Google Cloud. I encountered this same issue (see my logs below) and tried:
changing docker image to docker:18.09-dind in docker.yaml.
setting FABRIC_CFG_PATH=$PWD/configFiles instead of FABRIC_CFG_PATH=$PWD in create_channel.yaml according to another StackOverflow answer.
However, these workaround did not work for me and I still encounter the error.
How do I fix this to be able to successfully deploy the network?
> ./setup_blockchainNetwork.sh
peersDeployment.yaml file was configured to use Docker in a container.
Creating Docker deployment
persistentvolume/docker-pv created
persistentvolumeclaim/docker-pvc created
service/docker created
deployment.apps/docker-dind created
Creating volume
The Persistant Volume does not seem to exist or is not bound
Creating Persistant Volume
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/createVolume.yaml
persistentvolume/shared-pv created
persistentvolumeclaim/shared-pvc created
Success creating Persistant Volume
Creating Copy artifacts job.
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/copyArtifactsJob.yaml
job.batch/copyartifacts created
Wating for container of copy artifact pod to run. Current status of copyartifacts-dcg4m is Pending
copyartifacts-dcg4m is now Running
Starting to copy artifacts in persistent volume.
Waiting for 10 more seconds for copying artifacts to avoid any network delay
Waiting for copyartifacts job to complete
Copy artifacts job completed
Generating the required artifacts for Blockchain network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/generateArtifactsJob.yaml
job.batch/utils created
Waiting for generateArtifacts job to complete
Waiting for generateArtifacts job to complete
Creating Services for blockchain network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/blockchain-services.yaml
service/blockchain-ca created
service/blockchain-orderer created
service/blockchain-org1peer1 created
service/blockchain-org2peer1 created
service/blockchain-org3peer1 created
service/blockchain-org4peer1 created
Creating new Deployment to create four peers in network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/peersDeployment.yaml
deployment.apps/blockchain-orderer created
deployment.apps/blockchain-ca created
deployment.apps/blockchain-org1peer1 created
deployment.apps/blockchain-org2peer1 created
deployment.apps/blockchain-org3peer1 created
deployment.apps/blockchain-org4peer1 created
Checking if all deployments are ready
Waiting for 15 seconds for peers and orderer to settle
Creating channel transaction artifact and a channel
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/create_channel.yaml
job.batch/createchannel created
Waiting for createchannel job to be completed
Waiting for createchannel job to be completed
Create Channel Failed
> kubectl get pods
NAME READY STATUS RESTARTS AGE
blockchain-ca-58b4bbbcc7-dqmnw 1/1 Running 0 30s
blockchain-orderer-ddc9466d-2sqt8 1/1 Running 0 30s
blockchain-org1peer1-ffbf698bb-fd6nf 1/1 Running 0 29s
blockchain-org2peer1-98f7fb5f9-mb5m7 1/1 Running 0 29s
blockchain-org3peer1-75d6b8bf5c-bxd24 1/1 Running 0 29s
blockchain-org4peer1-675669ffff-b4dxj 1/1 Running 0 29s
copyartifacts-dcg4m 0/1 Completed 0 60s
createchannel-9wt54 1/2 Error 0 12s
docker-dind-54767c54c5-crk7b 0/1 CrashLoopBackOff 3 73s
utils-wbpcz 0/2 Completed 0 37s
> kubectl logs createchannel-9wt54 -c createchanneltx
/shared
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-networkd.service-QHqKfL
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-resolved.service-NuNfWF
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-timesyncd.service-SzE37R
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen] main -> INFO 001 Loading configuration
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen.localconfig] Load -> PANI 002 Error reading configuration: Unsupported Config Type ""
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen] func1 -> PANI 003 Error reading configuration: Unsupported Config Type ""
panic: Error reading configuration: Unsupported Config Type "" [recovered]
panic: Error reading configuration: Unsupported Config Type ""
...
FABRIC_CFG_PATH setting is wrong.
Currently, your error is a phrase that occurs when there is a problem with the syntax in the configtx.yaml file or when the file path is wrong and cannot be found.
For configtxgen, refer to the configtx.yaml file under FABRIC_CFG_PATH.
In the tutorial you provided, configtx.yaml is not found under configFiles directory and it exists under artifacts directory.
I'll suggest two of the easiest solutions out of many.
move artifacts/configtx.yaml to configFiles/configtx.yaml
mv ./artifacts/configtx.yaml configFiles/configtx.yaml
Or, set FABRIC_CFG_PATH to configFiles
export FABRIC_CFG_PATH=${PWD}/artifacts

How to wait for tekton pipelinRun conditions

I have the following code within a gitlab pipeline which results in some kind of race condition:
kubectl apply -f pipelineRun.yaml
tkn pipelinerun logs -f pipeline-run
The tkn command immediately exits, since the pipelineRun object is not yet created. There is one very nice solution for this problem:
kubectl apply -f pipelineRun.yaml
kubectl wait --for=condition=Running --timeout=60s pipelinerun/pipeline-run
tkn pipelinerun logs -f pipeline-run
Unfortunately this is not working as expected, since Running seems to be no valid condition for a pipelineRun object. So my question is: what are the valid conditions of a pipelineRun object?
I didn't search too far and wide, but it looks like they only have two condition types imported from the knative.dev project?
https://github.com/tektoncd/pipeline/blob/main/vendor/knative.dev/pkg/apis/condition_types.go#L32
The link above is for the imported condition types from the pipeline source code of which it looks like Tekton only uses "Ready" and "Succeeded".
const (
// ConditionReady specifies that the resource is ready.
// For long-running resources.
ConditionReady ConditionType = "Ready"
// ConditionSucceeded specifies that the resource has finished.
// For resource which run to completion.
ConditionSucceeded ConditionType = "Succeeded"
)
But there may be other imports of this nature elsewhere in the project.
Tekton TaskRuns and PipelineRun only use a condition of type Succeeded.
Example:
conditions:
- lastTransitionTime: "2020-05-04T02:19:14Z"
message: "Tasks Completed: 4, Skipped: 0"
reason: Succeeded
status: "True"
type: Succeeded
The different status and messages available for the Succeeded condition are available in the documentation:
TaskRun: https://tekton.dev/docs/pipelines/taskruns/#monitoring-execution-status
PipelineRun: https://tekton.dev/docs/pipelines/pipelineruns/#monitoring-execution-status
As a side note, there is an activity timeout available in the API. That timeout is not surfaced to the CLI options though. You could create a tkn feature request for that.

Kubernetes pod created through Airflow remains in running state

I've set up Airflow in a Kubernetes cluster. To run tasks, I'm using the KubernetesPodOperator.
When I run a task and take a look at kubectl get pods, I see a pod is created correctly and it also completes. However, when I look at Airflow, I see the state isn't updated and it says it's still in the running state.
[2019-01-27 12:43:56,580] {models.py:1595} INFO - Executing <Task(KubernetesPodOperator): xxx> on 2019-01-20T00:00:00+00:00
[2019-01-27 12:43:56,581] {base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run xxx xxx 2019-01-20T00:00:00+00:00 --job_id 15 --raw -sd DAGS_FOLDER/xxx.py --cfg_path /tmp/tmpxx39wldz']
[2019-01-27 12:45:21,603] {models.py:1355} INFO - Dependencies not met for <TaskInstance: xxx.xxx 2019-01-20T00:00:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-01-27 12:43:56.565328+00:00.
[2019-01-27 12:45:21,639] {models.py:1355} INFO - Dependencies not met for <TaskInstance: xxx.xxx 2019-01-20T00:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2019-01-27 12:45:21,641] {logging_mixin.py:95} INFO - [2019-01-27 12:45:21,641] {jobs.py:2614} INFO - Task is not able to be run
Is there anything specific I should do to return the pod's state back to Airflow? The KubernetesPodOperator is defined as follows:
do_something = KubernetesPodOperator(
task_id='xxx',
image='gcr.io/project/image',
namespace='default',
name='xxx',
arguments=['dummy'],
xcom_push=True,
in_cluster=True,
image_pull_policy='Always',
trigger_rule='dummy',
dag=dag,
)
Edit: It appears that the base container has completed, but airflow-xcom-sidecar is still running. Is there anything specific I should do to stop that one?
Hard to tell exactly without looking at your setup, but it looks like the pod is done and it's trying to an xcom push to your main Airflow and it's not able to connect. I would check the logs for airflow-xcom-sidecar. Something like:
$ kubectl logs <airflow-job-pod> -c airflow-xcom-sidecar
You can also try running your KubernetesOperator with xcom_push=False:
do_something = KubernetesPodOperator(
task_id='xxx',
image='gcr.io/project/image',
namespace='default',
name='xxx',
arguments=['dummy'],
xcom_push=False,
in_cluster=True,
image_pull_policy='Always',
trigger_rule='dummy',
dag=dag,
)

Codeship Pro on_fail accross step

Is the on_fail directive of a step run when a previous step has failed ?
I'm using these steps :
- name: fail intentionally
service: busybox
command: false
- name: check if onfail is called
service: busybox
command: true
on_fail:
- command: echo reporting failure
Calling jet steps produces the following output :
(step: fail intentionally)
(image: busybox) (service: busybox) Image exists, using cached image
(step: fail intentionally) error ✗
(step: fail intentionally) container exited with a 1 code
My on_fail is not run.
Is that an issue with the jet utility or would things behave the same in Codeship ?
You have defined an on_fail contingency for the second test step (a step that will not fail). If the on_fail was set for the first step (which fails and stops the build), you would have noted the echoed statement.
This behavior would be consistent with a build running in CodeShip Pro.

Pod Completion without completing process

I have a cluster running some jobs, there's a job that executes a pod. that pod is completed while in process, like 1+3=5 now it should display 5 but it stops in 1+3 and its status is to complete. I don't know what can cause a pod to complete without executing the whole code. any help or thoughts on it would help a lot.
Detail:
I have a case now,
console.log("Opening in ECS ");<<--in one case pod successfully terminates here -->>
try {
await funcy1();<<-- an async function -->>
console.log("opening in ECS end");<--in second case pod successfully terminates here-->>
} catch (error) {
throw error;
}
now the pod is completed at stated line, if there's an error it should be thrown (which is logged). But I cannot see any log. It's just pod is completed on specified line which shouldn't be the case.
some erros from pods descriptions are:
State: Terminated
Reason: Error
Exit Code: 255
and
State: Terminated
Reason: Error
Exit Code: 137
and
State: Terminated
Reason: Completed
Exit Code: 0
So the issue was not specifying resources of the pod, just thought it can help someone.