Kubernetes pod created through Airflow remains in running state - kubernetes

I've set up Airflow in a Kubernetes cluster. To run tasks, I'm using the KubernetesPodOperator.
When I run a task and take a look at kubectl get pods, I see a pod is created correctly and it also completes. However, when I look at Airflow, I see the state isn't updated and it says it's still in the running state.
[2019-01-27 12:43:56,580] {models.py:1595} INFO - Executing <Task(KubernetesPodOperator): xxx> on 2019-01-20T00:00:00+00:00
[2019-01-27 12:43:56,581] {base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run xxx xxx 2019-01-20T00:00:00+00:00 --job_id 15 --raw -sd DAGS_FOLDER/xxx.py --cfg_path /tmp/tmpxx39wldz']
[2019-01-27 12:45:21,603] {models.py:1355} INFO - Dependencies not met for <TaskInstance: xxx.xxx 2019-01-20T00:00:00+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-01-27 12:43:56.565328+00:00.
[2019-01-27 12:45:21,639] {models.py:1355} INFO - Dependencies not met for <TaskInstance: xxx.xxx 2019-01-20T00:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2019-01-27 12:45:21,641] {logging_mixin.py:95} INFO - [2019-01-27 12:45:21,641] {jobs.py:2614} INFO - Task is not able to be run
Is there anything specific I should do to return the pod's state back to Airflow? The KubernetesPodOperator is defined as follows:
do_something = KubernetesPodOperator(
task_id='xxx',
image='gcr.io/project/image',
namespace='default',
name='xxx',
arguments=['dummy'],
xcom_push=True,
in_cluster=True,
image_pull_policy='Always',
trigger_rule='dummy',
dag=dag,
)
Edit: It appears that the base container has completed, but airflow-xcom-sidecar is still running. Is there anything specific I should do to stop that one?

Hard to tell exactly without looking at your setup, but it looks like the pod is done and it's trying to an xcom push to your main Airflow and it's not able to connect. I would check the logs for airflow-xcom-sidecar. Something like:
$ kubectl logs <airflow-job-pod> -c airflow-xcom-sidecar
You can also try running your KubernetesOperator with xcom_push=False:
do_something = KubernetesPodOperator(
task_id='xxx',
image='gcr.io/project/image',
namespace='default',
name='xxx',
arguments=['dummy'],
xcom_push=False,
in_cluster=True,
image_pull_policy='Always',
trigger_rule='dummy',
dag=dag,
)

Related

Filebeat automatically stops without kill

I use filebeat with elk. I started it with nohup command.
nohup ./filebeat -e -c filebeat.yml -d "publish" > filebeat.log &
Application stopped automatically after one day. close_inactive parameter is not work. Is there any configuration that i missed for this problem.
2020-10-22T09:55:36.814+0100 INFO crawler/crawler.go:165 Crawler stopped
2020-10-22T09:55:36.815+0100 INFO registrar/registrar.go:367 Stopping Registrar
2020-10-22T09:55:36.815+0100 INFO registrar/registrar.go:293 Ending Registrar
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10540,"time":{"ms":10547}},"total":{"ticks":68190,"time":{"ms":68203},"value":68190},"user":{"ticks":57650,"time":{"ms":57656}}},"handles":{"limit":{"hard":16000,"soft":16000},"open":10},"info":{"ephemeral_id":"b57f1c4d-7a80-4f1f-aaba-5ab9ee057757","uptime":{"ms":7119571}},"memstats":{"gc_next":22377264,"memory_alloc":11462592,"memory_total":18240359416,"rss":50831360},"runtime":{"goroutines":21}},"filebeat":{"events":{"added":528063,"done":528063},"harvester":{"closed":77,"open_files":0,"running":0,"started":77},"input":{"log":{"files":{"truncated":38}}}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"events":{"acked":527884,"batches":4732,"failed":51426,"total":579310},"read":{"bytes":32364,"errors":4},"type":"logstash","write":{"bytes":180629879,"errors":19}},"pipeline":{"clients":0,"events":{"active":0,"filtered":179,"published":527884,"retry":99719,"total":528063},"queue":{"acked":527884}}},"registrar":{"states":{"cleanup":8,"current":38,"update":528063},"writes":{"success":4356,"total":4356}},"system":{"cpu":{"cores":8},"load":{"1":0.66,"15":0.52,"5":0.56,"norm":{"1":0.0825,"15":0.065,"5":0.07}}}}}}
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:154 Uptime: 1h58m39.572210325s
2020-10-22T09:55:36.820+0100 INFO [monitoring] log/log.go:131 Stopping metrics logging.
2020-10-22T09:55:36.820+0100 INFO instance/beat.go:432 filebeat stopped.
What is the content of "filebeat.yml"? it can stop for example if you didn't define any paths.
Also, you might want to change the logging level to get more information as to what happened:
logging.level: debug
Stop the filebeat service and Run the Filebeat in debug mode from command line to check for any issue in your configuration using the command below from the filebeat home directory.
filebeat -e -c filebeat.yml -d "*"

Kubernetes pod marked as `Completed` despite the exit code `255`

Situation:
I've got a CronJob that often fails (this is expected at the moment). Due to the fact that the container performing the job, has a side-car, the dependencies are between the containers are expressed through bash scripts and common mounts of emptyDir in /etc/liveness folder:
spec:
containers:
- args:
- -c
- set -x;
...
./process; # execute the main process
rc=$?;
rm /etc/liveness; # clean-up
exit $rc;
command:
- /bin/bash
Problem:
In the scenarios, where the job fails, I see the following in the logs:
+ rc=255
+ rm /etc/liveness
+ exit 255
With retryPolicy set to never, the failed pod enters the Completed status, which is misleading:
scheduler-1594015200-wl9xc 0/2 Completed 0 24m
According to official doc,
A Job creates one or more Pods and ensures that a specified number of
them successfully terminate.
And containers enter terminated state when
it has successfully completed execution or when it has failed for some
reason.
So if you set retryPolicy to never, this is what will happen.
A Pod's status field is a PodStatus object, which has a phase field.
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Status and Phase is not the same. So I learned, that what happens above is that my pods end up in status Completed and phase Failed.

Airflow- dag_id could not be found issue when using kubernetes executor

I am using airflow stable helm chart and using Kubernetes Executor, new pod is being scheduled for dag but its failing with dag_id could not be found issue. I am using git-sync to get dags. Below is the error and kubernetes config values. Can someone please help me resolve this issue?
Error:
[2020-07-01 23:18:36,939] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-01 23:18:36,940] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/dags/etl/sampledag_dag.py
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 37, in <module>
args.func(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 523, in run
dag = get_dag(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 149, in get_dag
'parse.'.format(args.dag_id))
airflow.exceptions.AirflowException: dag_id could not be found: sampledag . Either the dag did not exist or it failed to parse.
Config:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: false
AIRFLOW__KUBERNETES__GIT_REPO: git#git.com/dags.git
AIRFLOW__KUBERNETES__GIT_BRANCH: master
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME: git-secret
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY: airflow-repo
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: tag
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"
sampledag
import logging
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
import os
args = {
'owner': 'airflow'
}
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
try:
print("Entered try block")
with models.DAG(
dag_id='sampledag',
schedule_interval=datetime.timedelta(days=1),
start_date=YESTERDAY) as dag:
print("Initialized dag")
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id='trigger-task',
# Name of task you want to run, used to generate Pod ID.
name='trigger-name',
namespace='scheduler',
in_cluster = True,
cmds=["./docker-run.sh"],
is_delete_operator_pod=False,
image='imagerepo:latest',
image_pull_policy='Always',
dag=dag)
print("done")
except Exception as e:
print(str(e))
logging.error("Error at {}, error={}".format(__file__, str(e)))
raise
I had the same issue. I solved it by adding the following to my config:
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH: repo/
What was happening is that the init container will download your dags in [AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT]/[AIRFLOW__KUBERNETES__GIT_SYNC_DEST] and AIRFLOW__KUBERNETES__GIT_SYNC_DEST by default is repo (https://airflow.apache.org/docs/stable/configurations-ref.html#git-sync-dest)
I am guessing that the problem could be incurred from the difference in your setup that causes: /opt/airflow/dags/dags/etl/sampledag_dag.py and AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
I'd double check that these are what you want, and are what you expect.
I was facing the same issue while trying to use Kubernetes Executor using stable helm airflow chart. In my case, I was able to resolve it by changing
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000" to AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER: "65533" in the env section of helm chart.
Same value is mentioned in this link
I came to this conclusion as the init container (git sync) which was running before the temporary worker pod came up was not able to clone/sync the git dags to the worker pods. In my case, there was a permissions error (even when kube secret for ssh clone was correctly passed)
Note:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync
Getting the same issue, I solved it with the suggestion from #gtrip to set the UID of the git-sync run user to 65533.
I would add the following debug hints:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync

How to set kube-scheduler print log to file

kubernetes's version is 1.2
I want to watch the scheduler's log. So how to set kube-scheduler's log print to a file?
The kube-scheduler's configuration is at this path: /etc/kubernetes/scheduler.
And the global configuration is at this path: /etc/kubernetes/config.
So we can see these notes:
# logging to stderr means we get it in the systemd journal
KUBE_LOGTOSTDERR="--logtostderr=true"
# journal message level, 0 is debug
KUBE_LOG_LEVEL="--v=0"
Can you tail the contents of the service (if running in systemd): journalctl -u apiserver -f
Or if a container, find the container id of the scheduler, and tail with docker: docker logs -f

Django celery db scheduler not working after version upgrade

I'm upgrading celery and django-celery from:
celery==2.4.5
django-celery==2.3.3
To:
celery==3.0.24
django-celery==3.0.23
After the pip upgrade i run the migrations and all is well.
I then restarted celery worker and celery beat with the below commands:
django-admin.py celery worker --loglevel=DEBUG --config=portal.settings.development -E
django-admin.py celery beat --loglevel=DEBUG --config=portal.settings.development
The celery beat initial output shows it knows about the tasks:
__ - ... __ - _
Configuration ->
. broker -> amqp://zonza:**#localhost:5672/zonza
. loader -> djcelery.loaders.DjangoLoader
. scheduler -> djcelery.schedulers.DatabaseScheduler
. logfile -> [stderr]#%DEBUG
. maxinterval -> now (0s)
[INFO] Wed, 18 Jun 2014 13:31:18 +0000 celery.beat 2184 140177823078144 beat: Starting...
[2014-06-18 13:31:18,332: DEBUG/MainProcess] DatabaseScheduler: intial read
[2014-06-18 13:31:18,332: INFO/MainProcess] Writing entries...
[2014-06-18 13:31:18,333: DEBUG/MainProcess] DatabaseScheduler: Fetching database schedule
[2014-06-18 13:31:18,366: DEBUG/MainProcess] Current schedule:
<ModelEntry: SOON_EXPIRY_ALERT SOON_EXPIRY_ALERT(*[], **{}) {4}>
<ModelEntry: celery.backend_cleanup celery.backend_cleanup(*[], **{}) {4}>
<ModelEntry: REFRESH_DB_CACHE REFRESH_DB_CACHE(*[], **{}) {4}>
Now none of my Periodic Tasks run :/ Any ideas?
edit: if i change the scheduler setting to the default 'celery.beat.PersistentScheduler' one, the tasks will work. but i think we need to use the djcelery one in this project for a number of reasons
edit2: after about 40mins of nothing the tasks now start running properly, this obviously is not ideal, i have no idea why
It should be in the changelogs somewhere, but Celery changed from storing dates in local time to storing them in UTC.
The database scheduler is not able to automatically convert to the new format, so you need to reset the last_run_at fields for every periodic task.
Something like:
UPDATE djcelery_periodic_task SET last_run_at=NULL