Celery beat running tasks every minute even thought It's set for every two hours - celery

I'm trying to use celery beat to run tasks daily at a specific time.
However for testing purposes, I'm setting up two tasks to run every two hours, this is what my config looks like:
CELERYBEAT_SCHEDULE = {
'daily-google-connect': {
'task': 'app.engine.schedule_fetcher',
'schedule': crontab(hour='*/2'),
'args': (['G'])
},
'daily-facebook-connect': {
'task': 'app.engine.schedule_fetcher',
'schedule': crontab(hour='*/2'),
'args': (['F'])
}
}
This is how I run celery:
celery beat -A app.engine.celery --schedule=/tmp/celerybeat-schedule --pidfile=/tmp/celerybeat.pid -l info
Everything runs in Docker containers using docker-compose so I make sure that I re-build the app's image and restart the containers.
I even enter into the running container and I see the crontab setup in the code... however in my logs, I see the task running every minute.
What else can I do to debug this?
I appreciate any help,
Thanks

Your crontab is configured to run „At every minute past every 2nd hour“.
from celery.schedules import crontab
str(crontab(hour='*/2'))
'<crontab: * */2 * * * (m/h/d/dM/MY)>'
Ref: https://crontab.guru/#*_*/2_*_*_*
Correct crontab for „Every two hours“ is: 0 */2 * * *.
Ref: https://crontab.guru/every-2-hours
This should fix your issue:
CELERYBEAT_SCHEDULE = {
'daily-google-connect': {
'task': 'app.engine.schedule_fetcher',
'schedule': crontab(minute='0', hour='*/2'),
'args': (['G'])
},
'daily-facebook-connect': {
'task': 'app.engine.schedule_fetcher',
'schedule': crontab(minute='0', hour='*/2'),
'args': (['F'])
}
}

Related

Dynamically change celery beat schedule params

I get schedule values from .env file. And sometimes parameters in .env file change.
Is it possible to change schedule values of already running celery beat tasks?
My celery.py:
import os
from celery import Celery
from celery.schedules import crontab
from dotenv import load_dotenv
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproj.settings')
app = Celery('myproj')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
load_dotenv()
orders_update_time = float(os.getenv("ORDERS_UPDATE_TIME"))
if not orders_update_time:
orders_update_time = 60.0
orders_update_time = float(os.getenv("REMAINS_SEND_TIME"))
if not remains_send_time:
remains_send_time = 60.0
app.conf.beat_schedule = {
'wb_orders_autosaver': {
'task': 'myapp.tasks.orders_autosave',
'schedule': orders_update_time,
},
'wb_remains_autosender': {
'task': 'myapp.tasks.remains_autosend',
'schedule': remains_send_time,
},
}
Yes, use django-celery-beat. That will allow you to save your schedule to the database and you can use the django admin ui to modify the schedule.
From django shell_plus, you can run the following commands to create your schedule:
schedule = CrontabSchedule(minute='0', hour='10')
schedule.save()
PeriodicTask.objects.create(
crontab=schedule,
task='myapp.tasks.orders_autosave',
name='autosave orders',
)
schedule = CrontabSchedule(minute='15', hour='10')
schedule.save()
PeriodicTask.objects.create(
crontab=schedule,
task='myapp.tasks.remains_autosend',
name='autosend remains',
)
PeriodicTasks.changed()
Or you can use the UI in the django admin panel:
Select add periodic task
Enter in the information about your task and select save

airflow dag - task is immediately put into 'up_for_retry' state ('start_date' is 1 day ago)

I do not know if i am lack of airflow scheduler knowledge or if this is a potential bug from airflow.
situation is like this:
my dag's start date is set to be "start_date": airflow.utils.dates.days_ago(1),
i uploaded the dag to the folder where airflow scans the DAGs
i then turn the dag on (it was by default 'off')
the tasks in the pipeline immediately goes into 'up_for_retry' and you do not really see what had been tried before.
airflow Version Info: Version : 1.10.14. it is run on kubenetes in azure
use Celery executor with Redis
the task instance details are listed below:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Task Instance State Task is in the 'up_for_retry' state which is not a valid state for execution. The task must be cleared in order to be run.
Not In Retry Period Task is not ready for retry yet but will be retried automatically. Current date is 2021-05-17T09:06:57.239015+00:00 and task will be retried at 2021-05-17T09:09:50.662150+00:00.
am i missing something to judge if it is a bug or if it is expected?
addition, below is the DAG definition as requested.
import airflow
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
dag_args = {
"owner": "our_project_team_name",
"retries": 1,
"email": ["ouremail_address_replaced_by_this_string"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
# Implement cluster reuse on Databricks, pick from light, medium, heavy cluster type based on workloads
clusters = Variable.get("our_project_team_namejob_cluster_config", deserialize_json=True)
databricks_connection = "our_company_databricks"
adl_connection = "our_company_wasb"
pipeline_name = "process_our_data_from_boomi"
dag = DAG(dag_id=pipeline_name, default_args=dag_args, schedule_interval="0 3 * * *")
notebook_dir = "/Shared/our_data_name"
lib_path_sub = ""
lib_name_dev_plus_branch = ""
atlas_library = {
"whl": f"dbfs:/python-wheels/atlas{lib_path_sub}/atlas_library-0{lib_name_dev_plus_branch}-py3-none-any.whl"
}
create_our_data_name_source_data_from_boomi_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_source_data_from_boomi",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_standardized_table_from_source_xml_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_standardized_table_from_source_xml",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_enriched_table_from_standardized_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_enriched",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
layer_1_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_source",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_source_data_from_boomi_notebook_params,
libraries=[atlas_library],
)
layer_2_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_standardized",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_standardized_table_from_source_xml_notebook_params,
libraries=[
{"maven": {"coordinates": "com.databricks:spark-xml_2.11:0.5.0"}},
{"pypi": {"package": "inflection"}},
atlas_library,
],
)
layer_3_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_enriched",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_enriched_table_from_standardized_notebook_params,
libraries=[atlas_library],
)
layer_1_task >> layer_2_task >> layer_3_task
after getting some help from #AnandVidvat about trying to make retry=0 experiment and some firend help to change operator to either DummyOperator or PythonOperator, i can confirm that the issue is not to do with DatabricksOperator or airflow version 1.10.x. i.e it is not an airflow bug.
so in summary, when a DAG, has meaningful operator, my setup fails in first Execution without any task log, and during retry works OK (the task log hides the fact it had been retried, because the failure had no logs).
In order to reduce the total run time. The workaround/patch, before finding the real cause, is to set the retry_delay to 10 seconds (default is 5 mins, and it makes DAG run long unnessicssarily.)
Next step is to figure out what is causing this 1st failure thing, by checking logs on scheduler or woker pods in our current setup (azure K8s, postgresql, Redis, celery executor).
p.s. I used below DAG tested and get the conclusion.
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import time
from pprint import pprint
dag_args = {
"owner": "min_test",
"retries": 1,
"email": ["c243d70b.domain.onmicrosoft.com#emea.teams.ms"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
pipeline_name = "min_test_debug_airflow_baseline_PythonOperator_1_retry"
dag = DAG(
dag_id=pipeline_name,
default_args=dag_args,
schedule_interval="0 3 * * *",
tags=["min_test_airflow"],
)
def my_sleeping_function(random_base):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return "Whatever you return gets printed in the logs"
run_this = PythonOperator(
task_id="print_the_context",
provide_context=True,
python_callable=print_context,
dag=dag,
)
# Generate 3 sleeping tasks, sleeping from 0 to 2 seconds respectively
for i in range(3):
task = PythonOperator(
task_id="sleep_for_" + str(i),
python_callable=my_sleeping_function,
op_kwargs={"random_base": float(i) / 10},
dag=dag,
)
task.set_upstream(run_this)

using supervisord to run lsyncd script

I'm trying to run my lsynd's script with supervisord in order to have it always run.
I've coded this conf for my supervisor
[program:autostart_lsyncd]
command=bash -c "lsyncd /home/sync/lsyncd_script.lua"
autostart=true
autorestart=unexpected
numprocs=1
startsecs = 0
stderr_logfile=/var/log/autostart_sync.err.log
stdout_logfile=/var/log/autostart_sync.out.log
Script runs ok at startup but it exits always
2018-04-09 09:48:49,638 INFO success: autostart_lsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-04-09 09:48:49,639 INFO exited: autostart_lsyncd (exit status 0; expected)
I can't understand if this is the correct way to keep alive a lsynd script or not.
Suggestions?
I'm using this configuration to supervisord in file /etc/supervisor/conf.d/lsyncd.conf
[program:lsyncd]
command=/usr/bin/lsyncd -nodaemon /etc/lsyncd/lsyncd.conf.lua
autostart=true
autorestart=unexpected
startretries=3
And this configuration to lsyncd (/etc/lsyncd/lsyncd.conf.lua):
settings {
logfile = "/var/log/lsyncd/lsyncd.log",
statusFile = "/var/log/lsyncd/lsyncd.status"
}
sync {
default.rsync,
source="/var/www/html/sites/default/files",
target="root#cdn:/var/www/html/sites/default/files",
exclude = {"*.php", "*.po", "\.ht*"},
rsync = {
archive = false,
acls = false,
compress = true,
links = false,
owner = false,
perms = false,
verbose = true,
rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"
}
}
Also I had configure ssh keys and install rsync in the servers.

Can celery's beat tasks execute at timed intervals?

This is the beat tasks setting:
celery_app.conf.update(
CELERYBEAT_SCHEDULE = {
'taskA': {
'task': 'crawlerapp.tasks.manual_crawler_update',
'schedule': timedelta(seconds=3600),
},
'taskB': {
'task': 'crawlerapp.tasks.auto_crawler_update_day',
'schedule': timedelta(seconds=3600),
},
'taskC': {
'task': 'crawlerapp.tasks.auto_crawler_update_hour',
'schedule': timedelta(seconds=3600),
},
})
Normally taskA,taskB,taskC execute at the same time after my command celery -A myproj beat as the beat tasks. But now I want that taskA execute first,and then some time later taskB excute second,taskC excute at last.And after 3600 seconds they excute again.And after 3600 seconds they excute again.And after 3600 seconds they excute again. Is it possible?
Yeah, it's possible. Create a chain for all three tasks and then use this chained task for scheduling.
In your tasks.py file:
from celery import chain
chained_task = chain(taskA, taskB, taskC)
Then schedule the chained_task:
celery_app.conf.update(
CELERYBEAT_SCHEDULE = {
'chained_task': {
'task': 'crawlerapp.tasks.manual_crawler_update',
'schedule': timedelta(seconds=3600),
},
})
By this, your task will execute in order once in 3600 seconds.

want to find when the job will be started in celery

i am new to celery. i have some configuration in celeryconfig.py as follow:
from datetime import timedelta
BROKER_URL='redis://localhost:6379/0'
CELERY_RESULT_BACKEND="redis"
CELERY_REDIS_HOST="localhost"
CELERY_REDIS_PORT=6379
CELERY_REDIS_DB=0
CELERY_IMPORT=("mail")
CELERYBEAT_SCHEDULE={'runs-every-30-seconds' :
{
'task': 'mail.mail',
'schedule': timedelta(seconds=30),
},
}
i have scheduled that the job will run periodically in 30 seconds. now i want that the jobs should start on 29 aug at 4:00PM then how should i configure this??
You should use Cron instead of timedelta. The Celery documentation discusses this specifically, and provides some useful examples. See Crontab schedules
Here is an example from Celery:
from celery.schedules import crontab
CELERYBEAT_SCHEDULE = {
# Executes every Monday morning at 7:30 A.M
'every-monday-morning': {
'task': 'tasks.add',
'schedule': crontab(hour=7, minute=30, day_of_week=1),
'args': (16, 16),
},
}
To make this work for your condition, you will also need to specify the cron month_of_year parameter.