In Celery tasking, is the database scheduler sufficient, or do I need to specify the run_every property as well? - celery

I have a periodic task that is supposed to run once a day, but currently it runs twice a day, and I'm not sure why. The second run occurs milliseconds after the intended one.
My periodic task has the run_every property specified:
run_every = crontab(hour=1, minute=1)
but in my settings file, the database scheduler is specified:
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'
furthermore in the database, there are tables with the task names and their crontab schedule.
For example we have a table called djcelery_crontabschedule and it also specifies that the same task should run at 1:01 am.
Could this be causing my task to run twice every day?

I never use run_every... Here is an example from the beatconfig.py file that I use:
beat_schedule = {
'company-data-report': {
'task': 'report.company_data_report',
'schedule': crontab(minute=0, hour=7),
'args': [],
'options': {'expires': 120*60}
},
etc
This particular task runs every day at specified time. We use the default Celery scheduler, not some third-party implementation.

Related

Generating uuid and use it across Airflow DAG

I'm trying to create a dynamic airflow that has the following 2 tasks:
Task 1: Creates files with a generated UUID as part of their name
Task 2: Runs a check on those files
So I define a variable 'FILE_UUID' and sets it as follow: str(uuid.uuid4()). And also created a constant file name:
MY_FILE = '{file_uuid}_file.csv'.format(file_uuid=FILE_UUID}
Then - Task 1 is a bashOperator that get MY_FILE as part of the command, and it creates a file successfully.
I can see the generated files include a specific UUID in the name,
TASK 2 fails is a PythonOperator that get MY_FILE as an op_args. But can't access the file. Logs show that it tries to access files with a different UUID.
Why is my "constant" is being run separately on every task? Is there any way to prevent that from happening?
I'm using Airflow 1.10, my executor is LocalExecutor.
I tried setting the constant outside the "with DAG" and inside it, also tries working with macros, but then PythonOperator just uses the macro strings literally using the values they hold.
You have to keep in mind that the DAG definition file is a sort of "configuration script", not an actual executable to run your DAGs. The tasks are executed in completely different environments, most of the times not even on the same machine. Think of it like a configuration XML which sets up your tasks, and then they are built and executed on some other machine in the cloud - but it's Python instead of XML.
In conclusion - your DAG code is Python, but it's not the one being executed in the runtime of your tasks. So if you generate a random uuid there, it will get evaluated at an unknown time and multiple times - for each task, on different machines.
To have it consistent across tasks you need to find another way, for example:
use XCOM such that the first tasks uses the uuid it gets, and then writes that to XCOM for all downstream tasks to use.
anchor your uuid with something constant across your pipeline, a source, a date, or whatever (e.g. if it's a daily task, you can build your uuid from date parts mixing in some dag/task specifics, etc. - whatever will make your uuid the same for all tasks, but unique for unique days)
Example DAG using the first method (XCOM's):
from datetime import datetime
import uuid
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
with DAG(dag_id='global_uuid',
schedule_interval='#daily',
start_date=...) as dag:
generate_uuid = PythonOperator(
task_id='generate_uuid',
python_callable=lambda: str(uuid.uuid4())
)
print_uuid1 = BashOperator(
task_id='print1',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
print_uuid2 = BashOperator(
task_id='print2',
bash_command='echo {{ task_instance.xcom_pull("generate_uuid") }}'
)
generate_uuid >> print_uuid1 >> print_uuid2

bitbake do_image dependency not cached

I have a task do_image_custom that has a dependency on task do_image_ext4.
That task (do_image_ext4) generates an image file containing DATETIME.
The first time I build my image, no errors.
dependency_DATETIME.rootfs.ext4 is generated and used by its dependents.
If I make a change to the consuming task of the ext4 file, because I need to stipulate the dependency on DATETIME.rootfs.ext4.
After I build a second time (without cleaning), I get the error that do_image_custom cannot find newer_datetime.rootfs.ext4
I check the IMGDEPLOYDIR and sure enough, that file doesn't exist and the do_image_ext4 task still has the first timestamp.
My question is, what am I doing wrong here in do_image_custom such that it re-evaluates DATETIME every time it is run without checking with (perhaps) the sstate cache?
The problem was that my custom task (do_image_custom) depended on the output of a prior task. That task output generates an ext4 image with a timestamp in the name.
do_image_custom re-evaluated the DATETIME, even though the dependency (the ext4 file with an earlier DATETIME did not, and therefore was not rebuilt. Hence when do_image_custom executed, it referenced a file that did not exist (the error) because it was not generated (correctly so, because the basehash for the dependency task was unchanged).
The solution was (in front of me all along) to modify my custom task (do_image_custom) to refer to a symlink (also generated in the same step as the ext4) which does not have a DATETIME in the symlink name, hence making do_image_custom invariant to any or no changes to it's dependent step.

Spring batch, handle failed files again, skip successfully handled

How to ideologically correct organize file handling?
I have a folder for new files (NEW), folder for old files (OLD), a folder for failed files (FAIL). New file puts in NEW, then if the handling was correct, the file goes to OLD, if the handling was failed, the file goes to ERR. Then we take this file again and correcting it and put in NEW if all ok file goes to OLD if failed goes to ERR. And repeat again and again.
I have job with constant name "fileHandlingJob", in job i have some steps: "extract", "handling", "utilize", and i have job parameters: "filePath", "fileName".
Thanks!
If you state that uniqueness criteria of file - it's file's name, then you are on right way.
If job was in state FAILED (ERR folder) then you can retrigger it with same set of parameters. If job was COMPLETED - you can't run it again. Spring batch will complain.
You can ensure this behaviour by having unique file name as Job's parameter. So no other job could be triggered with same file name. Spring batch will simply prevent this.
Second parameter filePath can be additional non-unique parameter.
JobParametersBuilder jobParametersBuilder = new JobParametersBuilder()
.addString("fileName", "myfile.xml", true)
.addDate("filePath", "C:\new\myfile.xml", false);
true/false here means whether parameter is unique or not.

Stopping SpringBatch jobs started from the command line

Spring Batch jobs can be started from the commandline by telling the JVM to run CommandLineJobRunner. According to the JavaDoc, running the same command with the added parameter of -stop will stop the Job:
The arguments to this class can be provided on the command line
(separated by spaces), or through stdin (separated by new line). They
are as follows:
jobPath jobIdentifier (jobParameters)* The command line options are as
follows
jobPath: the xml application context containing a Job
-restart: (optional) to restart the last failed execution
-stop: (optional) to stop a running execution
-abandon: (optional) to abandon a stopped execution
-next: (optional) to start the next in a sequence according to the JobParametersIncrementer in the Job jobIdentifier: the name of the job or the id of a job execution (for -stop, -abandon or -restart).
jobParameters: 0 to many parameters that will be used to launch a job specified in the form of key=value pairs.
However, on the JavaDoc for the main() method the -stop parameter is not specified. Looking through the code on docjar.com I can't see any use of the -stop parameter where I would expect it to be.
I suspect that it is possible to stop a batch that has been started from the command line but only if the batches being run are backed by a non-transient jobRepository? If running a batch on the command line that only stores its data in HSQL (ie in memory) there is no way to stop the job other than CTRL-C etc?
stop command is implemented, see source for CommandLineJobRunner, line 300+
if (opts.contains("-stop")) {
List<JobExecution> jobExecutions = getRunningJobExecutions(jobIdentifier);
if (jobExecutions == null) {
throw new JobExecutionNotRunningException("No running execution found for job=" + jobIdentifier);
}
for (JobExecution jobExecution : jobExecutions) {
jobExecution.setStatus(BatchStatus.STOPPING);
jobRepository.update(jobExecution);
}
return exitCodeMapper.intValue(ExitStatus.COMPLETED.getExitCode());
}
The stop switch will work, but it will only stop the job after the currently executing step completes. It won't kill the job immediately.

Inherit roles from parent tasks in Capistrano callbacks

I have several tasks which all must check that the machines serving as roles have a certain file with certain contents. The logic is reasonable to separate into a prerequisite, or a callback.
task t1, :roles => [:r1] do
...
end
task t2, :roles => [:r2,:r3] do
...
end
before <what?> do
# must only run on :r1 when triggered by t1,
# and only on :r2 and :r3 when triggered by t2!
<ensure role given to parent task has a given file>
end
How do we do that in Capistrano?
It turns out that a before callback can invoke a regular def, in which case it runs for the roles of the parent task. If, however, you call a task there, and that task has no roles, all roles will be used to run it. The real question is where are the dependencies across tasks...