When using the official Celery s3 results backend, how do I visualise logs (state, input and output of tasks) of my tasks conveniently?
Here are some examples actions I'd like to be easy:
filter over failed tasks that happened over the last day
check the input argument of a failed task to attempt to reproduce it.
calculate success rate over a period of time?
has a periodic task run?
I previously used django-celery-results and there is an example screenshot of the interface I get in the admin.
Related
I am struggling to optimize my data factory pipeline to achieve as little time spent in spinning up compute for dataflows.
My understanding is that if we set up a runtime with a TTL of say 15 minutes, then all subsequent flows executed in a sequence following this should experience very short compute acquisition times, but does this also hold true, when switching from one pipeline to the other - in the image below, would flow 3 utilize that the runtime was already spun up in flow 1? I ask because I see very sporadic behavior.
Pipeline example
If you are using the same Azure IR inside of the same factory, yes. However, the activities must be executed sequentially, otherwise, ADF will spin-up another pool for you. That's because Databricks parallel job executions are not supported in job clusters. I describe the techniques in this video and in this document.
We are considering using airflow to replace our currently custom rq based workflow but I am unsure of the best way to design it. Or if it even makes sense to use airflow.
The use case is:
We get a data upload from a user.
Given the data types received we optionally run zero or more jobs
Each job runs if a certain combination of datatypes was received. It runs for that user for a time frame determined from the data received
The job reads data from a database and writes results to a database.
Further jobs are potentially triggered as a result of these jobs.
e.g.
After a data upload we put an item on a queue:
upload:
user: 'a'
data:
- type: datatype1
start: 1
end: 3
- type: datatype2
start: 2
end: 3
And this would trigger:
job1, user 'a', start: 1, end: 3
job2, user 'a', start: 2, end: 3
and then maybe job1 would have some clean up job that runs after it.
(Also it would be good if it was possible to be able to restrict jobs to only run if there are not other jobs running for the same user.)
Approaches I have considered:
1.
Trigger a DAG when data upload arrives on message queue.
Then this DAG determines which dependent jobs to to run and passes as arguments (or xcom) the user and the time range.
2.
Trigger a DAG when data upload arrives on message queue.
Then this DAG dynamically creates DAGS for the jobs based on datatypes and templates in the user and timeframe.
So you get dynamic DAGs for each user, job, timerange combo.
I'm not even sure how to trigger DAGs from a message queue... And finding it hard to find examples similar to this use case. Maybe that is because Airflow is not suited?
Any help/thoughts/advice would be greatly appreciated.
Thanks.
Airflow is built around time based schedules. It is not built to trigger runs based on the landing of data. There are other systems designed to do this instead. I heard something like pachyderm.io or maybe dvs.org. Even repurposing a CI tool or customizing a Jenkins setup could trigger based on file change events or a message queue.
However you can try to work it with Airflow by having an external queue listener use rest API calls to Airflow to trigger DAGs. EG if the queue is an AWS SNS queue you could have an AWS Lambda listener in simple Python do this.
I would recommend one DAG per job type (or is it user, whichever is less) which the trigger logic determines is correct based on the queue. If there's common clean up tasks and the like, the DAG might use a TriggerDagRunOperator to start those, or you might just have a common library of those clean up tasks that each DAG includes. I think the latter is cleaner.
DAGs can have their tasks limited to certain pools. You could make a pool per user so as to limit the runs of jobs per user. Alternatively if you have a DAG per user, you could set your max concurrent DAG runs for that DAG to something reasonable.
I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".
I am using celery.contrib.batches to execute a batch of celery tasks. I know its experimental but still wanted to give it a try and I am pretty close. While executing individual tasks in the batch and I am sending signals like backend.mark_as_started(request.id), backend.mark_as_done(request.id, True) deliberately. But the signals are not being received at the worker. Note that everything works if I get rid of batches and execute task one a time. Meaning, my signal handler functions do get executed.
The celery.contrib.Batches indeed do not send these signals. The solution is to send those signals from inside the Batch task.
I have some tasks, which are further divided into runnables. Runnables execute as task instances. Runnables have dependencies within the tasks and also to other tasks's runnables. I have the information of deadlines and periods of tasks and the execution order of tasks and runnables i.e I can extract the data flow but the only point where I am stucking is that how can I get the information if the task instances are executing within the period i.e obeying the deadlines and if not executing withtin the deadline then that task instance will execute in the next cycle or next period.
Any ideas ? Suggestions ?
p.s I dont have timing information for the execution of runnables.