Workflow platform for managing the processing of incoming files - workflow

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.

Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

Related

Workflow system for both ETL and Queries by Users

I am looking for a workflow system that supports the following needs:
dealing with a complex ETL pipeline with various kinds of APIs
(file-based, REST, console, databases, ...)
offers automated scheduling/orchestration on different execution environments (AWS, Azure, on-Premise clusters, local machine, ...)
has an option for "reactive" workflows i.e. workflows that can be triggered and executed instantaneously without unnecessary delay, are executed with highest priority and the same workflow can be started several times simultaneously
Especially the third requirement seems to be tricky to find. The purpose of this requirement is that a user should be able to send a query to activate a (computationally non-heavy) workflow and get back a result immediately instead of waiting some seconds or even minutes and multiple users might want to use the same workflow simultaneously. The reason this is important is that the ETL workflows and the user ("reactive") workflows share a substantial overlap and I do intend to reuse parts of these workflows instead of maintaining two sets of workflows that are executed by different tools.
Apache Airflow appears to be the natural choice for requirements 1. and 2. but does not seem to support the third requirement since it starts the execution in (lengthy) fixed time slots and does not allow for the simulataneous execution of several instances of the same DAG (workflow).
Are there any tools out there that support all these requirements or do I have to use two different workflow management tools or even have to stick to a (Python) script for the user workflows?
You can trigger a dag manually by using the CLI or the API. Have a look at this post: https://medium.com/#ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a
You'll have to test if you can execute multiple dag runs at the same time.

How to (properly) Create Jobs On Demand

What I would like to do
I would like to create a Kubernetes workflow where users could POST jobs whenever they wanted, and they might do it at any time, not necessarily scheduling anything (CronJobs), or specifying parallelism or completion requirements, i.e., users could create Jobs on demand.
How I would do it right now
The way I'm thinking about accomplishing this is by simply applying the Jobs to the Kubernetes cluster (I also have to make sure the job doesn't have the same name of a current one because otherwise Kubernetes will think it's a mistake and won't create another one). However, this feels improper because the Jobs will be kind of scattered on the cluster and I would lose control over them (though Kubernetes would supposedly automatically manage them optimally).
Is there a better, proper a way?
I imagine a more proper way of configuring all this is to create some sort of Deployment and Service on top of the Jobs, but is that an existing feature on Kubernetes? Huge companies probably have had this problem in the past so I wonder: what are the best practices for this Kubernetes Jobs On Demand use case?
Not a full answer but you might be interested in this project: https://github.com/ivoscc/kubernetes-task-runner.
It provides an API to launch one-time tasks as Jobs on a Kubernetes cluster, handles input/output files via GCS and periodically cleans up finished Jobs.

Dataprep jobs running for over 72 hours since 6/20 update. Job status reads complete but not published

I have been running daily Dataprep jobs and since the update last week, approximately half of my jobs are now hanging and not being published. They appear as jobs in progress although when I go to the actual job page, the job appears to be complete. There is no publishing action and the publishing target does not appear updated. Some jobs have now been going on for over 72 hours since Friday.
I've seen traces of other users having the same issue online but have not seen any sort of response or recognition from either Google or Trifacta.
I have tried restarting the jobs to no success and it appears that there is no way to cancel those hanging jobs because from Google's perspective, it seems as though the jobs were successful itself, just not published. This problem appears both on my jobs that publish to BigQuery as well as jobs that publish to Google Cloud Storage, as well as manual and scheduled jobs.
This may impact only jobs that have been pushed during the upgrade and should be rather cosmetic in nature. Please note that you won't get charged.
Did the exact same job work before with no changes? If so, please contact support and provide them as reference the successful and now failing job ID so it can be investigated further.
Cheers,
Sebastian
I have come acros the same problem! The output of the jobs is placed in a temp folder in cloudstorage with the output mostly consisting out of multiple files without headers....
It is also creating huge issues here. Instead of the normal output file, it places multiple parts of it in a temp folder without headers. The makes new scheduled jobs that rely on these outputs useless, because it does not load the new output.
If you manually merge the files in the temp folder and add headers (in case of csv) + place them in the correct folder, the output can be created manualy (for csv).
Also no response from Google yet.
We're seeing the exact same thing down to the destinations and job types . . . it's almost like Dataprep is losing track of the underlying DataFlow job and not finishing on its completion (that's why you see the temp files—that's the output, then Dataprep handles the formatting of the output file separately).
Someone was kind enough to already post this on the issue tracker, so please go star it and add any additional details that may be helpful to the Dataprep team:
https://issuetracker.google.com/issues/135865374

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.