Workflow system for both ETL and Queries by Users - workflow

I am looking for a workflow system that supports the following needs:
dealing with a complex ETL pipeline with various kinds of APIs
(file-based, REST, console, databases, ...)
offers automated scheduling/orchestration on different execution environments (AWS, Azure, on-Premise clusters, local machine, ...)
has an option for "reactive" workflows i.e. workflows that can be triggered and executed instantaneously without unnecessary delay, are executed with highest priority and the same workflow can be started several times simultaneously
Especially the third requirement seems to be tricky to find. The purpose of this requirement is that a user should be able to send a query to activate a (computationally non-heavy) workflow and get back a result immediately instead of waiting some seconds or even minutes and multiple users might want to use the same workflow simultaneously. The reason this is important is that the ETL workflows and the user ("reactive") workflows share a substantial overlap and I do intend to reuse parts of these workflows instead of maintaining two sets of workflows that are executed by different tools.
Apache Airflow appears to be the natural choice for requirements 1. and 2. but does not seem to support the third requirement since it starts the execution in (lengthy) fixed time slots and does not allow for the simulataneous execution of several instances of the same DAG (workflow).
Are there any tools out there that support all these requirements or do I have to use two different workflow management tools or even have to stick to a (Python) script for the user workflows?

You can trigger a dag manually by using the CLI or the API. Have a look at this post: https://medium.com/#ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a
You'll have to test if you can execute multiple dag runs at the same time.

Related

Way to persist process spawned by a task on an agent?

I'm developing an Azure Devops extension with tasks in it. In one of the tasks, I'm starting a process and I'm doing configurations. In another task, I'm accessing the same process API to consume it. This is working perfectly fine, but I notice that after the job is done, my process is killed. I was planning to allow the user to do the configuration on an agent and be able to access it in another job or pipeline.
Is there a way to persist a process on an agent? I feel like the agent is killing every child processes created on cleanup. Where can I find documentation on this?
Edit: I managed to find this thread that talks about a certain Process.clean variable but there's not any more information about it and I didn't find documentation on it.
Your feeling is correct. Agents clean up spawned processes when the job finishes, and that's by design. A single machine can have multiple agents on it, and multiple agents can be running tasks in parallel. What if you have one machine with 10 agents on it, and they all start up this process at once?
IMO, the approach you're taking is suspect. If you need to persist information across jobs, there are numerous ways to do so (for example, an output variable containing JSON) that don't involve spawning a service that continues running outside the scope of the job that started it.

Workflow platform for managing the processing of incoming files

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.
Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

How to modify the scheduler of Pegasus WMS

I'm interested in scientific workflow scheduling. I'm trying to figure out and modify the existing scheduling algorithm inside Pegasus workflow management system from http://pegasus.isi.edu/, but I don't know where it is and how to do so. Thanks!
Pegasus has a notion of site selection during it's mapping phase where it maps the jobs to the various sites defined in the site catalog. The site selection is explained in the documentation here
https://pegasus.isi.edu/wms/docs/latest/running_workflows.php#mapping_refinement_steps
Internally, there is a site selector interface that you can implement to incorporate your own scheduling algorithms.
You can access the javadoc at
https://pegasus.isi.edu/wms/docs/latest/javadoc/edu/isi/pegasus/planner/selector/SiteSelector.html
There are some implementations included in this package
There is a version of Heft also implemented there. The algorithm is implemented in the the following class.
edu.isi.pegasus.planner.selector.site.heft.Algorithm
Looking at the Heft implementation of site selector will provide you a good template on how to incorporate other site selection algorithms.
However, you need to keep in mind, that Pegasus maps the workflow to various sites and then hands over the workflow to Condor DAGMan for execution. Condor DAGMAn looks at what jobs are ready to run and then releases them to local Condor queue ( managed by Condor Schedd). The jobs are then submitted to the remote sites by Condor Schedd. The actual node on which a job gets executed is determined the by local resource scheduler on the site. For example, if you submit the jobs in a workflow to a site that is running PBS , then PBS decides the actual nodes on which a job runs.
In case of Condor you can associate requirements with your jobs that can help you steer jobs to specific nodes etc.
With a workflow, you can also associate job priorities that determine the priority of the job in the local Condor Queue on the submit host. You can use that to control what job gets submitted by schedd first if there are multiple jobs in the queue.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.