How to modify the scheduler of Pegasus WMS - workflow

I'm interested in scientific workflow scheduling. I'm trying to figure out and modify the existing scheduling algorithm inside Pegasus workflow management system from http://pegasus.isi.edu/, but I don't know where it is and how to do so. Thanks!

Pegasus has a notion of site selection during it's mapping phase where it maps the jobs to the various sites defined in the site catalog. The site selection is explained in the documentation here
https://pegasus.isi.edu/wms/docs/latest/running_workflows.php#mapping_refinement_steps
Internally, there is a site selector interface that you can implement to incorporate your own scheduling algorithms.
You can access the javadoc at
https://pegasus.isi.edu/wms/docs/latest/javadoc/edu/isi/pegasus/planner/selector/SiteSelector.html
There are some implementations included in this package
There is a version of Heft also implemented there. The algorithm is implemented in the the following class.
edu.isi.pegasus.planner.selector.site.heft.Algorithm
Looking at the Heft implementation of site selector will provide you a good template on how to incorporate other site selection algorithms.
However, you need to keep in mind, that Pegasus maps the workflow to various sites and then hands over the workflow to Condor DAGMan for execution. Condor DAGMAn looks at what jobs are ready to run and then releases them to local Condor queue ( managed by Condor Schedd). The jobs are then submitted to the remote sites by Condor Schedd. The actual node on which a job gets executed is determined the by local resource scheduler on the site. For example, if you submit the jobs in a workflow to a site that is running PBS , then PBS decides the actual nodes on which a job runs.
In case of Condor you can associate requirements with your jobs that can help you steer jobs to specific nodes etc.
With a workflow, you can also associate job priorities that determine the priority of the job in the local Condor Queue on the submit host. You can use that to control what job gets submitted by schedd first if there are multiple jobs in the queue.

Related

Workflow system for both ETL and Queries by Users

I am looking for a workflow system that supports the following needs:
dealing with a complex ETL pipeline with various kinds of APIs
(file-based, REST, console, databases, ...)
offers automated scheduling/orchestration on different execution environments (AWS, Azure, on-Premise clusters, local machine, ...)
has an option for "reactive" workflows i.e. workflows that can be triggered and executed instantaneously without unnecessary delay, are executed with highest priority and the same workflow can be started several times simultaneously
Especially the third requirement seems to be tricky to find. The purpose of this requirement is that a user should be able to send a query to activate a (computationally non-heavy) workflow and get back a result immediately instead of waiting some seconds or even minutes and multiple users might want to use the same workflow simultaneously. The reason this is important is that the ETL workflows and the user ("reactive") workflows share a substantial overlap and I do intend to reuse parts of these workflows instead of maintaining two sets of workflows that are executed by different tools.
Apache Airflow appears to be the natural choice for requirements 1. and 2. but does not seem to support the third requirement since it starts the execution in (lengthy) fixed time slots and does not allow for the simulataneous execution of several instances of the same DAG (workflow).
Are there any tools out there that support all these requirements or do I have to use two different workflow management tools or even have to stick to a (Python) script for the user workflows?
You can trigger a dag manually by using the CLI or the API. Have a look at this post: https://medium.com/#ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a
You'll have to test if you can execute multiple dag runs at the same time.

Way to persist process spawned by a task on an agent?

I'm developing an Azure Devops extension with tasks in it. In one of the tasks, I'm starting a process and I'm doing configurations. In another task, I'm accessing the same process API to consume it. This is working perfectly fine, but I notice that after the job is done, my process is killed. I was planning to allow the user to do the configuration on an agent and be able to access it in another job or pipeline.
Is there a way to persist a process on an agent? I feel like the agent is killing every child processes created on cleanup. Where can I find documentation on this?
Edit: I managed to find this thread that talks about a certain Process.clean variable but there's not any more information about it and I didn't find documentation on it.
Your feeling is correct. Agents clean up spawned processes when the job finishes, and that's by design. A single machine can have multiple agents on it, and multiple agents can be running tasks in parallel. What if you have one machine with 10 agents on it, and they all start up this process at once?
IMO, the approach you're taking is suspect. If you need to persist information across jobs, there are numerous ways to do so (for example, an output variable containing JSON) that don't involve spawning a service that continues running outside the scope of the job that started it.

How to (properly) Create Jobs On Demand

What I would like to do
I would like to create a Kubernetes workflow where users could POST jobs whenever they wanted, and they might do it at any time, not necessarily scheduling anything (CronJobs), or specifying parallelism or completion requirements, i.e., users could create Jobs on demand.
How I would do it right now
The way I'm thinking about accomplishing this is by simply applying the Jobs to the Kubernetes cluster (I also have to make sure the job doesn't have the same name of a current one because otherwise Kubernetes will think it's a mistake and won't create another one). However, this feels improper because the Jobs will be kind of scattered on the cluster and I would lose control over them (though Kubernetes would supposedly automatically manage them optimally).
Is there a better, proper a way?
I imagine a more proper way of configuring all this is to create some sort of Deployment and Service on top of the Jobs, but is that an existing feature on Kubernetes? Huge companies probably have had this problem in the past so I wonder: what are the best practices for this Kubernetes Jobs On Demand use case?
Not a full answer but you might be interested in this project: https://github.com/ivoscc/kubernetes-task-runner.
It provides an API to launch one-time tasks as Jobs on a Kubernetes cluster, handles input/output files via GCS and periodically cleans up finished Jobs.

Quartz scheduler - external Trigger configuration through AdoJobStore and Clustering

Exploring (Ado)JobStore (data base job store in general) I met subjects like clustering, load balancing and sharing jobs' work data state across multiple applications.
But I think I didn't find a JobStore subject that covers my scenario.
I need to run Quartz Jobs in Windows Service and I need to be able to change configuration of Triggers in other application (in Admin panel in web application) and the Triggers to be applied by the Quartz in my Windows Service automatically (Quartz tracks changes and applies them).
Is it possible to do this by using AdoJobStore/Clustering mechanism? I mean in terms of JobStore's features, so by using Quartz scheduler API. Not by using SQL and changing data in Quartz tables directly or any other workarounds (according to Quartz's Best Practices doc).
The Quartz.NET scheduler can be accessed remotely, independently of job stores. Since you already have a web app you can add a reference to the remote scheduler and use the API to administer jobs, triggers etc.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.