Way to persist process spawned by a task on an agent? - azure-devops

I'm developing an Azure Devops extension with tasks in it. In one of the tasks, I'm starting a process and I'm doing configurations. In another task, I'm accessing the same process API to consume it. This is working perfectly fine, but I notice that after the job is done, my process is killed. I was planning to allow the user to do the configuration on an agent and be able to access it in another job or pipeline.
Is there a way to persist a process on an agent? I feel like the agent is killing every child processes created on cleanup. Where can I find documentation on this?
Edit: I managed to find this thread that talks about a certain Process.clean variable but there's not any more information about it and I didn't find documentation on it.

Your feeling is correct. Agents clean up spawned processes when the job finishes, and that's by design. A single machine can have multiple agents on it, and multiple agents can be running tasks in parallel. What if you have one machine with 10 agents on it, and they all start up this process at once?
IMO, the approach you're taking is suspect. If you need to persist information across jobs, there are numerous ways to do so (for example, an output variable containing JSON) that don't involve spawning a service that continues running outside the scope of the job that started it.

Related

Workflow system for both ETL and Queries by Users

I am looking for a workflow system that supports the following needs:
dealing with a complex ETL pipeline with various kinds of APIs
(file-based, REST, console, databases, ...)
offers automated scheduling/orchestration on different execution environments (AWS, Azure, on-Premise clusters, local machine, ...)
has an option for "reactive" workflows i.e. workflows that can be triggered and executed instantaneously without unnecessary delay, are executed with highest priority and the same workflow can be started several times simultaneously
Especially the third requirement seems to be tricky to find. The purpose of this requirement is that a user should be able to send a query to activate a (computationally non-heavy) workflow and get back a result immediately instead of waiting some seconds or even minutes and multiple users might want to use the same workflow simultaneously. The reason this is important is that the ETL workflows and the user ("reactive") workflows share a substantial overlap and I do intend to reuse parts of these workflows instead of maintaining two sets of workflows that are executed by different tools.
Apache Airflow appears to be the natural choice for requirements 1. and 2. but does not seem to support the third requirement since it starts the execution in (lengthy) fixed time slots and does not allow for the simulataneous execution of several instances of the same DAG (workflow).
Are there any tools out there that support all these requirements or do I have to use two different workflow management tools or even have to stick to a (Python) script for the user workflows?
You can trigger a dag manually by using the CLI or the API. Have a look at this post: https://medium.com/#ntruong/airflow-externally-trigger-a-dag-when-a-condition-match-26cae67ecb1a
You'll have to test if you can execute multiple dag runs at the same time.

Workflow platform for managing the processing of incoming files

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.
Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

How to (properly) Create Jobs On Demand

What I would like to do
I would like to create a Kubernetes workflow where users could POST jobs whenever they wanted, and they might do it at any time, not necessarily scheduling anything (CronJobs), or specifying parallelism or completion requirements, i.e., users could create Jobs on demand.
How I would do it right now
The way I'm thinking about accomplishing this is by simply applying the Jobs to the Kubernetes cluster (I also have to make sure the job doesn't have the same name of a current one because otherwise Kubernetes will think it's a mistake and won't create another one). However, this feels improper because the Jobs will be kind of scattered on the cluster and I would lose control over them (though Kubernetes would supposedly automatically manage them optimally).
Is there a better, proper a way?
I imagine a more proper way of configuring all this is to create some sort of Deployment and Service on top of the Jobs, but is that an existing feature on Kubernetes? Huge companies probably have had this problem in the past so I wonder: what are the best practices for this Kubernetes Jobs On Demand use case?
Not a full answer but you might be interested in this project: https://github.com/ivoscc/kubernetes-task-runner.
It provides an API to launch one-time tasks as Jobs on a Kubernetes cluster, handles input/output files via GCS and periodically cleans up finished Jobs.

How to use Cron to run commands when jobs in a particular folder complete?

I am running a long simulation on our cluster. I submit dozens of jobs first, each job is "hold on" its previous one, so that the simulation could be extended to the period that I want.
Due to the limitation of total jobs we could submit, I have to submit many jobs every day, when the previous jobs have been completed.
I feel it is time-consuming to do this every day. So I wonder if Cron could
monitor if all the jobs launched in a particular folder have been
completed on the cluster
if yes, execute the commands written in a job.sh file, to submit
more jobs within that particular folder.
I am also happy if other methods could be used except Cron.
Thank you.
You may also be interested in trying BeyondCron, which is currently available for early adopters. Using conditions you should be able to solve you problem.

Using Kafka instead of Redis for the queue purposes

I have a small project that uses Redis for the task queue purposes. Here is how it basically works.
I have two components in the system: desktop client (can be more than one) and a server-side app. Server-side app has a pull of tasks for the desktop client(s). When a client comes, the first available task from the pull is given to it. As the task has an id, when the desktop client gets back with the results, the server-side app can recognize the task by its id. Basically, I do the following in Redis:
Keep all the tasks as objects.
Keep queue (pool) of tasks in several lists: queue, provided, processing.
When a task is being provided to the desktop client, I use RPOPLPUSH in Redis to move the id from the queue list to the provided list.
When I get a response from the desktop client, I use LREM for the given task ID from the provided list (if it fails, I got a task that was not provided or was already processed, or just never existed - so, I break the execution). Then I use LPUSH to add the task id to the processing list. Given that I have unique task ids (controlled on the level of my app), I avoid duplicates in the Redis lists.
When the task is finished (the result got from the desktop client is processed and somehow saved), I remove the task from the processing list and delete the task object from Redis.
If anything goes wrong on any step (i.e. the task gets stuck on the processing or provided list), I can move the task back to the queue list and re-process it.
Now, the question: is it somehow possible to do the similar stuff in Apache Kafka? I do not need the exact behavior as in Redis - all I need is to be able to provide a task to the desktop client (it shouldn't be possible to provide the same task twice) and mark/change its state according to the actual processing status (new, provided, processing), so that I could control the process and restore the tasks that were not processed due to some problem. If it's possible, could anyone please describe the applicable workflow?
It is possible for kafka to act as a standard queue. Check the consumer group feature.
If the question is about the appropriateness, please also refer Is Apache Kafka appropriate for use as a task queue?
We are using kafka as a task queue, one of the consideration went in favor of kafka was that it is already in our application ecosystem, found it easier than adding one more component.