SpringXD Job split flow steps running in separate containers in distributed mode - spring-batch

I am aware of the nested job support (XD-1972) work and looking forward to that. A question regarding split flow support. Is there a plan to support running parallel steps, as defined in split flows, in separate containers?
Would it be as simple as providing custom implementation of a proper taskExecutor, or it is something more involved?

I'm not aware of support for splits to be executed across multiple containers being on the roadmap currently. Given the orchestration needs of something like that, I'd probably recommend a more composed approach anyways.
A custom 'TaskExecutor` could be used to farm out the work, but it would be pretty specialized. Each step within the flows in a split are executed within the scope of a job. That scope (and all it's rights and responsibilities) would need to be carried over to the "child" containers.

Related

Multi-instance and Loop in BPMN

I am trying to model a certain behaviour, where couple of activities in differents swimlanes supposed to be processed in a loop. Now BPMN uses tokens to ilustrate the flow and paths taken. I wonder how such tokens work in case of loops. Does every activity iteration creates a token which consequently travel through the connected activities?
E.g. Let's say Activity1 will be performed in a loop 10 times. Will that create 10 tokens where each will travel through the remaining activities of the process? Such behaviour would be undesirable, however if I am not mistaken multi-instance activities work that way.
The only solution on my mind which would comply with BPMN specification would be to create a Call activity for the whole block of activities and then run the Call activity in a loop.
Can anyone clarify for me the use of loops and multi-instances in BPMN from the view of tokens?
Thank you in advance!
Based upon my reading of the documentation: https://www.omg.org/spec/BPMN/2.0/PDF The answer from #qwerty_so does not seem to conform to the standard, although in part this seems to be because the question also seems imprecise or at least underspecified.
A token (see glossary) is simply an imaginary object that represents the flow unit in the process diagram. There are at least three different types of loops specified in the standard, which suggest different implications for the flow unit.
Sections 13.2.6 and 12.2.7 describe Loop Activity and Multiple Instance Activities respectively. While the latter, on its face, might not seem like a loop, the standard defines attributes of the activity that suggest otherwise including: MultipleInstanceLoopCharacteristics and ExpressionloopCardinality.
In the former case, it seems that the operational semantics suggest a single flow unit that repeats multiple times according to some policy or even unbounded.
In the latter case, the activity has "multiple instances spawned," including a parallel variant.
That multiple instances can flow forward in parallel, on its face, suggests that the system must at least allow for the possibility of spawning multiple tokens (or conceptually splitting the original token) to support multiple threads proceeding simultaneously along different paths.
That said, the Loop Activity (13.2.6) appears to support the OP's desired semantics.

Celery - running a set of tasks with complex dependencies

In the application I'm working on, a user can perform a "transition" which consists of "steps". A step can have an arbitrary number of dependencies on other steps. I'd like to be able to call a transition and have the steps execute in parallel as separate Celery tasks.
Ideally, I'd like something along the lines of celery-tasktree, except for directed acyclic graphs in general, rather than only trees, but it doesn't appear that such a library exists as yet.
The first solution that comes to mind is a parallel adaptation of a standard topological sort - rather than determining a linear ordering of steps which satisfy the dependency relation, we determine the entire set of steps that can be executed in parallel at the beginning, followed by the entire set of steps that can be executed in round 2, and so on.
However, this is not optimal when tasks take a variable amount of time and workers have to idle waiting for a longer running task while there are tasks that are now ready to run. (For my specific application, this solution is probably fine for now, but I'd still like to figure out how to optimise this.)
As noted in https://cs.stackexchange.com/questions/2524/getting-parallel-items-in-dependency-resolution, a better way is operate directly off the DAG - after each task finishes, check whether any of its dependent tasks are now able to run, and if so, run them.
What would be the best way to go about implementing something like this? It's not clear to me that there's an easy way to do this.
From what I can tell, Celery's group/chain/chord primitives aren't flexible enough to allow me to express a full DAG - though I might be wrong here?
I'm thinking I could create a wrapper for tasks which notifies dependent tasks once the current task finishes - I'm not sure what the best way to handle such a notification would be though. Accessing the application's Django database isn't particularly neat, and would make it hard to spin this out into a generic library, but Celery itself doesn't provide obvious mechanisms for this.
I also faced this problem but i couldn't really find a better solution or library except for one library, For anyone still interested, you can checkout
https://github.com/selinon/selinon. Although its only for python 3, It seems to be the only thing that does exactly what you want.
Airflow is another option but airflow is used in a more static environment just like other dag libraries.

Should I override the default ExecutionContext?

When using futures in scala the default behaviour is to use the default Implicits.global execution context. It seems this defaults to making one thread available per processor. In a more traditional threaded web application this seems like a poor default when the futures are performing a task such as waiting on a database (as opposed to some cpu bound task).
I'd expect that overriding the default context would be fairly standard in production but I can find so little documentation about doing it that it seems that it might not be that common. Am I missing something?
Instead of thinking of it as overriding the default execution context, why not ask instead "Should I use multiple execution contexts for different things?" If that's the question, then my answer would be yes. Where I work, we use Akka. Within our app, we use the default Akka execution context for non blocking functionality. Then, because there is no good non blocking jdbc driver currently, all of our blocking SQL calls use a separate execution context, where we have a thread per connection approach. Keeping the main execution context (a fork join pool) free from blocking lead to a significant increase in throughput for us.
I think it's perfectly ok to use multiple different execution contexts for different types of work within your system. It's worked well for us.
The "correct" answer is that your methods that needs to use an ExecutionContext require an ExecutionContext in their signature, so you can supply ExecutionContext(s) from the "outside" to control execution at a higher level.
Yes, creating and using other execution contexts in you application is definitely a good idea.
Execution contexts will modularize your concurrency model and isolate the different parts of your application, so that if something goes wrong in a part of your app, the other parts will be less impacted by this. To consider your example, you would have a different execution context for DB-specific operations and another one for say, processing of web requests.
In this presentation by Jonas Boner this pattern is referred to as creating "Bulkheads" in your application for greater stability & fault tolerance.
I must admit I haven't heard much about execution context usage by itself. However, I do see this principle applied in some frameworks. For example, Play will use different execution contexts for different types of jobs and they encourage you to split your tasks into different pools if necessary: Play Thread Pools
The Akka middleware also suggests splitting your app into different contexts for the different concurrency zones in your application. They use the concept of Dispatcher which is an execution context on batteries.
Also, most operators in the scala concurrency library require an execution context. This is by design to give you the flexibility you need when in modularizing your application concurrency-wise.

UML Deployment Diagram for IaaS and PaaS Cloud Systems

I would like to model the following situation using a UML deployment diagram.
A small command and control machine instance is spawned on an Infrastructure as a Service cloud platform such as Amazon EC2. This instance is in turn responsible for spawning additional instances and providing them with a control script NumberCruncher.py either via something like S3 or directly as a start up script parameter if the program is small enough to fit into that field. My attempt to model the situation using UML deployment diagrams under the working assumption that a Machine Instance is a Node is unsatisfying for the following reasons.
The diagram seems to suggest that there will be exactly three number cruncher nodes. Is it possible to illustrate a multiplicity of Nodes in a deployment diagram like one would illustrate a multiplicity of object instances using a multi-object. If this is not possible for Nodes then this seems to be a Long Standing Issue
Is there anyway to show the equivalent of deployment regions / data-centres in the deployment diagram?
Lastly:
What about Platform as a Service? The whole Machine Instance is a Node idea completely breaks down at that point. What on earth do you do in that case? Treat the entire PaaS provider as a single node and forget about the details?
Regarding your first question:
Is there anyway to show the equivalent of deployment regions /
data-centres in the deployment diagram?
I generally use Notes for that.
And your second question:
What about Platform as a Service? The whole Machine Instance is a Node
idea completely breaks down at that point. What on earth do you do in
that case? Treat the entire PaaS provider as a single node and forget
about the details?
I would say, yes for your last question. And I suppose you could take more details from the definition of the deployment model and its elements. Specially at the end of this paragraph:
They [Nodes] can be nested and can be connected into systems of arbitrary
complexity using communication paths. Typically, Nodes represent
either hardware devices or software execution environments.
and
ExecutionEnvironments represent standard software systems that
application components may require at execution time.
source: http://www.omg.org/spec/UML/2.5/Beta1/

Do all cluster schedulers take array jobs, and if they do, do they set SGE_TASK_ID array id?

When using qsub to put array jobs on a cluster the global variable SGE_TASK_ID gets set to the array job ID. I use this in a shell script that I run on a cluster, where each array job needs to do something different based on the SGE_TASK_ID. Is this a common way for cluster schedulers to do this, or do they all have a different approach?
Most schedulers have a way to do this, although it can be slightly different in different setups. In TORQUE the variable is called $PBS_ARRAYID but it works the same.
Do all cluster schedulers take array jobs
No. Many do, but not all.
and if they do, do they set SGE_TASK_ID array id?
Only Grid Engine will set SGE_TASK_ID because this is simply what the variable is called in Grid Engine. Other cluster middlewares have a different name for it, with different semantics.
It's a bit unclear where you are aiming with your question, but if you want to write a program/system that runs on many different cluster middlewares / load balancers / schedulers, you should look into DRMAA. This will abstract variables like SGE_TASK_ID.