How is a distributed (e.g., cluster-based) program normally represented/prepared? - distributed-computing

I am not familiar with distributed computing/programming.
I know that a large-scale program should be divided into tasks and then described as a directed acyclic graph (DAG) so that a scheduler can distribute the tasks to nodes.
My question is:
(1) What programming paradigms do we normally use to build such a program? Do we need to manually write both the codes and the according DAG description or we have some automatic tool to generate a DAG given the source code?
(2) What runtime libraries do we normally use to schedule/distribute the tasks?

Related

Chaos engineering best practice

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.
These tools are both fault injection tools, and do nothing to analysis on the tested system.
According to the principles of chaos, we should
1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2.Hypothesize that this steady state will continue in both the control group and the experimental group.
3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
Is there any good suggestions or best practice?
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.
If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.
But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.
At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.
So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.
The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.
I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.
We use a very linear variety of chaos methodology which is:
find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available
One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.
Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

How to represent part of BPMN workflow that is automated by system?

I am documenting a user workflow where part of the flow is automated by a system (e.g. if the order quantity is less than 10 then approve the order immediately rather than sending it to a staff for review).
I have swim lanes that goes from people to people but not sure where I can fit this system task/decision path. What's the best practice? Possibly a dumb idea but I'm inclined to create a new swim lane and call it the "system".
Any thoughts?
The approach of detaching system task into separate lane is quite possible as BPMN 2.0 specification does not explicitly specify meaning of lanes and says something like that:
Lanes are used to organize and categorize Activities within a Pool.
The meaning of the Lanes is up to the modeler. BPMN does not specify
the usage of Lanes. Lanes are often used for such things as internal
roles (e.g., Manager, Associate), systems (e.g., an enterprise
application), an internal department (e.g., shipping, finance), etc.
So you are completely free to fill them with everything you want. However, your case is quite evident and doesn't require such separation at all. According to your description we have typical conditional activity which can be expressed via Service task or Sub-process. These are 2 different approaches and they hold different semantics.
According to BPMN specification Service Task is a task that uses some sort of service, which could be a Web service or an automated application. I.e it is usually used when modeller don't want to decompose some process and is intended to outsource it to some external tool or agent.
Another cup of tea is Sub-process, which is typically used when you
want to wrap some complex piece of workflow for reuse or if that
piece of workflow can be decomposed into sub-elements.
In your use case sub-process is a thing of choice. It is highly adjustable, transparent and maintainable. For example, inside sub-process you can use Business Rules Engine for your condition parameter (Order Quantity) and flexibly adjust its value on-the-fly.
In greater detail you can learn the difference of these approcahes from this blog.
There is a technique of expressing system tasks/decisions via dedicated participant/lane. Then all system tasks are collocated on a system lane.
System tasks (service tasks in BPMN) are usually done on behalf of an actor, so in my opinion it is useful to position them in the lane for that actor.
Usually such design also help to keep the diagram easy to read by limiting the number of transition between "users" lanes and "system" lane.

Celery - running a set of tasks with complex dependencies

In the application I'm working on, a user can perform a "transition" which consists of "steps". A step can have an arbitrary number of dependencies on other steps. I'd like to be able to call a transition and have the steps execute in parallel as separate Celery tasks.
Ideally, I'd like something along the lines of celery-tasktree, except for directed acyclic graphs in general, rather than only trees, but it doesn't appear that such a library exists as yet.
The first solution that comes to mind is a parallel adaptation of a standard topological sort - rather than determining a linear ordering of steps which satisfy the dependency relation, we determine the entire set of steps that can be executed in parallel at the beginning, followed by the entire set of steps that can be executed in round 2, and so on.
However, this is not optimal when tasks take a variable amount of time and workers have to idle waiting for a longer running task while there are tasks that are now ready to run. (For my specific application, this solution is probably fine for now, but I'd still like to figure out how to optimise this.)
As noted in https://cs.stackexchange.com/questions/2524/getting-parallel-items-in-dependency-resolution, a better way is operate directly off the DAG - after each task finishes, check whether any of its dependent tasks are now able to run, and if so, run them.
What would be the best way to go about implementing something like this? It's not clear to me that there's an easy way to do this.
From what I can tell, Celery's group/chain/chord primitives aren't flexible enough to allow me to express a full DAG - though I might be wrong here?
I'm thinking I could create a wrapper for tasks which notifies dependent tasks once the current task finishes - I'm not sure what the best way to handle such a notification would be though. Accessing the application's Django database isn't particularly neat, and would make it hard to spin this out into a generic library, but Celery itself doesn't provide obvious mechanisms for this.
I also faced this problem but i couldn't really find a better solution or library except for one library, For anyone still interested, you can checkout
https://github.com/selinon/selinon. Although its only for python 3, It seems to be the only thing that does exactly what you want.
Airflow is another option but airflow is used in a more static environment just like other dag libraries.

SpringXD Job split flow steps running in separate containers in distributed mode

I am aware of the nested job support (XD-1972) work and looking forward to that. A question regarding split flow support. Is there a plan to support running parallel steps, as defined in split flows, in separate containers?
Would it be as simple as providing custom implementation of a proper taskExecutor, or it is something more involved?
I'm not aware of support for splits to be executed across multiple containers being on the roadmap currently. Given the orchestration needs of something like that, I'd probably recommend a more composed approach anyways.
A custom 'TaskExecutor` could be used to farm out the work, but it would be pretty specialized. Each step within the flows in a split are executed within the scope of a job. That scope (and all it's rights and responsibilities) would need to be carried over to the "child" containers.

How does one represent multiple threads in a flow chart

I have been tasked with creating a flow chart for some client server and start up processes in our organizations software. A lot of our processes run concurrently as they have no impact on one another. How is this traditionally represented in the flow chart?
I was thinking that flowcharts aren't really intended for this, but as it turns out, there actually is a notation for concurrency. Wikipedia says:
Concurrency symbol
Represented by a double transverse line with any number of entry and exit arrows. These symbols are used whenever two or more control flows must operate simultaneously. The exit flows are activated concurrently, when all of the entry flows have reached the concurrency symbol. A concurrency symbol with a single entry flow is a fork; one with a single exit flow is a join.
I did some looking around on Google images and found this notation:
But this will only apply for a specific type of parallelism (what if you don't spawn all your threads at once?), and won't apply to a multiprocess model at all. In case of a multiprocess model, I would just make a separate flowchart for each process.
Examples speak louder than words! See the flow chart in a paper.