UML Deployment Diagram for IaaS and PaaS Cloud Systems - deployment

I would like to model the following situation using a UML deployment diagram.
A small command and control machine instance is spawned on an Infrastructure as a Service cloud platform such as Amazon EC2. This instance is in turn responsible for spawning additional instances and providing them with a control script NumberCruncher.py either via something like S3 or directly as a start up script parameter if the program is small enough to fit into that field. My attempt to model the situation using UML deployment diagrams under the working assumption that a Machine Instance is a Node is unsatisfying for the following reasons.
The diagram seems to suggest that there will be exactly three number cruncher nodes. Is it possible to illustrate a multiplicity of Nodes in a deployment diagram like one would illustrate a multiplicity of object instances using a multi-object. If this is not possible for Nodes then this seems to be a Long Standing Issue
Is there anyway to show the equivalent of deployment regions / data-centres in the deployment diagram?
Lastly:
What about Platform as a Service? The whole Machine Instance is a Node idea completely breaks down at that point. What on earth do you do in that case? Treat the entire PaaS provider as a single node and forget about the details?

Regarding your first question:
Is there anyway to show the equivalent of deployment regions /
data-centres in the deployment diagram?
I generally use Notes for that.
And your second question:
What about Platform as a Service? The whole Machine Instance is a Node
idea completely breaks down at that point. What on earth do you do in
that case? Treat the entire PaaS provider as a single node and forget
about the details?
I would say, yes for your last question. And I suppose you could take more details from the definition of the deployment model and its elements. Specially at the end of this paragraph:
They [Nodes] can be nested and can be connected into systems of arbitrary
complexity using communication paths. Typically, Nodes represent
either hardware devices or software execution environments.
and
ExecutionEnvironments represent standard software systems that
application components may require at execution time.
source: http://www.omg.org/spec/UML/2.5/Beta1/

Related

Amazon ECS, capacity provider not able to provide required capacity

I want to create an ECS cluster with two capacity providers: 
standard that uses on-demand instances
spot, that uses spot instances 
ECS is going to be linked to auto-scaling groups and handle scaling for the above providers. 
When defining a service, I am going to use custom capacity provider strategy. The sample configuration could be as follow: 
base: 2 for the standard provider 
weight: 0 for the standard provider, 1 for the spot provider 
If I am not mistaken, with that configuration, my service should create 2 instances on the standard (on-demand) provider, and rest on the spot one. 
Assuming I want to manage 10 tasks under my service. 
In the happy path, 2 of them runs on my standard provider, and 8 on the spot.  
Here is the question - how is the unhappy scenario handled when spot instances are not available? Will my service contains only 2 tasks that were placed on the on-demand instances?
If yes, how can I dynamically adjust my service to temporarily use only the on-demand provider?
Or, maybe the above configuration doesn't make any sense, and there is a better way to utilize spot instance and ECS to cut costs? 
Currently the Capacity Providers and services combination do not think about if the instances they are running on are spot or on-demand https://github.com/aws/containers-roadmap/issues/773
Your configuration seems reasonable for using spot. Assuming that you pick a range of instance type and availability zone then there is typically sufficient spot capacity. However, Amazoan always states that you shouldn't run production workloads on spot :shrug:

Chaos engineering best practice

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.
These tools are both fault injection tools, and do nothing to analysis on the tested system.
According to the principles of chaos, we should
1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2.Hypothesize that this steady state will continue in both the control group and the experimental group.
3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
Is there any good suggestions or best practice?
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.
If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.
But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.
At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.
So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.
The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.
I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.
We use a very linear variety of chaos methodology which is:
find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available
One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.
Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

How to represent part of BPMN workflow that is automated by system?

I am documenting a user workflow where part of the flow is automated by a system (e.g. if the order quantity is less than 10 then approve the order immediately rather than sending it to a staff for review).
I have swim lanes that goes from people to people but not sure where I can fit this system task/decision path. What's the best practice? Possibly a dumb idea but I'm inclined to create a new swim lane and call it the "system".
Any thoughts?
The approach of detaching system task into separate lane is quite possible as BPMN 2.0 specification does not explicitly specify meaning of lanes and says something like that:
Lanes are used to organize and categorize Activities within a Pool.
The meaning of the Lanes is up to the modeler. BPMN does not specify
the usage of Lanes. Lanes are often used for such things as internal
roles (e.g., Manager, Associate), systems (e.g., an enterprise
application), an internal department (e.g., shipping, finance), etc.
So you are completely free to fill them with everything you want. However, your case is quite evident and doesn't require such separation at all. According to your description we have typical conditional activity which can be expressed via Service task or Sub-process. These are 2 different approaches and they hold different semantics.
According to BPMN specification Service Task is a task that uses some sort of service, which could be a Web service or an automated application. I.e it is usually used when modeller don't want to decompose some process and is intended to outsource it to some external tool or agent.
Another cup of tea is Sub-process, which is typically used when you
want to wrap some complex piece of workflow for reuse or if that
piece of workflow can be decomposed into sub-elements.
In your use case sub-process is a thing of choice. It is highly adjustable, transparent and maintainable. For example, inside sub-process you can use Business Rules Engine for your condition parameter (Order Quantity) and flexibly adjust its value on-the-fly.
In greater detail you can learn the difference of these approcahes from this blog.
There is a technique of expressing system tasks/decisions via dedicated participant/lane. Then all system tasks are collocated on a system lane.
System tasks (service tasks in BPMN) are usually done on behalf of an actor, so in my opinion it is useful to position them in the lane for that actor.
Usually such design also help to keep the diagram easy to read by limiting the number of transition between "users" lanes and "system" lane.

SpringXD Job split flow steps running in separate containers in distributed mode

I am aware of the nested job support (XD-1972) work and looking forward to that. A question regarding split flow support. Is there a plan to support running parallel steps, as defined in split flows, in separate containers?
Would it be as simple as providing custom implementation of a proper taskExecutor, or it is something more involved?
I'm not aware of support for splits to be executed across multiple containers being on the roadmap currently. Given the orchestration needs of something like that, I'd probably recommend a more composed approach anyways.
A custom 'TaskExecutor` could be used to farm out the work, but it would be pretty specialized. Each step within the flows in a split are executed within the scope of a job. That scope (and all it's rights and responsibilities) would need to be carried over to the "child" containers.

What's a unique, persistent alternative to MAC address?

I need to be able to repeatably, non-randomly, uniquely identify a server host, which may be arbitrarily virtualized and over which I have no control.
A MAC address doesn't work because in some virtualized environments, network interfaces don't have hardware addresses.
Generating a state file and saving it to disk doesn't work because the virtual machine may be cloned, thus duplicating the file.
The server's SSH host keys may be a candidate. They can be cloned like a state file, but in practice they generally aren't because it's such a security problem that it's a mistake not often made.
There's also /var/lib/dbus/machine-id, but that's dependent on dbus. (Thanks Preetam).
There's a cpuid but that's apparently deprecated. (Thanks Bruno Aguirre on Twitter).
Hostname is worth considering. Many systems like Chef already require unique hostnames. (Thanks Alfie John)
I'd like the solution to persist a long time, and certainly across server reboots and software restarts. Ultimately, I also know that users of my software will deprecate a host and want to replace it with another, but keep continuity of the data associated with it, so there are reasons a UUID might be considered mutable over the long term, but I don't particularly want a host to start considering itself to be unknown and re-register itself for no reason.
Are there any alternative persistent, unique identifiers for a host?
It really depends on what is meant by "persistent". For example, two VMs can't each open the same network socket to you, so even if they are bit-level clones of each other it is possible to tell them apart.
So, all that is required is sufficient information to tell the machines apart for whatever the duration of the persistence is.
If the duration of the persistence is the length of a network connection, then you don't need any identifiers at all -- the sockets themselves are unique.
If the persistence needs to be longer -- say, for the length of a boot -- then you can regenerate UUIDs whenever the system boots. (Note that a VM that is cloned would still have to reboot, unless you're hot-copying it.)
If it needs to be longer than that -- say, indefinitely -- then you can generate a UUID identifier on boot and save it to disk, but only use this as part of the identifying information of the machine. If the virtual machine is subsequently cloned, you will know this since you will have two machines reporting the same ID from different sources -- for instance, two different network sockets, different boot times, etc. Since you can tell them apart, you have enough information to differentiate the two cloned machines, which means you can take a subsequent action that forces further differentiation, like instructing each machine to regenerate its state file.
Ultimately, if a machine is perfectly cloned, then by definition you cannot tell which one was the "real one" to begin with, only that there are now two distinguishable machines.
Implying that you can tell the difference between the "real one" and the "cloned one" means that there is some state you can use to record the difference between the two, like the timestamp of when the virtual machine itself was created, in which case you can incorporate that into the state record.
It looks like simple solutions have been ruled out.
So that could lead to complex solutions, like this protocol:
- Client sends tuple [ MAC addr, SSH public host key, sequence number ]
- If server receives this tuple as expected, server and client both increment sequence number.
- Otherwise server must determine what happened (was client cloned? did client move?), perhaps reaching a tentative conclusion and alerting a human to verify it.
I don't think there is a straight forward "use X solution" based on the info available but here are some general suggestions that might get you to a better spot.
If cloning from a "gold image" consider using some "first boot" logic to generate a unique ID. Config management systems like Chef, Puppet or Cf-engine provide some scaffolding to achieve this.
Consider a global state manager like zookeeper. Specifically its atomic counter functionality. Same system could get new ID over time, but it would be unique.
Also this stack overflow might give you some other direction. It references Twitter's approach to a similar problem.
If I understand correctly, you want a durable, globally unique identifier under these conditions:
An OS installation that can be cloned while running, so any state inside the VM won't work, and
Could be running in an arbitrary virtualization environment, so any state outside the VM won't work.
I realize this doesn't directly answer your question, but it really seems like either the design or the constraints need some substantial adjustment to accomodate a solution.