Amazon ECS, capacity provider not able to provide required capacity - amazon-ecs

I want to create an ECS cluster with two capacity providers: 
standard that uses on-demand instances
spot, that uses spot instances 
ECS is going to be linked to auto-scaling groups and handle scaling for the above providers. 
When defining a service, I am going to use custom capacity provider strategy. The sample configuration could be as follow: 
base: 2 for the standard provider 
weight: 0 for the standard provider, 1 for the spot provider 
If I am not mistaken, with that configuration, my service should create 2 instances on the standard (on-demand) provider, and rest on the spot one. 
Assuming I want to manage 10 tasks under my service. 
In the happy path, 2 of them runs on my standard provider, and 8 on the spot.  
Here is the question - how is the unhappy scenario handled when spot instances are not available? Will my service contains only 2 tasks that were placed on the on-demand instances?
If yes, how can I dynamically adjust my service to temporarily use only the on-demand provider?
Or, maybe the above configuration doesn't make any sense, and there is a better way to utilize spot instance and ECS to cut costs? 

Currently the Capacity Providers and services combination do not think about if the instances they are running on are spot or on-demand https://github.com/aws/containers-roadmap/issues/773
Your configuration seems reasonable for using spot. Assuming that you pick a range of instance type and availability zone then there is typically sufficient spot capacity. However, Amazoan always states that you shouldn't run production workloads on spot :shrug:

Related

Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).
There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).
I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

Drools global variable initialization and scaling for performance

Thanks in advance. We are trying to adopt drools as rules engine in our enterprise. After evaluating basic functionality in POC mode, we are exploring further. We have the following challenges and I am trying to validate some of the options we are considering. Any help is greatly appreciated.
Scenario-1: Say you get USA state (TX,CA,CO etc) in a fact's field. Now you want the rule to check if the 'state value on the fact' exists in a predetermined static list of state values(say the list contains three values TX,TN,MN).
Possible solution to Scenario-1: 'static list of state values' can be set as a global variable and the rule can access the global variable while performing the check.
Qustions on Scenario-1:
Is the 'possible solution to scenario-1' the standard practice? If so, is it possible to load the value of this global variable from a database during rule engine(KIE Server) startup? If yes, could you let me know the drools feature that enables us to load global variables from a database? Should a client application (a client application that calls kie-server) initialize global variables instead?
Scenario-2: We want to horizantally scale rule execution sever. Say we have one rule engine server(kie-server) exposing rest-api. Can we have multiple instances running behind a loadbalancer to have it scale horizontally? Is there any other way of achieving the scalability?
Q1: It depends. The usual solution for a small, rarely (if ever) changing set that is used just in a single rule is to put it into the rule, using the in operator. If you think you might have to change it or use it frequently, a global would be one way of achieving that but you must make sure that the global is initialized before any facts are inserted.
There is nothing out-of-the-box for accessing a DB.
Q2: A server running a Drools session is just another Java server program, so any load balancing applicable to this class of programs should apply to a Drools app as well. What are you fearing?

SpringXD Job split flow steps running in separate containers in distributed mode

I am aware of the nested job support (XD-1972) work and looking forward to that. A question regarding split flow support. Is there a plan to support running parallel steps, as defined in split flows, in separate containers?
Would it be as simple as providing custom implementation of a proper taskExecutor, or it is something more involved?
I'm not aware of support for splits to be executed across multiple containers being on the roadmap currently. Given the orchestration needs of something like that, I'd probably recommend a more composed approach anyways.
A custom 'TaskExecutor` could be used to farm out the work, but it would be pretty specialized. Each step within the flows in a split are executed within the scope of a job. That scope (and all it's rights and responsibilities) would need to be carried over to the "child" containers.

UML Deployment Diagram for IaaS and PaaS Cloud Systems

I would like to model the following situation using a UML deployment diagram.
A small command and control machine instance is spawned on an Infrastructure as a Service cloud platform such as Amazon EC2. This instance is in turn responsible for spawning additional instances and providing them with a control script NumberCruncher.py either via something like S3 or directly as a start up script parameter if the program is small enough to fit into that field. My attempt to model the situation using UML deployment diagrams under the working assumption that a Machine Instance is a Node is unsatisfying for the following reasons.
The diagram seems to suggest that there will be exactly three number cruncher nodes. Is it possible to illustrate a multiplicity of Nodes in a deployment diagram like one would illustrate a multiplicity of object instances using a multi-object. If this is not possible for Nodes then this seems to be a Long Standing Issue
Is there anyway to show the equivalent of deployment regions / data-centres in the deployment diagram?
Lastly:
What about Platform as a Service? The whole Machine Instance is a Node idea completely breaks down at that point. What on earth do you do in that case? Treat the entire PaaS provider as a single node and forget about the details?
Regarding your first question:
Is there anyway to show the equivalent of deployment regions /
data-centres in the deployment diagram?
I generally use Notes for that.
And your second question:
What about Platform as a Service? The whole Machine Instance is a Node
idea completely breaks down at that point. What on earth do you do in
that case? Treat the entire PaaS provider as a single node and forget
about the details?
I would say, yes for your last question. And I suppose you could take more details from the definition of the deployment model and its elements. Specially at the end of this paragraph:
They [Nodes] can be nested and can be connected into systems of arbitrary
complexity using communication paths. Typically, Nodes represent
either hardware devices or software execution environments.
and
ExecutionEnvironments represent standard software systems that
application components may require at execution time.
source: http://www.omg.org/spec/UML/2.5/Beta1/

Do all cluster schedulers take array jobs, and if they do, do they set SGE_TASK_ID array id?

When using qsub to put array jobs on a cluster the global variable SGE_TASK_ID gets set to the array job ID. I use this in a shell script that I run on a cluster, where each array job needs to do something different based on the SGE_TASK_ID. Is this a common way for cluster schedulers to do this, or do they all have a different approach?
Most schedulers have a way to do this, although it can be slightly different in different setups. In TORQUE the variable is called $PBS_ARRAYID but it works the same.
Do all cluster schedulers take array jobs
No. Many do, but not all.
and if they do, do they set SGE_TASK_ID array id?
Only Grid Engine will set SGE_TASK_ID because this is simply what the variable is called in Grid Engine. Other cluster middlewares have a different name for it, with different semantics.
It's a bit unclear where you are aiming with your question, but if you want to write a program/system that runs on many different cluster middlewares / load balancers / schedulers, you should look into DRMAA. This will abstract variables like SGE_TASK_ID.