Resource Maintenance - simulation

I have a Resource with 10 Units. I want that 1 unit out of them goes for scheduled maintenance. But the Maintenance option under the resource tab is sending all the 10 units for maintenance at the same time. Can I select only 1 at a time? I just want out of 10, only 1 goes under maintenance according to the schedule. How can I do this ?

Just split the 10 resources into two resource pools, one with a maintenance schedule and one without.
Seize from both pools for all blocks where you were seizing from the original one.

Please show your setup for the resource pool. By default the maintenance is created for each unit in the resource pool individually, depending on your settings they should not all go for maintenance at the same time.
See simple example below. I changed the default maintenance settings slightly to get a better example.
And never get a situation where all 10 goes for maintenance at the same time. See below, the time plot is simply plotting resourcePool.idle()
If however you used deterministic values instead of distributions in your maintenance setup you will find that they all go off and on maintenance at exactly the same time.

Related

Dynamic number of replicas in a Kubernetes cron-job

I've been looking for days for a way to set-up a cron-job with a dynamic number of jobs.
I've read all these solutions and it seems that, in order to initialise a dynamic number of jobs, I need to do it manually with a script and a job template, but I need it to be automatic.
A bit of context:
I have a database / message queue / whatever can store "items"
I would like to start a job (so a single replica of a container) every 5 minutes to process each item
So, let's say there is a Kafka topic / a db table / a folder containing 5 records / rows / files, I would like Kubernetes to start 5 replicas of the job (with the cron-job) automatically. After 5 minutes, there will be 2 items, so Kubernetes will just start 2 replicas.
The most feasible solution seems to be using a static number of pods and make them process multiple items, but I feel like there is a better way to accomplish my desire keeping it inside Kubernetes that I can't figure due to my lack of experience. 🤔
What would you do to solve this problem?
P.S. Sorry for my English.
There are two ways I can think of:
Using a CronJob that is parallelised (1 work-item/pod or 1+ work-items/pod). This is what you're trying to achieve. Somewhat.
Using a data processing application. This I believe is the recommended approach.
Why and Why Not CronJobs
For (1), there are a few things that I would like to mention. There is no upside to having multiple Job/CronJob items when you are trying to perform the same operation from all of them. You think you are getting parllelism, but not really, you are only increasing management overhead. If your workload grows too large (which it will) there will be too many Job objects in the cluster and the API server will be slowed down drastically.
Job and CronJob items are only for stand-alone work items that need to be performed regularly. They are house-keeping tasks. So, selecting CronJobs for data processing is a very bad idea. Even if you run a parallelized set of pods (as provided here and here in the docs like you mentioned), even then, it would be best suited to have a single Job that handles all the pods that are working on the same work-item. So, you should not be thinking of "scaling Jobs" in those terms. Instead, think of scaling Pods. So, if you really want to move ahead with utilizing the Job and CronJob mechanisms, go ahead, the MessageQueue based design is your best bet. And you will have to reinvent a lot of wheels to get it to work (read below why that is the case).
Recommended Solution
For (2), I only say this since I see you are trying to perform data processing and doing this with a one-off mechanism like a Job will not be a good idea (Jobs are basically stateless, since they perform an operation that can be repeated simply without any repercussions). Say you start a pod, it fails processing, how will other pods know that this item was not processed successfully? What if the pod dies, the Job cannot keep track of the items in your data store, since the Job is not aware of the nature of the work you're performing. Therefore, it is natural for you to pursue a solution where the system components are specifically designed for data processing.
You will want to look into a system that can understand the nature of your data, how to keep track of the processing queues that have been finished successfully, how to restart a new Pod with the same item as input, from the Pod that just crashed etc. This is a lot of application/use-case specific functionality that is best served through the means of an operator or a CustomResource and a controller. And obviously, since this is not a new problem, there is a ton of solutions out there that can perform this the best way for you.
The best course of action would be to have that system in place, deployed with the means of a Deployment pattern, where auto-scaling would be enabled and you will achieve real parallelism that will also be best suited for data processing batch jobs.
And remember, when we talk about scaling in Kubernetes, it is always the pods that scale, not containers, not deployments, not services. Always Pods. That is because at the bottom of the chain, there is always a Pod somewhere that is working on something be it a Job that owns it, or a Deployment or a Service a DaemonSet or whatever. And it is obviously a bad idea to have multiple application containers in a Pod due to so many reasons. (side-car and adapter patterns are just helpers, they don't run the application).
Perhaps this blog that discusses data processing in Kubernetes can help.

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

How to structure an elastic Azure Batch application?

I am evaluating Batch for a project and, while it seems like it will do what I'm looking for, I'm not sure if what I am assuming is really correct.
I have what is basically a job runner from a queue. The current solution works but when the pool of nodes scales down, it just blindly kills off machines. I am looking for something that, when scaling down, will allow currently-running jobs to complete and then remove the node(s) from the pool. I also want to preemptively increase the pool size if a spike is likely to occur (and not have those nodes shut down). I can adjust the pool size externally if that makes sense (seems like the best option so far).
My current idea is to have one pool with one job & task per node, and that task listens to a queue in a loop for messages and processes them. After an iteration count and/or time limit, it shuts down, removing that node from the pool. If the pool size didn't change, I would like to replace that node with a new one. If the pool was shrunk, it should just go away. If the pool size increases, new nodes should run and start up the task.
I'm not planning on running something that continually add pools, or nodes to the pool, or tasks to a job, though I will probably have something that sets the pool size periodically based on queue length or something similar. What I would rather not do is something like "there are 10 things in the queue, add a pool with x nodes, then delete it".
Is this possible or are my expectations incorrect? So far, from reading the docs, it seems like it should be doable, and I have a simple task working, but I'm not sure about the scaling mechanics or exactly how to structure the tasks/jobs/pools.
Here's one possible way to lean into the strengths of Azure Batch and achieve what you've described.
Create your job with a JobManagerTask that monitors your queue for incoming work and adds a new Batch Task for each item of your workload. Each task will process a single piece of work, then exit.
The Batch Scheduler will then take care of allocating tasks to compute nodes. It can also take care of retrying tasks that fail, and so on.
Configure your pool with an AutoScale formula to dynamically resize your pool to meet your load. Your formula can specify taskcompletion to ensure tasks get to complete before any one compute node is removed.
If your workload peaks are predictable (say, 9am every day) your AutoScale expression could scale up your pool in anticipation. If those spikes are not predicable, your external monitoring (or your JobManager) can change the AutoScale expression at any time to suit.
If appropriate, your job manager can terminate once all the required tasks have been added; set onAllTasksComplete to terminatejob, ensuring your job is completed after all your tasks have finished.
A single pool can process tasks from multiple jobs, so if you have multiple concurrent workloads, they could share the same pool. You can give jobs different values for priority if you want certain jobs to be processed first.

How to troubleshoot and reduce communication overhead on Rockwell ControlLogix

Need help. We have a plc that's cpu keeps getting maxed out. We've already upgraded it once. Now we need work on optimize it.
We have over 50 outgoing msg instructions, 60 incoming, and 103 number of ethernet devices (flow meters, drives, etc) I've gone through and tried to make sure everything is cached that can be, only instructions that are currently needed are running, and communication to the same plc happen in the same scan, but I haven't made a dent.
I'm having trouble identifying which instructions are significant. It seems the connections will be consolidated so lots of msgs shouldn't be too big of a problem. Considering Produced & Consumed tags but our team isn't very familiar with them and I believe you have to do a download to modify them, which is a problem. Our IO module RPIs are all set to around 200ms, but that didn't seem to make a difference (from 5ms).
We have a shutdown this weekend and I plan on disabling everything and turning it back on one part at a time to see where the load is really coming from.
Does anyone have any suggestions? The task monitor doesn't have a lot of detail that I can understand, i.e. It's either too summarized or too instant for me to make heads or tales of it. Here is a couple screens from the Task Monitor to shed some light on what I'm seeing.
First question coming to mind is are you using the Continues Task or is all in Periodic tasks?
I had a similar issue many years ago with a CLX. Rockwell suggested increasing the System Overhead Time Slice to around 40 to 50%. The default is 20%.
Some details:
Look at the System Overhead Time Slice (go to Advanced tab under Controller Properties). Default is 20%. This determines the time the controller spends running its background tasks (communications, messaging, ASCII) relative to running your continuous task.
From Rockwell:
For example, at 25%, your continuous task accrues 3 ms of run time. Then the background tasks can accrue up to 1 ms of run time, then the cycle repeats. Note that the allotted time is interrupted, but not reduced, by higher priority tasks (motion, user periodic or event tasks).
Here is a detailed Word Doc from Rockwell:
https://rockwellautomation.custhelp.com/ci/fattach/get/162759/&ved=2ahUKEwiy88qq0IjeAhUO3lQKHf01DYcQFjADegQIAxAB&usg=AOvVaw125pgiSor_bf-BpNSvNVF8
And here is a detailed KB from Rockwell:
https://rockwellautomation.custhelp.com/app/answers/detail/a_id/42964

How to evenly distribute the DAGs runs execution during the day

I have a huge amount of DAGS (>>100.000) that should each run once a day.
In order to not have big spikes in processing at certain times during the day (and other reasons) I would like to have the actual DAG runs be distributed evenly throughout the day.
Do I need to do this programmatically by myself distributing the start_date throughout the day or is there a better way where Airflow does that for me?
One possibly solution: If you create one or more pools, each with a limited number of slots you can effectively set a 'maximum parallelism' of execution, and tasks will wait until a slot is available. However, it may not give you quite enough flexibility you need