Kubernetes Orchestration depending upon number of rows/records/Input Files - kubernetes

Requirement is to orchestrate ETL containers depending upon the number of records present at the Source system (SQL/Google Analytics/SAAS/CSV files).
To explain take a Use Case:- ETL Job has to process 50K records present in SQL server, however, it takes good processing time to execute this job by one server/node as this server makes a connection with SQL, fetches the data and process the records.
Now the problem is how to orchestrate in Kubernetes this ETL Job so that it scales up/down the containers depending upon number of records/Input. Like the case discussed above if there are 50K records to process in parallel then it should scale up the containers process the records and scales down.

You would generally use a queue of some kind and Horizontal Pod Autoscaler (HPA) to watch the queue size and adjust the queue consumer replicas automatically. Specifics depend on the exact tools you use.

Related

How to estimate RAM and CPU per Kubernetes Pod for a Spring Batch processing job?

I'm trying to estimate hardware resources for a Kubernetes Cluster to be able to handle the following scenarios:
On a daily basis I need to read 46,3 Million XML messages 10KB each (approx.) from a queue and then insert them in a Spark instance and in a Sybase DB instance. I need to come out with an estimation of how many pods I will need to process this amount of data and how much RAM and how many vCPUs will be required per pod in order to determine the characteristics of the nodes of the cluster. The reason behind all this is that we have some budget restrictions and we need to have an idea of the sizing before starting the corresponding development.
The second scenario is the same as the one already described but 18,65 times bigger, i.e. 833,33 Million XML messages per day. This is expected to be the case within a couple of years.
So far we plan to use Spring Batch with partitioning steps. I need orientation on how to determine the ideal Spring Batch configuration, required RAM, and required CPU per POD, as well as the number of PODS.
I will greatly appreciate any comments from your side.
Thanks in advance.

Dynamic number of replicas in a Kubernetes cron-job

I've been looking for days for a way to set-up a cron-job with a dynamic number of jobs.
I've read all these solutions and it seems that, in order to initialise a dynamic number of jobs, I need to do it manually with a script and a job template, but I need it to be automatic.
A bit of context:
I have a database / message queue / whatever can store "items"
I would like to start a job (so a single replica of a container) every 5 minutes to process each item
So, let's say there is a Kafka topic / a db table / a folder containing 5 records / rows / files, I would like Kubernetes to start 5 replicas of the job (with the cron-job) automatically. After 5 minutes, there will be 2 items, so Kubernetes will just start 2 replicas.
The most feasible solution seems to be using a static number of pods and make them process multiple items, but I feel like there is a better way to accomplish my desire keeping it inside Kubernetes that I can't figure due to my lack of experience. 🤔
What would you do to solve this problem?
P.S. Sorry for my English.
There are two ways I can think of:
Using a CronJob that is parallelised (1 work-item/pod or 1+ work-items/pod). This is what you're trying to achieve. Somewhat.
Using a data processing application. This I believe is the recommended approach.
Why and Why Not CronJobs
For (1), there are a few things that I would like to mention. There is no upside to having multiple Job/CronJob items when you are trying to perform the same operation from all of them. You think you are getting parllelism, but not really, you are only increasing management overhead. If your workload grows too large (which it will) there will be too many Job objects in the cluster and the API server will be slowed down drastically.
Job and CronJob items are only for stand-alone work items that need to be performed regularly. They are house-keeping tasks. So, selecting CronJobs for data processing is a very bad idea. Even if you run a parallelized set of pods (as provided here and here in the docs like you mentioned), even then, it would be best suited to have a single Job that handles all the pods that are working on the same work-item. So, you should not be thinking of "scaling Jobs" in those terms. Instead, think of scaling Pods. So, if you really want to move ahead with utilizing the Job and CronJob mechanisms, go ahead, the MessageQueue based design is your best bet. And you will have to reinvent a lot of wheels to get it to work (read below why that is the case).
Recommended Solution
For (2), I only say this since I see you are trying to perform data processing and doing this with a one-off mechanism like a Job will not be a good idea (Jobs are basically stateless, since they perform an operation that can be repeated simply without any repercussions). Say you start a pod, it fails processing, how will other pods know that this item was not processed successfully? What if the pod dies, the Job cannot keep track of the items in your data store, since the Job is not aware of the nature of the work you're performing. Therefore, it is natural for you to pursue a solution where the system components are specifically designed for data processing.
You will want to look into a system that can understand the nature of your data, how to keep track of the processing queues that have been finished successfully, how to restart a new Pod with the same item as input, from the Pod that just crashed etc. This is a lot of application/use-case specific functionality that is best served through the means of an operator or a CustomResource and a controller. And obviously, since this is not a new problem, there is a ton of solutions out there that can perform this the best way for you.
The best course of action would be to have that system in place, deployed with the means of a Deployment pattern, where auto-scaling would be enabled and you will achieve real parallelism that will also be best suited for data processing batch jobs.
And remember, when we talk about scaling in Kubernetes, it is always the pods that scale, not containers, not deployments, not services. Always Pods. That is because at the bottom of the chain, there is always a Pod somewhere that is working on something be it a Job that owns it, or a Deployment or a Service a DaemonSet or whatever. And it is obviously a bad idea to have multiple application containers in a Pod due to so many reasons. (side-car and adapter patterns are just helpers, they don't run the application).
Perhaps this blog that discusses data processing in Kubernetes can help.

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

How to structure an elastic Azure Batch application?

I am evaluating Batch for a project and, while it seems like it will do what I'm looking for, I'm not sure if what I am assuming is really correct.
I have what is basically a job runner from a queue. The current solution works but when the pool of nodes scales down, it just blindly kills off machines. I am looking for something that, when scaling down, will allow currently-running jobs to complete and then remove the node(s) from the pool. I also want to preemptively increase the pool size if a spike is likely to occur (and not have those nodes shut down). I can adjust the pool size externally if that makes sense (seems like the best option so far).
My current idea is to have one pool with one job & task per node, and that task listens to a queue in a loop for messages and processes them. After an iteration count and/or time limit, it shuts down, removing that node from the pool. If the pool size didn't change, I would like to replace that node with a new one. If the pool was shrunk, it should just go away. If the pool size increases, new nodes should run and start up the task.
I'm not planning on running something that continually add pools, or nodes to the pool, or tasks to a job, though I will probably have something that sets the pool size periodically based on queue length or something similar. What I would rather not do is something like "there are 10 things in the queue, add a pool with x nodes, then delete it".
Is this possible or are my expectations incorrect? So far, from reading the docs, it seems like it should be doable, and I have a simple task working, but I'm not sure about the scaling mechanics or exactly how to structure the tasks/jobs/pools.
Here's one possible way to lean into the strengths of Azure Batch and achieve what you've described.
Create your job with a JobManagerTask that monitors your queue for incoming work and adds a new Batch Task for each item of your workload. Each task will process a single piece of work, then exit.
The Batch Scheduler will then take care of allocating tasks to compute nodes. It can also take care of retrying tasks that fail, and so on.
Configure your pool with an AutoScale formula to dynamically resize your pool to meet your load. Your formula can specify taskcompletion to ensure tasks get to complete before any one compute node is removed.
If your workload peaks are predictable (say, 9am every day) your AutoScale expression could scale up your pool in anticipation. If those spikes are not predicable, your external monitoring (or your JobManager) can change the AutoScale expression at any time to suit.
If appropriate, your job manager can terminate once all the required tasks have been added; set onAllTasksComplete to terminatejob, ensuring your job is completed after all your tasks have finished.
A single pool can process tasks from multiple jobs, so if you have multiple concurrent workloads, they could share the same pool. You can give jobs different values for priority if you want certain jobs to be processed first.

Apache Geode scaling

I'm trying to measure the performance of Geode
I have 3 identical hosts to test it.
I created a partitioned region.
I started a geode cluster with one server.
I do "get" and "put" operations in the loop.
I get about 50000 op/sec.
Add started a cluster with three geode nodes.
I do get and put operations in the loop.
I get the same 50000 op/sec.
I would expect to see the increased performance, but it is suprisingly the same for 1-node cluster and 3-nodes cluster.
Could you please help. What are the possible settings to change in order to get horizontal scalability.
Thank you.
Well, you just got horizontal scalability for data storage at no loss of throughput :)
To horizontally scale your throughput, I think your workload was not enough to max-out the server. You need to start multiple clients (OR threads in a single client) against a single server until you do not see throughput increase by adding any new clients. At this point you start a new server. This new server should allow you to add more clients and horizontally scale your throughput.
You may find the ycsb benchmark useful, which allows you to start multiple threads in a client to perform operations.
You should setuo and environment who you see a performance decrease with single node and then make same test with partitioned one.