Overhead in running multiple pods vs multiple (identical) processes in 1 pod? [closed] - kubernetes

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 16 days ago.
Improve this question
Context
I am running a job-processing task (specifically, Resque) in a kubernetes setup. This task sets up one or more workers to takes job messages off a queue and processes them. A typical invocation is to set your desired worker count, e.g. COUNT=8 QUEUE=* resque:work.
Of course, in Kubernetes, I am going to add autoscaling to the Deployment running this task.
There's a prevailing recommendation to run 1 process per pod (see below). My concern is that doing so can be wasteful when the process I wish to run has a built-in multi-process management system to run identical processes. I am trying to understand the theory & docs to inform decisions and experiments.
My motivation question is: is there any reason to continue setting worker count, or does it make more sense to use only one worker process per pod? I.e. is there significant overhead in each pod instance compared to letting Resque spawn multiple processes?
Question
The objective question is: where should I expect / look for overhead in running 1 process per pod vs letting 1 pod's main process spawn multiple children?
E.g. IIUC each pod is running its own version of the OS and other utils installed in the container. So that at least is some memory overhead vs running a single container, single OS, mutli-Resque-worker setup; is that correct? What else should I be looking at, prior to simply benchmarking a bunch of guesses, to model resource consumption for this setup?
More Context
I understand that small process count allows for more granular scaling. I don't consider scaling at a finer resolution than, say, 4 processes at a time to be much benefit, so I'd start there if pod overhead should be considered. Am I overthinking it, and should I forget about pod overhead and just use a worker count of 1 per pod?
This question is informed off of many "one process per pod" references out there. Many listed in this similar question and a stack exchange question linked therein.
The linked question was concerned with scaling processes inside a pod to optimize node compute usage, which I get is well managed by k8s already
The nested links are more about limiting to one concern per pod, which is the case in my question.
My question is about overhead of running 4 identical worker processes in 4 pods vs in 1 pod.

Either way is fine and I wouldn't expect it to make a huge difference, except maybe at large scale.
There is nothing architecturally wrong with running multiple worker tasks inside a single container, particularly within a framework that's specifically designed to do it. As you note it's usually considered good form to run one concern per container, and there are various issues with running multiple processes (recovering from failed processes, collecting logs); but if you have a system that's specifically designed to launch and manage subprocesses, running it in a container is fine.
The questions I'd start asking here are around how many Resque workers you're planning to run at the same time. If it's "thousands", then you can start hitting limits around the number of pods per node and pressure on the Kubernetes scheduler. In that case, using multiple workers per container to cut down the number of pods can make some sense. If it's "dozens" then limiting it to just one could make it a little easier to visualize and manage.
Starting up a new container can be somewhat expensive (I'm used to seeing 30-60s startup times, but depending heavily on the image) but having a running container isn't especially so. It looks like Resque has a manager process on top of some number of workers so you'll have those extra Rubies, but that's probably not significant memory or storage.

Related

Dynamic number of replicas in a Kubernetes cron-job

I've been looking for days for a way to set-up a cron-job with a dynamic number of jobs.
I've read all these solutions and it seems that, in order to initialise a dynamic number of jobs, I need to do it manually with a script and a job template, but I need it to be automatic.
A bit of context:
I have a database / message queue / whatever can store "items"
I would like to start a job (so a single replica of a container) every 5 minutes to process each item
So, let's say there is a Kafka topic / a db table / a folder containing 5 records / rows / files, I would like Kubernetes to start 5 replicas of the job (with the cron-job) automatically. After 5 minutes, there will be 2 items, so Kubernetes will just start 2 replicas.
The most feasible solution seems to be using a static number of pods and make them process multiple items, but I feel like there is a better way to accomplish my desire keeping it inside Kubernetes that I can't figure due to my lack of experience. 🤔
What would you do to solve this problem?
P.S. Sorry for my English.
There are two ways I can think of:
Using a CronJob that is parallelised (1 work-item/pod or 1+ work-items/pod). This is what you're trying to achieve. Somewhat.
Using a data processing application. This I believe is the recommended approach.
Why and Why Not CronJobs
For (1), there are a few things that I would like to mention. There is no upside to having multiple Job/CronJob items when you are trying to perform the same operation from all of them. You think you are getting parllelism, but not really, you are only increasing management overhead. If your workload grows too large (which it will) there will be too many Job objects in the cluster and the API server will be slowed down drastically.
Job and CronJob items are only for stand-alone work items that need to be performed regularly. They are house-keeping tasks. So, selecting CronJobs for data processing is a very bad idea. Even if you run a parallelized set of pods (as provided here and here in the docs like you mentioned), even then, it would be best suited to have a single Job that handles all the pods that are working on the same work-item. So, you should not be thinking of "scaling Jobs" in those terms. Instead, think of scaling Pods. So, if you really want to move ahead with utilizing the Job and CronJob mechanisms, go ahead, the MessageQueue based design is your best bet. And you will have to reinvent a lot of wheels to get it to work (read below why that is the case).
Recommended Solution
For (2), I only say this since I see you are trying to perform data processing and doing this with a one-off mechanism like a Job will not be a good idea (Jobs are basically stateless, since they perform an operation that can be repeated simply without any repercussions). Say you start a pod, it fails processing, how will other pods know that this item was not processed successfully? What if the pod dies, the Job cannot keep track of the items in your data store, since the Job is not aware of the nature of the work you're performing. Therefore, it is natural for you to pursue a solution where the system components are specifically designed for data processing.
You will want to look into a system that can understand the nature of your data, how to keep track of the processing queues that have been finished successfully, how to restart a new Pod with the same item as input, from the Pod that just crashed etc. This is a lot of application/use-case specific functionality that is best served through the means of an operator or a CustomResource and a controller. And obviously, since this is not a new problem, there is a ton of solutions out there that can perform this the best way for you.
The best course of action would be to have that system in place, deployed with the means of a Deployment pattern, where auto-scaling would be enabled and you will achieve real parallelism that will also be best suited for data processing batch jobs.
And remember, when we talk about scaling in Kubernetes, it is always the pods that scale, not containers, not deployments, not services. Always Pods. That is because at the bottom of the chain, there is always a Pod somewhere that is working on something be it a Job that owns it, or a Deployment or a Service a DaemonSet or whatever. And it is obviously a bad idea to have multiple application containers in a Pod due to so many reasons. (side-car and adapter patterns are just helpers, they don't run the application).
Perhaps this blog that discusses data processing in Kubernetes can help.

Celery: dynamically allocate concurrency based on worker memory

My celery use case: spin up a cluster of celery workers and send many tasks to that cluster, and then terminate the cluster when all of the tasks have completed (usually ~2 hrs).
I currently have it setup to use the default concurrency, which is not optimal for my use case. I see it is possible to specify a --concurrency argument in celery, which specifies the number of tasks that a worker will run in parallel. This is also not ideal for my use case, because, for example:
cluster A might have very memory intensive tasks and --concurrency=1 makes sense, but
cluster B might be memory light, and --concurrency=50 would optimize my workers.
Because I use these clusters very often for very different types of tasks, I don't want to have to manually profile the task beforehand and manually set the concurrency each time.
My desired behaviour is have memory thresholds. So for example, I can set in a config file:
min_worker_memory = .6
max_worker_memory = .8
Meaning that the worker will increment concurrency by 1 until the worker crosses over the threshold of using more than 80% memory. Then, it will decrement concurrency by 1. It will keep that concurrency for the lifetime of the cluster unless the worker memory falls below 60%, at which point it will increment concurrency by 1 again.
Are there any existing celery settings that I can leverage to do this, or will I have to implement this logic on my own? max memory per child seems somewhat close to what I want, but this ends in killed processes which is not what I want.
Unfortunately Celery does not provide an Autoscaler that scales up/down depending on the memory usage. However, being a well-designed piece of software, it gives you an interface that you may implement up to however you like. I am sure with the help of the psutil package you can easily create your own autoscaler. Documentation reference.

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

Why do my results become un-reproducible when run from qsub?

I'm running matlab on a cluster. when i run my .m script from an interactive matlab session on the cluster my results are reproducible. but when i run the same script from a qsub command, as part of an array job away from my watchful eye, I get believable but unreproducible results. The .m files are doing exactly the same thing, including saving the results as .mat files.
Anyone know why run one way the scripts give reproducible results, and run the other way they become unreproducible?
Is this only a problem with reproducibility or is this indicative of inaccurate results?
%%%%%
Thanks to spuder for a helpful answer. Just in case anyone stumbles upon this and is interested here is some further information.
If you use more than one thread in Matlab jobs, this may result in stealing resources from other jobs which plays havoc with the results. So you have 2 options:
1. Select exclusive access to a node. The cluster I am using is not currently allowing parallel array jobs, so doing this for me was very wasteful - i took a whole node but used it in serial.
2. Ask matlab to run on a singleCompThread. This may make your script take longer to complete, but it gets jobs through the queue faster.
There are a lot of variables at play. Ruling out transient issues such as network performance, and load here are a couple of possible explanations:
You are getting assigned a different batch of nodes when you run an interactive job from when you use qsub.
I've seen some sites that assign a policy of 'exclusive' to the nodes that run interactive jobs and 'shared' to nodes that run queued 'qsub' jobs. If that is the case, then you will almost always see better performance on the exclusive nodes.
Another answer might be that the interactive jobs are assigned to nodes that have less network congestion.
Additionally, if you are requesting multiple nodes, and you happen to land on nodes that traverse multiple hops, then you could be seeing significant network slowdowns. The solution would be for the cluster administrator to setup nodesets.
Are you using multiple nodes for the job? How are you requesting the resources?

Celery: limit memory usage (large number of django installations)

we're having a setup with a large number of separate django installations on a single box. each of these have their own code base & linux user.
we're using celery for some asynchronous tasks.
each of the installations has its own setup for celery, i.e. its own celeryd & worker.
the amount of asynchronous tasks per installation is limited, and not time-critical.
when a worker starts it takes about 30mb of memory. when it has run for a while this amount may grow (presumably due to fragmentation).
the last bulletpoint has already been (somewhat) solved by settings --maxtasksperchild to a low number (say 10). This ensures a restart after 10 tasks, after which the memory at least goes back to 30MB.
However, each celeryd is still taking up a lot of memory, since the minimum amount of workers appears to be 1 as opposed to 0. I also imagine running python manage.py celery worker does not lead to the smallest-possible footprint for the celeryd, since the full stack is loaded even if the only thing that happens is checking for tasks.
In an ideal setup, I'd like to see the following: a process that has a very small memory footprint (100k or so) is looking at the queue for new tasks. when such a task arises, it spins up the (heavy) full django stack in a separate process. and when the worker is done, the heavy process is spun down.
Is such a setup configurable using (somewhat) standard celery? If not, what points of extension are there?
we're (currently) using celery 3.0.17 and the associated django-celery.
Just to make sure I understand - you have a lot of different django codebases, each with their own celery, and they take up too much memory when running on a single box simultaneously, all waiting for a celery job to come down the pipe? How many celery instances are we talking about here?
In my experience, you're using django celery in a very different way than it was designed for - all of your different django projects should be condensed to a few (or a single) project(s), composed of multiple applications. Then you set up a small number of queues to field celery tasks from the different apps - this way, you only have as many dormant celery threads taking up 30mb as you have queues, and a single queue can handle multiple tasks (from multiple apps if you want). The memory issue should go away.
To reiterate - you only need one celeryd, driving multiple workers. This way your bottleneck is job concurrency, not dormant memory needs.
Why do you need so many django installations? Please let me know if I'm missing something, or if you need clarification.