How to limit memory usage in yocto bitbake? - yocto

When building some packages, I found OOM in dmesg.
The build process was killed and terminated.
Anyway to set up memory usage limitations?

Easiest way for me is to specify the number of concurrent tasks running when building with
$ BB_NUMBER_THREADS=2 bitbake <target>
Where 2 is the number of concurrent build processes running.
You can also set this in your local.conf. Here's another answer on the topic.

You can limit the parallel make jobs.
Set a PARALLEL_MAKE yout local.conf
PARALLEL_MAKE ?= "-j 1"

Related

How to add more threads when building?

When doing
bitbake core-image-xxxx
the build task will auto select 8 threads ( since my CPU is 8 cores) to build the image.
my system is 72GB ram, can I force bitbake to run with more threads?
or any way to ask bitbake to use more ram?
To increase threads usage:
You add following to your local.conf inside the build/conf directory. Replace x and y with your wanted configuration
PARALLEL_MAKE = "-j x"
BB_NUMBER_THREADS = "y"
PARALLEL_MAKE defines how many threads should be used when using make -j command during do_compile.
BB_NUMBER_THREADS defines number of threads for bitbake.
I do not know about increasing memory usage, if you want to increase the speed of the build you could to it with a ramdisk.
https://www.linuxbabe.com/command-line/create-ramdisk-linux
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-PARALLEL_MAKE
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-BB_NUMBER_THREADS

Queries regarding celery scalability

I have few questions regarding celery. Please help me with that.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Please answer my questions. Thanks in advance.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Yes. A celery worker runs your code, and so naturally it needs access to that code. How you make the code accessible though is entirely up to you. Some approaches include:
Code updates and restarting of workers as part of deployment
If you run your celery workers in kubernetes pods this comes down to building a new docker image and upgrading your workers to the new image. Using rolling updates this can be done with zero downtime.
Scheduled synchronization from a repository and worker restarts by broadcast
If you run your celery workers in a more traditional environment or for some reason you don't want to rebuild whole images, you can use some central file system available to all workers, where you update the files e.g. syncing a git repository on a schedule or by some trigger. It is important you restart all celery workers so they reload the code. This can be done by remote control.
Dynamic loading of code for every task
For example in omega|ml we provide lambda-style serverless execution of
arbitrary python scripts which are dynamically loaded into the worker process.
To avoid module loading and dependency issues it is important to keep max-tasks-per-child=1 and use the prefork pool. While this adds some overhead it is a tradeoff that we find is easy to manage (in particular we run machine learning tasks and so the little overhead of loading scripts and restarting workers after every task is not an issue)
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
-O fair stops workers from prefetching tasks unless there is an idle process. However there is a quirk with rate limits which I recently stumbled upon. In practice I have not experienced a problem with neither prefetching nor rate limiting, however as with any distributed system it pays of to think about the effects of the asynchronous nature of execution (this is not particular to Celery but applies to all such such systems).
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Rabbitmq does not know of the workers (nor do any of the other broker supported by celery) - they just maintain a queue of messages. That is, it is the workers that pull tasks from the broker.
A concern that may come up with this is what if my worker crashes while executing tasks. There are several aspects to this: There is a distinction between a worker and the worker processes. The worker is the single task started to consume tasks from the broker, it does not execute any of the task code. The task code is executed by one of the worker processes. When using the prefork pool (which is the default) a failed worker process is simply restarted without affecting the worker as a whole or other worker processes.
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
That depends on the scale and type of workload you need to run. In general CPU bound tasks should be run on workers with a concurrency setting that doesn't exceed the number of cores. If you need to process more of these tasks than you have cores, run multiple workers to scale out. Note if your CPU bound task uses more than one core at a time (e.g. as is often the case in machine learning workloads/numerical processing) it is the total number of cores used per task, not the total number of tasks run concurrently that should inform your decision.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Hard to say in general, best to run some tests. For example if 4 concurrently run tasks use all the memory on a single node, adding another worker will not help. If however you have two queues e.g. with different rates of arrival (say one for low frequency but high-priority execution, another for high frequency but low-priority) both of which can be run concurrently on the same node without concern for CPU or memory, a single node will do.

Slurm: Why use srun inside sbatch?

In a sbatch script, you can directly launch programs or scripts (for example an executable file myapp) but in many tutorials people use srun myapp instead.
Despite reading some documentation on the topic, I do not understand the difference and when to use each of those syntaxes.
I hope this question is precise enough (1st question on SO), thanks in advance for your answers.
The srun command is used to create job 'steps'.
First, it will bring better reporting of the resource usage ; the sstat command will provide real-time resource usage for processes that are started with srun, and each step (each call to srun) will be reported individually in the accounting.
Second, it can be used to setup many instances of a serial program (program that only use one CPU) into a single job, and micro-schedule those programs inside the job allocation.
Finally, for parallel jobs, srun will also play the important role of starting the parallel program and setup the parallel environment. It will start as many instances of the program as were requested with the --ntasks option on the CPUs that were allocated for the job. In the case of a MPI program, it will also handle the communication between the MPI library and Slurm.

Rundeck - any command execution fails when running on 5.8k nodes

I'm running a rundeck server to delegate a simple script to 5.8k other linux servers.
The very simple script is bellow
!/bin/bash
A=$(hostname)
echo $A
When i run the same job with a smaller number of targets (4089 nodes)
the comands work fine
I tried looking at my service.log page and its not incrementing anything
Any ideas on how to be able to run on all the 5.8k nodes? And where should i look for errors?
Rundeck does not have limits to nodes, certainly depends on how many executions you want to run, how much ram, how many processors and disk space.
Maybe you need to increase the Java heap size:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#java-heap-size
And how to adapt this to your SSH plugin:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#built-in-ssh-plugins

bind9 (named) does not start in multi-threaded mode

From the bind9 man page, I understand that the named process starts one worker thread per CPU if it was able to determine the number of CPUs. If its unable to determine, a single worker thread is started.
My question is how does it calculate the number of CPUs? I presume by CPU, it means cores. The Linux machine I work is customized and has kernel 2.6.34 and does not support lscpu or nproc utilities. named is starting a single thread even if i give -n 4 option. Is there any other way to force named to start multiple threads?
Thanks in advance.