Set the process affinity on a large number of processes - affinity

I'm running taskset over a large number of pids (500+). The total time is 30s+. Is there a fast way to set many pids to the same cpu? It doesn't improve performance much to run the taskset processes in parallel.[1] Thanks!
[1] I tried both backgrounding via & and parallelization via gnu parallel

Running taskset on a single pid takes 2 ms on my computer. GNU Parallel has an overhead of 2-10 ms, so that will slow down running them. On my system I can run 500 tasksets in 5 seconds:
seq 500 | time parallel taskset -p {} $$
(Obviously you will want to change the input to PIDs and not masks, and the command template to PIDs).
So I am puzzled if your system takes > 30s to do the same.
If the 5 secs is too long, xargs has fewer safety features, but is faster:
seq 500 | time xargs -P4 -n1 -I{} taskset -p {} $$
If this is still too slow, you are looking at making your own C-program.

Related

How to make handbrake use the cpu with less intensity?

I've recently began using HandBrake to process some videos I downloaded to make them lighter. I built a small python GUI program to automate the processing, making use of the CLI version. What I am doing is generating the command according to the video and executing it with os.system. Something like this:
import os
def process(args):
#some algorithm to generate cmd using args
cmd = "handbrakecli -i raw_video.mp4 -o video.mp4 -O -e x264" #example command
os.system(cmd)
os.remove("raw_video.mp4")
The code works perfectly, but the problem is the overuse of my CPU. Usually, this takes 100% of CPU usage during considerable amount of time. I use the program CoreTemp to keep track of my processor temperature and, usually, it hits 78 °C.
I tried using BES (Battle Encoder Shirase) by saving the cmd command into a batch file called exec.bat and doing os.system("BES_1.7.7\BES.exe -J -m exec.exe 20"), but this simply does nothing.
Speed isn't important at all. Even if it takes longer, I just want to use less of my CPU, something around 50% would be great. Any idea on how I could do so?
In Handbrake you can pass advanced parameters so you only use a certain amount of CPU threads.
You can use threads, view the Handbrake CLI Documentation
When using threads you can specify any number of CPU processors to use. The default is auto.
The -x parameter stands for Advanced settings in the GUI of Handbrake, that is where threads will go.
The below tells Handbrake to only use one CPU thread for the Advanced setting:
-x threads=1
You can also use the veryslow for the --encoder-preset setting to help the CPU load.
--encoder-preset=veryslow
I actually prefer using the --encoder-preset=veryslow preset since I see an overall better quality in the encode.
And both together:
--encoder-preset=veryslow -x threads=1
So formatted with your cmd variable:
cmd = "handbrakecli -i raw_video.mp4 -o video.mp4 -O -e x264 --encoder-preset=veryslow -x threads=1" #example command
See if that helps.
One easy way in Linux is to use taskset. You can use the terminal or make a custom shortcut/command.
For example, my CPU has 8 threads but I only want to use 6 for Handbrake.
Just start the program with taskset -c 2,3,4,5,6,7 handbrake, this way the threads 0 and 1 will be free to another task/process and the program will run on threads 2,3,4,5,6,7.
In Windows you can change the Target of the shortcut or use on cmd:
C:\Windows\System32\cmd.exe /C start "" /affinity FC "C:\Program Files\HandBrake\HandBrake.exe""
As far as I understand It reads the value backwards for each four bits, it means the first four bits in Hexadecimal are for threads 7-4 (1111) and the second four bits in Hexadecimal are for threads 3-0 (1100). In my case I have a 8 threads CPU and leaving free theads 1 and 0 (see image below).

How to get the elapsed time of a bunch of PBS Torque jobs?

I'm using PBS Torque to run multiple jobs. The idea is simple, each job works on a shunk of data. The PBS_Torque job_script to launch a job is called run_appli.sh
Here is the simple code (code 1) to launch 10 jobs
for i in 1 2 3 4 5 6 7 9 10 do; qsub run_appli.sh ; done
Indeed, I can monitor the execution of each of those jobs using qstat (see the command below) and have elapsed time of each job.
watch -n1 -d `qstat`
However, I am interested by the overall elapsed time. That means the time starting from when I launched all the jobs (code 1) and when the last job finished its execution.
Does anyone have an idea on how to do this ?
If you know the job id of the first job, you can look at it's ctime (creation time, or the time it is queued). You can then check the end time for the last job's comp_time. The difference between the two would be the total time elapsed.
qstat -f $first_job_id | grep ctime # Shows the first job's queued time
qstat -f $last_job_id | grep comp_time # Shows the final job's completion time.
If the last job's isn't completed, then the running elapsed time would just be the current time - the first job's queue time.

Celery : CELERYD_CONCURRENCY and number of workers

From the other stackoverflow answer, I've tried to limit celery's number of workers
After I terminated all the celery worker, I restarted celery with new configuration.
CELERYD_CONCURRENCY = 1 (in Django's settings.py)
Then I typed following command to check how many celery workers are working.
ps auxww | grep 'celery worker' | grep -v grep | awk '{print $2}'
It returns two PIDs like 24803, 24817.
Then I changed configuration to CELERYD_CONCURRENCY = 2 and restarted celery.
Same command returns three PIDs like 24944, 24958, 24959. (As you can see, last two PIDs are sequential)
It implies that number of workers is increased as I expected.
However, I don't know why it returns two PIDs even though there is only one celery worker is working?
Is there a something subsidiary process to help worker?
I believe one process always acts like the controller that listens for tasks and then distributes them to it's child processes to actually perform the work. Therefore, you will always have 1 more process than the configuration setting.

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.
When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
Is there any meaning to the bash: parallel: command not found message?
Yes: -j is the number of jobs per node.
Yes: Install 'parallel' in your $PATH on the remote hosts.
Yes: It is a consequence from parallel missing from the $PATH.
GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.
This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.
The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:
parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodisplay -r "primes1({})" :::: 10 20 30 40
Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

Stress testing a command-line application

I have a command line perl script that I want to stress test. Basically what I want to do is to run multiple instances of the same script in parallel so that I can figure out at what point our machine becomes unresponsive.
Currently I am doing something like this:
$ prog > output1.txt 2>err1.txt & \
prog > output2.txt 2>err2.txt &
.
.
.
.
and then I am checking ps to see which instances finished and which didn't. Is there any open-source application available that can automated this process? Preferably with a web-interface?
You can use xargs to run commands in parallel:
seq 1 100 | xargs -n 1 -P 0 -I{} sh -c 'prog > output{}.txt 2>err{}.txt'
This will run 100 instances in parallel.
For a better testing framework (including parallel testing via 'spawn') take a look at Expect.
Why not use the crontab or Scheduled Tasks to automatically run the script?
You could write something to automatically parse the output easily.
With GNU Parallel this will run one prog per CPU core:
seq 1 1000 | parallel prog \> output{}.txt 2\>err{}.txt
If you wan to run 10 progs per CPU core do:
seq 1 1000 | parallel -j1000% prog \> output{}.txt 2\>err{}.txt
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ