Setting CPU affinity using taskset - affinity

I am using the taskset tool to set CPU affinity for one of my programs. How do I set the affinity on a single CPU only - since I was not sure about this, so I was doing this:
taskset -c 2-2 tests/prog 1 2 3
...expecting, that I am scheduling the program to run on CPU #2 only, following the similar way for other CPUs. Even if I'm right, this is a bad way to perform what I want IMO, can I get some help?
Thank you,
Sayan

Easiest way would be using the CPU masks like
taskset -p mask pid
#taskset -p 0x00000001 11587
pid 11587's current affinity mask: ff
pid 11587's new affinity mask: 1

taskset -c 2 ... should work to pin the program to CPU #2 (which is the third CPU -- CPUs are numbered from 0).
Even if I'm right, this is a bad way to perform what I want IMO, can I get some help?
Depends on what you want. What are you trying to accomplish?

Related

Celery: dynamically allocate concurrency based on worker memory

My celery use case: spin up a cluster of celery workers and send many tasks to that cluster, and then terminate the cluster when all of the tasks have completed (usually ~2 hrs).
I currently have it setup to use the default concurrency, which is not optimal for my use case. I see it is possible to specify a --concurrency argument in celery, which specifies the number of tasks that a worker will run in parallel. This is also not ideal for my use case, because, for example:
cluster A might have very memory intensive tasks and --concurrency=1 makes sense, but
cluster B might be memory light, and --concurrency=50 would optimize my workers.
Because I use these clusters very often for very different types of tasks, I don't want to have to manually profile the task beforehand and manually set the concurrency each time.
My desired behaviour is have memory thresholds. So for example, I can set in a config file:
min_worker_memory = .6
max_worker_memory = .8
Meaning that the worker will increment concurrency by 1 until the worker crosses over the threshold of using more than 80% memory. Then, it will decrement concurrency by 1. It will keep that concurrency for the lifetime of the cluster unless the worker memory falls below 60%, at which point it will increment concurrency by 1 again.
Are there any existing celery settings that I can leverage to do this, or will I have to implement this logic on my own? max memory per child seems somewhat close to what I want, but this ends in killed processes which is not what I want.
Unfortunately Celery does not provide an Autoscaler that scales up/down depending on the memory usage. However, being a well-designed piece of software, it gives you an interface that you may implement up to however you like. I am sure with the help of the psutil package you can easily create your own autoscaler. Documentation reference.

Running NetLogo headless on HPC, how to increase CPU usage?

I was running NetLogo headless on HPC using behaviourspace. Some non-NetLogo other user on the HPC complained to me that I am not utilizing the CPU cores to very little extent and should increase. I don't know exactly know to how to do so, please help. I am guessing renice won't be of any help.
Code:
#!/bin/bash
#$ -N NewPara3-d
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/Test_results3-d.nlogo \
--experiment 3-d \
--table /home/abhishekb/csvresults/Test_results3-d.csv
In comments you link your related question where you're trying to use linux process priority to make jobs run faster / use more CPU
There you ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output shows 0.84: the average load on processors in all.q is 84%. This doesn't seem too low.
You state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder if it's just because you're using a lot of nodes (even if just for a short time).
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 30 - maybe take the number 30 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = (time to run 1 job) * number of runs in experiment) / time you want the run to take
I'm not an expert but the Scheduler on the cluser seems to be supported in OpenMole.
OpenMole is a nice solution for Embed your NetLogo model transparently on many environnements. It can be on solution ...

Is it possible to control CPU cores?

If I own quad core processor, can I "isolate" or totally control 1 core from other cores?
The fourth core job is to serve only and only for allocated thread and nothing else.
What I want is to do thread that job is get to get numbers from memory, from always same physical adresses and calculate them how I want and put them back to the same place. (I will disable virtual memory)
Thank you for your answers.
On linux, you can use the sched_setaffinity function to do this. Set the affinity for the special thread to 8 (i.e. 0b1000) and the other threads to 7 (i.e. 0b0111).
You're probably SOL if you really want to switch off virtual memory or anything like that. However, you may be able to write a kernel driver whose job is to expose the relevant section of physical memory to your user program.

What is the Overhead of matlabpool?

Could anyone point to me what is the overhead of running a matlabpool ?
I started a matlabpool :
matlabpool open 132procs 100
Starting matlabpool using the '132procs' configuration ... connected to 100 labs.
And followed cpu usage on the nodes as :
pdsh -A ps aux |grep dmlworker
When I launch the matlabpool, it starts with ~35% cpu usage on average and when the pool
is not being used it slowly (in 5-7 minutes) goes down to ~2% on average.
Is this normal ? What is the typical overhead ? Does that change if matlabpooljob is launched as a "batch" job ?
This is normal. ps aux reports the average CPU utilization since the process was started, not over a rolling window. This means that, although the workers initialize relatively quickly and then become idle, it will take longer for this to reflect in CPU%. This is different to the Linux top command, for example, which will reflect the utilization since the last screen update in %CPU.
As for typical overhead, this depends on a number of factors: clearly the number of workers, the rate and data size of jobs submitted (as well as in maintaining the worker processes, there is some overhead in marshalling input and output, which is not part of "useful computation"), whether the Matlab pool is local or attached to a job manager, and the Matlab version and O/S.
From experience, as a rough guide on a modern *nix server, I would think an idle worker should be not be consuming more than 20% of a single core (e.g. <~1% total CPU utilization on a 16-core box) after initilization, unless there is a configuration issue. I should not expect this to be influenced by what kind of jobs you are submitting (whether using "createJob" or "batch" or "parfor" for example): the workers and communication mechanisms underneath are essentially the same.

Can memcached make full use of multi-core?

Is memcached capable of making full use of multi-core? Or is there any way tuning this?
memcached has "-t" option:
-t <threads>
Number of threads to use to process incoming requests. This option is only meaningful
if memcached was compiled with thread support enabled. It is typically not useful to
set this higher than the number of CPU cores on the memcached server. The default is
4.
so, I believe it can use all your CPU cores, of course if it was compiled with corresponding option.
memcached is multi-threaded by default and has no problem saturating many cores. It's a bit harder to saturate all cores on more massively parallel boxes (e.g. a 256-core CMT box) just because it gets harder to get the data in and out of the network.
If you find areas where some sort of contention is preventing you from saturating cores, file a bug or start a discussion.
Based on a this research by Intel, Memcached v.1.6 beta cannot scale well on a multicore system. Their experiments show that as core counts increase from 1 to 8, maximum throughput (with a median RTT < 1ms SLA) only doubles.
CAREFUL. This terminology is quite confusing. Memcached man page says -t option is only good up to the number of cores. However, this is odd because threads and processes are VERY different. Threads have NOTHING to do with the number of cores. Processes can definitely run on more than one cor, while threads cannot (unless they call to an OS routine, then they can thread switch and pack in more than 100% cpu usage). Threads share memory and just depend on an instruction pointer to differentiate who is who. Processes share nothing unless it is explicitly declared as shared ahead of time, and sharing occurs via the OS.
Overall, I want MORE CLARITY from the Memcached people about whether their app is multiprocessing or multithreaded and thus if it can use more than 100% of cpu.