I am using Rxjava2 with spring boot.
I have 500 concurrent request on server.
Each request spawns 10 threads which calls other services(so IO work)
So in this case, should I use Schedulers.io() or Schedulers.compuatation().
Basically my confusion is ideally io() should be used as this is IO work,
but this could create large number of threads?
Also can i specify the pool size of computation threads?
Also can i specify the pool size of io threads?
should I use Schedulers.io() or Schedulers.compuatation().
You want to calls other services, it's I/O work, so you should not use computation(). Because it's best to leave computation() for CPU intensive work only otherwise you won't get good CPU utilization.
can i specify the pool size of computation threads?
No, computation() is backed by a bounded thread-pool with size equal to the number of available processors. So if you want to spawns 10 threads, you cannot do it.
can i specify the pool size of io threads
If you need to limit the maximum number of simultaneous network calls, use: Scheduler.from(Executors.newFixedThreadPool(10))
It's unnecessary for your use-case, because you only do 10 task at a same time. But it's a good practice, because io() is unbounded and if you need to schedule hundreds tasks in parallel then each of them will have their own thread and causes context switching overhead.
For more information, see: rxJava Schedulers Use Cases
Vert.x seems to create upto 2 * NUM_OF_CORES event loop threads by default.
And this seems to be a fairly old change (7 years+)
On a machine with 4 physical cores (8 logical cores with hyper-threading), it creates 16 event loop threads.
Shouldn't NUM_OF_CORES (i.e., 8 in above example) number of event loop threads be ideal?
Only explaination I could find was from Tim Fox (original author of vertx):
we use 2 * number of cores by default - in practice this gives better results as OSes don't always distribute threads evenly across cores.
But a few load tests I did gave better results when I used 8 instead of 16. So want to understand under what conditions should the default give better results?
In optimal CPU bound calculations having about the same number of threads and logical cores is a good practice because we want out thread to use more CPU power as possible without interfering with other threads.
Usually Vert.x is not used for CPU intensive computations; for the most common usecases of Vert.x it might be helpful to have some more threads ready for beign used when needed, rather than having to create new ones on the go.
Why not using 10 * NUM_OF_CORES threads then? Because of the thread creation overhead and the risk of creating too many unused threads (that would lower the system performace). So this choise is (probably) the result of the tradeoff between thread responsiveness and waste of system resources.
Your benchmarks can produce bad results with 2 * NUM_OF_CORES for a variety or reasons, such as:
OS thread management (allocation time and context switches);
lack of system resouces (a lot of programs running with the one you are testing);
misuration issues (did the measure start before the thread allocation? did the test last for an amount of time that makes the thread creation time negligible?);
probably something else I can't figure out rn 😅
Hope it helped!
In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).
I'm unsure how Round Robin scheduling works with I/O Operations. I've learned that CPU bound processes are favoured by Round Robin scheduling, but what happens if a process finishes its time slice early?
Say we neglect the dispatching process itself and a process finishes its time slice early, will the scheduler schedule another process if its CPU bound, or will the current process start its IO operation, and since that isn't CPU bound, will immediately switch to another (CPU bound) process after? And if CPU bound processes are favoured, will the scheduler schedule ALL CPU bound process until they are finished and only afterwards schedule the I/O processes?
Please help me understand.
There are two distinct schedulers: the CPU (process/thread ...) scheduler, and the I/O scheduler(s).
CPU schedulers typically employ some hybrid algorithms, because they certainly do regularly encounter both pre-emption and processes which voluntarily give up part of their time-slice. They must service higher-priority work quickly, while not "starving" anyone. (A study of the current Linux scheduler is most interesting. There have been several.)
CPU schedulers identify processes as being either "primarily 'I/O-bound'" or "primarily 'CPU-bound'" at this particular time, knowing that their characteristics can and do change. If your process repeatedly consumes full time slices, it is seen as CPU-bound.
I/O schedulers seek to order and re-order the I/O request queues for maximum efficiency. For instance, to keep the read/write head of a physical disk-drive moving efficiently in a single direction. (The two components of disk-drive delay are "seek time" and "rotational latency," with "seek time" being by-far the worst of the two. Per contra, solid-state drives have very different timing.) I/O-schedulers also have to be aware of the channels (disk interface cards, cabling, etc.) that provide access to each device: they can't simply watch what any one drive is doing. As with the CPU-scheduler, requests must be efficiently handled but never "starved." Linux's I/O-schedulers are also readily available for your study.
"Pure round-robin," as a scheduling discipline, simply means that all requests have equal priority and will be serviced sequentially in the order that they were originally submitted. Very pretty birds though they are, you rarely encounter Pure Robins in real life.
I am taking a course on distributed systems and we have to make our project using Scala. Our instructor told us that Scala is good in the sense that it uses multiple cores to do the computation and uses parallelism to solve problems while being integrated with the actor model.
This is a theoretical question. I have learned some basics about the actor model using Akka and my question is that, while programming, does the user have to provide the details to the compiler so that various actors work on multiple cores, or does Scala take care of that and use multiple cores for various actors?
In a nutshell my question is: when we declare multiple actors using the Akka libraries in Scala, does Scala compiler automatically use the multi-core CPU power to distribute various actors among cores, or does the programmer have to provide some input to do this?
TL;DR: With the default configuration in Akka you need do nothing to get pretty good parallelism for most use cases.
Longer Answer: Actors in Akka run on a Dispatcher and that Dispatcher has an ExecutionService which is typically a pool of Threads. The number of Threads is configured by the developer, but by default is 3 times the number of CPU cores on the machine (see default-dispatcher.parallelism-factor here in the reference configuration).
At any point in time each CPU core can be running an Actor using one of these threads, so provided you have a number of threads in your Dispatcher's ExecutionService that is equal to the number of cores on your CPU, you will be able to take advantage of all your cores. The reason that this is set to three times the number of cores in the default configuration is to compensate for blocking IO.
IO is slow, and blocking calls hog threads at times you are doing IO rather than using the CPU. So the key to getting the best level of parallelism is configuring this thread pool:
If you are doing only non-blocking IO, you can set it to the number of CPU cores you have and feel confident you are taking full advantage of your CPU.
The more blocking IO you do, the more threads you will need to keep getting good parallelism, but be warned - the more Threads you use, the more memory you will use and Threads are not the most lightweight things in the world.
theon's answer is pretty good, but I would just like to point out that actors are not the only way to achieve parallelism in Scala. If you do not need to manage state, Futures are generally a simpler way to perform computation in parallel. You just wrap each snippet of code that can run independently of others in a call to the Future factory function, and you can then compose/transform the results of each snippet (also in parallel) using calls to map, flatMap, fold, etc., or with for comprehensions. All you need to configure is an ExecutionContext as an implicit val, and if you are already using Akka, you can use the same one that your actors use, or you can use the preconfigured global default.
Is it possible to set the concurrency (the number of simultaneous workers) on a per-task level in Celery? I'm looking for something more fine-grained that CELERYD_CONCURRENCY (that sets the concurrency for the whole daemon).
The usage scenario is: I have a single celerlyd running different types of tasks with very different performance characteristics - some are fast, some very slow. For some I'd like to do as many as I can as quickly as I can, for others I'd like to ensure only one instance is running at any time (ie. concurrency of 1).
You can use automatic routing to route tasks to different queues which will be processed by celery workers with different concurrency levels.
celeryd-multi start fast slow -c:slow 3 -c:fast 5
This command launches 2 celery workers listening fast and slow queues with 3 and 5 concurrency levels respectively.
CELERY_ROUTES = {"tasks.a": {"queue": "slow"}, "tasks.b": {"queue":
The tasks with type tasks.a will be processed by slow queue and tasks.b tasks by fast queue respectively.