In the text books they say that the major advantage of CFS is that it is very fair in allocating CPU to different processes. However, I am unable to know how CFS with the RB-Tree is capable of achieving better form of fairness than that achieved by simple Round Robin queue !
If we forget about CFS grouping and other features, which can also be incorporated somehow in simple RR queue, can anybody tell me how CFS is more fair than RR?
Thanks in advance
I believe the key difference relates to the concept of "sleeper fairness".
With RR, each of the processes on the ready queue gets an equal share of CPU time, but what about the processes that are blocked/waiting for I/O? They may sit on the I/O queue for a long time, but they don't get any built-up credit for that once they get back into the ready queue.
With CFS, processes DO get credit for that waiting time, and will get more CPU time once they are no longer blocked. That helps reward more interactive processes (which tend to use more I/O) and promotes system responsiveness.
Here is a good detailed article about CFS, which mentions "sleeper fairness": https://developer.ibm.com/tutorials/l-completely-fair-scheduler/
Related
According to the docs:
Keeping them (completed jobs) around in the system will put pressure on the API server.
I understand that regularly going through a long list of completed jobs, only to find out that none needs to run, is a waste of CPU, as well as stopped pods are a waste of disk space, but how much of a problem is this really?
Must I clean them up ASAP because the cluster goes down otherwise?
I think the boundary number will be floating around 150k:
https://kubernetes.io/docs/setup/best-practices/cluster-large/#support (see total pods)
There's a boundary number in any case, so it's a good idea to add some cleaner, especially if you know what is safe to clean.
How does Scylla guarantee/keeps write latency low for write workload case, as more write would produce more memflush and compaction? Is there a throttling to it? Would be really helpful if someone can asnwer.
In essence, Scylla provides consistent low latency by parallelizing the problem, and then properly prioritizing user-facing vs. back-office tasks.
Parallelizing - Scylla uses a shard-per-thread architecture. Each thread is responsible for all activities for its token range. Reads, writes, compactions, repairs, etc.
Prioritizing - Each thread is scheduled according to the priorities of the tasks. High priority tasks like dealing with read (query) and write (commitlog) receive the highest priority. Back-office tasks such as memtable flushes, compaction and repair are only done when there are spare cycles. Which - given the nanosecond scale of modern CPUs - there usually are.
If there are not enough spare cycles, and RAM or Disk start to fill, Scylla will bump the priority of the back-office tasks in order to save the node. So that will, in fact, inject some latency. But that is an indication that you are probably undersized, and should add some resources.
I would recommend starting with the Scylla Architecture whitepaper at https://go.scylladb.com/real-time-big-data-database-principles-offer.html. There are also many in-depth talks from Scylla developers at https://www.scylladb.com/resources/tech-talks/
For example, https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-2019/ talks at great depth about shard-per-core.
https://www.scylladb.com/tech-talk/oltp-or-analytics-why-not-both/ talks at great depth about task prioritization.
Memtable flush is more urgent than regular compaction as we on one hand want to flush late, in order to create fewer sstables in level 0 and on the other, we like to evacuate memory from ram. We have a memory controller which automatically determine the flush condition. It is done in the background while operations for to the commitlog and get flushed according to the configured criteria.
Compaction is more of a background operation and we have controllers for it too. Go ahead and search the blog for compaction
I'm unsure how Round Robin scheduling works with I/O Operations. I've learned that CPU bound processes are favoured by Round Robin scheduling, but what happens if a process finishes its time slice early?
Say we neglect the dispatching process itself and a process finishes its time slice early, will the scheduler schedule another process if its CPU bound, or will the current process start its IO operation, and since that isn't CPU bound, will immediately switch to another (CPU bound) process after? And if CPU bound processes are favoured, will the scheduler schedule ALL CPU bound process until they are finished and only afterwards schedule the I/O processes?
Please help me understand.
There are two distinct schedulers: the CPU (process/thread ...) scheduler, and the I/O scheduler(s).
CPU schedulers typically employ some hybrid algorithms, because they certainly do regularly encounter both pre-emption and processes which voluntarily give up part of their time-slice. They must service higher-priority work quickly, while not "starving" anyone. (A study of the current Linux scheduler is most interesting. There have been several.)
CPU schedulers identify processes as being either "primarily 'I/O-bound'" or "primarily 'CPU-bound'" at this particular time, knowing that their characteristics can and do change. If your process repeatedly consumes full time slices, it is seen as CPU-bound.
I/O schedulers seek to order and re-order the I/O request queues for maximum efficiency. For instance, to keep the read/write head of a physical disk-drive moving efficiently in a single direction. (The two components of disk-drive delay are "seek time" and "rotational latency," with "seek time" being by-far the worst of the two. Per contra, solid-state drives have very different timing.) I/O-schedulers also have to be aware of the channels (disk interface cards, cabling, etc.) that provide access to each device: they can't simply watch what any one drive is doing. As with the CPU-scheduler, requests must be efficiently handled but never "starved." Linux's I/O-schedulers are also readily available for your study.
"Pure round-robin," as a scheduling discipline, simply means that all requests have equal priority and will be serviced sequentially in the order that they were originally submitted. Very pretty birds though they are, you rarely encounter Pure Robins in real life.
I try to use hardware to speed up the scheduling and dispatching.
Therefore i need to know what exactly is in the ready queue in order to figure out whether using hardware can indeed help and by how much.
In all OS literature, it just mentions scheduler fetches process and put into ready queue.
And i have some knowledge about process, like virtual address space, executable code, PID and so on.
But i just can't connect them together. I don't think each time, scheduler will store all these information in the ready queue.
So can somebody help? What is exactly stored in the ready queue? Like how many bytes of data, what are they? If it is system-dependent, can you give me at least one example for one system?
Thanks
Ready queues stores the processes which can be executed in the processor when given an opportunity i.e. the processes which are not waiting for any sort of I/O operations, etc to complete before they can be executed.
As far as h/w for increased scheduling and dispatching is concerned,
I feel increasing the main memory capacity can help substantially.
Increasing main memory will result in less swap in/ swap out of memory blocks between secondary and primary memory and hence will ultimately result in less thrashing which will increase the performance very much.
Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.