How does Scylla handles agressive memflush & compaction on write work load? - latency

How does Scylla guarantee/keeps write latency low for write workload case, as more write would produce more memflush and compaction? Is there a throttling to it? Would be really helpful if someone can asnwer.

In essence, Scylla provides consistent low latency by parallelizing the problem, and then properly prioritizing user-facing vs. back-office tasks.
Parallelizing - Scylla uses a shard-per-thread architecture. Each thread is responsible for all activities for its token range. Reads, writes, compactions, repairs, etc.
Prioritizing - Each thread is scheduled according to the priorities of the tasks. High priority tasks like dealing with read (query) and write (commitlog) receive the highest priority. Back-office tasks such as memtable flushes, compaction and repair are only done when there are spare cycles. Which - given the nanosecond scale of modern CPUs - there usually are.
If there are not enough spare cycles, and RAM or Disk start to fill, Scylla will bump the priority of the back-office tasks in order to save the node. So that will, in fact, inject some latency. But that is an indication that you are probably undersized, and should add some resources.
I would recommend starting with the Scylla Architecture whitepaper at https://go.scylladb.com/real-time-big-data-database-principles-offer.html. There are also many in-depth talks from Scylla developers at https://www.scylladb.com/resources/tech-talks/
For example, https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-2019/ talks at great depth about shard-per-core.
https://www.scylladb.com/tech-talk/oltp-or-analytics-why-not-both/ talks at great depth about task prioritization.

Memtable flush is more urgent than regular compaction as we on one hand want to flush late, in order to create fewer sstables in level 0 and on the other, we like to evacuate memory from ram. We have a memory controller which automatically determine the flush condition. It is done in the background while operations for to the commitlog and get flushed according to the configured criteria.
Compaction is more of a background operation and we have controllers for it too. Go ahead and search the blog for compaction

Related

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?
In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

How does one core get work in a multi-core system?

Is it the Operating System who delegates any job to core?
What is that specific algorithm or a way, on which it is decided that the next task will be assigned to which cpu core?
Correct, it is the operating system's responsibility to designate tasks for the CPU to complete, regardless of how many cores it has. It does this via a scheduling algorithm, which decides in what order tasks/processes should be executed. In a symmetric multiprocessing environment, the OS views each core as an independent, identical CPU and therefore schedules them individually. When several cores are available, there are a couple important things to keep in mind:
1. Load balancing- For maximum performance, each core should be performing roughly the same amount of work.
2. Affinity- Because of caching, it is best (in terms of performance) for processes to complete the entirety of their execution on just one processor.
These things need to be kept in mind along with the traditional scheduling considerations of priority, fairness etc. Obviously, this topic is far too large for just one post to handle, so here are some resources that go in to further detail:
https://www.tutorialspoint.com/operating_system/os_process_scheduling_algorithms.htm
https://www.geeksforgeeks.org/multiple-processor-scheduling-in-operating-system/

Can Multiprocessor CPUs avoid context-switching?

Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?
The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.

Multi core machine - cpu load metric

In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.