I am evaluating two RTOS's (GHS Integrity, QNX) based on few high level requirements. One of the requirement is the capability of the RTOS to handle high data throughput in terms of inter process communications(through shared memory) and also through network stack(>10GBPS). Since these are proprietary RTOS, it is hard to get access to them without buying. Any comments on the performance of the two RTOS's will be greatly appreciated.
Thanks
Anup
Related
How does Scylla guarantee/keeps write latency low for write workload case, as more write would produce more memflush and compaction? Is there a throttling to it? Would be really helpful if someone can asnwer.
In essence, Scylla provides consistent low latency by parallelizing the problem, and then properly prioritizing user-facing vs. back-office tasks.
Parallelizing - Scylla uses a shard-per-thread architecture. Each thread is responsible for all activities for its token range. Reads, writes, compactions, repairs, etc.
Prioritizing - Each thread is scheduled according to the priorities of the tasks. High priority tasks like dealing with read (query) and write (commitlog) receive the highest priority. Back-office tasks such as memtable flushes, compaction and repair are only done when there are spare cycles. Which - given the nanosecond scale of modern CPUs - there usually are.
If there are not enough spare cycles, and RAM or Disk start to fill, Scylla will bump the priority of the back-office tasks in order to save the node. So that will, in fact, inject some latency. But that is an indication that you are probably undersized, and should add some resources.
I would recommend starting with the Scylla Architecture whitepaper at https://go.scylladb.com/real-time-big-data-database-principles-offer.html. There are also many in-depth talks from Scylla developers at https://www.scylladb.com/resources/tech-talks/
For example, https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-2019/ talks at great depth about shard-per-core.
https://www.scylladb.com/tech-talk/oltp-or-analytics-why-not-both/ talks at great depth about task prioritization.
Memtable flush is more urgent than regular compaction as we on one hand want to flush late, in order to create fewer sstables in level 0 and on the other, we like to evacuate memory from ram. We have a memory controller which automatically determine the flush condition. It is done in the background while operations for to the commitlog and get flushed according to the configured criteria.
Compaction is more of a background operation and we have controllers for it too. Go ahead and search the blog for compaction
Nowadays the concept "real-time" has a lot different interpretations. In this question two definitions are provided:
The hard real-time definition considers any missed deadline to be a system failure. This scheduling is used extensively in mission critical systems where failure to conform to timing constraints results in a loss of life or property.
and
The soft real-time definition allows for frequently missed deadlines, and as long as tasks are timely executed their results continue to have value. Completed tasks may have increasing value up to the deadline and decreasing value past it.
In my research I came to the following conclusions:
The middleware supports hard real-time if it provides predictable and efficient end-to-end control over system resources. Like setting the thread-priority of all the threads created by the middleware.
It appears to me that good performance is the most relevant factor to support soft real-time applications.
Is this true? Are other relevant features of communication middlewares which support soft real-time applications?
First, for precise definitions of real-time principles and terms, based on first principles and mental models, I refer you to real-time.org.
The real-time practitioner computing community uses a variety of inconsistent and incomplete "definitions" of "real-time," "hard real-time," and "soft real-time." The real-time computing research community has a consensus on "hard real-time" but is confused about "soft real-time."
The core of the research community's "hard" real-time computing model is that tasks have hard deadlines, and all these deadlines must not be missed, else the system has failed. Meeting the deadlines is the "timeliness" criterion, and guaranteeing that all deadlines will be met is the "predictability" criterion--that predictability is "deterministic."
(In some of these models, tasks without deadlines are allowed in the background if they do not interfere with the hard real-time tasks; they usually also are prevented from being starved.)
This model requires everything related to the hard real-time tasks to be static (known in advance)--i.e., it requires that the time evolution of the system is known in advance. This requirement is very strong, and in most cases, it is not feasible. There are important hard real-time systems in which this requirement is (at least presumed) to be satisfied. Well-known examples include digital avionics flight control, certain medical devices, power plant control, railroad crossing control, etc. These examples are safety-critical, but not all hard real-time systems are (and we will see below that most safety-critical systems are not and cannot be hard real-time, although some may include simple low level hard real-time components).
Soft real-time refers to a class of real-time systems which are generalizations of hard real-time ones. The generalizations included weaker timeliness criteria and/or weaker predictability criteria.
For example, consider a model with tasks having deadlines as hard real-time ones do. In this particular model, the timeliness criterion is that any number of tasks are allowed to be up to 15% tardy, and the predictability criterion is that this must be guaranteed (i.e., deterministic) just like for hard real-time systems. If one or more of these tasks is more than 15% tardy, the system has failed.This model is not a conventional hard real-time one, although it may be a safety-critical one.
Consider another model: the timeliness criterion is that no more than 20% of the tasks can be more than 5% tardy, and the predictability criterion is that this is guaranteed to be satisfied with at least probability 0.9. Violation of the timeliness and/or predictability criteria means the system has failed.This is not a hard real-time system, although it may be a safety-critical one.
But consider: what if the utility of that system degrades according to not meeting one or any of those criteria--say, 23% were more than 5% tardy, or less than 20% of the tasks were tardy but 10% of those were more than 5% tardy, or all of the criteria were met except that the predictability is only 0.8. There are many real-time systems having such dynamic properties.
We need to specify how that system degradation (say, the system's "utility" or "value") is related to how many and to what degree any of those timeliness and predictability criteria were or were not met. In fact, this model is a notional example of many actual existing real-time systems that are as safety-critical as possible--for example, for doing defense against nuclear armed hostile missiles (and numerous other military combat systems, because they all have various inherent dynamic uncertainties).
Now we return to that need to specify how a real-time system's timeliness and predictability are related to the system's utility. A successfully used solution to that is called "time/utility functions," (or "time/value functions") and is described in great detail on real-time.org. The functions for each task are derived from the physical nature of the system application(s). The system's timeliness and predictability of timeliness are based on those of the tasks--for example, by weighted accrual of their individual utilities.
Soft real-time systems (in the precisely defined sense described on real-time.org) are the general case, and hard real-time systems are a special case that applies to a much smaller domain of real-time problems. All hard and soft real-time systems can be specified and created with time/utility functions.
All that clarified, now we can address your question about real-time middleware.
One obvious source for an answer is The Open Group Real-Time CORBA (RTC) standard (Google, there is a GREAT deal of detailed information available).
RTC can be implemented as a fixed priority infrastructure, with a 15-bit system-wide priority that is mapped onto the node priorities. In that case, the minimum requirements are: respecting thread priorities between client and server for resolving resource contention during the processing of CORBA invocations; bounding the duration of thread priority inversions during end-to-end processing; bounding the latencies of operation invocations. It is possible to build hard real-time RTC distributed systems according to those three requirements (and many exist)--but obviously the underlying network QoS affects the real-time behavior. So RTC provides for pluggable application-specific networking, such as those having deterministic QoS (so hard real-time is possible at and below the RTC layers), and those having non-deterministic QoS (but still the RTC layers have the three essential fixed priority real-time properties).
More generally, RTC provides for soft real-time (in the technical sense defined on real-time.org) at the CORBA layers. It does that by providing a first order scheduling abstraction called "distributed threads." And it provides a scheduling framework that supports not only fixed priorities but also time/utility functions, which are general enough to express a very general class of "utility accrual" soft real-time scheduling algorithms. Such algorithms (or usually heuristics) are needed for distributed systems consisting of application-specific soft real-time system models such as I described above.
What if you don't want to use RTC? The good news is that RTC's principles first appeared publicly in a different distributed real-time system (described on real-time.org), and can be (and have been) transplanted to other real-time middleware for both hard and soft real-time systems.
For soft real-time (again, in the precisely defined sense from real-time.org) middleware, the principles of dynamic timeliness and predictability of timeliness must be applied to resource management at each node of the middleware's system--including being applied to scheduling the middleware's network (e.g., access, routing, etc.). Instances of this approach appear in several Ph.D. theses, and have also been implemented in a number of military combat distributed real-time time systems.
From what I understand, synchronized keyword syncs local thread cache with main memory. volatile keyword basically always reads the variable from the main memory at every access. Of course accessing main memory is much more expensive than local thread cache so these operations are expensive. However, a CAS operation use low level hardware operations but still has to access main memory. So how is a CAS operation any faster?
I believe the critical factor is as you state - the CAS mechanisms use low-level hardware instructions that allow for the minimal cache flushing and contention resolution.
The other two mechanisms (synchronization and volatile) use different architectural tricks that are more generally available across all different architectures.
CAS instructions are available in one form or another in most modern architectures but there will be a different implementation in each architecture.
Interesting quote from Brian Goetz (supposedly)
The relative speed of the operations is largely a non-issue. What is relevant is the difference in scalability between lock-based and non-blocking algorithms. And if you're running on a 1 or 2 core system, stop thinking about such things.
Non-blocking algorithms generally scale better because they have shorter "critical sections" than lock-based algorithms.
Note that a CAS does not necessarily have to access memory.
Most modern architectures implement a cache coherency protocol like MESI that allows the CPU to do shortcuts if there is only one thread accessing the data at the same time. The overhead compared to traditional, unsynchronized memory access is very low in this case.
When doing a lot of concurrent changes to the same value however, the caches are indeed quite worthless and all operations need to access main memory directly. In this case the overhead for synchronizing the different CPU caches and the serialization of memory access can lead to a significant performance drop (this is also known as cache ping-pong), which can be just as bad or even worse than what you experience with lock-based approaches.
So never simply assume that if you switch to atomics all your problems go away. The big advantage of atomics are the progress guarantees for lock-free (someone always makes progress) or wait-free (everyone finishes after a certain number of steps) implementations. However, this is often orthogonal to raw performance: A wait-free solution is likely to be significantly slower than a lock-based solution, but in some situations you are willing to accept that in order to get the progress guarantees.
I have a dual core Intel processor and would like to use one core for processing certain commands like SATA writes and another for reads, how do we do it? Can this be controlled from the application(with multiple threads) or would this require a change in the kernel to ensure the reads/writes dont get processed by the the 'wrong' core?
This will be pretty much totally up to your operating system, which you haven't specified.
Some may offer thread affinity to try and keep one thread on the same execution engine (be that a core or a CPU), but that's only for threads. If two threads both write to disk, then they may well do so on different engines.
If you want that sort of low level control, it's probably best to do it at the kernel level.
My question to you would by "Why?". A great deal of performance tuning goes into OS kernels and they would generally know better than any application how to efficiently do this low level stuff.
Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.