StarCounter and CAP - nosql

I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland

The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.

There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

Difference between centralized and distributed computing

Can anyone tell me the differences between centralized and distributed computing?
Centralized
A system with centralized multiprocessor parallel architecture.In the late 1980 s Centralized systems have been progressively replaced by distributed systems.
characteristics of centralized system
Non autonomous components
usually homogeneous technology
Multiple users share the same resources at all time
single point of control
single point of failure
Distributed
set of tightly coupled programs executing on one or more computers which are interconnected through a network and coordinating their actions. These programs know about one another and carry out tasks that none could carry out in isolation
characteristics of distributed system
autonomous components
Mostly build using heterogeneous technology
System components may be used exclusively
Concurrent processes can execute
Multiple point of failure
Requirement of distributed system
Scalability- possibility of adding new hosts
openness- easily extended and modified
Heterogeneity-supports various H/W S/w platforms
Resource sharing- H/w, S/W and data
fault tolerance- ability to function correctly even if faults occur
Centralized: all calculations are done on one particular computer (system). Example: you have a dedicated server for calculating data.
Distributed: the calculation is distributed to multiple computers. Example: when you have a large amount of data then you can divide it and send each part to particular computers which will make the calculations for their part.
Main basic differences are:
distrib-systems have no global state
no shared memory
no shared variables
distrib-systems have no shared time clock
therefore order of events is difficult
distrib-systems can have race conditions
race conditions see http://en.wikipedia.org/wiki/Race_condition
So "computing" in a distrubuted environment is very difficult. Do you have concret question about programing models or whatever?
Centralized Systems
"In Centralized Systems,several jobs are done on a particular central processing unit(CPU)"
Distributed Systems
"In Distributed Systems,jobs are distributed among several processor.The Processor are interconnected by a computer network"
(Sheheryar ,NUML)
Briefly, Centralized computing, as the name itself depicts, is concerned with just a single server. The particular operation is being held at this server location and nowhere else.
Distributed computing is held where the system requirement is quite large, and the job is distributed to several processors and the solutions are then combined together, keeping in mind that the processors are interconnected by a computer network.
centralized system:is a system which computing is done at central location using terminals attached to central computer in brief (mainframe and dump terminals all computation is done on the mainframe through terminals )
distributed system:is a collection of independent computers that appear to its users as single coherent system where hardware is distributed consisting of n processing elements (processor and memory )also software is distributed where no centralized os each processing element has its own os ,no physically centralized file system and inter-process communication via message passing at lowest level
Big Note:the main differences is reliability. in distributed system if one machine crashes,the system as a whole can still survive
METHOD OF ARBITRATION In all but the simplest systems, more than one module may
need control of the bus.
In a centralized scheme, a single hardware device, referred to as a bus controller or arbiter, is responsible for allocating time on the bus.
In a distributed scheme, there is no central controller. Rather, each module contains access
control logic and the modules act together to share the bus.
in centralized system in case the server fails it affects the whole system because the server controls the whole operation
in D.S system incase a system fails it doesn't affect the operations of the other computers because they are independent and distributed in operations
Let us try to understand this with an example.
Say you are carrying a large amount of money. You are in a crowded train, where your pocket may be picked and you might lose money. What is the ideal strategy for carrying money?
Put all money in a single pocket: In this case, it is easy for you to just put the money in the pocket and be done. When you go back home, you can simply take out money from the pocket and count it. But wait. What if your pocket is picked? You lose ALL the money (bankrupt? eh!). Seems like it is not the best idea to store all the money in a SINGLE pocket. Let us think what else we can do
Divide your money: Put some of it in the left pocket, put some in the right pocket and maybe put some in your bag (which has a limited capacity). You need to devise a strategy to divide the money with you. Also, when you go back home, you will have to spend time collecting money from different pockets and collecting it at one place. However, we are in a better situation now. If one of our pocket (or bag) is picked, we do not lose ALL of the money. The chances of your bag, left pocket and the right pocket, all being picked is fairly low. With a little overhead of dividing money, you can now avoid losing all of your money. Isn’t that better?
This is how distributed systems work. They divide the information (money in your case) and keep it on different machines (pockets and bags for us). This way if one of the machine goes down, we are not at a big loss. That is, we do not have a single point of failure
Another important thing that distributed systems implement is data replication. They put replicas of same information in multiple machines. This way, if one of the machines goes down, we do not lose the information. So, we now have something called as fault tolerance.

Differences between hard real-time, soft real-time, and firm real-time?

I have read the definitions for the different notions of real-time, and the examples provided for hard and soft real-time systems make sense to me. But, there is no real explanation or example of a firm real-time system. According to the link above:
Firm: Infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline.
Is there a clear distinction between firm real-time vs. hard or soft real-time, and is there a good example that illustrates that distinction?
In comments, Charles asked that I submit tag wikis for the new tags. The example of a "firm real-time system" I provided for the firm-real-time tag was a milk serving system. If the system delivers milk after its expiration time, then the milk is considered "not useful". One can tolerate eating cereal without milk, but the quality of the experience is degraded.
This is just the idea I formed in my head when I initially read the definition. I am looking for a much better example, and perhaps a better definition of firm real-time that will improve my notion of it.
Hard Real-Time
The hard real-time definition considers any missed deadline to be a system failure. This scheduling is used extensively in mission critical systems where failure to conform to timing constraints results in a loss of life or property.
Examples:
Air France Flight 447 crashed into the ocean after a sensor malfunction caused a series of system errors. The pilots stalled the aircraft while responding to outdated instrument readings. All 12 crew and 216 passengers were killed.
Mars Pathfinder spacecraft was nearly lost when a priority inversion caused system restarts. A higher priority task was not completed on time due to being blocked by a lower priority task. The problem was corrected and the spacecraft landed successfully.
An Inkjet printer has a print head with control software for depositing the correct amount of ink onto a specific part of the paper. If a deadline is missed then the print job is ruined.
Firm Real-Time
The firm real-time definition allows for infrequently missed deadlines. In these applications the system can survive task failures so long as they are adequately spaced, however the value of the task's completion drops to zero or becomes impossible.
Examples:
Manufacturing systems with robot assembly lines where missing a deadline results in improperly assembling a part. As long as ruined parts are infrequent enough to be caught by quality control and not too costly, then production continues.
A digital cable set-top box decodes time stamps for when frames must appear on the screen. Since the frames are time order sensitive a missed deadline causes jitter, diminishing quality of service. If the missed frame later becomes available it will only cause more jitter to display it, so it's useless. The viewer can still enjoy the program if jitter doesn't occur too often.
Soft Real-Time
The soft real-time definition allows for frequently missed deadlines, and as long as tasks are timely executed their results continue to have value. Completed tasks may have increasing value up to the deadline and decreasing value past it.
Examples:
Weather stations have many sensors for reading temperature, humidity, wind speed, etc. The readings should be taken and transmitted at regular intervals, however the sensors are not synchronized. Even though a sensor reading may be early or late compared with the others it can still be relevant as long as it is close enough.
A video game console runs software for a game engine. There are many resources that must be shared between its tasks. At the same time tasks need to be completed according to the schedule for the game to play correctly. As long as tasks are being completely relatively on time the game will be enjoyable, and if not it may only lag a little.
Siewert: Real-Time Embedded Systems and Components.
Liu & Layland: Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment.
Marchand & Silly-Chetto: Dynamic Scheduling of Soft Aperiodic Tasks and Periodic Tasks with Skips.
Hard real-time means you must absolutely hit every deadline. Very few systems have this requirement. Some examples are nuclear systems, some medical applications such as pacemakers, a large number of defense applications, avionics, etc.
Firm/soft real time systems can miss some deadlines, but eventually performance will degrade if too many are missed. A good example is the sound system in your computer. If you miss a few bits, no big deal, but miss too many and you're going to eventually degrade the system. Similar would be seismic sensors. If you miss a few datapoints, no big deal, but you have to catch most of them to make sense of the data. More importantly, nobody is going to die if they don't work correctly.
The line is fuzzy, because even a pacemaker can be off by a small amount without killing the patient, but that's the general gist.
It's sort of like the difference between hot and warm. There's not a real divide, but you know it when you feel it.
After reading the Wikipedia page and other pages on real-time computing. I made the following inferences:
1> For a Hard real-time system, if the system fails to meet the deadline even once the system is considered to have Failed.
2> For a Firm real-time system, even if the system fails to meet the deadline, possibly more than once (i.e. for multiple requests), the system is not considered to have failed. Also, the responses for the requests (replies to a query, result of a task, etc.) are worthless once the deadline for that particular request has passed (The usefulness of a result is zero after its deadline). A hypothetical example can be a storm forecast system (if a storm is predicted before arrival, then the system has done its job, prediction after the event has already happened or when it is happening is of no value).
3> For a Soft real-time system, even if the system fails to meet the deadline, possibly more than once (i.e. for multiple requests), the system is not considered to have failed. But, in this case the results of the requests are not worthless value for a result after its deadline, is not zero, rather it degrades as time passes after the deadline. Eg.: Streaming audio-video.
Here is a link to a resource that was very helpful.
It's popular to associate some great catastrophe with the definition of hard real-time, but this is not relevant. Any failure to meet a hard real-time constraint simply means that the system is broken. The severity of the outcome when something is labelled "broken" isn't material to the definition.
Firm and soft simply fail to be automatically declared broken on failing to meet a single deadline.
For a fair example of hard real-time, from the page you linked:
Early video game systems such as the Atari 2600 and Cinematronics vector graphics had hard real-time requirements because of the nature of the graphics and timing hardware.
If something in the video generation loop missed just a single deadline then the whole display would glitch, which would be intolerable, even if it was rare. That would be a broken system and you'd take it back to the shop for a refund. So it's hard real-time.
Obviously any system can be subject to situations it cannot handle, so it's necessary to restrict the definition to being within the expected operating conditions -- noting that in safety-critical applications people must plan for terrible conditions ("the coolant has evaporated", "the brakes have failed", but rarely "the sun has exploded").
And lets not forget that sometimes there's an implicit "while anybody is watching" operating condition. If nobody sees you break the rules (or if they did but they die the fire before telling anyone), and nobody can prove that you broke the rules after the fact, then it's kind of the same as if you never broke the rules!
The simplest way to distinguish between the different kinds of real-time system types is answering the question:
Is a delayed system response (after the deadline) is still useful or not?
So depending on the answer you get for this question, your system could be included as one of the following categories:
Hard: No, and delayed answers are considered a system failure
This is the case when missing the dead-line will make the system unusable. For example the system controlling the car Airbag system should detect the crash and inflate rapidly the bag. The whole process takes more or less one-twenty-fifth of a second. Thus, if the system for example react with 1 second of delay the consequences could be mortal and it will be no benefit having the bag inflated once the car has already crashed.
Firm: No, but delayed answers are not necessary a system failure
This is the case when missing the deadline is tolerable but it will affect the quality of the service. As a simple example consider a video encryption system. Normally the password of encryption is generated in the server (video Head end) and sent to the customer set-top box. This process should be synchronized so normally the set-top box receives the password before starts receiving the encrypted video frames. In this case a delay it may lead to video glitches since the set-top box is not able to decode the frames because it hasn't received the password yet. In this case the service (film, an interesting football match, etc) could be affected by not meeting the deadline. Receiving the password with delay in this case is not useful since the frames encrypted with the same have already caused the glitches.
Soft: Yes, but the system service is degraded
As from the the wikipedia description the usefulness of a result degrades after its deadline. That means, getting a response from the system out of the deadline is still useful for the end user but its usefulness degrade after reaching the deadline. A simple example for this case is a software that automatically controls the temperature of a room (or a building). In this case if the system has some delays reading the temperature sensors it will be a little bit slow to react upon brusque temperature changes. However, at the end it will end up reacting to the change and adjusting accordingly the temperature to keep it constant for example. So in this case the delayed reaction is useful, but it degrades the system quality of service.
A soft real time is easiest to understand, in which even if the result is obtained after the deadline, the results are still considered as valid.
Example: Web browser- We request for certain URL, it takes some time in loading the page. If the system takes more than expected time to provide us with the page, the page obtained is not considered as invalid, we just say that the system's performance wasn't up to the mark (system gave low performance!).
In hard real time system, if the result is obtained after the deadline, the system is considered to have failed completely.
Example: In case of a robot doing some job like line tracing, etc. If a hindrance comes on its path, and the robot doesn't process this information within some programmed deadline (almost instant!), the robot is said to have failed in its task (the robot system may also get completely destroyed!).
In firm real time system, if the result of process execution comes after the deadline, we discard that result, but the system is not termed to have been failed.
Example: Satellite communication for enemy position monitoring or some other task. If the ground computer station to which the satellites send the frames periodically is overloaded, and the current frame (packet) is not processed in time and the next frame comes up, the current packet (the one who missed the deadline) doesn't matter whether the processing was done (or half done or almost done) is dropped/discarded. But the ground computer is not termed to have completely failed.
To define "soft real-time," it is easiest to compare it with "hard real-time." Below we will see that the term "firm real-time" constitutes a misunderstanding about "soft real-time."
Speaking casually, most people implicitly have an informal mental model that considers information or an event as being "real-time"
• if, or to the extent that, it is manifest to them with a delay (latency) that can be related to its perceived currency
• i.e., in a time frame that the information or event has acceptably satisfactory value to them.
There are numerous different ad hoc definitions of "hard real-time," but in that mental model, hard real-time is represented by the "if" term. Specifically, assuming that real-time actions (such as tasks) have completion deadlines, acceptably satisfactory value of the event that all tasks complete is limited to the special case that all tasks meet their deadlines.
Hard real-time systems make the very strong assumptions that everything about the application and system and environment is static and known a' priori—e.g., which tasks, that they are periodic, their arrival times, their periods, their deadlines, that they won’t have resource conflicts, and overall the time evolution of the system. In an aircraft flight control system or automotive braking system and many other cases those assumptions can usually be satisfied so that all the deadlines will be met.
This mental model is deliberately and very usefully general enough to encompass both hard and soft real-time--soft is accommodated by the "to the extent that" phrase. For example, suppose that the task completions event has suboptimal but acceptable value if
no more than 10% of the tasks miss their deadlines
or no task is more than 20% tardy
or the average tardiness of all tasks is no more than 15%
or the maximum tardiness among all tasks is less than 10%
These are all common examples of soft real-time cases in a great many applications.
Consider the single-task application of picking your child up after school. That probably does not have an actual deadline, instead there is some value to you and your child based on when that event takes place. Too early wastes resources (such as your time) and too late has some negative value because your child might be left alone and potentially in harm's way (or at least inconvenienced).
Unlike the static hard real-time special case, soft real-time makes only the minimum necessary application-specific assumptions about the tasks and system, and uncertainties are expected. To pick up your child, you have to drive to the school, and the time to do that is dynamic depending on weather, traffic conditions, etc. You might be tempted to over-provision your system (i.e., allow what you hope is the worst case driving time) but again this is wasting resources (your time, and occupying the family vehicle, possibly denying use by other family members).
That example may not seem to be costly in terms of wasted resources, but consider other examples. All military combat systems are soft real-time. For example, consider performing an aircraft attack on a hostile ground vehicle using a missile guided with updates to it as the target maneuvers. The maximum satisfaction for completing the course update tasks is achieved by a direct destructive strike on the target. But an attempt to over-provision resources to make certain of this outcome is usually far too expensive and may even be impossible. In this case, you may be less but sufficiently satisfied if the missile strikes close enough to the target to disable it.
Obviously combat scenarios have a great many possible dynamic uncertainties that must be accommodated by the resource management. Soft real-time systems are also very common in many civilian systems, such as industrial automation, although obviously military ones are the most dangerous and urgent ones to achieve acceptably satisfactory value in.
The keystone of real-time systems is "predictability." The hard real-time case is interested in only one special case of predictability--i.e., that the tasks will all meet their deadlines and the maximum possible value will be achieved by that event. That special case is named "deterministic."
There is a spectrum of predictability. Deterministic (determinism) is one end-point (maximum predictability) on the predictability spectrum; the other end-point is minimum predictability (maximum non-determinism). The spectrum's metric and end-points have to be interpreted in terms of a chosen predictability model; everything between those two end-points is degrees of unpredictability (= degrees of non-determinism).
Most real-time systems (namely, soft ones) have non-deterministic predictability, for example, of the tasks' completions times and hence the values gained from those events.
In general (in theory), predictability, and hence acceptably satisfactory value, can be made as close to the deterministic end-point as necessary--but at a price which may be physically impossible or excessively expensive (as in combat or perhaps even in picking up your child from school).
Soft real-time requires an application-specific choice of a probability model (not the common frequentist model) and hence predictability model for reasoning about event latencies and resulting values.
Referring back to the above list of events that provide acceptable value, now we can add non-deterministic cases, such as
the probability that no task will miss its deadline by more than 5% is greater than 0.87. (Note the number of scheduling criteria expressed in there.)
In a missile defense application, given the fact that in combat the offense always has the advantage over the defense, which of these two real-time computing scenarios would you prefer:
because the perfect destruction of all the hostile missiles is very unlikely or impossible, assign your defensive resources to maximize the probability that as many of the most threatening (e.g., based on their targets) hostile missiles will be successfully intercepted (close interception counts because it can move the hostile missile off-course);
complain that this is not a real-time computing problem because it is dynamic instead of static, and traditional real-time concepts and techniques do not apply, and it sounds more difficult than static hard real-time, so you are not interested in it.
Despite the various misunderstandings about soft real-time in the real-time computing community, soft real-time is very general and powerful, albeit potentially complex compared with hard real-time. Soft real-time systems as summarized here have a lengthy successful history of use outside the real-time computing community.
To directly answer the OP question:
A hard real-time system can provide deterministic guarantees—most commonly that all tasks will meet their deadlines, interrupt or system call response time will always be less than x, etc.—IF AND ONLY IF very strong assumptions are made and are correct that everything that matters is static and known a' priori (in general, such guarantees for hard real-time systems are an open research problem except for rather simple cases)
A soft real-time system does not make deterministic guarantees, it is intended to provide the best possible analytically specified and accomplished probabilistic timeliness and predictability of timeliness that are feasible under the current dynamic circumstances, according to application-specific criteria.
Obviously hard real-time is a simple special case of soft real-time. Obviously soft real-time's analytical non-deterministic assurances can be very complex to provide, but are mandatory in the most common real-time cases (including the most dangerous safety-critical ones such as combat) since most real-time cases are dynamic not static.
"Firm real-time" is an ill-defined special case of "soft real-time." There is no need for this term if the term "soft real-time" is understood and used properly.
I have a more detailed much more precise discussion of real-time, hard real-time, soft real-time, predictability, determinism, and related topics on my web site real-time.org.
real-time - Pertaining to a system or mode of operation in which computation is performed during the actual time that an external process occurs, in order that the computation results can be used to control, monitor, or respond to the external process in a timely manner. [IEEE Standard 610.12.1990]
I know this definition is old, very old. I can't, however, find a more recent definition by the IEEE (Institute of Electrical and Electronics Engineers).
Consider a task that inputs data from the serial port. When new data arrives the serial port triggers an event. When the software services that event, it reads and processes the new data. The serial port has a hardware to store incoming data (2 on the MSP432, 16 on the TM4C123) such that if the buffer is full and more data arrives, the new data is lost. Is this system hard, firm, or soft real time?
It is hard real time because if the response is late, data may be lost.
Consider a hearing aid that inputs sounds from a microphone, manipulates the sound data, and then outputs the data to a speaker. The system usually has small and bounded jitter, but occasionally other tasks in the hearing aid cause some data to be late, causing a noise pulse on the speaker. Is this system hard, firm or soft real time?
It is firm real time because it causes an error that can be perceived but the effect is harmless and does not significantly alter the quality of the experience.
Consider a task that outputs data to a printer. When the printer is idle the printer triggers an event. When the software services that event, it sends more data to the printer. Is this system hard, firm or soft real time?
It is soft real time because the faster it responses the better, but the value of the system (bandwidth is amount of data printed per second) diminishes with latency.
UTAustinX: UT.RTBN.12.01x Realtime Bluetooth Networks
Maybe the definition is at fault.
From my experience, I would separate the two as being hardware and software dependant.
If you have 200ms to service a hardware driven interrupt, that is what you've got. You stick 300ms of code in there and the system isn't broken, it hasn't been developed. You'll be switched out before you've finished. Your code doesn't work or is not fit for purpose. Many systems have hard defined processing periods. Video, telecoms etc.
If you're writing an application that's real-time, this could be considered soft. If you run out of time you could hope for less load next time, you could tune the OS, add some memory or even upgrade the hardware. You have options.
To look at it from a UX perspective is not helpful. A Skoda might not be broken if it glitches, but a BMW sure as hell will be.
The definition has expanded over the years to the detriment of the term. What is now called "Hard" real-time is what used to be simply called real-time. So systems in which missing timing windows (rather than single-sided time deadlines) would result incorrect data or incorrect behavior should be consider real-time. Systems without that characteristic would be considered non-real-time.
That's not to say that time isn't of interest in non-real-time systems, it just means that timing requirements in such systems don't result in fundamentally incorrect results.
Hard real time systems uses preemptive version of priority scheduling, so that critical tasks get immediately scheduled, whereas soft real time systems uses non-preemptive version of the priority scheduling, which allows the present task to be finished before control is transferred to the higher priority task, causing additional delays. Thus the task deadlines are critically followed in Hard real time systems, whereas in soft real time systems they are handled not that seriously.

MongoDB - Cluster of small/many or few/big nodes

Can it be said what is best in a general case (where the database size is really big): To have a MongoDB cluster consisting of a larger number of smaller blade servers, or a few, really fat, servers?
Given is that the shard key has a quite fine granularity, so splitting should not be a problem.
If there are no "golden bullet", what is the pros and cons with either setup?
Best in what aspect? From a financial point of view, I'd go for lots of cheap hardware :)
MongoDB has been built to easily scale across nodes, so why not take advantage of this? The reason you'd want just one or a few beefy servers for a SQL server is to minimize the spread of relational data across physical nodes. But since MongoDB uses documents, most of your related data is stored in a single document. This means that it's all stored at the same physical location and you don't have to do costly lookups on other nodes to reconstruct the 'complete picture' of your data.
Another thing to keep in mind is that map-reduce jobs can only run in parallel in a sharded environment. So if you plan to do a lot of map-reducing, more shards/servers will result in better performance.
What if your database outgrows your beefy servers? Are you going to invest in another beefy server that handles that small amount of extra growth? Or what if one of them crashes? With smaller and cheaper servers, you can scale up (or down) more gradually if the need arises. Also, the impact of a server crash is much smaller, as it will affect only a small portion of your data.
To summarize: a large cluster of smaller servers isn't a silver bullet, as managing such a cluster has its own challenges, but it is significantly cheaper and possibly faster as well if you're doing map-reduce.