Out-of-order execution vs. speculative execution - cpu-architecture

I have read the wikipedia page about out-of-order execution and speculative exectution.
What I fail to understant though are the similarities and differences. It seems to me that speculative execution uses out-of-order execution when it has not determined the value of a condition for example.
The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.
I'd like to get some clarification about this.

Speculative execution and out-of-order execution are orthogonal. One could design a processor that is OoO but not speculative or speculative but in-order. OoO execution is an execution model in which instructions can be dispatched to execution units in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints. See the simulation-based study on this idea: Maximizing Limited Resources: a Limit-Based Study and Taxonomy
of Out-of-Order Commit).
Speculative execution is an execution model in which instructions can be fetched and enter the pipeline and begin execution without knowing for sure that they will indeed be required to execute (according to the control flow of the program). The term is often used to specifically refer to speculative execution in the execution stage of the pipeline. The Meltdown paper does define these terms on page 3:
In this paper, we refer to speculative execution in a more
restricted meaning, where it refers to an instruction sequence
following a branch, and use the term out-of-order execution to refer
to any way of getting an operation executed before the processor has
committed the results of all prior instructions.
The authors here specifically refer to having branch prediction with executing instructions past predicted branches in the execution units. This is commonly the intended meaning of the term. Although it's possible to design a processor that executes instructions speculatively without any branch prediction by using other techniques such as value prediction and speculative memory disambiguation. This would be speculation on data or memory dependencies rather than on control. An instruction could be dispatched to an execution unit with an incorrect operand or that loads the wrong value. Speculation can also occur on the availability of execution resources, on the latency of an earlier instruction, or on the presence of a needed value in a particular unit in the memory hierarchy.
Note that instructions can be executed speculatively, yet in-order. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions are issued. The Intel Bonnell microarchitecture is an example of a real processor that is in-order and supports branch prediction.
Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. Speculative execution is particularly beneficial when used with OoO.
The confusion came when I read the papers of Meltdown and Spectre and
did additional research. It is stated in the Meltdown paper that
Meltdown is based on out-of-order execution, while some other
resources including the wiki page about sepeculative execution state
that Meltdown is based on speculative execution.
The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is somewhat a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.
See Peter's answer for more details on Meltdown and his other answer.
Related: What is the difference between Superscalar and OoO execution?.

I'm still having hard time figuring out, how Meltdown uses speculative execution. The example in the paper (the same one I mentioned here earlier) uses IMO only OoO - #Name in a comment
Meltdown is based on Intel CPUs optimistically speculating that loads won't fault, and that if a faulting load reaches the load ports, that it was the result of an earlier mispredicted branch. So the load uop gets marked so it will fault if it reaches retirement, but execution continues speculatively using data the page table entry says you aren't allowed to read from user-space.
Instead of triggering a costly exception-recovery when the load executes, it waits until it definitely reaches retirement, because that's a cheap way for the machinery to handle the branch miss -> bad load case. In hardware, it's easier for the pipe to keep piping unless you need it to stop / stall for correctness. e.g. A load where there's no page-table entry at all, and thus a TLB miss, has to wait. But waiting even on a TLB hit (for an entry with permissions that block using it) would be added complexity. Normally a page-fault is only ever raised after a failed page walk (which doesn't find an entry for the virtual address), or at retirement of a load or store that failed the permissions of the TLB entry it hit.
In a modern OoO pipelined CPU, all instructions are treated as speculative until retirement. Only at retirement do instructions become non-speculative. The Out-of-Order machinery doesn't really know or care whether it's speculating down one side of a branch that was predicted but not executed yet, or speculating past potentially-faulting loads. "Speculating" that loads don't fault or ALU instructions don't raise exceptions happens even in CPUs that aren't really considered speculative, but fully out-of-order execution turns that into just another kind of speculation.
I'm not too worried about an exact definition for "speculative execution", and what counts / what doesn't. I'm more interested in how modern out-of-order designs actually work, and that it's actually simpler to not even try to distinguish speculative from non-speculative until the end of the pipeline. This answer isn't even trying to address simpler in-order pipelines with speculative instruction-fetch (based on branch prediction) but not execution, or anywhere in between that and full-blown Tomasulo's algorithm with a ROB + scheduler with OoO exec + in-order retirement for precise exceptions.
For example, only after retirement can a store ever commit from the store buffer to L1d cache, not before. And to absorb short bursts and cache misses, it doesn't have to happen as part of retirement either. So one of the only non-speculative out-of-order things is committing stores to L1d; they have definitely happened as far as the architectural state is concerned, so they have to be completed even if an interrupt / exception happens.
The fault-if-reaching-retirement mechanism is a good way to avoid expensive work in the shadow of a branch mispredict. It also gives the CPU the right architectural state (register values, etc.) if the exception does fire. You do need that whether or not you let the OoO machinery keep churning on instructions beyond a point where you've detected an exception.
Branch-misses are special: there are buffers that record micro-architectural state (like register-allocation) on branches, so branch-recovery can roll back to that instead of flushing the pipeline and restarting from the last known-good retirement state. Branches do mispredict a fair amount in real code. Other exceptions are very rare.
Modern high-performance CPUs can keep (out-of-order) executing uops from before a branch miss, while discarding uops and execution results from after that point. Fast recovery is a lot cheaper than discarding and restarting everything from a retirement state that's potentially far behind the point where the mispredict was discovered.
E.g. in a loop, the instructions that handle the loop counter might get far ahead of the rest of the loop body, and detect the mispredict at the end soon enough to redirect the front-end and maybe not lose much real throughput, especially if the bottleneck was the latency of a dependency chain or something other than uop throughput.
This optimized recovery mechanism is only used for branches (because the state-snapshot buffers are limited), which is why branch misses are relatively cheap compared to full pipeline flushes. (e.g. on Intel, memory-ordering machine clears, performance counter machine_clears.memory_ordering: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
Exceptions are not unheard-of, though; page-faults do happen in the normal course of operation. e.g. store to a read-only page triggers copy-on-write. Load or store to an unmapped page triggers page-in or handling the lazy mapping. But thousands to millions of instructions usually run between every page fault even in a process that's allocating new memory frequently. (1 per micro or milli-second on a 1GHz CPU). In code that doesn't map new memory, you can go far longer without exceptions. Mostly just a timer interrupt occasionally in pure number crunching without I/O.
But anyway, you don't want to trigger a pipeline flush or anything expensive until you're sure that an exception will really fire. And that you're sure you have the right exception. e.g. maybe the load address for an earlier faulting load wasn't ready as soon, so the first faulting load to execute wasn't the first in program order. Waiting until retirement is a cheap way to get precise exceptions. Cheap in terms of additional transistors to handle this case, and letting the usual in-order retirement machinery figure out exactly which exception fires is fast.
The useless work done executing instructions after an instruction marked to fault on retirement costs a tiny bit of power, and isn't worth blocking because exceptions are so rare.
This explains why it makes sense to design hardware that was vulnerable to Meltdown in the first place. Obviously it's not safe to keep doing this, now that Meltdown has been thought of.
Fixing Meltdown cheaply
We don't need to block speculative execution after a faulting load; we just need to make sure it doesn't actually use sensitive data. It's not the load succeeding speculatively that's the problem, Meltdown is based on the following instructions using that data to produce data-dependent microarchitectural effects. (e.g. touching a cache line based on the data).
So if the load ports mask the loaded data to zero or something as well as setting the fault-on-retirement flag, execution continues but can't gain any info about the secret data. This should take about 1 extra gate delay of critical path, which is probably possible in the load ports without limiting the clock speed or adding an extra cycle of latency. (1 clock cycle is long enough for logic to propagate through many AND/OR gates within a pipeline stage, e.g. a full 64-bit adder).
Related: I suggested the same mechanism for a HW fix for Meltdown in Why are AMD processors not/less vulnerable to Meltdown and Spectre?.

Related

What happens when a core write in its L1 cache while another core is having the same line in its L1 too?

What happens when a core write in its L1 cache while another core is having the same line in its L1 too ?
Let say for an intel Skylake CPU.
How does the cache system preserve consistency ? Does it update in real time, does it stop one of the cores ?
What's the performance cost of continuously writing in same cache line with two cores ?
In general modern CPUs use some variant1 of the MESI protocol to preserve cache coherency.
In your scenario of an L1 write, the details depend on the existing state of the cache lines: is the cache line already in the cache of the writing core? In the other core, in what state is the cache line, e.g., has it been modified?
Let's take the simple case where the line isn't already in the writing core (C1), and it is in the "exclusive" state in the other core (C2). At the point where the address for the write is known, C1 will issue an RFO (request for ownership) transaction onto the "bus" with the address of the line and the other cores will snoop the bus and notice the transaction. The other core that has the line will then transition its line from the exclusive to the invalid state and the value of the value of the line will be provided to the requesting core, which will have it in the modified state, at which point the write can proceed.
Note that at this point, further writes to that line from the writing core proceed quickly, since it is in the M state which means no bus transaction needs to take place. That will be the case until the line is evicted or some other core requests access.
Now, there are a lot of additional details in actual implementations which aren't covered above or even in the wikipedia description of the protocol.
For example, the basic model involves a single private cache per CPU, and shared main memory. In this model, core C2 would usually provide the value of the shared line onto the bus, even though it has not modified it, since that would be much faster than waiting to read the value from main memory. In all recent x86 implementations, however, there is a shared last-level L3 cache which sits between all the private L2 and L1 caches and main memory. This cache has typically been inclusive so it can provide the value directly to C1 without needing to do a cache-to-cache transfer from C2. Furthermore, having this shared cache means that each CPU may not actually need to snoop the "bus" since the L3 cache can be consulted first to determine which, if any, cores actually have the line. Only the cores that have the line will then be asked to make a state transition. Kind of a push model rather than pull.
Despite all these implementation details, the basics are the same: each cache line has some "per core" state (even though this state may be stored or duplicated in some central place like the LLC), and this state atomically undergoes logical transitions that ensure that the cache line remains consistent at all times.
Given that background, here are some specific answers to you final two sub-questions:
Does it update in real time, does it stop one of the cores?
Any modern core is going to do this in real time, and also in parallel for different cache lines. That doesn't mean it is free! For example, in the description above, the write by C1 is stalled until the cache coherence protocol is complete, which is likely dozens of cycles. Contrast that with a normal write which takes only a couple cycles. There are also possible bandwidth issues: the requests and responses used to implement the protocol use shared resources that may have a maximum throughput; if the rate of coherence transactions passes some limit, all requests may slow down even if they are independent.
In the past, when there was truly a shared bus, there may have been some partial "stop the world" behavior in some cases. For example, the lock prefix for x86 atomic instructions is apparently named based on the lock signal that a CPU would assert on the bus while it was doing an atomic transaction. During that entire period other CPUs are not able to fully use the bus (but presumably they could still proceed with CPU-local instructions).
What's the performance cost of continuously writing in same cache line with two cores?
The cost is very high because the line will continuously ping-pong between the two cores as described above (at the end of the process described, just reverse the roles of C1 and C2 and restart). The exact details vary a lot by CPU and even by platform (e.g, a 2-socket configuration will change this behavior a lot), but basically are probably looking at a penalty of 10s of cycles per write versus a not-shared output of 1 write per cycle.
You can find some specific numbers in the answers to this question which covers both the "two threads on the same physical core" case and the "separate cores" case.
If you want more details about specific performance scenarios, you should probably ask a separate specific question that lays out the particular behavior you are interested in.
1 The variations on MESI often introduce new states, such as the "owned" state in MOESI or the "Forwarded" state in MESIF. The idea is usually to make certain transitions or usage patterns more efficient than the plain MESI protocol, but the basic idea is largely the same.

Making a pipelined processor with instructions issued in alternate clock cycles

Why can't we design a (semi)pipelined processor that issues instruction at every alternate clock tick, instead of the pipelined processor that issues instruction at every clock tick?
Having the instructions wait would probably reduce the hazards and stalls that we try to solve in a complicate way. It could completely eliminate the branch stalls and thus save the expensive pipeline flush.
You answered your own question in the comments. You could design one, but you're essentially sacrificing potential performance in order to simplify your design. A slight variation on what you're suggesting is something called a barrel processor. Each cycle the processor takes one instruction from a different thread and this allows the pipeline to be simplified. The HEP architecture is another variant on this idea.

A program resistent to power/hardware/OS failures

I need to write a program that performs a parallel search in a large space of possible states, with new areas being discovered (and their exploration started) in the process, and exploration of some areas being terminated early as intermediate results obtained elsewhere eliminate a possibility of discovering new useful results in them. The search is performed using multiple threads running in a heavy cooperation with each other to avoid recalculation of intermediate data.
A complex internal state (including call stacks of several threads and state synchronization primitives they use) has to be maintained and updated during the whole process, and there is no apparent way to split the computation into isolated chunks that can be executed sequentially, each saving and passing a small intermediate result to the next. Also, there is no way to split the computation into independent parallel threads not communicating with each other, without imposing a prohibitive overhead due to recalculation of large amount of intermediate data.
Because of the large search domain, the program would possibly run for months before producing a final result. Hence, there is a significant risk of power, hardware or OS failure during the program execution that can lead to a complete loss of all work that has been done to the moment. In such a case the program will need to restart all its computations from scratch.
I need a solution that can prevent a complete data loss in such cases. I thought of an execution engine/platform that continuously saves the current state of the process to a failure-resistant storage like a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to a degree when there would be no benefit compared to an expected computation time including restarts due to possible failures.
In fact, I do not need an ideal solution that continuously saves the program state, and I can easily bear a loss of hours or maybe even days of work. A possible heavyweight solution that comes to my mind is to run the program inside a virtual machine, saving its snapshots from time to time, and restoring the machine after a possible host failure from a recent snapshot. This approach can also help to recover the program state after a random or preventable guest OS failure.
Is there a similar, but more lightweight solution limited to preserving a state of a single process? Or could you suggest any other approaches that can solve my problem?
You may want to look at using Erlang which allows large numbers of threads to run at relatively low cost. Because the thread cost is low, redundancy can be used to achieve increased reliability.
For the problem you present, a triple-redundancy scheme may be the way to go, where periodic checks for synchronization across the three (or more) systems would determine by vote who has failed.

Differences between hard real-time, soft real-time, and firm real-time?

I have read the definitions for the different notions of real-time, and the examples provided for hard and soft real-time systems make sense to me. But, there is no real explanation or example of a firm real-time system. According to the link above:
Firm: Infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline.
Is there a clear distinction between firm real-time vs. hard or soft real-time, and is there a good example that illustrates that distinction?
In comments, Charles asked that I submit tag wikis for the new tags. The example of a "firm real-time system" I provided for the firm-real-time tag was a milk serving system. If the system delivers milk after its expiration time, then the milk is considered "not useful". One can tolerate eating cereal without milk, but the quality of the experience is degraded.
This is just the idea I formed in my head when I initially read the definition. I am looking for a much better example, and perhaps a better definition of firm real-time that will improve my notion of it.
Hard Real-Time
The hard real-time definition considers any missed deadline to be a system failure. This scheduling is used extensively in mission critical systems where failure to conform to timing constraints results in a loss of life or property.
Examples:
Air France Flight 447 crashed into the ocean after a sensor malfunction caused a series of system errors. The pilots stalled the aircraft while responding to outdated instrument readings. All 12 crew and 216 passengers were killed.
Mars Pathfinder spacecraft was nearly lost when a priority inversion caused system restarts. A higher priority task was not completed on time due to being blocked by a lower priority task. The problem was corrected and the spacecraft landed successfully.
An Inkjet printer has a print head with control software for depositing the correct amount of ink onto a specific part of the paper. If a deadline is missed then the print job is ruined.
Firm Real-Time
The firm real-time definition allows for infrequently missed deadlines. In these applications the system can survive task failures so long as they are adequately spaced, however the value of the task's completion drops to zero or becomes impossible.
Examples:
Manufacturing systems with robot assembly lines where missing a deadline results in improperly assembling a part. As long as ruined parts are infrequent enough to be caught by quality control and not too costly, then production continues.
A digital cable set-top box decodes time stamps for when frames must appear on the screen. Since the frames are time order sensitive a missed deadline causes jitter, diminishing quality of service. If the missed frame later becomes available it will only cause more jitter to display it, so it's useless. The viewer can still enjoy the program if jitter doesn't occur too often.
Soft Real-Time
The soft real-time definition allows for frequently missed deadlines, and as long as tasks are timely executed their results continue to have value. Completed tasks may have increasing value up to the deadline and decreasing value past it.
Examples:
Weather stations have many sensors for reading temperature, humidity, wind speed, etc. The readings should be taken and transmitted at regular intervals, however the sensors are not synchronized. Even though a sensor reading may be early or late compared with the others it can still be relevant as long as it is close enough.
A video game console runs software for a game engine. There are many resources that must be shared between its tasks. At the same time tasks need to be completed according to the schedule for the game to play correctly. As long as tasks are being completely relatively on time the game will be enjoyable, and if not it may only lag a little.
Siewert: Real-Time Embedded Systems and Components.
Liu & Layland: Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment.
Marchand & Silly-Chetto: Dynamic Scheduling of Soft Aperiodic Tasks and Periodic Tasks with Skips.
Hard real-time means you must absolutely hit every deadline. Very few systems have this requirement. Some examples are nuclear systems, some medical applications such as pacemakers, a large number of defense applications, avionics, etc.
Firm/soft real time systems can miss some deadlines, but eventually performance will degrade if too many are missed. A good example is the sound system in your computer. If you miss a few bits, no big deal, but miss too many and you're going to eventually degrade the system. Similar would be seismic sensors. If you miss a few datapoints, no big deal, but you have to catch most of them to make sense of the data. More importantly, nobody is going to die if they don't work correctly.
The line is fuzzy, because even a pacemaker can be off by a small amount without killing the patient, but that's the general gist.
It's sort of like the difference between hot and warm. There's not a real divide, but you know it when you feel it.
After reading the Wikipedia page and other pages on real-time computing. I made the following inferences:
1> For a Hard real-time system, if the system fails to meet the deadline even once the system is considered to have Failed.
2> For a Firm real-time system, even if the system fails to meet the deadline, possibly more than once (i.e. for multiple requests), the system is not considered to have failed. Also, the responses for the requests (replies to a query, result of a task, etc.) are worthless once the deadline for that particular request has passed (The usefulness of a result is zero after its deadline). A hypothetical example can be a storm forecast system (if a storm is predicted before arrival, then the system has done its job, prediction after the event has already happened or when it is happening is of no value).
3> For a Soft real-time system, even if the system fails to meet the deadline, possibly more than once (i.e. for multiple requests), the system is not considered to have failed. But, in this case the results of the requests are not worthless value for a result after its deadline, is not zero, rather it degrades as time passes after the deadline. Eg.: Streaming audio-video.
Here is a link to a resource that was very helpful.
It's popular to associate some great catastrophe with the definition of hard real-time, but this is not relevant. Any failure to meet a hard real-time constraint simply means that the system is broken. The severity of the outcome when something is labelled "broken" isn't material to the definition.
Firm and soft simply fail to be automatically declared broken on failing to meet a single deadline.
For a fair example of hard real-time, from the page you linked:
Early video game systems such as the Atari 2600 and Cinematronics vector graphics had hard real-time requirements because of the nature of the graphics and timing hardware.
If something in the video generation loop missed just a single deadline then the whole display would glitch, which would be intolerable, even if it was rare. That would be a broken system and you'd take it back to the shop for a refund. So it's hard real-time.
Obviously any system can be subject to situations it cannot handle, so it's necessary to restrict the definition to being within the expected operating conditions -- noting that in safety-critical applications people must plan for terrible conditions ("the coolant has evaporated", "the brakes have failed", but rarely "the sun has exploded").
And lets not forget that sometimes there's an implicit "while anybody is watching" operating condition. If nobody sees you break the rules (or if they did but they die the fire before telling anyone), and nobody can prove that you broke the rules after the fact, then it's kind of the same as if you never broke the rules!
The simplest way to distinguish between the different kinds of real-time system types is answering the question:
Is a delayed system response (after the deadline) is still useful or not?
So depending on the answer you get for this question, your system could be included as one of the following categories:
Hard: No, and delayed answers are considered a system failure
This is the case when missing the dead-line will make the system unusable. For example the system controlling the car Airbag system should detect the crash and inflate rapidly the bag. The whole process takes more or less one-twenty-fifth of a second. Thus, if the system for example react with 1 second of delay the consequences could be mortal and it will be no benefit having the bag inflated once the car has already crashed.
Firm: No, but delayed answers are not necessary a system failure
This is the case when missing the deadline is tolerable but it will affect the quality of the service. As a simple example consider a video encryption system. Normally the password of encryption is generated in the server (video Head end) and sent to the customer set-top box. This process should be synchronized so normally the set-top box receives the password before starts receiving the encrypted video frames. In this case a delay it may lead to video glitches since the set-top box is not able to decode the frames because it hasn't received the password yet. In this case the service (film, an interesting football match, etc) could be affected by not meeting the deadline. Receiving the password with delay in this case is not useful since the frames encrypted with the same have already caused the glitches.
Soft: Yes, but the system service is degraded
As from the the wikipedia description the usefulness of a result degrades after its deadline. That means, getting a response from the system out of the deadline is still useful for the end user but its usefulness degrade after reaching the deadline. A simple example for this case is a software that automatically controls the temperature of a room (or a building). In this case if the system has some delays reading the temperature sensors it will be a little bit slow to react upon brusque temperature changes. However, at the end it will end up reacting to the change and adjusting accordingly the temperature to keep it constant for example. So in this case the delayed reaction is useful, but it degrades the system quality of service.
A soft real time is easiest to understand, in which even if the result is obtained after the deadline, the results are still considered as valid.
Example: Web browser- We request for certain URL, it takes some time in loading the page. If the system takes more than expected time to provide us with the page, the page obtained is not considered as invalid, we just say that the system's performance wasn't up to the mark (system gave low performance!).
In hard real time system, if the result is obtained after the deadline, the system is considered to have failed completely.
Example: In case of a robot doing some job like line tracing, etc. If a hindrance comes on its path, and the robot doesn't process this information within some programmed deadline (almost instant!), the robot is said to have failed in its task (the robot system may also get completely destroyed!).
In firm real time system, if the result of process execution comes after the deadline, we discard that result, but the system is not termed to have been failed.
Example: Satellite communication for enemy position monitoring or some other task. If the ground computer station to which the satellites send the frames periodically is overloaded, and the current frame (packet) is not processed in time and the next frame comes up, the current packet (the one who missed the deadline) doesn't matter whether the processing was done (or half done or almost done) is dropped/discarded. But the ground computer is not termed to have completely failed.
To define "soft real-time," it is easiest to compare it with "hard real-time." Below we will see that the term "firm real-time" constitutes a misunderstanding about "soft real-time."
Speaking casually, most people implicitly have an informal mental model that considers information or an event as being "real-time"
• if, or to the extent that, it is manifest to them with a delay (latency) that can be related to its perceived currency
• i.e., in a time frame that the information or event has acceptably satisfactory value to them.
There are numerous different ad hoc definitions of "hard real-time," but in that mental model, hard real-time is represented by the "if" term. Specifically, assuming that real-time actions (such as tasks) have completion deadlines, acceptably satisfactory value of the event that all tasks complete is limited to the special case that all tasks meet their deadlines.
Hard real-time systems make the very strong assumptions that everything about the application and system and environment is static and known a' priori—e.g., which tasks, that they are periodic, their arrival times, their periods, their deadlines, that they won’t have resource conflicts, and overall the time evolution of the system. In an aircraft flight control system or automotive braking system and many other cases those assumptions can usually be satisfied so that all the deadlines will be met.
This mental model is deliberately and very usefully general enough to encompass both hard and soft real-time--soft is accommodated by the "to the extent that" phrase. For example, suppose that the task completions event has suboptimal but acceptable value if
no more than 10% of the tasks miss their deadlines
or no task is more than 20% tardy
or the average tardiness of all tasks is no more than 15%
or the maximum tardiness among all tasks is less than 10%
These are all common examples of soft real-time cases in a great many applications.
Consider the single-task application of picking your child up after school. That probably does not have an actual deadline, instead there is some value to you and your child based on when that event takes place. Too early wastes resources (such as your time) and too late has some negative value because your child might be left alone and potentially in harm's way (or at least inconvenienced).
Unlike the static hard real-time special case, soft real-time makes only the minimum necessary application-specific assumptions about the tasks and system, and uncertainties are expected. To pick up your child, you have to drive to the school, and the time to do that is dynamic depending on weather, traffic conditions, etc. You might be tempted to over-provision your system (i.e., allow what you hope is the worst case driving time) but again this is wasting resources (your time, and occupying the family vehicle, possibly denying use by other family members).
That example may not seem to be costly in terms of wasted resources, but consider other examples. All military combat systems are soft real-time. For example, consider performing an aircraft attack on a hostile ground vehicle using a missile guided with updates to it as the target maneuvers. The maximum satisfaction for completing the course update tasks is achieved by a direct destructive strike on the target. But an attempt to over-provision resources to make certain of this outcome is usually far too expensive and may even be impossible. In this case, you may be less but sufficiently satisfied if the missile strikes close enough to the target to disable it.
Obviously combat scenarios have a great many possible dynamic uncertainties that must be accommodated by the resource management. Soft real-time systems are also very common in many civilian systems, such as industrial automation, although obviously military ones are the most dangerous and urgent ones to achieve acceptably satisfactory value in.
The keystone of real-time systems is "predictability." The hard real-time case is interested in only one special case of predictability--i.e., that the tasks will all meet their deadlines and the maximum possible value will be achieved by that event. That special case is named "deterministic."
There is a spectrum of predictability. Deterministic (determinism) is one end-point (maximum predictability) on the predictability spectrum; the other end-point is minimum predictability (maximum non-determinism). The spectrum's metric and end-points have to be interpreted in terms of a chosen predictability model; everything between those two end-points is degrees of unpredictability (= degrees of non-determinism).
Most real-time systems (namely, soft ones) have non-deterministic predictability, for example, of the tasks' completions times and hence the values gained from those events.
In general (in theory), predictability, and hence acceptably satisfactory value, can be made as close to the deterministic end-point as necessary--but at a price which may be physically impossible or excessively expensive (as in combat or perhaps even in picking up your child from school).
Soft real-time requires an application-specific choice of a probability model (not the common frequentist model) and hence predictability model for reasoning about event latencies and resulting values.
Referring back to the above list of events that provide acceptable value, now we can add non-deterministic cases, such as
the probability that no task will miss its deadline by more than 5% is greater than 0.87. (Note the number of scheduling criteria expressed in there.)
In a missile defense application, given the fact that in combat the offense always has the advantage over the defense, which of these two real-time computing scenarios would you prefer:
because the perfect destruction of all the hostile missiles is very unlikely or impossible, assign your defensive resources to maximize the probability that as many of the most threatening (e.g., based on their targets) hostile missiles will be successfully intercepted (close interception counts because it can move the hostile missile off-course);
complain that this is not a real-time computing problem because it is dynamic instead of static, and traditional real-time concepts and techniques do not apply, and it sounds more difficult than static hard real-time, so you are not interested in it.
Despite the various misunderstandings about soft real-time in the real-time computing community, soft real-time is very general and powerful, albeit potentially complex compared with hard real-time. Soft real-time systems as summarized here have a lengthy successful history of use outside the real-time computing community.
To directly answer the OP question:
A hard real-time system can provide deterministic guarantees—most commonly that all tasks will meet their deadlines, interrupt or system call response time will always be less than x, etc.—IF AND ONLY IF very strong assumptions are made and are correct that everything that matters is static and known a' priori (in general, such guarantees for hard real-time systems are an open research problem except for rather simple cases)
A soft real-time system does not make deterministic guarantees, it is intended to provide the best possible analytically specified and accomplished probabilistic timeliness and predictability of timeliness that are feasible under the current dynamic circumstances, according to application-specific criteria.
Obviously hard real-time is a simple special case of soft real-time. Obviously soft real-time's analytical non-deterministic assurances can be very complex to provide, but are mandatory in the most common real-time cases (including the most dangerous safety-critical ones such as combat) since most real-time cases are dynamic not static.
"Firm real-time" is an ill-defined special case of "soft real-time." There is no need for this term if the term "soft real-time" is understood and used properly.
I have a more detailed much more precise discussion of real-time, hard real-time, soft real-time, predictability, determinism, and related topics on my web site real-time.org.
real-time - Pertaining to a system or mode of operation in which computation is performed during the actual time that an external process occurs, in order that the computation results can be used to control, monitor, or respond to the external process in a timely manner. [IEEE Standard 610.12.1990]
I know this definition is old, very old. I can't, however, find a more recent definition by the IEEE (Institute of Electrical and Electronics Engineers).
Consider a task that inputs data from the serial port. When new data arrives the serial port triggers an event. When the software services that event, it reads and processes the new data. The serial port has a hardware to store incoming data (2 on the MSP432, 16 on the TM4C123) such that if the buffer is full and more data arrives, the new data is lost. Is this system hard, firm, or soft real time?
It is hard real time because if the response is late, data may be lost.
Consider a hearing aid that inputs sounds from a microphone, manipulates the sound data, and then outputs the data to a speaker. The system usually has small and bounded jitter, but occasionally other tasks in the hearing aid cause some data to be late, causing a noise pulse on the speaker. Is this system hard, firm or soft real time?
It is firm real time because it causes an error that can be perceived but the effect is harmless and does not significantly alter the quality of the experience.
Consider a task that outputs data to a printer. When the printer is idle the printer triggers an event. When the software services that event, it sends more data to the printer. Is this system hard, firm or soft real time?
It is soft real time because the faster it responses the better, but the value of the system (bandwidth is amount of data printed per second) diminishes with latency.
UTAustinX: UT.RTBN.12.01x Realtime Bluetooth Networks
Maybe the definition is at fault.
From my experience, I would separate the two as being hardware and software dependant.
If you have 200ms to service a hardware driven interrupt, that is what you've got. You stick 300ms of code in there and the system isn't broken, it hasn't been developed. You'll be switched out before you've finished. Your code doesn't work or is not fit for purpose. Many systems have hard defined processing periods. Video, telecoms etc.
If you're writing an application that's real-time, this could be considered soft. If you run out of time you could hope for less load next time, you could tune the OS, add some memory or even upgrade the hardware. You have options.
To look at it from a UX perspective is not helpful. A Skoda might not be broken if it glitches, but a BMW sure as hell will be.
The definition has expanded over the years to the detriment of the term. What is now called "Hard" real-time is what used to be simply called real-time. So systems in which missing timing windows (rather than single-sided time deadlines) would result incorrect data or incorrect behavior should be consider real-time. Systems without that characteristic would be considered non-real-time.
That's not to say that time isn't of interest in non-real-time systems, it just means that timing requirements in such systems don't result in fundamentally incorrect results.
Hard real time systems uses preemptive version of priority scheduling, so that critical tasks get immediately scheduled, whereas soft real time systems uses non-preemptive version of the priority scheduling, which allows the present task to be finished before control is transferred to the higher priority task, causing additional delays. Thus the task deadlines are critically followed in Hard real time systems, whereas in soft real time systems they are handled not that seriously.

Can two processes simultaneously run on one CPU core?

Can two processes simultaneously run on one CPU core, which has hyper threading? I learn from the Internet. But, I do not see a clear straight answer.
Edit:
Thanks for discussion and sharing! My purse to post my question here is not to discuss about parallel computing. It will be too big to be discussed here. I just want to know if a multithread application can benefit more from hyper threading than a multi process application. After further reading, I have following as my learning notes.
1) A Hyper-Threading Technology enabled CPU Core has two set of CPU state and Interrupt Logic. Meanwhile, it has only one set of Execution Units and Cache. (I have not study what is pipeline yet)
2) Multi threading benefits from Hyper Threading only if there is latency happen in some executed thread. I think this point can exactly map to the common reason for why and when software programmer use multi thread. If the multi thread application has been optimized. It may not gain any benefit from Hypter threading.
3) If the CPU state maps to process state, I believe Marc is correct that multiple process application can even benefit more from hyper threading technology.
4) When CPU vendor says "thread", it looks like their "thread" is different from thread that I know as a java programmer?
No, a hyperthreaded CPU core still only has a single execution pipeline. Even though it appears as two CPUs to the overlying OS, there's still only ever one instruction being executed at any given time.
Hyperthreading was intended to allow the CPU to continue executing one thread while another thread was stalled waiting for a resource or other operation to complete, without leaving too many stages of the pipeline empty and useless. This goes back to the Pentium 4 days, with its absurdly long pipeline - a stall was essentially catastrophic for efficiency and throughput, and hyperthreading allowed Intel to keep the cpu busy doing other things while it cleaned up from the stall.
While Marc B's answer is pretty much the definitive summary of how HT works, I just want to make a small contribution by linking this article, which should clear up a lot of things about HT: http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/
Short answer, yes.
A single core cpu(a processor), can run 2 or more threads simultaneously. These threads may belong to the one program, or they may belong different programs and thus processes. This type of multithreading is called Simultaneous MultiThreading(SMT).
Information that claims cpu core can execute only one instruction at any given time is also not true. Modern CPUs exploit Instruction Level Parallelism(ILP) by duplicating pipeline resources(e.g 2 ALUs instead of 1). This type of pipeline is called "superscalar" pipeline.
Wikipedia page of Simultaneous Multithreading:
Simultaneous multithreading