Google BigTable is a system that uses LSM-tree as its core data structure for storage. LSM-tree can use different merge strategies. The two most common ones are (1) leveled merging which is more read-optimized, and (2) tiered merging which is more write-optimized. These merge strategies can further be configured by adjusting the size ratio between adjacent levels.
I have not been able to find anywhere what is BigTable's default behavior in these respects, and whether it can be tuned or not. As a result, it is hard to understand it's default performance properties and how to adapt them to different workloads.
With tiered merging, a level of LSM-tree gathers runs until it reaches capacity. It then merges these runs and flushes the resulting run to the next larger level. There are at most O(T * log_T(N)) runs at each level, and write cost is O(log_T(N) / B), where N is the data size, B is the block size, and T is the size ratio between levels.
With leveled merging, there is one run at each level of LSM-tree. A merge takes place as soon as a new run comes into the level, and if the level exceeds capacity the resulting run is flushed to the next larger level. There are at most O(log_T(N)) runs at each level, and write cost is O((T * log_T(N)) / B).
As a result, these schemes have different read/write performance properties. However, I have been unable to find sources on whether Google's BigTable uses leveled or tiered merging, and what is the default size ratio T? Also, are these aspects of the system fixed, or are they tunable?
The merge compaction behavior and strategy used by Google Cloud Bigtable is currently not tunable by end users via the Cloud Bigtable APIs, although the underlying system which backs the Cloud Bigtable product is dynamic and tunable by our engineering and operations teams.
Here's a somewhat recent paper on different approaches to merge compaction algorithms which have been explored in Bigtable:
Online Bigtable merge compaction
https://arxiv.org/pdf/1407.3008.pdf
We employ a number of proprietary approaches to dynamically adjusting and tuning merge compaction behavior. If you do have more specific questions related to your use of the system or are experiencing issues with merge compaction behavior you can of course feel free to file a support case.
Related
I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)
There are plenty of good stackoverflow Q&A on the CAP theorem as CP vs AP etc.
In a nutshell the theorem states:
"In the presence of a partition you must sacrifice availability or consistency"
Lets imagine we speak about storage, databases in particular.
What are the technical reasons to Partition in first place?
(I'll try to take some guesses below):
OS can handle only so many ports/system handles.
Single "N Petabyte" Hard discs do not exist, you need more, until you run out of SATA/PCI ports.
Bringing the data closer to the consumer.
Single Database size is limited to size X.
Please note that there is a difference in meaning between "partitioning" as per CAP and "partitioning" as per physical database design.
"partitioning" as per CAP refers to what happens when a node in a distributed system becomes unavailable/unreachable, thus refers to phenomena that happen "at run time".
"partitioning" as per physical database design refers to the design decision to distribute the physical records representing the rows of one single table across various distinct physical stores, but 'distinct' might even then still mean only 'various distinct segments of one single store'. Anyhow, it thus refers to things that happen at design time.
In particular, it means that if you do "partitioning" as per physical database design, this does not necessarily lead to the existence of a "distributed system" in the sense of CAP. In particular, when "partitioning" as per physical database design, you do not necessarily create a system with various distinct runtime components operating "independently" : if you partition a table, you'll still typically have only one single DBMS that you are communicating with, thus only one sole runtime component.
Also in particular, if you "partition" as per physical database design, it is wrong to conclude that because of CAP theorem, consistency must necessarily have been sacrificed.
I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?
A summary and additional discussion can be found at: L1 caches usually have split design, but L2, L3 caches have unified design, why?.
Introduction
A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.
The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.
To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:
Typically, the best performance occurs with half of the cache devoted
to instructions and half to data.
They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.
As shown in Fig. 6, the performance of the best dedicated cache CUXD
(half allotted to instructions and half to data) in general is quite
similar to that of a homogeneous cache (CUX); the extra complexity of
a dedicated cache control is thus not justifiable.
It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:
Thus far, the cache memory has been assumed to be composed of
homogeneous cells. But conceivably a functionally specialized
partitioning of the cache could give higher performance. For example,
perhaps a cache devoted exactly half to instructions and half to data
would be more effective than a homogeneous one; alternatively, one
that holds just instructions could be better than one holding just
data. To test these hypotheses, the effects of dividing the cache into
sections dedicated to specific uses were investigated.
(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)
It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.
Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.
Advantages of the Split Cache Design
The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.
Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.
In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]
Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.
The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:
LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it results in losing its contents. There is, however, a performance penalty with the proposed design (compared to a pure SRAM cache hierarchy).
Feasibility exploration of NVM based I-cache through MSHR enhancements: This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM. There is no loop cache here. This paper instead targets the high write latency issue of STT-RAM, which is incurred when a line is filled in the cache. The idea is that when a requested line is received from the L2 cache, the L1 cache first buffers the line in the MSHR allocated for its request. The MSHRs are still SRAM-based. Then the instruction cache line can be fed into the pipeline directly from the MSHR without having to potentially stall until it gets written in the STT-RAM cache. Similar to the previous work, the proposed architecture improves energy consumption at the expense of reduced performance.
System level exploration of a STT-MRAM based level 1 data-cache: Proposes using STT-RAM for the L1 data cache while keeping all other caches SRAM-based. This reduces area overhead and energy consumption, but performance is penalized.
Loop optimization in presence of STT-MRAM caches: A study of performance-energy tradeoffs: Compares the energy consumption and performance of pure (only SRAM or only STT-RAM) and hybrid (the L2 and instruction cache are STT-RAM-based) hierarchies. The hybrid cache hierarchy a performance-energy tradeoff that is in between the pure SRAM and pure STT-RAM hierarchies.
So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.
There are two other advantages, which will be discussed later in this answer.
Disadvantages of the Split Cache Design
The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate (see the VSC paper discussed later). This is the primary disadvantage of the split design. (If there is a placement contention on a single cache set in the split design, this contention may still occur in the unified design and it may have a worse impact on performance. In such a scenario, the split design would have a lower overall miss rate.)
Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. (An inconsistency may be allowed between the two caches for a small number of cycles, but if the ISA does not allow such inconsistencies to be observable, they have to be detected before the instruction that got modified permanently changes the architectural state.) Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.
Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, the fully dual-ported design has the biggest area. This holds true irrespective of the types of the two ports (exclusive-read, exclusive-write, read/write). The dual-ported banked cache has a higher area than the single-ported unified cache. How these two compare against split is less obvious to me. My understanding is that the split design has a higher area than the single-ported unified design [TODO: Explain why]. It may be important to consider the cache organization details, the lengths of the cache buses to the pipeline, and the process technology. One thing to note here is a single-ported instruction cache has a lower are than a single-ported data cache or unified cache because the instruction cache requires only an exclusive-read port while the others require a read/write port.
Unified L1 and Split L2 Caches in Real Processors
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline. An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively. Later Itanium processors reduced the L2I size to 512 KB. The Itanium 2 9000 manual explains why the L2 was made split:
The separate instruction and data L2 caches provide more efficient
access to the caches compared to Itanium 2 processors where
instruction requests would contend against data accesses for L2
bandwidth against data accesses and potentially impact core execution
as well as L2 throughput.
.
.
.
The L3 receives requests from both the
L2I and L2D but gives priority to the L2I request in the rare case of
a conflict. Moving the arbitration point from the L1-L2 in the Itanium
2 processor to the L2-L3 cache greatly reduces conflicts thanks to the
high hit rates of the L2.
(I think "against data accesses" was written twice by mistake.)
The second paragraph from that quote mentions an advantage that I have missed earlier. A split L2 cache moves the data-instruction conflict point from the L2 to the L3. In addition, some/many requests that miss in the L1 caches may hit in the L2, thereby making contention at the L3 less likely.
By the way, the L2I and L2D in the Itanium 2 9000 both use the NRU replacement policy.
Unified L1 Cache Partitioning
James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data, which was published in 2013. The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for reason why the unified design has a low L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation). However, the combined impact on performance of the higher latency and lower miss rate still makes the system with the unified L1 cache slower than the one with the split cache.
The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has even a low miss rate because partitioning reduces potential space conflict between lines holding and instructions and lines holding data. According to the experimental results (all cache designs have the same overall capacity), even if the VSC has the same latency as the unified cache, the VSC has about the same performance as the split design on a single-core system and has a higher performance on a multi-core system because the lower miss rate results in less contention on accessing the shared L2 cache. In addition, in both the single-core and multi-core system configurations, the VSC reduces energy consumption due to the lower miss rate.
A VSC could have a lower latency than a unified cache. Although both are dual ported (to have the same bandwdith as the single-ported split cache), in the VSC design, only the interface needs to be dual ported because no part of the cache can be accessed more than once at the same time. (The paper doesn't explicitly say so, but I think the VSC allows the same line to be present in both partitions if it holds both instructions and data, so it still has the consistency problem that exists in the split design.) Assuming that each bank of the cache represents one cache way, then each bank can be single-ported in VSC. This leads to a simpler design (see: Fast quadratic increase of multiport-storage-cell area with port number) and may allow reducing the latency. Moreover, assuming that the different in latency between the unified design and the split design is small (because the instruction cache and data cache in the split design are physically close to each other), the VSC design can store instructions and data in banks that are physically close to where they are needed in the pipeline and support variable-latency access depending on how many banks are allocated for each. The larger the number of banks, the higher the latency, up to the latency of the unified design. This would require, however, a pipeline design that can handle such variable latency cache.
I think one important thing that this paper is missing is evaluating the VSC design with higher access latencies with respect to the split design (not just 2 cycles vs. 3 cycles). I think increasing the latency by even only one cycle would make VSC slower than split.
The Case of the Z80000 Unified Cache
The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles (loads that miss in the cache and other complex instructions may take more cycles). Each pair of consecutive clock cycles constitutes a processor cycle. The cache design of the Z80000 has a number of unique properties that I've not seen anywhere else:
There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite of being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist. The full associativity minimizes space contention between instruction and data lines. This design was made possible by the small cache size and relatively shallow pipeline with respect to the cache latency.
The System Configuration Control Longword (SCCL) offers Cache Instruction (CI) and Cache Data (CD) control bits. If CI is 1, instruction fetches that miss in the cache can be filled in the cache. If CD is 1, data loads that miss in the cache can be filled in the cache. The cache uses a write-no-allocate policy so write misses never allocate in the cache. If both CI and CD are set to 1, the cache effectively works like a unified cache. If only one of the flags is 1, the cache effectively works like a data-only or instruction-only cache. Applications can tune these flags to improve performance.
This property is not relevant to the question, but I found it interesting. SCCL also offers Cache Replacement (CR) control bit. Setting this bit to zero disables replacement on a miss, so lines are never replaced. If all entries in a set are occupied and a load/fetch miss occurs in that set, the line is simply not filled in the cache.
The Cases of the R3000, 80486, and Pentium P5
I came across the following complementary question on SE Retrocomputing: Why did Intel abandon unified CPU cache?. There are a number of issues with the accepted answer on that question. I'll address these issues here and explain why the 80486 and Pentium caches were designed like that based on information from Intel.
The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance according to Section 1.8 of PaceMips R3000 32-Bit, 25 MHz RISC CPU with Integrated Memory Management Unit. The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. (Also, some of the address wires are used to provide cache line tags to the on-chip cache controller for tag matching.) Both caches are direct-mapped, maybe to achieve a single-cycle access latency. A unified external cache design with the same bandwdith, associativity, and capacity requires the cache to be fully dual-ported, or the VSC design could be used but VSC was invented many years later. Such a unified cache would be more expensive and may have a latency larger than the required single cycle to keep the pipeline filled with instructions.
Another issue with the linked answer from Retro is that just because the 80486 evolved directly from the 80386 doesn't necessarily mean that it has also use the unified design. According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle, Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16KB vs. up to 256KB+256KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses. In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment desc) in one access. The R3000 caches share a single 4-byte bus. The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.
The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwdith contention.
Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwdith demand. The paper doesn't mention whether Intel considered using a triple-ported banked unified cache, but they probably did and found out that it's not feasible at that time, so they went for a split cache with a dual-ported 8-banked data cache and a single-ported instruction cache. With today's fab technology, the triple-ported unified design may be better
Wider pipelines in later microarchitectures required higher parallelism at the data cache. Now we're at 4 64-byte ports in Sunny Cove.
Answering the Second Part of the Question
it was mentioned that the cache is a split cache, and no hazard what
does this exactly means?
It's probably about the structural hazard mentioned in Paul's comment. That is, a unified single-ported cache cannot be accessed by the instruction fetch unit and the memory unit at the time.
I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution
I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland
The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.
There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html