Resource pool for a data structure - pool

When implementing an elementary data structure like stack, queues, linked list et al.
should I create a resource pool(of nodes) by dynamically allocating memory in bunch or should I allocate memory separately every time I need a node?

This entirely depends on your goals. By default (i.e. unless you really need to do otherwise), just do a normal allocation for each next node.
Memory pools as compared to just allocating nodes:
Make allocation faster. Depending on underlying allocation mechanism, sometimes significantly faster.
Usually less memory fragmentation, though this may not be a problem with some allocators.
Major drawback: waste memory on reserved, but not used nodes. This is very important if you use data structure indiscriminately (e.g. 1000s of instances) as opposed to just having a couple of instances.
Because of the drawback, memory pools are not suitable for generic situations.
In C++ all standard containers have an allocator template parameter.

It's a basic time vs. space trade-off. Chose based on what is more important to you:
If you allocate the pool in advance, then your runtime element insertions will be -- on average -- optimized for speed, i.e. constant time, i.e O(1). "On average" means that most insertions will be constant time, except those that hit the maximum and require expanding the pool, which are linear time, O(n). You also risk wasting some memory if you end up not using the whole pool.
If you do real-time allocation of every new node, you will always have constant-time insertion, but in this case, the constant time is a little longer than the constant time above, because you not only have to put a value into a memory location, but you also have to first allocate the memory location. Also, this method does not waste any memory by reserving memory locations in advance.
In most situations, I think real-time allocation is sufficiently efficient time-wise that I don't see why you'd use the pooled approach, unless your application requires extreme average speed or you do huge numbers of insertions.

Related

System Design - how to Pick CPU, Memory for an application

I am practicing System Design concepts and I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
For Back of the envelope calculation ,I saw examples of calculating tps for read and write calls, calculate bandwidth needs, database storage needs etc. but I have not seen how to determine cpu, memory needs and how many instances are enough. Is there a procedure that guides to solve this problem?
My hunch says that we pick small to medium sized server instance (if we use cloud provider like AWS) and run stress tests for calculated TPS and see CPU and memory usage and see if we need to increase or decrease server configuration based on results?
I would greatly appreciate any inputs you may have.
I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
This is mostly a question about economics. If resources was very cheap, you could use a lot of them - but unfortunately, they have an economic cost.
Scale out horizontal or scale up vertical
The first fundamental question to ask, should you scale up your app vertically (e.g. to bigger instances) or should you scale out your app horizontally.
The most important thing here is that scaling out horizontally is much easier. But wether you can scale out horizontally of if you have to scale up vertically depends on your app. If your app is a stateless webserver, it typically is very easy to scale out, but if you have a stateful cache or database, scale up vertically might be your only short term option. Try to design so that you can scale out horizontally since that is much easier.
Accurate size - use observability
To find your accurate size, use observability and investigate your bottlenecks and adjust relatively to that.
E.g. if you use too little memory, your app will be terminated, or if you use too little CPU, your response time will be slow. Just start somewhere and adjust.
In addition to Jonas's answer:
You have two approaches (which are not mutually exclusive):
Estimate your needs based on expected load, etc.
Adjust you needs based on what you observe in production.
Regarding the first approach:
Have you done any analysis into what your expected load is? E.g. how many users (unique sessions), how many requests on average per hour (page views, API calls, etc), potential peaks in activity leading to increased load, etc.
Have you done any benchmarking?
Have you looked at your system and what it does, and worked out if it has any specific resource (CPU, memory, disk, etc) needs?
Estimating resources ahead of time requires some knowledge (or informed guesses) regarding what the load will be, as per the 3 points above. Having an idea of what the daily or hourly request average is isn't a bad place to start.
Also make sure you aware if any potential spikes that might catch you out (end of month for financial systems/services). Whether or not these are significant enough that is worth worrying about is another thing. A friend of mine was working on a ticketing system once, and they had massive traffic spikes for major events that did warrant serious scaling-out and back... but your average system probably won't need to be that extreme.
CPU is probably only worth "worrying" about if you have anything that does any above average processing - this should be obvious through benchmarking or if you/your team has good knowledge of your code.
Disk usage can be calculated - e.g.
If on average a user generates 1Mb of data in a session (not including system logs), and you get 100 sessions a day then that's 100Mb a day, 500Mb a working week, 200Mb a month, etc.
If a user profile has on average 200Kb of data and 300Kb of storage space (images) then you can calculate that.
You can also do this for records, especially for records that you know are "large" (e.g. >25mb) or where there will be lots of them (e.g. millions).
You can also start to forecast growth over time if you allow a growth rate (e.g. number of users and their sessions, and the amount of data generated). A simple way to do that is to have a spreadsheet with some simple formulas that take various inputs like number of users, average requests per user, disk space per user, etc. You can then do what-if modelling by playing with the inputs.
In terms of the second approach - as Jonas says, observe and adjust. Make sure you know how to do that, and that your solution provides the data you need. This might be using metrics provided by your cloud-provider (if applicable) or instrumentation / reporting you have custom built into you solution.
Scaling-Up is probably more relevant in scenarios where you have a central point/resource that cannot be scaled-out, like a central database.

Why are most cache line sizes designed to be 64 byte instead of 32/128byte now?

i found in linux, it shows my cpu's cache line size is 64 byte, and i realized 16/32/128 byte is existing, but most cpu are designed to 64 byte cache line size now. why not bigger or smaller?
It's a trade-off. Wider caches are more efficient (in terms of area/power for a given cache size), but result in more memory traffic for random (non-sequential/strided) access, and more false sharing contention between parallel caches.
if you have a memory access pattern that only needs a few bytes from each cache line (eg, iterating along a linked list that is scattered widely across memory), each access will need to pull an entire line into the cache. So doubling the line size will double the memory traffic.
if different CPUs, each with its own cache, are accessing memory on the same cache line, that line will have to "bounce" back and forth between the caches. Avoiding this means putting more padding between objects.
In both cases, these problems can be avoided by tuning the software to want memory in chunks that are multiples of the cache line size. The bigger the cache line size, the more work that is.
As Chris Dodd's answer points out, the sizing of cache lines involves trade-offs.
Larger cache lines reduce the number of tag bits per data byte, provide prefetching, and facilitate higher bandwidth (particularly at the memory and the L1 interfaces) at the cost of excessive prefetch (wasting bandwidth and cache capacity), false sharing, higher miss latency (especially without critical word first/early restart), and higher conflict misses (for smaller caches, with fewer sets the probability that more accesses than the associativity will map to a particular set increases). (Larger cache lines can also provide greater performance predictability by guaranteeing a cache hit within a larger address range and number of bytes.)
Modern systems would not noticeably benefit from such prefetching; configurable static prefetcher logic would provide the same behavior and dynamic prefetching can exploit variable resource availability (e.g., cache capacity and memory channel occupancy) and utility as well as provide more flexible prefetching (such as non-unit stride).
Tag overhead is not as significant a concern in terms of area for modern caches using SRAM for data as well as tags. (IBM's Power and zArchitecture implementations use eDRAM for outer cache data storage and SRAM for tags, which more than doubles the area cost of tags relative to data.) However, access latency and access energy are effected by the size of the tag arrays. For L1 caches, way prediction is more effective with larger cache lines both because there are fewer cache lines for a given cache capacity and because spatial locality tends to apply even beyond reasonable cache line sizes; only having to check one set of tags for a wider or larger number of accesses reduces the cost of higher bandwidth (this is most noticeable in GPUs which exploit spatial locality and sacrifice latency for bandwidth). For outer cache levels, phased tag-data access is often used (tags are checked before data access begins, saving energy — especially given higher associativity and miss rates); smaller tag arrays for a given capacity both reduce access energy and latency (especially for misses — 50% hit rates are not unheardof). (Note that one can use partial tags to provide early miss detection in the common case where a miss has no matching partial tags. Other filtering mechanisms are also possible.)
False sharing can be countered by using sectored caches, where more than one validity (or coherence state) entry is provided for each address portion of the tag. This provides an intermediate design point between larger cache lines with more frequent false sharing and smaller cache lines with higher tag overhead. Such also inherently supports reducing cache line fill delay. For traditional layouts, this has the substantial effective capacity cost of large cache lines when false sharing or less spatial locality is more common. For designs using indirection, such as proposed Non-Uniform Cache Architectures and V-Way caches, the capacity utilization issue can be reduced by allocating data storage at a finer granularity at the cost of more indirection pointer storage.
Larger cache lines provide three bandwidth benefits. The command overhead is less (address and action information is nearly constant — address is one bit smaller for each doubling in size) so the bandwidth overhead per data byte is lower; this is more significant for coherence traffic where many messages carry only metadata. (Obviously with more numerous coherence nodes false sharing can be more problematic acting against this advantage.) Other per-request overheads also do not scale with request size (e.g., DRAM row activation with random-access, close-on-completion row management). ECC (or check codes with retransmission) also have less per payload byte overhead with larger payloads (this can be used to store extra metadata while using commodity width memory modules).
A larger cache line also facilitates wider memory interfaces when burst length is fixed. Increasing DRAM burst length facilitates higher bandwidth; DDR5 moved to a burst length of 16, pushing DIMMs into using two 32-bit wide channels to be compatible with x86's de facto standardization on 64-byte cache lines. While this chance can be viewed positively as increasing available memory level parallelism (MLP)— doubling the number of channels and reducing DRAM bank conflicts — MLP is more important when relative latency of memory is greater (large on-chip caches and faster processing), thread-level parallelism is available (multicore and multithreading), and out-of-order execution (and multithreading) expose more memory accesses to hide latency. Multicore (when used with significant data memory sharing as opposed to multiprogramming or large chunk/stream communication such as pipeline-style mulithreading) also increases the importance of false sharing, further reducing the benefits of larger cache lines (beyond MLP benefits of narrower channels). With lower (on-chip) communication latency and near-unavoidability of multicore processors, multithreaded programming becomes more attractive.
For L1 caches the microarchitecture (and somewhat the ISA) can be influence cache line sizes. Higher frequency designs favor smaller capacity L1 caches both for latency and access energy, especially with less latency tolerance from out-of-order execution or skewed pipelines (where the execution pipeline phase is one or more stages delayed from the address generation phase).
The relative sizes of the various tradeoffs also depends on the workload and software design. Larger cache capacity caches targeting workloads that benefit from such reduce the excessive prefetch and conflict disadvantages of larger cache lines; higher associativity reduces the conflict disadvantage, but workloads that are more likely to have conflicts are less likely (in general) to benefit from spatial locality (and the conflict disadvantage is typically less important for outer cache levels). Pointer-chasing workloads tend to favor lower latency and thus lower capacity, favoring smaller cache lines (at least in L1).
Software design is a significant factor. Avoiding false sharing tends to increase padding as cache line size increases, discouraging larger cache lines. Once a cache line size assumption is established in a software community (which is somewhat segregated according to ISA, OS, and hardware/system vendor) the effect of legacy code and legacy conceptualizations constrains cache line size.
Speculation: x86's orientation toward generic software and personal computer uses (cost and workload characteristics biasing toward smaller caches and workload perhaps generally having lower spatial locality) probably biased the choice toward a smaller cache line than ISA/hardware vendors targeting workstation and server workloads with a higher expectation of software development effort. x86 has standardized on 64-byte cache lines, IBM POWER9 uses 128-byte cache blocks (divided into four sectors for L1 caches) and IBM z15 uses 256-byte cache blocks.
(Latency vs. hit rate, access energy, and other tradeoffs as well as software and programmer legacy seem to lead to less strict standardization on 32KiB L1 cache capacity. The performance impact for a smaller or larger cache can be less significant than for false sharing, so the software constraints are less significant than for cache line size.)

Why LRU implementation is expensive in full associative TLB?

I have a book statement:
Implementation of LRU in full associative TLB is very expensive, so the general way is to use random substitution.
I don't understand why it's expensive under full associative cache. Isn't that just adding an additional reference bit...?
LRU requires maintaining a total order relation between all valid cache lines in a cache set. For example, consider a 3-way cache set with the following lines A, B, and C ordered from the most recently accessed to the least recently accessed (represented as ABC). If C is accessed next, then the order becomes CAB. If a new line, D, needs to be filled in the same cache set, since there are no invalid lines, the LRU replacement policy will choose B to be evicted and replaced by the new line. Then the order becomes DCA.
For a 3-way cache, there are up to 3*2 = 6 possible orders for the lines in each set. In general, for an N-way cache, there are up to N! (N factorial) possible orders. Theoretically, you need at least log2(N!) bits (rounded up to the nearest integer) per cache set to maintain the LRU property accurately. Note that log2(N!) is Θ(Nlog(N)), so it grows superlinearly with respect to the number of ways. No normal person likes anything whose cost grows superlinearly.
A particularly cheap case is a 2-way cache, where the LRU state requires only log2(2!) = 1 bits, i.e., a single bit. It is much more expensive for any other number of ways though.
In practice, though, there is no easy way to maintain a single number that represents the LRU state of a set. If the current LRU state is X and then some access to a line occurs, how can the next LRU state be determined? There is no simple mathematical relation that can be implemented in hardware. So instead of using a single number, a realistic implementation would use multiple numbers, one per cache line. In this case, these numbers are called ages. Such design would even require (many) more bits than the theoretical minimum log2(N!) to maintain the LRU state.
Aside from the hardware overhead, the LRU replacement policy is not necessarily optimal for performance. It depends on the memory access patterns of the applications in the target market domain and the rest of the cache hierarchy.
LRU has been used in many real processors. Caches that are 2-way associative typically use LRU. For example, AMD SledgeHammer uses LRU for both L1I and L1D caches. The Itanium 2 processor's L1 instruction cache uses LRU and it is 4-way associative. Usually, when the number of ways is larger than two, caches don't use LRU.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.