SIMD programming: hybrid approch for data structure layout - cpu-architecture

The Intel Optimization Reference Manual
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
discusses the advantage of Structure-Of-Arrays (SoA) data layout for SIMD processing compared to the traditional Array-Of-Structures (AoS) layout. This is clear.
However, there's one argument I don't understand. On page 4-23 it says "SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays X, Y, and Z (see Example 4-20) would require three separate data streams. This can require the use of more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access efficiency." To mitigate this problem they recommend a hybrid approach (Example 4-22).
Can somebody please explain the "three separate data streams", the "prefetches" and "additional address generation calculations", and "impact on DRAM page access efficiency"?
At https://stackoverflow.com/a/40169187/3852630 Peter Cordes discusses two effects: Three different data streams for X, Y, and Z would tie up three registers for the addresses, and if the three arrays would be mapped to the same cache lines, frequent cache eviction would be a problem. However, registers are not a sparse resource on modern CPUs, and multi-way caches should mitigate the cache problem.

Related

What kind of program(s) cannot maintain its complexity structure when the program or data is partitioned?

I received this question in my operating systems class and after some research I still cannot find an answer to this question.
I understand complexity structure to be the min complexity (number of computation steps) necessary to compute the given level of partitioning of data or program.
The answers are is in the question, namely programs that need more steps to tackle partitioned data or processing units.
If data access pattern (granularity, scope, cardinality) necessitates access and integration of the results.
Accessing and integrating products of division of computation (Threads, Processes, Nodes) (IO, Integration)
programs that have X level complexity utilising indexed access to all the parts and granularities of data at the same time. If the data were partitioned more steps would be necessary to access and query partitions individually Y and still more to integrate W, resulting in f(X, Y, W) level of complexity depending on the level of integration and access.
One example would be programs that perform table join queries optimising searches by indexing (SQL joins). Such programs could not remain in the same complexity operate if the tables or columns were in different databases or nodes (NoSQL(Key value, Columnar ...)).
Another example would be a program calling (threads, processes, nodes) and combining the results. Calling the threads and combining the results would take more computation steps than doing it sequentially.
The question is a bit out of context you would do well do add context!

How Finagle aperture algorithm chooses "non overlapping" subsets?

I have been reading about Finagle and trying to understand the code to figure out how Aperture's subset choice works.
I have seen that ApertureLeastLoaded has a "useDeterministicOrdering" and an "EndpointFactory" which I guess should be the key points to make the decision of which clients to take in the subset.
While reading the "deterministic subsetting" section of Google SRE's book, I understood that the best way to pick a subset of servers from the client point of view, is to know the total number of clients, and a unique sequential identifier of the current client, that can be used as seed of the subset generator.
In Finagle I can't understand how this process is done (I'm not super familiar with Scala) and the documentation both on the website and in the code, explain just how the aperture paradigm works, but not very clear how the initial subset is chosen
I hope somebody can enlighten me
One of the unique properties of Aperture is that its window is sized dynamically based on a clients offered load. That is, clients have a built in controller which can expand or shrink their window at runtime. This property is important as it allows clients to operate more efficiently and better adapt to a changing environment, but it does make it more complex to achieve a uniform load distribution across servers.
To contrast, the subsetting algorithm, as proposed by the Google SRE book, suggests that operators choose a static subset size which allows a uniform load distribution to be calculated analytically but introduces another static configuration that needs to be revisited as a system evolves.
Deterministic Aperture is, to the best of our knowledge, a novel algorithm for achieving a uniform load distribution while maintaining the dynamic properties of the window sizing mentioned above. From a high level, clients construct a topology of their peer cluster (which gives them a sense of ordering and proximity) and then derive a unique per-client permutation of the servers from the topology such that each server is uniformly represented across the permutations.
We are still in the early stages of testing this in production at Twitter, but early results look very promising. After we gather more empirical results, we hope to publish some more detailed content on how the algorithm works and its properties.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.

Data Scrambling Purpose

Can someone please explain to me what data scrambling is when it comes to a memory controller? According to Wikipedia, it somehow masks the user data with random patterns to prevent reverse engineering of a DRAM. But, it is also is used to finding electrical problems. Can someone please elaborate on these features of data scrambling? Thanks!
The Wikipedia article claimed:
Memory controllers integrated into certain Intel Core processors also
provide memory scrambling as a feature that turns user data written to
the main memory into pseudo-random patterns.[6][7] As such, memory
scrambling prevents forensic and reverse-engineering analysis based on
DRAM data remanence, by effectively rendering various types of cold
boot attacks ineffective. However, this feature has been designed to
address DRAM-related electrical problems, not to prevent security
issues, so it may not be rigorously cryptographically secure.[8]
However, I think that this claim is somewhat misleading because it implies that the purpose of data scrambling is to prevent reverse engineering. In fact the cited sources (listed as [6][7] in the quote) say the following:
The memory controller incorporates a DDR3 Data Scrambling feature to
minimize the impact of excessive di/dt on the platform DDR3 VRs due to
successive 1s and 0s on the data bus. Past experience has demonstrated
that traffic on the data bus is not random and can have energy
concentrated at specific spectral harmonics creating high di/dt that
is generally limited by data patterns that excite resonance between
the package inductance and on-die capacitances. As a result, the
memory controller uses a data scrambling feature to create
pseudo-random patterns on the DDR3 data bus to reduce the impact of
any excessive di/dt.
Basically the purpose of scrambling is to limit fluctuations in the current draw that is used on the DRAM data bus. There is nothing in the cited source to support the claim that it is designed to prevent reverse-engineering, though I suppose it is reasonable to assume that it might make reverse engineering more difficult. I'm not an expert in this area so I don't know for sure.
I have edited the Wikipedia article to remove the improperly sourced claim. Though I suppose someone could add it back it, but if so hopefully they can provide better sourcing.
It's not reverse engineering of the DRAM, it's reverse engineering of the data in the DRAM that scrambing is designed to prevent (e.g. forensics like cold-boot attacks), according to that article.
The electrical properties thing made me think of Row Hammer. Scrambling might make that harder, but IDK if that's what the author of that paragraph had in mind.