Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions? - operating-system

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.
I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.
However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?

You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.
For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.
For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.
GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.
Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

Related

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

Do modern CPU's have compression instructions

I have been curious about this for awhile since compression is used in about everything.
Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?
They don’t have general-purpose compression instructions.
AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.
On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.
If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.
However, modern CPUs have a lot of stuff for lossy multimedia compression.
Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).
Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.
Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.
It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:
Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.
That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.
Intel has a number of related patents:
Apparatus for Hardware Implementation of Lossless Data Compression.
Hardware apparatuses and methods for data decompression.
Systems, Methods, and Apparatuses for Decompression using Hardware and Software.
Systems, methods, and apparatuses for compression using hardware and software.
Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community
The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE