l2 cache CPU comparisions - cpu-cache

So, I have a older q8300 Quad Core CPU that uses a L2 cache of 4mb and I have been thinking of upgrading soon to a 6600k.
I took a look at all the specs of the 6600k and noticed it has a smaller L2 cache. It is around 1 MB, does this matter or does the 6600k's lower L2 cache not matter even in scenarios where applications depend on a higher L2 cache to put data through? I want to make sure I know how the L2 cache pipelines work in a streamlined way, that way I can figure out if I should just keep my q8300 quad for L2 dependent applications.
I was thinking, maybe due to the more advanced architecture of the 6600k maybe it passes data through to L3 more efficiently? I'm not sure about all this and I am still learning, so if someone could explain this that would be awesome.

It's a little confusing since the q8300 is a several generations older Merom/Conroe based architecture which has only 2 cache levels, while the i5-6600k is Skylake based and has 3 cache levels (actually some of them have 4, with an additional 128MB worth of EDRam, but not the 6600 I think).
What you should be comparing between is the "last-level" shared caches of the two products - this is usually designed as some co-located cache strip aside each core, so the size usually scales with core count (except when you have chopped products). Your old system had 2MB per core as last level (L2) cache, and the new one has 1.5MB per core as the last level (L3), but with 2x the cores you have a higher overall. If you would run a single thread it would enjoy the higher capacity, and if you activate all cores, you would probably have a higher aggregated performance, depending on the task.
On top of that, you get a whole new cache level with 256k per core, which is now the L2. This is not really added to your overall capacity, as the last level cache is usually inclusive (see - Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor), so you're duplicating the same space in the L3, but it does serve to compensate for the additional distance of the L3. Also note that the performance should have grown quite dramatically over the 7 (!) generations that passed between these 2 CPUs, so you probably shouldn't worry too much -
http://cpuboss.com/cpus/Intel-Core2-Quad-Q8300-vs-Intel-Core-i5-6600K

Related

Why are most cache line sizes designed to be 64 byte instead of 32/128byte now?

i found in linux, it shows my cpu's cache line size is 64 byte, and i realized 16/32/128 byte is existing, but most cpu are designed to 64 byte cache line size now. why not bigger or smaller?
It's a trade-off. Wider caches are more efficient (in terms of area/power for a given cache size), but result in more memory traffic for random (non-sequential/strided) access, and more false sharing contention between parallel caches.
if you have a memory access pattern that only needs a few bytes from each cache line (eg, iterating along a linked list that is scattered widely across memory), each access will need to pull an entire line into the cache. So doubling the line size will double the memory traffic.
if different CPUs, each with its own cache, are accessing memory on the same cache line, that line will have to "bounce" back and forth between the caches. Avoiding this means putting more padding between objects.
In both cases, these problems can be avoided by tuning the software to want memory in chunks that are multiples of the cache line size. The bigger the cache line size, the more work that is.
As Chris Dodd's answer points out, the sizing of cache lines involves trade-offs.
Larger cache lines reduce the number of tag bits per data byte, provide prefetching, and facilitate higher bandwidth (particularly at the memory and the L1 interfaces) at the cost of excessive prefetch (wasting bandwidth and cache capacity), false sharing, higher miss latency (especially without critical word first/early restart), and higher conflict misses (for smaller caches, with fewer sets the probability that more accesses than the associativity will map to a particular set increases). (Larger cache lines can also provide greater performance predictability by guaranteeing a cache hit within a larger address range and number of bytes.)
Modern systems would not noticeably benefit from such prefetching; configurable static prefetcher logic would provide the same behavior and dynamic prefetching can exploit variable resource availability (e.g., cache capacity and memory channel occupancy) and utility as well as provide more flexible prefetching (such as non-unit stride).
Tag overhead is not as significant a concern in terms of area for modern caches using SRAM for data as well as tags. (IBM's Power and zArchitecture implementations use eDRAM for outer cache data storage and SRAM for tags, which more than doubles the area cost of tags relative to data.) However, access latency and access energy are effected by the size of the tag arrays. For L1 caches, way prediction is more effective with larger cache lines both because there are fewer cache lines for a given cache capacity and because spatial locality tends to apply even beyond reasonable cache line sizes; only having to check one set of tags for a wider or larger number of accesses reduces the cost of higher bandwidth (this is most noticeable in GPUs which exploit spatial locality and sacrifice latency for bandwidth). For outer cache levels, phased tag-data access is often used (tags are checked before data access begins, saving energy — especially given higher associativity and miss rates); smaller tag arrays for a given capacity both reduce access energy and latency (especially for misses — 50% hit rates are not unheardof). (Note that one can use partial tags to provide early miss detection in the common case where a miss has no matching partial tags. Other filtering mechanisms are also possible.)
False sharing can be countered by using sectored caches, where more than one validity (or coherence state) entry is provided for each address portion of the tag. This provides an intermediate design point between larger cache lines with more frequent false sharing and smaller cache lines with higher tag overhead. Such also inherently supports reducing cache line fill delay. For traditional layouts, this has the substantial effective capacity cost of large cache lines when false sharing or less spatial locality is more common. For designs using indirection, such as proposed Non-Uniform Cache Architectures and V-Way caches, the capacity utilization issue can be reduced by allocating data storage at a finer granularity at the cost of more indirection pointer storage.
Larger cache lines provide three bandwidth benefits. The command overhead is less (address and action information is nearly constant — address is one bit smaller for each doubling in size) so the bandwidth overhead per data byte is lower; this is more significant for coherence traffic where many messages carry only metadata. (Obviously with more numerous coherence nodes false sharing can be more problematic acting against this advantage.) Other per-request overheads also do not scale with request size (e.g., DRAM row activation with random-access, close-on-completion row management). ECC (or check codes with retransmission) also have less per payload byte overhead with larger payloads (this can be used to store extra metadata while using commodity width memory modules).
A larger cache line also facilitates wider memory interfaces when burst length is fixed. Increasing DRAM burst length facilitates higher bandwidth; DDR5 moved to a burst length of 16, pushing DIMMs into using two 32-bit wide channels to be compatible with x86's de facto standardization on 64-byte cache lines. While this chance can be viewed positively as increasing available memory level parallelism (MLP)— doubling the number of channels and reducing DRAM bank conflicts — MLP is more important when relative latency of memory is greater (large on-chip caches and faster processing), thread-level parallelism is available (multicore and multithreading), and out-of-order execution (and multithreading) expose more memory accesses to hide latency. Multicore (when used with significant data memory sharing as opposed to multiprogramming or large chunk/stream communication such as pipeline-style mulithreading) also increases the importance of false sharing, further reducing the benefits of larger cache lines (beyond MLP benefits of narrower channels). With lower (on-chip) communication latency and near-unavoidability of multicore processors, multithreaded programming becomes more attractive.
For L1 caches the microarchitecture (and somewhat the ISA) can be influence cache line sizes. Higher frequency designs favor smaller capacity L1 caches both for latency and access energy, especially with less latency tolerance from out-of-order execution or skewed pipelines (where the execution pipeline phase is one or more stages delayed from the address generation phase).
The relative sizes of the various tradeoffs also depends on the workload and software design. Larger cache capacity caches targeting workloads that benefit from such reduce the excessive prefetch and conflict disadvantages of larger cache lines; higher associativity reduces the conflict disadvantage, but workloads that are more likely to have conflicts are less likely (in general) to benefit from spatial locality (and the conflict disadvantage is typically less important for outer cache levels). Pointer-chasing workloads tend to favor lower latency and thus lower capacity, favoring smaller cache lines (at least in L1).
Software design is a significant factor. Avoiding false sharing tends to increase padding as cache line size increases, discouraging larger cache lines. Once a cache line size assumption is established in a software community (which is somewhat segregated according to ISA, OS, and hardware/system vendor) the effect of legacy code and legacy conceptualizations constrains cache line size.
Speculation: x86's orientation toward generic software and personal computer uses (cost and workload characteristics biasing toward smaller caches and workload perhaps generally having lower spatial locality) probably biased the choice toward a smaller cache line than ISA/hardware vendors targeting workstation and server workloads with a higher expectation of software development effort. x86 has standardized on 64-byte cache lines, IBM POWER9 uses 128-byte cache blocks (divided into four sectors for L1 caches) and IBM z15 uses 256-byte cache blocks.
(Latency vs. hit rate, access energy, and other tradeoffs as well as software and programmer legacy seem to lead to less strict standardization on 32KiB L1 cache capacity. The performance impact for a smaller or larger cache can be less significant than for false sharing, so the software constraints are less significant than for cache line size.)

How many clock cycles do the stages of a simple 5 stage processor take?

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instruction memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Now I know that an instruction fetch, for example, is from memory which can take 4 cycles (L1 cache) or up to ~150 cycles (RAM). However, in every pipelining diagram, I see something like this, where each stage is assigned a single cycle.
Now, I know of course real processors have complex pipelines with over 19 stages and every architecture is different. However, am I missing something here? With memory accesses in IF and MA, can this 5 stage pipeline take dozens of cycles?
Classic 5-stage RISC pipelines are designed around single-cycle latency L1d / L1i, allowing 1 IPC (instruction per clock) in code without cache misses or other stalls. i.e. the hopefully common / good case. Every stage must have a worst-case critical path latency of 1 cycle, or trigger a stall.
Clock speeds were lower back then (even relative to 1 gate delay) so you could get more done in a single cycle, and the caches were simpler, often 8k direct-mapped, single port, sometimes even virtually tagged (VIVT) so TLB lookup wasn't part of the access latency.
First-gen MIPS, the R2000 (and R3000), had on-chip controllers1 for its direct-mapped PIPT split L1i/L1d write-through caches, but the actual tags+data were off-chip, from 4K to 64K. Achieving the required single-cycle latency with this setup limited clock speeds to 15 MHz (R2000) or 33 MHz (R3000) with available SRAM technology. The TLB was fully on-chip.
vs. modern Intel/AMD using 32kiB 8-way VIPT L1d/L1i caches, with at least 2 read + 1 write port for L1d, at such high clock speed that access latency is 4 cycles best-case on Intel SnB-family, or 5 cycles including address-generation. Modern CPUs have larger TLBs, too, which also adds to the latency. This is ok when out-of-order execution and/or other techniques can usually hide that latency, but classic 5-stage RISCs just had one single pipeline, not separately pipelined memory access. See also Cycles/cost for L1 Cache hit vs. Register on x86? for some more links about how performance on modern superscalar out-of-order exec x86 CPUs differs from classic-RISC CPUs.
If you wanted to raise clock speeds for the same transistor performance (gate delay), you'd divide the fetch and mem stages into multiple pipeline stages (i.e. pipeline them more heavily), if cache access was even on the critical path (i.e. if cache access could no longer be done in one clock period). The downside of lengthening the pipeline is raising branch latency (cost of a mispredict, and the amount of latency a correct prediction has to hide), as well as raising total transistor cost.
Note that classic-RISC pipelines do address-generation in the EX stage, using the ALU there to calculate register + immediate, the only addressing mode supported by most RISC ISAs build around such a pipeline. So load-use latency is effectively 2 cycles for pointer-chasing, due to the load delay for forwarding back to EX.)
On a cache miss, the entire pipeline would just stall: those early pipelines lacked scoreboarding of loads to allow hit-under-miss or miss-under-miss for loads from L1d cache.
MIPS R2000 did have a 4-entry store buffer to decouple execution from cache-miss stores. (Apparently built from 4 separate R2020 write-buffer chips, according to wikipedia.) The LSI datasheet says the write-buffer chips were optional, but with write-through caches, every store has to go to DRAM and would create a stall without write buffering. Most modern CPUs use write-back caches, allowing multiple writes of the same line without creating DRAM traffic.
Also remember that CPU speed wasn't as high relative to memory for early CPUs like MIPS R2000, and single-core machines didn't need an interconnect between cores and memory controllers. (Although they maybe did have a frontside bus to a memory controller on a separate chip, a "northbridge".) But anyway, back then a cache miss to DRAM cost a lot fewer core clock cycles. It sucks to fully stall on every miss, but it wasn't like modern CPUs where it can be in the 150 to 350 cycles range (70 ns * 5 GHz). DRAM latency hasn't improved nearly as much as bandwidth and CPU clocks. See also http://www.lighterra.com/papers/modernmicroprocessors/ which has a "memory wall" section, and Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: why modern CPUs need multi-level caches as the mismatch between CPU speed and memory latency has grown.
Later CPUs allowed progressively more memory-level parallelism by doing things like allowing execution to continue after a non-faulting load (successful TLB lookup), only stalling when you actually read a register that was last written by a load, if the load result isn't ready yet. This allows hiding load latency on a still-short and fairly simple in-order pipeline, with some number of load buffers to track outstanding loads. And with register renaming + OoO exec, the ROB size is basically the "window" over which you can hide cache-miss latency: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
Modern x86 CPUs even have buffers between pipeline stages in the front-end to hide or partially absorb fetch bubbles (caused by L1i misses, decode stalls, low-density code, e.g. a jump to another jump, or even just failure to predict a simple always-taken branch. i.e. only detecting it when it's eventually decoded, after fetching something other than the correct path. That's right, even unconditional branches like jmp foo need some prediction for the fetch stage.)
https://www.realworldtech.com/haswell-cpu/2/ has some good diagrams. Of course, Intel SnB-family and AMD Zen-family use a decoded-uop cache because x86 machine code is hard to decode in parallel, so often they can bypass some of that front-end complexity, effectively shortening the pipeline. (wikichip has block diagrams and microarchitecture details for Zen 2.)
See also Modern Microprocessors
A 90-Minute Guide! re: modern CPUs and the "memory wall": the increasing mismatch between DRAM latency and core clock cycle time. DRAM latency has only dropped a little bit (in absolute nanoseconds) as bandwidth has continued to climb tremendously in recent years.
Footnote 1: MIPS R2000 cache details:
An R2000 datasheet shows the D-cache was write-through, and various other interesting things.
According to a 1992 usenet message from an SGI engineer, the control logic just sends 18 index bits, receiving a word of data + 8 tags bits to determine hit or not. The CPU is oblivious to the cache size; you connect up the right number of index lines to SRAM address lines. (So I guess a line-size of one 4-byte word?)
You have to use at least 10 index bits because the tag is only 20 bits wide, and you need tag+index+2(byte-in-word) to be 32, the physical address-space size. That sets a minimum cache size of 4K.
20 bits of tag for every 32 bits of data is very inefficient. With a larger cache, fewer tag bits are actually needed, since more of the address is used up as part of the index. But Paul Ries posted that R2000/R3000 does not support comparing fewer tag bits. IDK if you could wire up some of the address output lines to the tag input lines, to generate matching bits instead of storing them in SRAMs.
A 32-byte cache line would still only need 20-bit tags (at most), but would have one tag per 8 words, a factor of 8 improvement in tag overhead. CPUs with larger caches, especially L2 caches, would definitely want to use larger line sizes.
But you're probably more likely to get conflict misses with fewer larger lines, especially with a direct-mapped cache. And the memory bus can still be busy filling a previous line when you encounter another miss, even if you have critical-word-first / early-restart so the miss latency wasn't worse if the memory bus was idle to start with.

Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.
I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.
However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?
You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.
For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.
For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.
GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.
Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

Is Scala doing anything in parallel on its own?

I have little program creating a maze. It uses lots of collections (the default variant, which is immutable, or at least used as an immutable).
The program calculates 30 mazes with increasing dimensions. Using a for comprehension over (1 to 30)
Since with the latest versions the parallel collections framework became available I thought to give it a spin, hoping for some performance gain.
This failed and when I investigated a little, I found the following:
When run without any call to anything remotely parallel it still showed a processor load of about 30% on each of the 4 cores of my machine.
When I replaced the Range 1 to 30 with (1 to 30).par CPU load went up to about 80% on all cores (which I expected). The order in which the mazes completed became more or less random (which I expected). The total time for all mazes stayed the same.
Replacing some of the internally used collections with their parallel counter parts did seem to have an effect.
I now have 2 questions:
Why do I have all 4 cores spinning, although there isn't anything that runs in parallel.
What might be likely reasons for the program to still take the same time, no matter if running in parallel or not. There are no obvious other bottlenecks but CPU cycles (no IO, no Network, plenty of Memory via -Xmx setting)
Any ideas on this?
The 30% per core version is just a poor scheduler (sounds like Windows 7) migrating the process from core to core very frequently. It's probably closer to 25% per core (1/4) for your process plus misc other load making 30%. If you run the same example under Linux you would probably see one core pegged.
When you converted to (1 to 30).par, you started really using threads across all cores but the synchronization overhead of distributing such a small amount of work and then collecting the results cancelled out the parallelism gains. You need to break your work into larger independent chunks.
EDIT: If each of 1..30 represents some larger amount of work (solving a maze, say) then automatic parallelization will work much better if each unit of work is about the same. Imagine you had 29 easy mazes and one very very hard maze. The 30th maze will still run serially (or very nearly) with everything else). If your mazes increase in complexity by number try spawning them in the order 30 to 1 by -1 so that the biggest tasks will go first. Think of it as a braindead solution to the knapsack problem.