Do modern CPU's have compression instructions - cpu-architecture

I have been curious about this for awhile since compression is used in about everything.
Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?

They don’t have general-purpose compression instructions.
AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.
On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.
If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.
However, modern CPUs have a lot of stuff for lossy multimedia compression.
Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).
Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.

Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.
It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:
Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.
That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.
Intel has a number of related patents:
Apparatus for Hardware Implementation of Lossless Data Compression.
Hardware apparatuses and methods for data decompression.
Systems, Methods, and Apparatuses for Decompression using Hardware and Software.
Systems, methods, and apparatuses for compression using hardware and software.
Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

Related

Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.
I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.
However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?
You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.
For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.
For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.
GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.
Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community
The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE

Difference: LZ77 vs. LZ4 vs. LZ4HC (compression algorithms)?

I understand the LZ77 and LZ78 algorithms.
I read about LZ4 here and here and found code for it.
Those links described the LZ4 block format. But it would be great if someone could explain (or direct me to some resource explaining):
How LZ4 is different from LZ77?
How is LZ4HC different from LZ4?
What idea makes the LZ4HC algorithm so fast?
LZ4 is built to compress fast, at hundreds of MB/s per core. It's a fit for applications where you want compression that's very cheap: for example, you're trying to make a network or on-disk format more compact but can't afford to spend a bunch of CPU time on compression. It's in a family with, for example, snappy and LZO.
The natural comparison point is zlib's DEFLATE algorithm, which uses LZ77 and Huffman coding and is used in gzip, the .ZIP and .PNG formats, and too many other places to count.
These fast compressors differ because:
They use repetition-detection code that's faster (often a simple hashtable with no collision detection) but doesn't search through multiple possible matches for the best one (which would take time but result in higher compression), and can't find some short matches.
They only try to compress repetitions in input--they don't try to take advantage of some bytes being more likely than others outside of repetitions.
Closely related to 2, they generate bytes of output at a time, not bits; allowing fraction-of-a-byte codes would allow for more compression sometimes, but would require more CPU work (often bit-shifting and masking and branching) to encode and decode.
Lots of practical work has gone into making their implementations fast on modern CPUs.
By comparison, DEFLATE gets better compression but compresses and decompresses slower, and high-compression algorithms like LZMA, bzip2, LZHAM, or brotli tend to take even more time (though Brotli at its faster settings can compete with zlib). There's a lot of variation among the high-compression algorithms, but broadly, they tend to capture redundancies over longer distances, take more advantage of context to determine what bytes are likely, and use more compact but slower ways to express their results in bits.
LZ4HC is a "high-compression" variant of LZ4 that, I believe, changes point 1 above--the compressor finds more than one match between current and past data and looks for the best match to ensure the output is small. This improves compression ratio but lowers compression speed compared to LZ4. Decompression speed isn't hurt, though, so if you compress once and decompress many times and mostly want extremely cheap decompression, LZ4HC would make sense.
Note that even a fast compressor might not allow one core to saturate a large amount of bandwidth, like that provided by SSDs or fast in-datacenter links. There are even quicker compressors with lower ratios, sometimes used to temporarily pack data in RAM. WKdm and Density are two such compressors; one trait they share is acting on 4-byte machine words of input at a time rather than individual bytes. Sometimes specialized hardware can achieve very fast compression, like in Samsung's Exynos chips or Intel's QuickAssist technology.
If you're interested in compressing more than LZ4 but with less CPU time than deflate, the author of LZ4 (Yann Collet) wrote a library called Zstd--here's a blog post from Facebook at its stable release, background on the finite state machines used for compactly encoding info on repetitions, and a detailed description in an RFC. Its fast modes could work in some LZ4-ish use cases. (Also, Apple developed lzfse on similar principles, and Google developed gipfeli as a 'medium' packer. Neither seemed to get much use in the outside world.) Also, a couple projects aim to provide faster/lighter DEFLATE: SLZ, patches to zlib by CloudFlare and Intel. (There's also been work on fast decompression on large modern CPU cores.)
Compared to the fastest compressors, those "medium" packers add a form of entropy encoding, which is to say they take advantage of how some bytes are more common than others and (in effect) put fewer bits in the output for the more common byte values.
If you're compressing one long stream, and going faster using more cores is potentially helpful, parallel compression is available for gzip through pigz and the zstd through the command-line tool's -T option (and in the library). (There are various experimental packers out there too, but they exist more to push boundaries on speed or density, rather than for use today.)
So, in general, you have a pretty good spectrum of alternative compressors for different apps:
For very fast compression: LZ4, zstd's lowest settings, or even weaker memory compressors
For balanced compression: DEFLATE is the old standard; Zstd and brotli on low-to-medium settings are good alternatives for new uses
For high compression: brotli, or Zstd on high settings
For very high compression (like static content that's compressed once and served many times): brotli
As you move from LZ4 through DEFLATE to brotli you layer on more effort to predict and encode data and get more compression out at the cost of some speed.
As an aside, algorithms like brotli and zstd can generally outperform gzip--compress better at a given speed, or get the same compression faster--but this isn't actually because zlib did anything wrong. The main secret is probably that newer algos can use more memory: zlib dates to 1995 (and DEFLATE to 1993). Back then RAM cost >3,000x as much as today, so just keeping 32KB of history made sense. Changes in CPUs over time may also be a factor: tons of of arithmetic (as used in finite state machines) is relatively cheaper than it used to be, and unpredictable ifs (branches) are relatively more expensive.

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.