64-bit Advantages for Discrete Event Simulation - simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community

The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE

Related

Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.
I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.
However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?
You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.
For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.
For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.
GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.
Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

What is the point of on-chip hardware accelerators, instead of that functionality being added as an instruction to the ISA?

I get that if a specialized operation is known to be common, it makes sense to do it in hardware. But at that point, why not make it a part of the ISA so it can be even faster?
Is there a benefit to making it a co-processor that communicates through shared memory?
This is a bit hand-wavy because I don't actually design hardware, but I think I know enough to say something that's at least plausible.
Adding it to the ISA means it has to be fairly tightly coupled to the pipeline, which doesn't fit well for things like integrated GPUs that have some specialized hardware and can filter out which pixels even need to be processed using dedicated hardware instead of software branching.
Even considering less complicated accelerators (e.g. for crypto):
Especially on simpler CPUs without out-of-order exec and large reordering windows, high-latency HW accelerators could stall the pipeline and stop it from getting other work done while waiting for a result.
Intel does tend to add things to the ISA, such as AES and SHA, because mainstream x86 CPUs do have the instruction throughput and vector registers to feed data to execution units that do one round of AES, for example.
If the accelerator is physically large but usually not needed by multiple cores at once, having groups of cores share one is more natural with some kind of co-processor arrangement to insulate the core from the round-trip latency of going off-core to compute something.
Also for GPUs, a GPU has more computational throughput than you can fit down the superscalar pipeline of a normal CPU. The FLOPS of an integrated GPU is typically much greater than a single core of a modern Intel CPU, even with 2x 256-bit FMA units. So you'd need to have a CPU instruction like "run shader" that runs a GPU program using its own separately-programmable machine code. GPU instruction scheduling is lighter weight than even a normal in-order CPU.

Do modern CPU's have compression instructions

I have been curious about this for awhile since compression is used in about everything.
Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?
They don’t have general-purpose compression instructions.
AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.
On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.
If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.
However, modern CPUs have a lot of stuff for lossy multimedia compression.
Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).
Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.
Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.
It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:
Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.
That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.
Intel has a number of related patents:
Apparatus for Hardware Implementation of Lossless Data Compression.
Hardware apparatuses and methods for data decompression.
Systems, Methods, and Apparatuses for Decompression using Hardware and Software.
Systems, methods, and apparatuses for compression using hardware and software.
Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:
bench
3.6424 0.5267 0.2114 5.0303 1.5557 3.4980
[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]
Note that it is particularly slow for LU and Sparse.
The server has this hardware:
CPU: Intel Xeon E7320 # 2.13GHz (4 physical processors, 16 logical)
128 GB RAM
64-bit operating system
Matlab Version: 9.2.0.556344 (R2017a)
Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.
There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.
My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.
Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?
Edit: Xeon E7320 as suggested by Peter.
Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).
As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).
The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.
I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).
You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.
As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.
(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)
A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).
For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.
I assume Matlab can take advantage of AVX + FMA if the CPU has it.
And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.
So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).
Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.