Do all ISAs and their implementation require word-alignment? - word

Does Intel's IA-32 require word alignment? if not, which are the ISAs and their chips that require word-alignment?

No. Intel x86 based chips accept misaligned memory references. SPARC based chips don't. There are other ISAs also which strictly don't accept misaligned memory references.

Related

Why does SIMD have single data instructions when it's called SIMD?

I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss is the single data equivalent of the multiple data vaddps. Just about every SIMD instruction have a single data version.
Why?
Why does SIMD have single data instructions when it's called SIMD?
It isn't a SIMD instruction in that sense
vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.
People who call that a SIMD instruction are talking about one of:
Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)
Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.
One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)
With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.
With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.
For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.
MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.
SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.
This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.
It's normal for scalar FP to share registers with SIMD
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.
(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).

Microcode terminology: are there names for different "styles" of microcode?

I've been looking at microcode and wondered about terminology.
The "classic" use of microcode is to replace the processor control logic with microcode to generate the processor control signals. But there are some systems that go much further and implement low-level parts of the operating system in microcode, most famously the Xerox Alto, but also systems like the Datapoint 6600 and to a smaller extent the IBM 360. In these systems, executing instructions is just one task for the microcode, rather than the point of the microcode. Is there a word for this style of microcode? "Microprogrammed" almost fits, but is used for microcode programming in general.
The second dimension I'm wondering about: in some systems the microarchitecture is pretty much the same as the programmer-level architecture, maybe with a few extra internal registers, for example, the 68000. But in other systems, the visible architecture is essentially unrecognizable in the microarchitecture. For example, the different IBM 360 models have completely different microarchitectures but identical programmer-level architectures. My second question is if there is a term to describe systems where the microarchitecture is completely different from the visible architecture?
(I know about vertical vs. horizontal microcode but this is different. Also, the example I use are old, but this isn't a retrocomputing question.)
Maurice Wilkes' original microcode paper doesn't mention horizontal vs vertical. But according to this taxonomy,
a horizontal microinstruction controls multiple resources in one
cycle
a vertical microinstruction controls a single resource
There are other microcode features such as writeable; these don't change the microinstruction encoding.
Horizontal vs vertical microcode is a spectrum rather than a dichotomy. A strictly horizontal microinstruction would consist solely of control bits and fields. Such a pure horizontal microinstruction for any real architecture would be very wide since there are a lot of functions to control in a complex processor. Moreover, these control bits would be quite sparse. The resulting microstore would be large and expensive and not necessarily fast.
Instead modern microarchitectures like the P6 have opcodes. An opcode decoder is a combinational circuit which takes opcode bits and emits control values. This costs some gate delay but provides significant width compression, allowing a much smaller microstore. A vertical microarchitecture simply takes this to an extreme and each opcode controls a single resource.
Writing complex instructions and low level OS components in microcode was actually efficient in the 60s and that led to CISC ISAs. However, when VLSI, caches and superscalars came along, this design decision was revisited which gave rise to RISC ISAs. But again, this historical progression of ISAs doesn't change the taxonomy of microcode.

Do modern CPU's have compression instructions

I have been curious about this for awhile since compression is used in about everything.
Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?
They don’t have general-purpose compression instructions.
AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.
On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.
If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.
However, modern CPUs have a lot of stuff for lossy multimedia compression.
Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).
Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.
Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.
It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:
Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.
That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.
Intel has a number of related patents:
Apparatus for Hardware Implementation of Lossless Data Compression.
Hardware apparatuses and methods for data decompression.
Systems, Methods, and Apparatuses for Decompression using Hardware and Software.
Systems, methods, and apparatuses for compression using hardware and software.
Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

A method of checking my cpu bit

My OS is windows 10 x86_64.
I had checked supporting arm64 the cpu. So I had knew 64bit cpu.
But sometimes, I got error message about OS bit.
So I do cpu bit test on c language.
printf("%d", sizeof(int*));
I had expected result is 8. But Result was 4.
1. What is my cpu bit?
2. if my cpu is 32bit, Can use memory over 4GB? My cpu supports arm64.
Please I'm very confused.
Your CPU almost certainly can't be both arm64 and run x86_64 Windows, because the Intel and ARM instruction sets are not the same. Perhaps you meant AMD64? If you search the web for you CPU model, you probably will be able to find out how many bit it is.
Further, keep in mind that the C standard only requires that ints be at least 16 bits, not the same size as the machine's native size. I suspect that the compiler you were testing with might not have been aware of the 64-bit capabilities of your CPU, and compiled your code as though your CPU was a 32-bit CPU.
As far as memory support, as far as I know, the motherboard and CPU model will affect the actual amount of memory your system will support.
Most likely you CPU supports amd64.
The size of the C standard types depends on the data model.
The size of a pointer depends on the execution mode (long-mode vs compatibility-mode) and can be 32-bit even on 64-bit OSes.
If your CPU is 32-bits you could use more than 4GiB of memory, but since the premises is almost surely false, the easiest solution is simply to recompile for a 64-bits environment.

64-bit Advantages for Discrete Event Simulation

As I understand it, Intel 64-bit CPUs offer the ability to address a larger address space (>4GB), which is useful for a large simulation. Interesting architectural hardware advantages::
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
BACKGROUND
Historically, the simulations have been performed on 32-bit IA (Intel Architecture) systems. I am wondering if where (if any) is opportunity to reduce simulation times with 64-bit CPUs: I expect that software should be recompiled to take advantage of 64-bit capability. This type of simulation would not benefit from a MAC (multiply and accumulate) nor does it use floating point calculations.
QUESTION
That being said, is there an Intel 64-bit instruction or capability that offers an appreciable advantage over the 32-bit instructions set that would accelerate simulation (computationally intensive and lengthy 32-BIT algorithms)?
If you have experience implementing simulations and have transitioned from 32 to 64 bit CPUs, please state this in your response (relevant experience is important). I look forward to insightful responses from the community
The most immediate computational benefits to expect regarding CPU instructions I can think of would be AVX although this is only loosely related to x86_64, but more of an CPU-generational issue.
In our company, we developed multiple, highly-complex discrete event simulations, simulating aircraft (including electrics, hydraulics, avionics software and everything related). They are all built with or ported to x86_64. The reasons are mostly due to memory addressing, allowing for larger caches and wider choice of algorithms (e.g. data-centric design, concurrency), graphics content also tends to be huge nowadays. However, optimizations regarding x86_64 instructions themselves, such as AVX, are left to compilers. I never saw code written in assembler or using compiler intrinsics to actually refer to specific x86_64 instructions explicitly.
To summarize, based on my experience, x86_64 CPUs allow for certain optimizations, often sacrificing memory consumption in favor of CPU processing:
Wider choice of algorithms, especially regarding concurrency, where data may need to be laid out in a way favoring parallel processing at the cost of occupied memory
Intermediate results or other processing output may be cached more easily in memory to avoid recomputation or to optimize for temporal or state-related coherence
AVX instructions may help compilers to vectorize more code than with MMX/SSE