why does cmsis oblige a max of priorities of 64 - real-time

i am trying to implement CMSIS RTOS on my project using ThreadX. how ever i found in the file cmsis_os2.c that it is obligatory to have a max priority of 64. i would like to keep it to 32 (ram optimisation) so does anyone has an explication on why i should use 64 and not 32. and does it bother to use 32 and simply modify the cmsis file? this is the code i found:
/* Ensure the maximum number of priorities is modified by the user to 64. */
#if(TX_MAX_PRIORITIES != 64)
#error "CMSIS RTOS ThreadX Wrapper: TX_MAX_PRIORITIES must be fixed to 64 in tx_user.h file"
#endif

CMSIS enumerates the priority with enum osPriority_t. This is a bad idea in my opinion, but it rather constrains the implementation, and would be hard to change without breaking the abstraction.
In ThreadX having 64 rather 32 priorities carries a 128 byte overhead (so not much of a RAM optimisation). If that is really a problem, then you could in the porting layer map the CMSIS 64 priorities to 32 levels simply by dividing the priority by 2 when creating the task with the native API. That might however modify the scheduling because tasks at priorities Nx2 and (Nx2)+1 would both map to the same priority N.
Another issue with changing the number of priorities is that porting your code to a different CMSIS RTOS2 implementation could change the scheduling behaviour, which rather defeats the object of the abstraction.
You have to take care with CMSIS RTOS2 priorities because in fact only 47 from 8 to 55 are normally used for user tasks, as can be seen from the enumeration. With 0, 1 and 56 reserved and 2 to 7 given no enumeration. How those map to native priorities is implementation dependent, and if you were to change the implementation you would still have to account for the reserved values. It is therefore not advisable to simply pass integer priorities without ensuring that they are in the range osPriorityLow to osPriorityRealtime7 . It is on the whole not a perfect abstraction.
ThreadX is perhaps unusual in having this overhead related to the number of priority levels. It is also unusual in having a configurable number of priority levels in any case. In many RTOS it is an 8 or 16 bit integer.

This is a CMSIS issue, not an Azure RTOS issue. You'll have to ask the CMSIS folks.

Related

Why are registers restrained about the number of registers?

I wonder why register must be only 32.
I know vaguely about the reason but i want to know more exactly.
Let's look at what we have with 32 general purpose integer-oriented registers, and what would happen if we went to 256 registers:
Diminishing returns
Normal compiled code demonstrates that with 32 registers, most function leave some of the registers unused.  So, adding more registers than 32 doesn't help most code.
Encoding size
On a register machine, binary operators like addition, subtraction, comparison, others require three operands: left source, right source, and target.  On a RISC machine, each of these uses a register operand, so that means 3 register operands in one instruction.  This means that 3 x 5 bits = 15 bits are used in such an instruction on a machine with 32 registers.
If we were to increase the number of registers to, say 256, then we would need 8 bits for each register operand.  That would mean 3 x 8 bits = 24 bits.  Instructions become larger, and this decreases the efficiency of the instruction cache — a critical component to performance.
Many instruction sets do have more than 32 registers
They add specialized registers, such as a whole second set for floating point, and also another set of extra wide registers for SIMD and vector operations.
In context, these additional register sets don't necessarily suffer the same code expansion as described above because these additional register sets don't intermix with each other: in other words we can have 32 integer registers and also 32 floating point registers, and still maintain 5 bit register fields in the instructions, because the instructions involved know which register set they are using and don't support mixing of the register sets in the same instruction.
Also, to be clear, many instruction sets have used different numbers of registers, many less than 32 yet some more than 32.

What is the advantage of having instructions in a uniform format?

Many processors have instructions which are of uniform format and width such as the ARM where all instructions are 32-bit long. other processors have instructions in multiple widths of say 2, 3, or 4 bytes long, such as 8086.
What is the advantage of having all instructions the same width and in a uniform format?
What is the advantage of having instructions in multiple widths?
Fixed Length Instruction Trade-offs
The advantages of fixed length instructions with a relatively uniform formatting is that fetching and parsing the instructions is substantially simpler.
For an implementation that fetches a single instruction per cycle, a single aligned memory (cache) access of the fixed size is guaranteed to provide one (and only one) instruction, so no buffering or shifting is required. There is also no concern about crossing a cache line or page boundary within a single instruction.
The instruction pointer is incremented by a fixed amount (except when executing control flow instructions--jumps and branches) independent of the instruction type, so the location of the next sequential instruction can be available early with minimal extra work (compared to having to at least partially decode the instruction). This also makes fetching and parsing more than one instruction per cycle relatively simple.
Having a uniform format for each instruction allows trivial parsing of the instruction into its components (immediate value, opcode, source register names, destination register name). Parsing out the source register names is the most timing critical; with these in fixed positions it is possible to begin reading the register values before the type of instruction has been determined. (This register reading is speculative since the operation might not actually use the values, but this speculation does not require any special recovery in the case of mistaken speculation but does take extra energy.) In the MIPS R2000's classic 5-stage pipeline, this allowed reading of the register values to be started immediately after instruction fetch providing half of a cycle to compare register values and resolve a branch's direction; with a (filled) branch delay slot this avoided stalls without branch prediction.
(Parsing out the opcode is generally a little less timing critical than source register names, but the sooner the opcode is extracted the sooner execution can begin. Simple parsing out of the destination register name makes detecting dependencies across instructions simpler; this is perhaps mainly helpful when attempting to execute more than one instruction per cycle.)
In addition to providing the parsing sooner, simpler encoding makes parsing less work (energy use and transistor logic).
A minor advantage of fixed length instructions compared to typical variable length encodings is that instruction addresses (and branch offsets) use fewer bits. This has been exploited in some ISAs to provide a small amount of extra storage for mode information. (Ironically, in cases like MIPS/MIPS16, to indicate a mode with smaller or variable length instructions.)
Fixed length instruction encoding and uniform formatting do have disadvantages. The most obvious disadvantage is relatively low code density. Instruction length cannot be set according to frequency of use or how much distinct information is required. Strict uniform formatting would also tend to exclude implicit operands (though even MIPS uses an implicit destination register name for the link register) and variable-sized operands (most RISC variable length encodings have short instructions that can only access a subset of the total number of registers).
(In a RISC-oriented ISA, this has the additional minor issue of not allowing more work to be bundled into an instruction to equalize the amount of information required by the instruction.)
Fixed length instructions also make using large immediates (constant operands included in the instruction) more difficult. Classic RISCs limited immediate lengths to 16-bits. If the constant is larger, it must either be loaded as data (which means an extra load instruction with its overhead of address calculation, register use, address translation, tag check, etc.) or a second instruction must provide the rest of the constant. (MIPS provides a load high immediate instruction, partially under the assumption that large constants are mainly used to load addresses which will later be used for accessing data in memory. PowerPC provides several operations using high immediates, allowing, e.g., the addition of a 32-bit immediate in two instructions.) Using two instructions is obviously more overhead than using a single instruction (though a clever implementation could fuse the two instructions in the front-end [What Intel calls macro-op fusion]).
Fixed length instructions also makes it more difficult to extend an instruction set while retaining binary compatibility (and not requiring addition modes of operation). Even strictly uniform formatting can hinder extension of an instruction set, particularly for increasing the number of registers available.
Fujitsu's SPARC64 VIIIfx is an interesting example. It uses a two-bit opcode (in its 32-bit instructions) to indicate a loading of a special register with two 15-bit instruction extensions for the next two instructions. These extensions provide extra register bits and indication of SIMD operation (i.e., extending the opcode space of the instruction to which the extension is applied). This means that the full register name of an instruction not only is not entirely in a fixed position, but not even in the same "instruction". (Similarities to x86's REX prefix--which provides bits to extend register names encoded in the main part of the instruction--might be noted.)
(One aspect of fixed length encodings is the tyranny of powers of two. Although it is possible to used non-power-of-two instruction lengths [Tensilica's XTensa now has fixed 24-bit instructions as its base ISA--with 16-bit short instruction support being an extension, previously they were part of the base ISA; IBM had an experimental ISA with 40-bit instructions.], such adds a little complexity. If one size, e.g., 32bits, is a little too short, the next available size, e.g., 64 bits, is likely to be too long, sacrificing too much code density.)
For implementations with deep pipelines the extra time required for parsing instructions is less significant. The extra dynamic work done by hardware and the extra design complexity are reduced in significance for high performance implementations which add sophisticated branch prediction, out-of-order execution, and other features.
Variable Length Instruction Trade-offs
For variable length instructions, the trade-offs are essentially reversed.
Greater code density is the most obvious advantage. Greater code density can improve static code size (the amount of storage needed for a given program). This is particularly important for some embedded systems, especially microcontrollers, since it can be a large fraction of the system cost and influence the system's physical size (which has impact on fitness for purpose and manufacturing cost).
Improving dynamic code size reduces the amount of bandwidth used to fetch instructions (both from memory and from cache). This can reduce cost and energy use and can improve performance. Smaller dynamic code size also reduces the size of caches needed for a given hit rate; smaller caches can use less energy and less chip area and can have lower access latency.
(In a non- or minimally pipelined implementation with a narrow memory interface, fetching only a portion of an instruction in a cycle in some cases does not hurt performance as much as it would in a more pipelined design less limited by fetch bandwidth.)
With variable length instructions, large constants can be used in instructions without requiring all instructions to be large. Using an immediate rather than loading a constant from data memory exploits spatial locality, provides the value earlier in the pipeline, avoids an extra instruction, and removed a data cache access. (A wider access is simpler than multiple accesses of the same total size.)
Extending the instruction set is also generally easier given support for variable length instructions. Addition information can be included by using extra long instructions. (In the case of some encoding techniques--particularly using prefixes--, it is also possible to add hint information to existing instructions allowing backward compatibility with additional new information. x86 has exploited this not only to provide branch hints [which are mostly unused] but also the Hardware Lock Elision extension. For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.)
Variable length encoding clearly makes finding the start of the next sequential instruction more difficult. This is somewhat less of a problem for implementations
that only decode one instruction per cycle, but even in that case it adds extra
work for the hardware (which can increase cycle time or pipeline length as well as use more energy). For wider decode several tricks are available to reduce the cost of parsing out individual instructions from a block of instruction memory.
One technique that has mainly been used microarchitecturally (i.e., not included in the interface exposed to software but only an implementation technique) is to use marker bits to indicate the start or end of an instruction. Such marker bits would be set for each parcel of instruction encoding and stored in the instruction cache. Such delays the availability of such information on a instruction cache miss, but this delay is typically small compared to the ordinary delay in filling a cache miss. The extra (pre)decoding work is only needed on a cache miss, so time and energy is saved in the common case of a cache hit (at the cost of some extra storage and bandwidth which has some energy cost).
(Several AMD x86 implementations have used marker bit techniques.)
Alternatively, marker bits could be included in the instruction encoding. This places some constrains on opcode assignment and placement since the marker bits effectively become part of the opcode.
Another technique, used by the IBM zSeries (S/360 and descendants), is to encode the instruction length in a simple way in the opcode in the first parcel. The zSeries uses two bits to encode three different instruction lengths (16, 32, and 48 bits) with two encodings used for 16 bit length. By placing this in a fixed position, it is relatively easy to quickly determine where the next sequential instruction begins.
(More aggressive predecoding is also possible. The Pentium 4 used a trace cache containing fixed-length micro-ops and recent Intel processors use a micro-op cache with [presumably] fixed-length micro-ops.)
Obviously, variable length encodings require addressing at the granularity of a parcel which is typically smaller than an instruction for a fixed-length ISA. This means that branch offsets either lose some range or must use more bits. This can be compensated by support for more different immediate sizes.
Likewise, fetching a single instruction can be more complex since the start of the instruction is likely to not be aligned to a larger power of two. Buffering instruction fetch reduces the impact of this, but adds (trivial) delay and complexity.
With variable length instructions it is also more difficult to have uniform encoding. This means that part of the opcode must often be decoded before the basic parsing of the instruction can be started. This tends to delay the availability of register names and other, less critical information. Significant uniformity can still be obtained, but it requires more careful design and weighing of trade-offs (which are likely to change over the lifetime of the ISA).
As noted earlier, with more complex implementations (deeper pipelines, out-of-order execution, etc.), the extra relative complexity of handling variable length instructions is reduced. After instruction decode, a sophisticated implementation of an ISA with variable length instructions tends to look very similar to one of an ISA with fixed length instructions.
It might also be noted that much of the design complexity for variable length instructions is a one-time cost; once an organization has learned techniques (including the development of validation software) to handle the quirks, the cost of this complexity is lower for later implementations.
Because of the code density concerns for many embedded systems, several RISC ISAs provide variable length encodings (e.g., microMIPS, Thumb2). These generally only have two instruction lengths, so the additional complexity is constrained.
Bundling as a Compromise Design
One (sort of intermediate) alternative chosen for some ISAs is to use a fixed length bundle of instructions with different length instructions. By containing instructions in a bundle, each bundle has the advantages of a fixed length instruction and the first instruction in each bundle has a fixed, aligned starting position. The CDC 6600 used 60-bit bundles with 15-bit and 30-bit operations. The M32R uses 32-bit bundles with 16-bit and 32-bit instructions.
(Itanium uses fixed length power-of-two bundles to support non-power of two [41-bit] instructions and has a few cases where two "instructions" are joined to allow 64-bit immediates. Heidi Pan's [academic] Heads and Tails encoding used fixed length bundles to encode fixed length base instruction parts from left to right and variable length chunks from right to left.)
Some VLIW instruction sets use a fixed size instruction word but individual operation slots within the word can be a different (but fixed for the particular slot) length. Because different operation types (corresponding to slots) have different information requirements, using different sizes for different slots is sensible. This provides the advantages of fixed size instructions with some code density benefit. (In addition, a slot might be allocated to optionally provide an immediate to one of the operations in the instruction word.)

Ghz to MIPS? Rough estimate anyone?

From the research I have done so far I learned that there the MIPS is highly dependent upon the application being run, or the language.
But can anyone give me their best guess for a 2.5 Ghz computer in MIPS? Or any other number of Ghz?
C++ if that helps.
MIPS stands for "Million Instructions Per Second", but that value becomes difficult to calculate for modern computers. Many processor architectures (such as x86 and x86_64, which make up most desktop and laptop computers) fall into the CISC category of processors. CISC architectures often contain instructions that perform several different tasks at once. One of the consequences of this is that some instructions take more clock cycles than other instructions. So even if you know your clock frequency (in this case 2.5 gigahertz), the number of instructions run per second depends mostly on which instructions a program uses. For this reason, MIPS has largely fallen out of use as a performance metric.
For some of my many benchmarks, identified in
http://www.roylongbottom.org.uk/
I produce an assembly code listing from which actual assembler instructions used can be calculated (Note that these are not actual micro instructions used by the RISC processors). The following includes %MIPS/MHz calculations based on these and other MIPS assumptions.
http://www.roylongbottom.org.uk/cpuspeed.htm
The results only apply for Intel CPUs. You will see that MIPS results depend on whether CPU, cache or RAM data is being used. For a modern CPU at 2500 MHz, likely MIPS are between 1250 and 9000 using CPU/L1 cache but much less accessing data in RAM. Then there are SSE SIMD integer instructions. Real integer MIPS for simple register based additions are in:
http://www.roylongbottom.org.uk/whatcpu%20results.htm#anchorC2D
Where my 2.4 GHz Core 2 CPU is shown to run at up to 17531 MIPS.
Roy
MIPS officially stands for Million Instructions Per Second but the Hacker's Dictionary defines it as Meaningless Indication of Processor Speed. This is because many companies use the theoretical maximum for marketing which is never achieved in real applications. E.g. current Intel processors can execute up to 4 instructions per cycle. Following this logic at 2.5 GHz it achieves 10,000 MIPS. In real applications, of course, this number is never achieved. Another problem, which slavik already mentions, is that instructions do different amounts of useful work. There are even NOPs, which–by definition–do nothing useful yet contribute to the MIPS rating.
To correct this people began using Dhrystone MIPS in the 1980s. Dhrystone is a synthetical benchmark (i.e. it is not based on a useful program) and one Dhrystone MIPS is defined relative to the benchmark performance of a VAX 11/780. This is only slightly less ridiculous than the definition above.
Today, performance is commonly measured by SPEC CPU benchmarks, which are based on real world programs. If you know these benchmarks and your own applications very well, you can make resonable predictions of performance without actually running your application on the CPU in question.
They key is to understand that performance will vary widely based on a number of characteristics. E.g. there used to be a program called The Many Faces of Go which essentially hard codes knowledge about the Board Game in many conditional if-clauses. The performance of this program is almost entirely determined by the branch predictor. Other programs use hughe amounts of memory that does not fit into any cache. The performance of these programs is determined by the bandwidth and/or latency of the main memory. Some applications may depend heavily on the throughput of floating point instructions while other applications never use any floating point instructions. You get the idea. An accurate prediction is impossible without knowing the application.
Having said all that, an average number would be around 2 instructions per cycle and 5,000 MIPS # 2.5 GHz. However, real numbers can be easily ten or even a hundred times lower.

What is the exact meaning of 'N' bit processor ? , clarification for freescale arch

While reading one Freescale processor manual I stuck somewhere, which specifies that it is a 32-bit processor.
May I know the exact meaning and logic behind that?
Update:
Does it specify its ALU width or its address width or its register width specifically or all of them together is N-bit each.
Update:
Hope you have heard of Freescale processors. I just came across their site which describes one of their latest Starcore-based processor known as SC3850 as a 16-bit processor. As far as I know, it has 32 bit program counters, including ALU, and 40-bit register width and 2x64 bit address bus width. Also the SC3850 can handle SIMD(2) instructions which are of 32 bit or 64 bit.
For more details please go through this link
One of the major reasons you would care about the register width of the processor is performance. Generally doubling the number of bits doubles the rate at which a processor can move data around, and compute. This is why we're not all using 8 bit processors.
The other major reason is address space. A 16 bit program counter limits you to 64k of address space, and a 32 bit counter limits you to 4 gigabytes. The new 64 bit processors make it possible, if all the address lines are present, to support 17,179,869,184 gigabytes of memory.
Firstly i dont have a definitive answer but i would guess that 8 being a power of 2, is an important factor. Being a power of 2 also means that certain optimisations may be performed by dividing the 8 bits into groups which also means lookup tables can be used for certain operations. 8 bits in the past was also the perfect size when dealing wiht plain old ascii characters. I can imagine that using 5 bit bytes and encoding a string of ascii characters across memory would be a pain.
Please check out the Wikipedia entry on 32-bit processors, from the entry:
In computer architecture, 32-bit
integers, memory addresses, or other
data units are those that are at most
32 bits (4 octets) wide. Also, 32-bit
CPU and ALU architectures are those
that are based on registers, address
buses, or data buses of that size.
32-bit is also a term given to a
generation of computers in which
32-bit processors were the norm.
Read and understand the article - then the answer for N will be obvious.

Why doesn't my processor have built-in BigInt support?

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?
There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
It'll wrap around inside the space it does have
or
It'll use more memory.
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?
Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.
It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.
Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.
Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)
imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.
ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.
In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.
Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for adc (and adcx / adox).
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
unsigned __int64 b, unsigned __int64 * out);
So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.
IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.
Here's an example of a BigInteger add function, in NASM syntax:
;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
; len is an element-count, not byte-count
global bigint_add
bigint_add: ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
; set up for using dst as an index for src
sub rsi, rdi ; rsi -= dst. So orig_src = rsi + rdi
clc ; CF=0 to set up for the first adc
; alternative: peel the first iteration and use add instead of adc
.loop:
mov rax, [rsi + rdi] ; load from src
adc rax, [rdi] ; <================= ADC with dst
mov [rdi], rax ; store back into dst. This appears to be cheaper than adc [rdi], rax since we're using a non-indexed addressing mode that can micro-fuse
lea rdi, [rdi + 8] ; pointer-increment without clobbering CF
dec rdx ; preserves CF
jnz .loop ; loop while(--len)
ret
On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.
Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)
adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).
According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.
Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.
Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.
Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.
As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.
Seems Intel is Adding (or has added as # time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
New instructions are being introduced on Intel® Architecture
Processors to enable fast implementations of large integer arithmetic.
Large Integer Arithmetic is widely used in multi-precision libraries
for high-performance technical computing, as well as for public key
cryptography (e.g., RSA). In this paper, we describe the critical
operations required in large integer arithmetic and their efficient
implementations using the new instructions.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.
BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx