Count Cycles not matching on STM32F103C8? Prefetch buffer not working as I think?

Count Cycles not matching on STM32F103C8? Prefetch buffer not working as I think? - stm32

I have been fighting this subject for a while. I am using STM32F103C8 with the ST-Link V2 on Atollic.
I made some delay functions on assembly. I have been testing this piece of code using a oscilloscope on ATSAM (84 MHz and work perfectly) and on STM32 I also use a CPU register to see the exact amount of cycles on the debugging - DWT (Data Watchpoint and Trace).
When I configure the STM32 CPU clock to 24MHz the exact amount of cycles that I have designed for the time delay is correct. It is, 1 cycle for a decrement assembly instruction and 2 cycles for a branch instruction (on most cases). So, the main loop spend 3 cycles.
When I change the CPU clock to 72MHz each assembly instruction spend twice that time!
Well, the prefecth buffer is 2x64 bits, and the wait states should not let influence the execution CPU time (not thinking on prediction or other code stalls) on this microcontroller? Should it?
Well, on 24MHz the flash memory has no wait state, with higher clock, the CPU should not wait to execute any code. Should it?
I flashing with the release hex to see some difference and did not find any.
My only explanation would be of the ST-LINK V2? Am I right?
Thanks a lot for your time and attention.
This is the piece of the code that matters:
asm (".equ fcpu, 72000000\n\t"); //72 MHz
asm (".equ const_ms, fcpu/3000 \n\t");
asm (".equ const_us, fcpu/3000000 \n\t");
void delay_us(uint32_t valor)
{
asm volatile ( "movw r1, #:lower16:const_us \n\t"
"movt r1, #:upper16:const_us \n\t"
"mul r0, r0, r1 \n\t"
"r_us: subs r0, r0, #1 \n\t"
"bne r_us \n\t");
}
void delay_ms(uint32_t valor)
{
asm volatile ("movw r1, #:lower16:const_ms \n\t"
"movt r1, #:upper16:const_ms \n\t"
"mul r0, r0, r1 \n\t"
"r_ms: subs r0, r0, #1 \n\t"
"bne r_ms \n\t");
}

It is because of the wait states of the FLASH memory run at 72MHz. It is good to read the documentation :).
Place the code in the SRAM and you will get what you want.
For the good results fro the FLASH avoid the branching as it flushes the pipeline. This kind of delays are good only for the very short ones. Anything longer should be implemented using the timers.
I advice to avoid delays in the code.
PS St-Link is not guilty :)

I have been doing several tests. My first conclusion is that the overhead depends on the alignment of the instructions on memory (the prefetch buffer is 2x64bits).
Second, because of the deterministic behavior of the branch, when taken, it flushes the prefetch buffer and also the pipeline.

Related

GNU Assembler and Exception Vector Table

I have been done the Baking Pi tutorial, and I have studied about SVC system call, in the Baking Pi tutorial, it set the base of my program is 0x8000 but the vector table base is 0, how do I access 0x0 by GNU assembler and use which kernel.ld I use now?

Depending on the Pi you can start at 0x8000 or 0x80000 by default. There are now different filenames to guide the bootloader as to what mode you want the processor kernel.img, kernel7.img kernel32.img or some various combinations you can easily look this up.
The baking Pi first off had issues as written but asked and answered many times in the Raspberry Pi website baremetal forums (a very good resource, best I have seen in a long time if not ever). You will need to be using an old old pi or a Pi Zero to get the tutorial to work unless it has been updated.
This is bare metal you own the whole address space if you want to put something at zero you simply do that.
Another approach is you can create a config.txt file and in that you can tell the bootloader in the GPU to load your image to 0x00000000 in the arms address space. Depending on the arm core you are using you can also use a VTOR register if present to change where the vector table is (so set it at 0x80000 instead of 0x0000. I don't think the arm11 in the Pi Zero or old old pis allows for that though. 32 bit mode on the newer ones does, but they are multi-core and that will unravel any learning exercises. you have to "sort the cores" as I like to say on boot, isolating one to continue and putting the others in an infinite loop so they don't interfere. The boot code that the gpu lays down for you on those Pi's does this for you so that only one hits 0x8000 or 0x80000, so the config.txt approach is something folks contemplate, but I would recommend against it for a while.
There are a number of tutorials linked in the raspberrypi baremetal forum on their website that should take you well beyond the baking Pi one(s). and/or help you through those as folks struggled with them for some time.
A linker script like this
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 0x10000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : { *(.bss*) } > ram
.data : { *(.data*) } > ram
}
with a bootstrap like this
.globl _start
_start:
mov sp,#0x8000
bl main
hang: b hang
should get you booted.
For the linker script you may need 0x80000 instead of 0x8000, and if you have at least one .data item, like a global variable:
unsigned int x = 5;
Then the bootstrap doesn't have to zero .bss (if your programming style is such that you rely on that). objcopy will pad the -O binary file with zeros between .rodata and .data if there is .data there taking care of zeroing bss.
You can let the tools do the work for you as far as an exception table goes:
.globl _start
_start:
ldr pc,reset_handler
ldr pc,undefined_handler
ldr pc,swi_handler
ldr pc,prefetch_handler
ldr pc,data_handler
ldr pc,unused_handler
ldr pc,irq_handler
ldr pc,fiq_handler
reset_handler: .word reset
undefined_handler: .word hang
swi_handler: .word hang
prefetch_handler: .word hang
data_handler: .word hang
unused_handler: .word hang
irq_handler: .word irq
fiq_handler: .word hang
reset:
mov r0,#0x8000
mov r1,#0x0000
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
Now if this is not a Pi Zero then the vector table works differently you need to read the arm docs anyway before going off into stuff like this but read up on the core and mode as well as the architecture docs for whichever you are using. The newer Pis have an armv7 mode and an armv8 mode (aarch32 and aarch64) and each has its own challenges, but they have all been covered in the forum.

When does the pipeline take 2 decode stages when there is a RAW dependency in 2 successive instructions

Consider a RISC pipeline having 5 stages, Find how many cycles are required for the instruction given below, assume operand forwarding, branch prediction is used in which the branch is not taken, ACS is the branch instruction and the five stages are Instruction fetch, Decode, Execute, Memory and Write back.
I1: ACS R0, R1,X
I2: LOAD R2, 0(R3)
I3: SUB R4 R2, R2
I4: X: ADD R5, R1, R2
I5: LOAD R1, 0(R5)
I6: SUB R1, R1, R4
I7: ADD R1, R1, R5
A. 11
B. 12
C. 13
D. 14
Solution:
In the solution, I coludn't understand why have they neglected 2 DECODE cycles in I6 and I7 although they have a RAW dependency?
Source of the question:
Question 41 of https://practice.geeksforgeeks.org/contest-quiz/sudo-gate-2020-mock-iii

I think the answer gives the right total (13 cycles) but put the stall in the wrong instruction.
I5 doesn't need to stall; I4 (ADD R5, R1, R2) produces R5 in time to forward it to the next instruction's EX for address calculation (LOAD R1, 0(R5)). (Your 5-stage classic RISC pipeline has bypass forwarding).
But I6 reads the result of a load instruction, and loads produce their result a cycle later than the ALU in EX. So like I3, I6 needs to stall, not I5.
(I7 depends on I6, but I6 is an ALU instruction so it can forward without stalling.)
They stalls in the D stage because the ID stage can't fetch registers that the I2 / I5 load hasn't produced yet.
Separately from that, your diagram shows I4 (and what should be I7) not even being fetched when the previous instruction stalls. That doesn't make sense to me. At the start of that cycle, the pipeline doesn't even know that it needs to stall because it hasn't yet decoded I3 (and I6) and detected that it reads a not-ready register so an interlock is needed.
Fetch doesn't wait until after decoding the previous instruction to see if it stalled or not; that would defeat the entire purpose of pipelining. It should look like
I3 IF D D EX MEM WB
I4 IF IF D EX MEM WB
BTW, load latency is the reason that classic MIPS has a load-delay slot (unpredictable behaviour if you try to use a register in the next instruction after loading into it). Later MIPS added interlocks to stall if you do that, instead of making it an error, so you can keep static code-size smaller (no NOP filler) in cases where you can't find any other instruction to put in that slot. (And some even later MIPS did out-of-order exec which can hide latency.)

What sort of data is stored in cpu registers

I know cpu registers are used for fast access. But could anyone give me an example of the data content stored in? Why these data are so imporant and have to be stored by operating system during context switching?

I would place registers in two groups:
System Registers
Registers that define the process state
System registers do not change with process contexts. Classically, the second group of registers includes:
A processor status register
General registers
Memory mapping registers
You seen to be most interested in #2 from the call of your question. For simplicity, I will use the VAX processor as the working example (The Intel Kludge-On-A-Chip is overly complex).
The VAX has 16 32-bit registers (R0 - R15). Some of those registers (R12–R15) have have special purposes:
PC = Program Counter points to the next instruction to execute
SP = Stack pointer points to bottom of the stack for the current mode.
AP = Argument Pointer points to the arguments to a function
FP = Frame Pointer used to restore the stack after a function call completes.
That leaves R0–R11 for general use.
R6-R11 can be used by programmers at will.
R0-R5 can be used by programmers but some instructions change their values.
The registers are 32 bits. They can then store:
One-Byte signed or unsigned integer
Two-byte signed or unsigned integer
Four-byte signed or unsigned integer
Four-byte floating point
You can do something like these:
ADDL3 R0, R1, R2 ; Add contents of R0 and R1 and store the result in R2
ADDF3 R0, R1, R2
In the first case, the processor treats the contents of R0 and R1 as 32-bit signed integers. In the second case, it treats the contents of R0 and R1 as 32-bit floating point values.
The interpretation of the register contents depends upon the instruction being executed. Thus, the two instructions above are likely to store different values in R2, even if they have the same values in R0 and R1.
Larger data types, adjacent registers can be combined.
ADDD3 R0, R2, R4
This adds the contents of R0/R1, to the contents of R2/R3, and stores the result in R4/R5, treating the contents of all the register pairs as 64-bit floating point values.
You can even do
ADDH3 R0, R4, R8
This adds the contents of R0/R1/R2/R3 to the contents of R4/R5/R6/R7, and stores the result in R8/R9/R10/R11, treating the contents of all the register quads as 128-bit floating point value.
The VAX has character and come complex matching instructions that use R0-R5 for special purposes (such as loop counters). These are instructions with long execution that can be interrupted. Using the registers to maintain the state of the instruction allows the instruction to be restarted midstream when the process is restarted.
Programmers use R0-R5. There is no problem with that as long as you don't use the instructions that disrupt them.
By Convention R0 and R1 are used for function return values.
So these are the kinds of things you do with registers.

They are not for fast access. They are the core of the cpu and every operation must be done on them. Cpu can add two numbers after you move them from memory to the registers, for example.

Maximum speed from IOS/iPad/iPhone

I done computing intensive app using OpenCV for iOS. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I was optimizing it down. From very first 15 seconds I was able to get 0.4 seconds speed. I wonder if I found all things and what others may want to share. What I did:
Replaced "double" data types inside OpenCV to "float". Double is 64bit and 32bit CPU cannot easily handle them, so float gave me some speed. OpenCV uses double very often.
Added "-mpfu=neon" to compiler options. Side-effect was new problem that emulator compiler does not work anymore and anything can be tested on native hardware only.
Replaced sin() and cos() implementation with 90 values lookup tables. Speedup was huge! This is somewhat opposite to PC where such optimizations does not give any speedup. There was code working in degrees and this value was converted to radians for sin() and cos(). This code was removed too. But lookup tables did the job.
Enabled "thumb optimizations". Some blog posts recommend exactly opposite but this is because thumb makes things usually slower on armv6. armv7 is free of any problems and makes things just faster and smaller.
To make sure thumb optimizations and -mfpu=neon work at best and do not introduce crashes I removed armv6 target completely. All my code is compiled to armv7 and this is also listed as requirement in app store. This means minimum iPhone will be 3GS. I think it is OK to drop older ones. Anyway older ones have slower CPUs and CPU intensive app provides bad user experience if installed on old device.
Of course I use -O3 flag
I deleted "dead code" from OpenCV. Often when optimizing OpenCV I see code which is clearly not needed for my project. For example often there is a extra "if()" to check for pixel size being 8 bit or 32 bit and I know that I need 8bit only. This removes some code, provides optimizer better chance to remove something more or replace with constants. Also code fits better into cache.
Any other tricks and ideas? For me enabling thumb and replacing trigonometry with lookups were boost makers and made me surprise. Maybe you know something more to do which makes apps fly?

If you are doing a lot of floating point calculations, it would benefit you greatly to use Apple's Accelerate framework. It is designed to use the floating point hardware to do calculations on vectors in parallel.
I will also address your points one by one:
1) This is not because of the CPU, it is because as of the armv7-era only 32-bit floating point operations will be calculated in the floating point processor hardware (because apple replaced the hardware). 64-bit ones will be calculated in software instead. In exchange, 32-bit operations got much faster.
2) NEON is the name of the new floating point processor instruction set
3) Yes, this is a well known method. An alternative is to use Apple's framework that I mentioned above. It provides sin and cos functions that calculate 4 values in parallel. The algorithms are fine tuned in assembly and NEON so they give the maximum performance while using minimal battery.
4) The new armv7 implementation of thumb doesn't have the drawbacks of armv6. The disabling recommendation only applies to v6.
5) Yes, considering 80% of users are on iOS 5.0 or above now (armv6 devices ended support at 4.2.1), that is perfectly acceptable for most situations.
6) This happens automatically when you build in release mode.
7) Yes, this won't have as large an effect as the above methods though.
My recommendation is to check out Accelerate. That way you can make sure you are leveraging the full power of the floating point processor.

I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:
__asm__("#start");
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
__asm__("#stop");
Now I press "Product -> Generate Output -> Assembly file" and what I get is:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
ldr r0, [sp, #84]
movs r1, #0
ldr r0, [r0, #16]
ldr r0, [r0, #28]
cmp r0, #4
mov r0, r4
blo LBB14_71
LBB14_70:
Ltmp1916:
ldr r3, [sp, #84]
movs r2, #0
Ltmp1917:
str r2, [r0], #4
adds r1, #1
Ltmp1918:
Ltmp1919:
ldr r2, [r3, #16]
ldr r2, [r2, #28]
lsrs r2, r2, #2
cmp r2, r1
bgt LBB14_70
LBB14_71:
Ltmp1920:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0])) and it was always 64. So I hardcoded it to be 64 and passed again via assembler:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
vldr.32 s16, LCPI14_7
mov r0, r4
movs r1, #0
mov.w r2, #256
blx _memset
# InlineAsm Start
#stop
# InlineAsm End
As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.
I also tried Accelerate as suggested replacing three lines with:
__asm__("#start");
vDSP_vclr(vec,1,64);
__asm__("#stop");
Assembly now looks:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1917:
str r1, [r7, #-140]
Ltmp1459:
Ltmp1918:
movs r1, #1
movs r2, #64
blx _vDSP_vclr
Ltmp1460:
Ltmp1919:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.
One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework

Why doesn't my processor have built-in BigInt support?

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?

There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
It'll wrap around inside the space it does have
or
It'll use more memory.
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?

Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.

It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.
It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.
Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.
Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)
imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.
ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.
In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.
Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for adc (and adcx / adox).
unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
unsigned __int64 b, unsigned __int64 * out);
So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.
IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.
Here's an example of a BigInteger add function, in NASM syntax:
;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
; len is an element-count, not byte-count
global bigint_add
bigint_add: ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
; set up for using dst as an index for src
sub rsi, rdi ; rsi -= dst. So orig_src = rsi + rdi
clc ; CF=0 to set up for the first adc
; alternative: peel the first iteration and use add instead of adc
.loop:
mov rax, [rsi + rdi] ; load from src
adc rax, [rdi] ; <================= ADC with dst
mov [rdi], rax ; store back into dst. This appears to be cheaper than adc [rdi], rax since we're using a non-indexed addressing mode that can micro-fuse
lea rdi, [rdi + 8] ; pointer-increment without clobbering CF
dec rdx ; preserves CF
jnz .loop ; loop while(--len)
ret
On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.
Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)
adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).
According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.
Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.
Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.

Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.

As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.

Seems Intel is Adding (or has added as # time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
New instructions are being introduced on Intel® Architecture
Processors to enable fast implementations of large integer arithmetic.
Large Integer Arithmetic is widely used in multi-precision libraries
for high-performance technical computing, as well as for public key
cryptography (e.g., RSA). In this paper, we describe the critical
operations required in large integer arithmetic and their efficient
implementations using the new instructions.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.

BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse