I have been done the Baking Pi tutorial, and I have studied about SVC system call, in the Baking Pi tutorial, it set the base of my program is 0x8000 but the vector table base is 0, how do I access 0x0 by GNU assembler and use which kernel.ld I use now?
Depending on the Pi you can start at 0x8000 or 0x80000 by default. There are now different filenames to guide the bootloader as to what mode you want the processor kernel.img, kernel7.img kernel32.img or some various combinations you can easily look this up.
The baking Pi first off had issues as written but asked and answered many times in the Raspberry Pi website baremetal forums (a very good resource, best I have seen in a long time if not ever). You will need to be using an old old pi or a Pi Zero to get the tutorial to work unless it has been updated.
This is bare metal you own the whole address space if you want to put something at zero you simply do that.
Another approach is you can create a config.txt file and in that you can tell the bootloader in the GPU to load your image to 0x00000000 in the arms address space. Depending on the arm core you are using you can also use a VTOR register if present to change where the vector table is (so set it at 0x80000 instead of 0x0000. I don't think the arm11 in the Pi Zero or old old pis allows for that though. 32 bit mode on the newer ones does, but they are multi-core and that will unravel any learning exercises. you have to "sort the cores" as I like to say on boot, isolating one to continue and putting the others in an infinite loop so they don't interfere. The boot code that the gpu lays down for you on those Pi's does this for you so that only one hits 0x8000 or 0x80000, so the config.txt approach is something folks contemplate, but I would recommend against it for a while.
There are a number of tutorials linked in the raspberrypi baremetal forum on their website that should take you well beyond the baking Pi one(s). and/or help you through those as folks struggled with them for some time.
A linker script like this
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 0x10000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : { *(.bss*) } > ram
.data : { *(.data*) } > ram
}
with a bootstrap like this
.globl _start
_start:
mov sp,#0x8000
bl main
hang: b hang
should get you booted.
For the linker script you may need 0x80000 instead of 0x8000, and if you have at least one .data item, like a global variable:
unsigned int x = 5;
Then the bootstrap doesn't have to zero .bss (if your programming style is such that you rely on that). objcopy will pad the -O binary file with zeros between .rodata and .data if there is .data there taking care of zeroing bss.
You can let the tools do the work for you as far as an exception table goes:
.globl _start
_start:
ldr pc,reset_handler
ldr pc,undefined_handler
ldr pc,swi_handler
ldr pc,prefetch_handler
ldr pc,data_handler
ldr pc,unused_handler
ldr pc,irq_handler
ldr pc,fiq_handler
reset_handler: .word reset
undefined_handler: .word hang
swi_handler: .word hang
prefetch_handler: .word hang
data_handler: .word hang
unused_handler: .word hang
irq_handler: .word irq
fiq_handler: .word hang
reset:
mov r0,#0x8000
mov r1,#0x0000
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
Now if this is not a Pi Zero then the vector table works differently you need to read the arm docs anyway before going off into stuff like this but read up on the core and mode as well as the architecture docs for whichever you are using. The newer Pis have an armv7 mode and an armv8 mode (aarch32 and aarch64) and each has its own challenges, but they have all been covered in the forum.
Related
Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).
I want to use Raspberry Compute Module 3 (CM3) for an industrial project.
The problem is that 4GB of emmc (connected to SD0 broadcom private bus) is not enough.
I want to connect an additional SD card (8GB) throught the second SD interface SD1 (GPIO from 22 to 27 in ALT3).
The problem is that with this connection and with the default Raspbian Lite jessy (kernel 4.4) the connected sdcard is not recognized.
I tried to set the gpio alternate (ALT3) function with cli raspi-gpio but no results.
What is the problem?
We are using the CM3L version (no on-board flash), and my references are to the schematic titled "Raspberry Pi Compute Module 3 (reduced)", dated 10-13-2016.
The CM3L cannot access an external SD card because the control lines are not brought out to the card-edge pins. We modified our CM3 samples, turning them into CM3L units with the following steps to remove the on-board flash and to bring the control lines to the card-edge pins (notes taken from my marked-up schematic):
To turn CM3 into CM3L:
Move R24 to R25 position
Short R12, R16, R17, R18, R19
Remove U7 (BGA Flash)
Not documented, but seems to be necessary: R9 should be zero Ohms,
and R8 is listed as a 2.2k Pull-up, but seems to be zero Ohms. Move R8 to
R9 position (or maybe just short across R9 pads.
Posssible using other gpio but not sd0, eg. Dev board won't do without modification.
See this thread. The other answer isn't ideal IMHO as you can't use both and are permanently modifying your compute.
You can have the 2nd SDIO peripheral at GPIO 22-27 or 34-39.
https://www.raspberrypi.org/forums/viewtopic.php?t=172406
I'm working with STM32F427 UASRT1 via the following class:
void DebugUartOperator::Init() {
// for USART1 and USART6
::RCC_APB2PeriphClockCmd(RCC_APB2Periph_USART1, ENABLE);
// USART1 via PORTA
::RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_GPIOA, ENABLE);
::GPIO_PinAFConfig(GPIOA, GPIO_PinSource9, GPIO_AF_USART1);
::GPIO_PinAFConfig(GPIOA, GPIO_PinSource10, GPIO_AF_USART1);
GPIO_InitTypeDef GPIO_InitStruct;
// fills the struct with the default vals:
// all pins, mode IN, 2MHz, PP, NOPULL
::GPIO_StructInit(&GPIO_InitStruct);
// mission-specific settings:
GPIO_InitStruct.GPIO_Pin = GPIO_Pin_9 | GPIO_Pin_10;
GPIO_InitStruct.GPIO_Mode = GPIO_Mode_AF;
::GPIO_Init (GPIOA, &GPIO_InitStruct);
USART_InitTypeDef USART_InitStruct;
// 9600/8/1/no parity/no HWCtrl/rx+tx
::USART_StructInit(&USART_InitStruct);
USART_InitStruct.USART_BaudRate = 921600;
USART_InitStruct.USART_WordLength = USART_WordLength_9b;
USART_InitStruct.USART_StopBits = USART_StopBits_1;
USART_InitStruct.USART_Parity = USART_Parity_Odd;
::USART_Init(USART1, &USART_InitStruct);
::USART_Cmd(USART1, ENABLE);
}
void DebugUartOperator::SendChar(char a) {
// wait for TX register to become empty
while(::USART_GetFlagStatus(USART1, USART_FLAG_TXE) != SET);
::USART_SendData(USART1, static_cast<uint8_t>(a));
}
The problem is that every now and then USART starts ignoring actual 8th data bit and setting it as a parity bit (odd parity, to be specific). The strangest of all that it sometimes happens even after a long poweroff, without any prior reprogramming or something. For example yesterday evening it was all OK, then next morning I come to work, switch the device on and it starts working the way described. But it's not limited to this, it may randomly appear after some next restart.
That effect is clearly visible with the oscilloscope and with different UART-USB converters used with diffrent programs. It is even possible, once this effect have appeared, to reprogram a microcontroller to transmit test data sets. For example, 0x00 to 0xFF in endless cycle. It does not affect the problem. Changing speeds (down to 9600 bps), bits per word, parity control does not help - the effect remains intact even after repropramming (resulting for example in really abnormal 2 parity bits per byte). Or, at least, while UASRT is being initialized and used in the usual order according to my program's workflow.
The only way to fix it is to make the main() function do the following:
int main() {
DebugUartOperator dua;
dua.Init();
while(1) {
uint8_t i;
while(++i)
dua.SendChar(i);
dua.SendChar(i);
}
}
With this, after reprogramming and restart, the first few bytes (up to 5) are transmitted rotten but then everything works pretty well and continues to work well through further restarts and reprograms.
This effect is observed on 2 different STM32F427s on 2 physically different boards of the same layout. No regularity is noticed in its appearance. Signal polarity and levels conform USART requirements, no noise or bad contacts are dectected during investigation. There seems to be no affection to UASRT1 from the direction of other code used in my program (either mine or library one) or it is buried deeply. CMSIS-OS is used as RTOS in the project, with Keil uVision 5.0.5's RTX OS.
Need help.
In STM's you can specify wordlength for usart / uart transmission but wordlength is a sum of data bits and bit parity. So if you would like to have 8 bit data and even parity bit, you have to specify UART_WORDLENGTH_9B andUART_PARITY_EVEN.
You can also have 9 bits of data, with no parity. In reference manual for F427 section 30.6.4, Bit 12 we see that it should be possible to set 9 data bits, but term data bits is also applicable to parity bit.
Bit 12M: Word length
This bit determines the word length. It is set or cleared by software.
0: 1 Start bit, 8 Data bits, n Stop bit
1: 1 Start bit, 9 Data bits, n Stop bit
Final answer is in 30.6.4, Bit 10
This bit selects the hardware parity control (generation and
detection). When the parity control is enabled, the computed parity is
inserted at the MSB position (9th bit if M=1; 8th bit if M=0) and
parity is checked on the received data. This bit is set and cleared by
software. Once it is set, PCE is active after the current byte (in
reception and in transmission).
I have trouble understanding the following assembly code which is used to add two integers using registers. It's not a very cumbersome question, just that I lack any good reference to learn the syntax. If you can provide me with the insight line by line. I would be extremely grateful.
MOV R1, #100
MOV R2, #100
MOV (R1), #50
ADD R2,(R1)
I get the first two lines which will store number 100 in the given registers, I just don't get the purpose of using brackets in next two lines.
And this is not homework, Just a question to clarify the theory behind it.
Question is what are the values of R1, R2 after the instructions have been executed.
I found the following explanation on another website, which helped me a lot to understand the use of brackets. I believe it would be very clarifying for other people too, so I will post it below:
Lets analyze this program:
MOV AX, 47104
MOV DS, AX
MOV [3998], 36
INT 32
... The first instruction, MOV AX, 47104, tells the computer to copy the number 47104 into the location AX. The next instruction, MOV DS, AX, tells the computer to copy the number in AX into the location DS. The next instruction, MOV [3998], 36 tells the computer to put the number 36 into memory location 3998. Finally, INT 32 exits the program by returning to the operating system.
Before we go on, I would like to explain just how this program works. Inside the CPU are a number of locations, called registers, which can store a number. Some registers, such as AX, are general purpose, and don't do anything special. Other registers, such as DS, control the way the CPU works.
DS just happens to be a segment register, and is used to pick which area of memory the CPU can write to. In our program, we put the number 47104 into DS, which tells the CPU to access the memory on the video card.
The next thing our program does is to put the number 36 into location 3998 of the video card's memory. Since 36 is the code for the dollar sign, and 3998 is the memory location of the bottom right hand corner of the screen, a dollar sign shows up on the screen a few microseconds later.
Finally, our program tells the CPU to perform what is called an interrupt. An interrupt is used to stop one program and execute another in its place. In our case, we want interrupt 32, which ends our program and goes back to MS-DOS, or whatever other program was used to start our program.
We can see from this example that the use of brackets resulted in inputting a value into a memory location, and not into a register. Lately, this value was read by the video card to display a symbol on the screen.
Credits to the writer on: http://www.swansontec.com/sprogram.html
I done computing intensive app using OpenCV for iOS. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I was optimizing it down. From very first 15 seconds I was able to get 0.4 seconds speed. I wonder if I found all things and what others may want to share. What I did:
Replaced "double" data types inside OpenCV to "float". Double is 64bit and 32bit CPU cannot easily handle them, so float gave me some speed. OpenCV uses double very often.
Added "-mpfu=neon" to compiler options. Side-effect was new problem that emulator compiler does not work anymore and anything can be tested on native hardware only.
Replaced sin() and cos() implementation with 90 values lookup tables. Speedup was huge! This is somewhat opposite to PC where such optimizations does not give any speedup. There was code working in degrees and this value was converted to radians for sin() and cos(). This code was removed too. But lookup tables did the job.
Enabled "thumb optimizations". Some blog posts recommend exactly opposite but this is because thumb makes things usually slower on armv6. armv7 is free of any problems and makes things just faster and smaller.
To make sure thumb optimizations and -mfpu=neon work at best and do not introduce crashes I removed armv6 target completely. All my code is compiled to armv7 and this is also listed as requirement in app store. This means minimum iPhone will be 3GS. I think it is OK to drop older ones. Anyway older ones have slower CPUs and CPU intensive app provides bad user experience if installed on old device.
Of course I use -O3 flag
I deleted "dead code" from OpenCV. Often when optimizing OpenCV I see code which is clearly not needed for my project. For example often there is a extra "if()" to check for pixel size being 8 bit or 32 bit and I know that I need 8bit only. This removes some code, provides optimizer better chance to remove something more or replace with constants. Also code fits better into cache.
Any other tricks and ideas? For me enabling thumb and replacing trigonometry with lookups were boost makers and made me surprise. Maybe you know something more to do which makes apps fly?
If you are doing a lot of floating point calculations, it would benefit you greatly to use Apple's Accelerate framework. It is designed to use the floating point hardware to do calculations on vectors in parallel.
I will also address your points one by one:
1) This is not because of the CPU, it is because as of the armv7-era only 32-bit floating point operations will be calculated in the floating point processor hardware (because apple replaced the hardware). 64-bit ones will be calculated in software instead. In exchange, 32-bit operations got much faster.
2) NEON is the name of the new floating point processor instruction set
3) Yes, this is a well known method. An alternative is to use Apple's framework that I mentioned above. It provides sin and cos functions that calculate 4 values in parallel. The algorithms are fine tuned in assembly and NEON so they give the maximum performance while using minimal battery.
4) The new armv7 implementation of thumb doesn't have the drawbacks of armv6. The disabling recommendation only applies to v6.
5) Yes, considering 80% of users are on iOS 5.0 or above now (armv6 devices ended support at 4.2.1), that is perfectly acceptable for most situations.
6) This happens automatically when you build in release mode.
7) Yes, this won't have as large an effect as the above methods though.
My recommendation is to check out Accelerate. That way you can make sure you are leveraging the full power of the floating point processor.
I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:
__asm__("#start");
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
__asm__("#stop");
Now I press "Product -> Generate Output -> Assembly file" and what I get is:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
ldr r0, [sp, #84]
movs r1, #0
ldr r0, [r0, #16]
ldr r0, [r0, #28]
cmp r0, #4
mov r0, r4
blo LBB14_71
LBB14_70:
Ltmp1916:
ldr r3, [sp, #84]
movs r2, #0
Ltmp1917:
str r2, [r0], #4
adds r1, #1
Ltmp1918:
Ltmp1919:
ldr r2, [r3, #16]
ldr r2, [r2, #28]
lsrs r2, r2, #2
cmp r2, r1
bgt LBB14_70
LBB14_71:
Ltmp1920:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0])) and it was always 64. So I hardcoded it to be 64 and passed again via assembler:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
vldr.32 s16, LCPI14_7
mov r0, r4
movs r1, #0
mov.w r2, #256
blx _memset
# InlineAsm Start
#stop
# InlineAsm End
As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.
I also tried Accelerate as suggested replacing three lines with:
__asm__("#start");
vDSP_vclr(vec,1,64);
__asm__("#stop");
Assembly now looks:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1917:
str r1, [r7, #-140]
Ltmp1459:
Ltmp1918:
movs r1, #1
movs r2, #64
blx _vDSP_vclr
Ltmp1460:
Ltmp1919:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.
One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework