Configuring USART to send a character in ARM thumb assembly - stm32

I am trying to send a character using my stm32. I am using Real Term serial capture program and have set up a baud rate of 9600.
I have attempted to write the initialization for the USART and GPIOA. So far, when I reset my device, it sends a NULL character to the serial capture program, so I think I am on the right track atleast. But I have tried writing a character to the USART_DR and have had no luck seeing the character at the serial capture side.
I have been following this link as a guide (http://www.micromouseonline.com/2009/12/31/stm32-usart-basics)
And here's a little guide for GPIO registers
#; GPIOx
#; MODER [15:0]
#;0:'00'-> input mode, which allows the GPIO pin to be used as an input pin,
#;1:'01'-> Output mode, which allows the GPIO pin to be used as an output pin,
#;3:'11'-> Analog mode, which allows the GPIO pin to be used as an Analog input pin and finally,
#;2:'10'-> Alternate function mode which allow the GPIO pins to be used by peripherals such as the UART, SPI e.t.c.
#;OTYPER
#;'0'-> output push/pull
#;'1'-> output open drain
#;OSPEEDR
#;'x0': 2MHz Low speed
#;'01':10MHz Medium speed
#;'11': 50MHz High speed
#;OPUPDR
#;'00'-> No pull-up/pull-down
#;'01'-> pull up
#;'10'-> pull down
#;'11'-> Reserved
How I have configured the USART:
According to the guide, I needed to set up PA9 as alternative function mode, output push pull, output low speed, and no pull-up/pull-down.
I then set up PA10 as general purpose input, floating (Though I do not need to use this at this point, I am just trying to see if I can get a character to send first.)
Next, I had to make sure the USART1 clock was enabled.
I found that the RCC_APB2ENR( RCC APB2 periperal clock enable register) is located 0x44 from RCC_base. I enabled it like so,
#; make sure USART1 is enabled clock
ldr r3,=RCC_BASE
ldr r2,[r3,#RCC_APB2ENR]
orr r2,#(1<<4) #; set enable bit
str r2,[r3,#RCC_APB2ENR]
I then set the baud rate, and enabled the CR1 TE and RE bit.
#; load the baud rate (9600), baud = fclk/(16*usartdiv), fclk=16*10^6
ldr r3, =USART1_BASE
mov r2, #0x683 #; Mantissa [15:4] 0x68=0d104 Frac [3:0] 0x3
str r2, [r3, #USART_BRR]
#; enable the USARTx_CR1_UE bit
ldr r2, [r3, #USART_CR1]
orr r2, #(1<<13)
str r2, [r3, #USART_CR1]
#; enable the USARTx_CR1_TE bit
ldr r2, [r3, #USART_CR1]
orr r2, #(1<<3)
str r2, [r3, #USART_CR1]
#; enable the USARTx_CR1_RE bit
ldr r2, [r3, #USART_CR1]
orr r2, #(1<<2)
str r2, [r3, #USART_CR1]
I think I have set it up correctly!
But to no avail, I am not seeing any characters show up in my serial capture program.
I just tried putting the character A in the data register but have received nothing.
ldr r3, =USART1_BASE
mov r2, #0x41 #; A
str r2, [r3, #USART_DR]
Its been a couple days now trying to debug whats wrong and I haven't found an answer. Some type of help would be appreciated! Thank you.

I did need to enable the Moder register to alternate function mode. But what I did not do was enable the alternate function for use by USART1.
Specifically, for Pin 9 (Tx) the alternate function for USART1 can be set in the Alternate Function High Register.
More detail here: (http://web.eece.maine.edu/~zhu/book/Appendix_I_Alternate_Functions.pdf)

Related

STM32 SPI Data Packing

I can't get the SPI on my STM32f3 discovery board (Datasheet) to work with gyroscope sensor (I3G4250D) on the register level. I know I'm sending data since I'm in full duplex and receiving dummy bytes from sensor using 16 bit data packing but when I try to receive using 8 bit access to DR register I get inconsistent values from sensor, sometimes returning one byte 0xff and other times returning 2 bytes 0xffff (at least I think that's what's happening) but no real values from from the sensor register I want to read. I think this has to do with automatic packing of STM32 SPI on my chip but I think I am addressing that by accessing DR register with uint8_t* but it doesn't seem to work. I also want to ask that when I compare the SPI protocol on sensor (datasheet page 24) and STM32 datasheet (page 729) I infer that both CPOL (clock polarity) and CPHA (clock phase) bits in STM32 SPI should be set but I seem to be able to at least send data with or without these bits set...
Here is my SPI Initialization function which includes trying to read bytes at the end of it and a write a byte to sensor register function:
void SPI_Init() {
/* Peripheral Clock Enable */
RCC->AHBENR |= RCC_AHBENR_GPIOEEN|RCC_AHBENR_GPIOAEN;
RCC->APB2ENR |= RCC_APB2ENR_SPI1EN;
/* GPIO Configuration */
GPIOA->MODER |= GPIO_MODER_MODER5_1|GPIO_MODER_MODER6_1|GPIO_MODER_MODER7_1; //Alternate function
GPIOA->OSPEEDR |= GPIO_OSPEEDER_OSPEEDR5|GPIO_OSPEEDER_OSPEEDR6|GPIO_OSPEEDER_OSPEEDR7; //High speed
GPIOA->AFR[0] |= 0x00500000|0x05000000|0x50000000; //AF for SCK,MISO,MOSI
GPIOE->MODER |= GPIO_MODER_MODER3_0; //Port E for NSS Pin
GPIOE->MODER |= GPIO_MODER_MODER3_0;
/* SPI Configuration */
SPI1->CR2 |= SPI_CR2_FRXTH|SPI_CR2_RXDMAEN; //Enable DMA but DMA is not used
// not sure if I need this?|SPI_CR1_CPOL|SPI_CR1_CPHA;
SPI1->CR1 |= SPI_CR1_BR_1|SPI_CR1_SSM|SPI_CR1_SSI|SPI_CR1_MSTR|SPI_CR1_SPE; //big endian, SPI#6MH, since using software set SPI_CR1_SSI to high for master mode
/* Slave Device Initialization */
SPI_WriteByte(CTRL_REG1_G,0x9f);
SPI_WriteByte(CTRL_REG4_G,0x10);
SPI_WriteByte(CTRL_REG5_G,0x10);
//receive test
uint8_t test =0xff;
uint8_t* spiDrPtr = (__IO uint8_t*)&SPI1->DR;
*spiDrPtr = 0x80|CTRL_REG1_G;
while(!(SPI1->SR & SPI_SR_TXE)){}
//SPI1->CR2 &= ~(SPI_CR2_FRXTH); //this is done in HAL not sure why though
*spiDrPtr = test; //Send dummy
while(!(SPI1->SR & SPI_SR_RXNE)){}
test = *spiDrPtr;
}
static void SPI_WriteByte(uint8_t regAdd, uint8_t data) {
uint8_t arr[2] = {regAdd,data}; //16 bit data packing
SPI1->DR = *((uint16_t*)arr);
}
Any suggestions?
Do not enable DMA if you do not use it.
You need to force 16 bit access (not 32 bits)
static void SPI_WriteByte(uint8_t regAdd, uint8_t data) {
uint8_t arr[2] = {regAdd,data}; //16 bit data packing
*(volatile uint16_t *)&SPI1->DR = *((volatile uint16_t*)arr);
}
Try using FRXTH = 0, and do all DR reads and writes in 16-bit words, then just discard the first byte.
To the later question, about CPOL/CPHA. These bits control SPI transfer format on the bit level. Refer to the following image (from wikipedia).
CPOL = 0, CPHA = 0 is also called "SPI mode 0", data bits are sampled on the SCK rising edge
CPOL = 1, CPHA = 1 is also called "SPI mode 3", data bits are also sampled on the rising edge.
The difference between two modes is SCK level between transfers, and an extra falling edge before the first bit.
Some chips states explicitly that both mode 0 and mode 3 are supported. Yet in the I3G4250D datasheet, section 5.2: "SDI and SDO are,
respectively, the serial port data input and output. These lines are driven at the falling edge
of SPC and should be captured at the rising edge of SPC."
When data is sent to the chip from mcu in mode 0, mcu drives the MOSI line before the first rising edge. Thus a slave chip can recieve valid data in both mode 0 and mode 3. But when data is transfered from chip to the mcu, chip may need the first falling SCK edge to shift/latch first data bit to the MISO line and with mode 0 you'll receive the readings shifted on one bit.
I've emphasised 'may' word, because the chip could still work correcly in both modes, latching the first bit with falling nSS-edge, the manufacturer just didn't do the testings, or doesn't guarantees that it'll work with other revisions and in all conditions.

GNU Assembler and Exception Vector Table

I have been done the Baking Pi tutorial, and I have studied about SVC system call, in the Baking Pi tutorial, it set the base of my program is 0x8000 but the vector table base is 0, how do I access 0x0 by GNU assembler and use which kernel.ld I use now?
Depending on the Pi you can start at 0x8000 or 0x80000 by default. There are now different filenames to guide the bootloader as to what mode you want the processor kernel.img, kernel7.img kernel32.img or some various combinations you can easily look this up.
The baking Pi first off had issues as written but asked and answered many times in the Raspberry Pi website baremetal forums (a very good resource, best I have seen in a long time if not ever). You will need to be using an old old pi or a Pi Zero to get the tutorial to work unless it has been updated.
This is bare metal you own the whole address space if you want to put something at zero you simply do that.
Another approach is you can create a config.txt file and in that you can tell the bootloader in the GPU to load your image to 0x00000000 in the arms address space. Depending on the arm core you are using you can also use a VTOR register if present to change where the vector table is (so set it at 0x80000 instead of 0x0000. I don't think the arm11 in the Pi Zero or old old pis allows for that though. 32 bit mode on the newer ones does, but they are multi-core and that will unravel any learning exercises. you have to "sort the cores" as I like to say on boot, isolating one to continue and putting the others in an infinite loop so they don't interfere. The boot code that the gpu lays down for you on those Pi's does this for you so that only one hits 0x8000 or 0x80000, so the config.txt approach is something folks contemplate, but I would recommend against it for a while.
There are a number of tutorials linked in the raspberrypi baremetal forum on their website that should take you well beyond the baking Pi one(s). and/or help you through those as folks struggled with them for some time.
A linker script like this
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 0x10000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : { *(.bss*) } > ram
.data : { *(.data*) } > ram
}
with a bootstrap like this
.globl _start
_start:
mov sp,#0x8000
bl main
hang: b hang
should get you booted.
For the linker script you may need 0x80000 instead of 0x8000, and if you have at least one .data item, like a global variable:
unsigned int x = 5;
Then the bootstrap doesn't have to zero .bss (if your programming style is such that you rely on that). objcopy will pad the -O binary file with zeros between .rodata and .data if there is .data there taking care of zeroing bss.
You can let the tools do the work for you as far as an exception table goes:
.globl _start
_start:
ldr pc,reset_handler
ldr pc,undefined_handler
ldr pc,swi_handler
ldr pc,prefetch_handler
ldr pc,data_handler
ldr pc,unused_handler
ldr pc,irq_handler
ldr pc,fiq_handler
reset_handler: .word reset
undefined_handler: .word hang
swi_handler: .word hang
prefetch_handler: .word hang
data_handler: .word hang
unused_handler: .word hang
irq_handler: .word irq
fiq_handler: .word hang
reset:
mov r0,#0x8000
mov r1,#0x0000
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}
Now if this is not a Pi Zero then the vector table works differently you need to read the arm docs anyway before going off into stuff like this but read up on the core and mode as well as the architecture docs for whichever you are using. The newer Pis have an armv7 mode and an armv8 mode (aarch32 and aarch64) and each has its own challenges, but they have all been covered in the forum.

How to disable stm32f405 jtag interface

I have a board using STM32F405RG, my client designed the hardware and had to use a couple of the JTAG pins (PA15 and PB4) as GPIO. I use SWD for flashing and debug so I would like to disable the JTAG interface and, as stated in the ST docs, "release" PA15 and PB4 to be used as GPIO outputs.
Most of my searches return how to disable the JTAG interface refer to the STM32F1xx and the F4 is much different in this area.
Since with PA15 and PB4 the AFR setting of zero selects the JTAG pin functions how does one release them to be used as GPIO outputs?
It's true that F1 JTAG port settings are different from F4 series.
In F1 series, you need to disable them from AF remap and debug I/O configuration register. For example, the following code disables JTAG pins but leaves SWD enabled:
RCC->APB2ENR |= RCC_APB2ENR_AFIOEN; // Enable A.F. clock
AFIO->MAPR |= AFIO_MAPR_SWJ_CFG_JTAGDISABLE; // JTAG is disabled, SWD is enabled
In F4 series it's easier. It's true that AF 0 selects JTAG pins but all you have to do is not selecting AF in MODER registers. On power-up, PA13, PA14, PA15, PB3 & PB4 are set to alternate function mode by their corresponding MODER bits. Just select another mode (input, output or analog) for those pins using the MODER registers.
You have to as you said "release" PA15 and PB4 to be used as GPIO outputs.
I don't think the F1 and F4 are different in this matter. So you configure them as outputs and set them forced high or low. This will disable their ability to be driven by the JTAG adapter

Count Cycles not matching on STM32F103C8? Prefetch buffer not working as I think?

I have been fighting this subject for a while. I am using STM32F103C8 with the ST-Link V2 on Atollic.
I made some delay functions on assembly. I have been testing this piece of code using a oscilloscope on ATSAM (84 MHz and work perfectly) and on STM32 I also use a CPU register to see the exact amount of cycles on the debugging - DWT (Data Watchpoint and Trace).
When I configure the STM32 CPU clock to 24MHz the exact amount of cycles that I have designed for the time delay is correct. It is, 1 cycle for a decrement assembly instruction and 2 cycles for a branch instruction (on most cases). So, the main loop spend 3 cycles.
When I change the CPU clock to 72MHz each assembly instruction spend twice that time!
Well, the prefecth buffer is 2x64 bits, and the wait states should not let influence the execution CPU time (not thinking on prediction or other code stalls) on this microcontroller? Should it?
Well, on 24MHz the flash memory has no wait state, with higher clock, the CPU should not wait to execute any code. Should it?
I flashing with the release hex to see some difference and did not find any.
My only explanation would be of the ST-LINK V2? Am I right?
Thanks a lot for your time and attention.
This is the piece of the code that matters:
asm (".equ fcpu, 72000000\n\t"); //72 MHz
asm (".equ const_ms, fcpu/3000 \n\t");
asm (".equ const_us, fcpu/3000000 \n\t");
void delay_us(uint32_t valor)
{
asm volatile ( "movw r1, #:lower16:const_us \n\t"
"movt r1, #:upper16:const_us \n\t"
"mul r0, r0, r1 \n\t"
"r_us: subs r0, r0, #1 \n\t"
"bne r_us \n\t");
}
void delay_ms(uint32_t valor)
{
asm volatile ("movw r1, #:lower16:const_ms \n\t"
"movt r1, #:upper16:const_ms \n\t"
"mul r0, r0, r1 \n\t"
"r_ms: subs r0, r0, #1 \n\t"
"bne r_ms \n\t");
}
It is because of the wait states of the FLASH memory run at 72MHz. It is good to read the documentation :).
Place the code in the SRAM and you will get what you want.
For the good results fro the FLASH avoid the branching as it flushes the pipeline. This kind of delays are good only for the very short ones. Anything longer should be implemented using the timers.
I advice to avoid delays in the code.
PS St-Link is not guilty :)
I have been doing several tests. My first conclusion is that the overhead depends on the alignment of the instructions on memory (the prefetch buffer is 2x64bits).
Second, because of the deterministic behavior of the branch, when taken, it flushes the prefetch buffer and also the pipeline.

Maximum speed from IOS/iPad/iPhone

I done computing intensive app using OpenCV for iOS. Of course it was slow. But it was something like 200 times slower than my PC prototype. So I was optimizing it down. From very first 15 seconds I was able to get 0.4 seconds speed. I wonder if I found all things and what others may want to share. What I did:
Replaced "double" data types inside OpenCV to "float". Double is 64bit and 32bit CPU cannot easily handle them, so float gave me some speed. OpenCV uses double very often.
Added "-mpfu=neon" to compiler options. Side-effect was new problem that emulator compiler does not work anymore and anything can be tested on native hardware only.
Replaced sin() and cos() implementation with 90 values lookup tables. Speedup was huge! This is somewhat opposite to PC where such optimizations does not give any speedup. There was code working in degrees and this value was converted to radians for sin() and cos(). This code was removed too. But lookup tables did the job.
Enabled "thumb optimizations". Some blog posts recommend exactly opposite but this is because thumb makes things usually slower on armv6. armv7 is free of any problems and makes things just faster and smaller.
To make sure thumb optimizations and -mfpu=neon work at best and do not introduce crashes I removed armv6 target completely. All my code is compiled to armv7 and this is also listed as requirement in app store. This means minimum iPhone will be 3GS. I think it is OK to drop older ones. Anyway older ones have slower CPUs and CPU intensive app provides bad user experience if installed on old device.
Of course I use -O3 flag
I deleted "dead code" from OpenCV. Often when optimizing OpenCV I see code which is clearly not needed for my project. For example often there is a extra "if()" to check for pixel size being 8 bit or 32 bit and I know that I need 8bit only. This removes some code, provides optimizer better chance to remove something more or replace with constants. Also code fits better into cache.
Any other tricks and ideas? For me enabling thumb and replacing trigonometry with lookups were boost makers and made me surprise. Maybe you know something more to do which makes apps fly?
If you are doing a lot of floating point calculations, it would benefit you greatly to use Apple's Accelerate framework. It is designed to use the floating point hardware to do calculations on vectors in parallel.
I will also address your points one by one:
1) This is not because of the CPU, it is because as of the armv7-era only 32-bit floating point operations will be calculated in the floating point processor hardware (because apple replaced the hardware). 64-bit ones will be calculated in software instead. In exchange, 32-bit operations got much faster.
2) NEON is the name of the new floating point processor instruction set
3) Yes, this is a well known method. An alternative is to use Apple's framework that I mentioned above. It provides sin and cos functions that calculate 4 values in parallel. The algorithms are fine tuned in assembly and NEON so they give the maximum performance while using minimal battery.
4) The new armv7 implementation of thumb doesn't have the drawbacks of armv6. The disabling recommendation only applies to v6.
5) Yes, considering 80% of users are on iOS 5.0 or above now (armv6 devices ended support at 4.2.1), that is perfectly acceptable for most situations.
6) This happens automatically when you build in release mode.
7) Yes, this won't have as large an effect as the above methods though.
My recommendation is to check out Accelerate. That way you can make sure you are leveraging the full power of the floating point processor.
I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:
__asm__("#start");
for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {
vec[kk] = 0;
}
__asm__("#stop");
Now I press "Product -> Generate Output -> Assembly file" and what I get is:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
ldr r0, [sp, #84]
movs r1, #0
ldr r0, [r0, #16]
ldr r0, [r0, #28]
cmp r0, #4
mov r0, r4
blo LBB14_71
LBB14_70:
Ltmp1916:
ldr r3, [sp, #84]
movs r2, #0
Ltmp1917:
str r2, [r0], #4
adds r1, #1
Ltmp1918:
Ltmp1919:
ldr r2, [r3, #16]
ldr r2, [r2, #28]
lsrs r2, r2, #2
cmp r2, r1
bgt LBB14_70
LBB14_71:
Ltmp1920:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0])) and it was always 64. So I hardcoded it to be 64 and passed again via assembler:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1915:
vldr.32 s16, LCPI14_7
mov r0, r4
movs r1, #0
mov.w r2, #256
blx _memset
# InlineAsm Start
#stop
# InlineAsm End
As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.
I also tried Accelerate as suggested replacing three lines with:
__asm__("#start");
vDSP_vclr(vec,1,64);
__asm__("#stop");
Assembly now looks:
# InlineAsm Start
#start
# InlineAsm End
Ltmp1917:
str r1, [r7, #-140]
Ltmp1459:
Ltmp1918:
movs r1, #1
movs r2, #64
blx _vDSP_vclr
Ltmp1460:
Ltmp1919:
add.w r0, r4, #8
# InlineAsm Start
#stop
# InlineAsm End
Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.
One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework