Copy function from IAR stm32f2/f4 flash to ram and run it - stm32

I want to copy a function from Flash to RAM and RUN it.
I know that IAR includes the __ramfunc type for functions that allows you to define a function in RAM but i dont want to use it for 2 reasons:
RAM funcs are using RAM memory that i use only at initialization
After upgrading 2 times the code (i'm doing a firmware update system) the __ramfunc is giving me a wrong location.
Basically what i want is to declare the function as flash and then in runtime copy it to memory and run it. I have the next code:
void (*ptr)(int size);
ptr=(void (*)(int size))&CurrentFont;
memset((char *) ptr,0xFF,4096);
Debugprintf("FLASH FUNC %X",GrabarFirmware);
Debugprintf("RAM FUNC %X",ptr);
char *ptr1=(char *)ptr,*ptr2=(char *)GrabarFirmware;
//Be sure that alignment is right
unsigned int p=(int )ptr2;
p&=0xFFFFFFFE;
ptr2=(char *)p;
for(int i=0;i<4096;i++,ptr1++,ptr2++)
*ptr1=*ptr2;
FLASH_Unlock();
// Clear pending flags (if any)
FLASH_ClearFlag(FLASH_FLAG_EOP | FLASH_FLAG_OPERR | FLASH_FLAG_WRPERR | FLASH_FLAG_PGAERR | FLASH_FLAG_PGPERR|FLASH_FLAG_PGSERR);
ptr(*((unsigned int *)(tempptrb+8)));
As details:
sizeof of a function doesn't work
linker returned me wrong functions addresses (odd addresses). Checking with the debugging tools i noticed that it was wrong, this is why i do the &0xFFFFFFFE.
After this code the function is perfectly copied to RAM, exactly the same code but when i run it with this:
ptr(*((unsigned int *)(tempptrb+8)));
I get an exception HardFault_Handler. After a lot of tests I was not able to fix this hardfault exception.
Checking the asm code I noticed that calls to __ramfunc and to normal flash functions is different and maybe the reason to get the HardFault exception.
This is the way that is is being called when defined as flash:
4782 ptr(*((unsigned int *)(tempptrb+8)));
\ 000000C6 0x6820 LDR R0,[R4, #+0]
\ 000000C8 0x6880 LDR R0,[R0, #+8]
\ 000000CA 0x47A8 BLX R5
4783 //(*ptr)();
Now if i call directly it define the code as a __ramfunc and directly i call it:
4786 GrabarFirmware(*((unsigned int *)(tempptrb+8)));
\ 0000007A 0x6820 LDR R0,[R4, #+0]
\ 0000007C 0x6880 LDR R0,[R0, #+8]
\ 0000007E 0x.... 0x.... BL GrabarFirmware
The reason for the exception is probably that I'm jumping from Flash to RAM and probably it is a cortex protection but when using the __ramfunc modifier I'm doing exactly that too, and debugging step by step, it doesnt jumps to the function in RAM, directly jumps to the exception as soon as I call the function.
A way to skip this would be a "goto" to the RAM memory. I tried it mixing C and ASM in C with asm("...") function but getting errors, and probably I would get the hardfault exception.
Any tip would be welcomed.

ptr was EVEN address, the first thing the processor is going to do is FAULT, you must jump to an ODD address to indicate it is 16-bit Thumb code, not 32-bit ARM code.
This was the problem here too, but not easy to find since it made a reference to a BOARD. Thanks to imbearr for finding it.
Here in official stm32 forums you can find more information about this

Related

ARM Eclipse debugging code in the RAM. Is it possible to see the source code`

I have a problem when try to debug the code which is copied to the SRAM and executed from there.
The code is overwriting the data - but it is done only during the system update. The sections where code is placed are correctly defined in the linker script file and the debuger correctly see the addresses. But when I step into the function (and the code in RAM is the correct one) it does not connect the source files with the code executed in the memory.
Do you know how can it be done. Debugging C code on the assembler level is not something which makes me happy :)
Any help appreciated.
The problem is a bit silly. When you call RAM function from the FLASH (the first call has to be done this way) it has to be done by the veneer. It was messing up the debugger. But having own calling macro (because of the distance it has to be done via the pointer) everything works fine
example calling macro.
#define RAMFCALL(func, ...) {unsigned (* volatile fptr)() = (unsigned (* volatile)())func; fptr(__VA_ARGS__);}

pycuda shared memory up to device hard limit

This is an extension of the discussion here: pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
Is there a method in pycuda that is equivalent to the following C++ API call?
#define SHARED_SIZE 0x18000 // 96 kbyte
cudaFuncSetAttribute(func, cudaFuncAttributeMaxDynamicSharedMemorySize, SHARED_SIZE)
Working on a recent GPU (Nvidia V100), going beyond 48 kbyte shared memory requires this function attribute be set. Without it, one gets the same launch error as in the topic above. The "hard" limit on the device is 96 kbyte shared memory (leaving 32 kbyte for L1 cache).
There's a deprecated method Fuction.set_shared_size(bytes) that sounds promising, but I can't find what it's supposed to be replaced by.
PyCUDA uses the driver API, and the corresponding function call for setting a function dynamic memory limits is cuFuncSetAttribute.
I can't find that anywhere in the current PyCUDA tree, and therefore suspect that it has not been implemented.
I'm not sure if this is what you're looking for, but this might help someone looking in this direction.
The dynamic shared memory size in PyCUDA can be set either using:
shared argument in the direct kernel call (the "unprepared call"). For example:
myFunc(arg1, arg2, shared=numBytes, block=(1,1,1), grid=(1,1))
shared_size argument in the prepared kernel call. For example:
myFunc.prepared_call(grid, block, arg1, arg2, shared_size=numBytes)
where numBytes is the amount of memory in bytes you wish to allocate at runtime.

stm32f4 HardFault_Handler - need debugging advice

I'm working on a project based on the stm32f4discovery board using IAR Embedded Workbench (though I'm very close to the 32kb limit on the free version so I'll have to find something else soon). This is a learning project for me and so far I've been able to solve most of my issues with a few google searches and a lot of trial and error. But this is the first time I've encountered a run-time error that doesn't appear to be caused by a problem with my logic and I'm pretty stuck. Any general debugging strategy advice is welcome.
So here's what happens. I have an interrupt on a button; each time the button is pressed, the callback function runs my void cal_acc(uint16_t* data) function defined in stm32f4xx_it.c. This function gathers some data, and on the 6th press, it calls my void gn(float32_t* data, float32_t* beta) function. Eventually, two functions are called, gn_resids and gn_jacobian. The functions are very similar in structure. Both take in 3 pointers to 3 arrays of floats and then modify the values of the first array based on the second two. Unfortunately, when the second function gn_jacobian exits, I get the HardFault.
Please look at the link (code structure) for a picture showing how the program runs up to the fault.
Thank you very much! I appreciate any advice or guidance you can give me,
-Ben
Extra info that might be helpful below:
Running in debug mode, I can step into the function and run through all the lines click by click and it's OK. But as soon as I run the last line and it should exit and move on to the next line in the function where it was called, it crashes. I have also tried rearranging the order of the calls around this function and it is always this one that crashes.
I had been getting a similar crash on the first function gn_resids when one of the input pointers pointed to an array that was not defined as "static". But now all the arrays are static and I'm quite confused - especially since I can't tell what is different between the gn_resids function that works and the gn_jacobian function that does not work.
acc1beta is declared as a float array at the beginning of main.c and then also as extern float32_t acc1beta[6] at the top of stm32f4xx_it.c. I want it as a global variable; there is probably a better way to do this, but it's been working so far with many other variables defined in the same way.
Here's a screenshot of what I see when it crashes during debug (after I pause the session) IAR view at crash
EDIT: I changed the code of gn_step to look like this for a test so that it just runs gn_resids twice and it crashes as soon as it gets to the second call - I can't even step into it. gn_jacobian is not the problem.
void gn_step(float32_t* data, float32_t* beta) {
static float32_t resids[120];
gn_resids(resids, data, beta);
arm_matrix_instance_f32 R;
arm_mat_init_f32(&R, 120, 1, resids);
// static float32_t J_f32[720];
// gn_jacobian(J_f32, data, beta);
static float32_t J_f32[120];
gn_resids(J_f32, data, beta);
arm_matrix_instance_f32 J;
arm_mat_init_f32(&J, 120, 1, J_f32);
Hardfaults on Cortex M devices can be generated by various error conditions, for example:
Access of data outside valid memory
Invalid instructions
Division by zero
It is possible to gather information about the source of the hardfault by looking into some processor registers. IAR provides a debugger macro that helps to automate that process. It can be found in the IAR installation directory arm\config\debugger\ARM\vector_catch.mac. Please refer to this IAR Technical Note on Debugging Hardfaults for details on using this macro.
Depending on the type of the hardfault that occurs in your program you should try to narrow down the root cause within the debugger.

Unaligned accesses are not detected by Raspberry PI version 1

I'm performing a set of activities to make sure Redis runs well in a set of embedded systems, including the Raspberry PI. In order to fix certain code paths of Redis where unaligned memory accesses are performed (due to a change introduced in Redis 3.2) I'm trying to force the PI to either log a message on unaligned memory accesses or send a signal to the process when this happens. In this way I can both make sure that Redis will run well where unaligned accesses are a violation, and that it will run faster in platforms where instead such accesses can be performed but are slower. ARM v6, the one used in the PI v1, is apparently able to deal with unaligned memory accesses, so if I use following command to configure Linux in order to sent a signal to the process performing the unaligned access:
echo 4 > /proc/cpu/alignment
And then run the following program:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv) {
char *buf = "foobareklsjdfklsjdfslkjfskdljfskdfjdslkjfdslkjfsd";
uint32_t *l = (uint32_t*) (buf+1);
printf("%p\n", l);
printf("%d\n", (int)*l);
return 0;
}
I can't see any signal received by the process, or the counters at /proc/cpu/alignment incrementing.
My guess is that this is due to ARM v6 ability to deal with unaligned addresses automatically, if a given CPU configuration flag is set. My question is, is my hypothesis correct? And if so, how to force a PI version 1 to actually raise an exception in case of unaligned accesses so that the Linux kernel can trap it and send a signal, log the access, and so forth, according to /proc/cpu/alignment settings?
EDIT: It is worth to note that not all the instructions can perform unaligned accesses even in ARM v6. For instance STMDB, STMFD, LDMDB, LDMEA and similar multiple words instructions will indeed raise an exception and will be trapped by the Linux kernel.
I think I eventually found my answers:
Yes I'm correct, up to the word size ARM v6 (or greater) can silently handle the unaligned accesses so no trap is generated and is completely transparent for the Linux kernel. Nothing will be logged, nor the traps counter in /proc/cpu/alignment will be incremented.
AFAIK there is no way I can force the kernel to trap word-sized unaligned accesses, since to do that apparently the CPU should be configured in order to trap the unaligned addresses in every case, but the Linux kernel does not do that AFAIK, probably because there is alignment unsafe code inside the kernel itself. Checking the Linux kernel source code indeed one can see:
if (cpu_is_v6_unaligned()) {
set_cr(__clear_cr(CR_A));
ai_usermode = safe_usermode(ai_usermode, false);
}
What this means is that the SCTLR.A bit is always cleared, so no trap
will be generated for unaligned accesses ARM v6 can handle.
There are a great deal of instructions that will still generate traps when used with unaligned addresses, for example multi store/load instructions, loading and storing of double values.
However, there are instructions that GCC (the version shipped in the default Raspberry Linux distribution) will happily produced that are not handled by the Linux kernel correctly, that will result in a SIGBUS generated even when /proc/cpu/alignment is set to fix the access.
So point number 4 basically means that, it is not a good idea to fix programs to run in ARM v6 just letting the Linux kernel handle unaligned addresses for us, even when the performance implications of unaligned addresses are not a problem: the program can still crash since not all the instructions are handled.
How to reliably find all the unaligned accesses in a program remains an open question AFAIK, since unfortunately, the otherwise wonderful valgrind program, never implemented this feature. In the past I had to use QEMU emulating Sparc, however this is a very slow process. Valgrind would be the trivial way to do that.

release build variable corruption when using ne10 math library assembly function

has anyone experience the following issue?
A stack variable getting changed/corrupted after calling ne10 assembly function such as ne10_len_vec2f_neon?
e.g
float gain = 8.0;
ne10_len_vec2f_neon(src, dst, len);
after the call to ne10_len_vec2f_neon, the value of gain changes as its memory is getting corrupted.
1. Note this only happens when the project is compiled in release build but not debug build.
2. Does Ne10 assembly functions preserve registers?
3. Replacing the assembly function call to c equivalent such as ne10_len_vec2f_c and both release and debug build seem to work OK.
thanks for any help on this. Not sure if there's an inherent issue within the program or it is really the call to ne10_len_vec2f_neon causing the corruption with release build.enter code here
I had a quick rummage through the master NEON code here:
https://github.com/projectNe10/Ne10/blob/master/modules/math/NE10_len.neon.s
... and it doesn't really touch address-based stack at all, so not sure it's a stack problem in memory.
However based on what I remember of the NEON procedure call standard q4-q7 (alias d8-d15 or s16-s31) should be preserved by the callee, and as far as I can tell that code is clobbering q4-6 without the necessary save/restore, so it does indeed look like it's clobbering the stack in registers.
In the failed case do you know if gain is still stored in FPU registers, and if yes which ones? If it's stored in any of s16/17/18/19 then this looks like the problem. It also seems plausible that a compiler would choose to use s16 upwards for things it needs to keep across a function call, as it avoids the need to touch in-RAM stack memory.
In terms of a fix, if you perform the following replacements:
s/q4/q8/
s/q5/q9/
s/q6/q10/
in that file, then I think it should work; no means to test here, but those higher register blocks are not callee saved.