How to print malloc size function with perf probe - trace

I want to trace my program to understand memory allocation of my program. The idea is whenever malloc is called, it print out call stack with allocated size.
This is command I used to create event:
perf probe -x /lib64/libc.so.6 'malloc allocated=-8(%bp):u64'
but perf report show me that allocated memory by this event is not correct. How can I fix this.
I think the problem is offset to size (-8(%bp)) is not correct. But I don't know asm so I can not understand libc binary.
UPDATE: With simple program, like:
for (int i=0; i<10; i++)
malloc(i);
then I can see the result is correct if I compiled with O0. The result when I compiled with O3 is not correct. And with my big program (hundred thousand lines of code), compiled with O0 but it can not give me the correct result.

Just take the argument from the rdi register instead of looking at the frame?
perf probe -x /lib64/libc.so.6 'malloc allocated=%di:u64'
On X86, use eax register instead:
perf probe -x /lib/i386-linux-gnu/libc.so.6 'malloc allocated=%ax:u32'

Related

pycuda shared memory up to device hard limit

This is an extension of the discussion here: pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
Is there a method in pycuda that is equivalent to the following C++ API call?
#define SHARED_SIZE 0x18000 // 96 kbyte
cudaFuncSetAttribute(func, cudaFuncAttributeMaxDynamicSharedMemorySize, SHARED_SIZE)
Working on a recent GPU (Nvidia V100), going beyond 48 kbyte shared memory requires this function attribute be set. Without it, one gets the same launch error as in the topic above. The "hard" limit on the device is 96 kbyte shared memory (leaving 32 kbyte for L1 cache).
There's a deprecated method Fuction.set_shared_size(bytes) that sounds promising, but I can't find what it's supposed to be replaced by.
PyCUDA uses the driver API, and the corresponding function call for setting a function dynamic memory limits is cuFuncSetAttribute.
I can't find that anywhere in the current PyCUDA tree, and therefore suspect that it has not been implemented.
I'm not sure if this is what you're looking for, but this might help someone looking in this direction.
The dynamic shared memory size in PyCUDA can be set either using:
shared argument in the direct kernel call (the "unprepared call"). For example:
myFunc(arg1, arg2, shared=numBytes, block=(1,1,1), grid=(1,1))
shared_size argument in the prepared kernel call. For example:
myFunc.prepared_call(grid, block, arg1, arg2, shared_size=numBytes)
where numBytes is the amount of memory in bytes you wish to allocate at runtime.

Copy function from IAR stm32f2/f4 flash to ram and run it

I want to copy a function from Flash to RAM and RUN it.
I know that IAR includes the __ramfunc type for functions that allows you to define a function in RAM but i dont want to use it for 2 reasons:
RAM funcs are using RAM memory that i use only at initialization
After upgrading 2 times the code (i'm doing a firmware update system) the __ramfunc is giving me a wrong location.
Basically what i want is to declare the function as flash and then in runtime copy it to memory and run it. I have the next code:
void (*ptr)(int size);
ptr=(void (*)(int size))&CurrentFont;
memset((char *) ptr,0xFF,4096);
Debugprintf("FLASH FUNC %X",GrabarFirmware);
Debugprintf("RAM FUNC %X",ptr);
char *ptr1=(char *)ptr,*ptr2=(char *)GrabarFirmware;
//Be sure that alignment is right
unsigned int p=(int )ptr2;
p&=0xFFFFFFFE;
ptr2=(char *)p;
for(int i=0;i<4096;i++,ptr1++,ptr2++)
*ptr1=*ptr2;
FLASH_Unlock();
// Clear pending flags (if any)
FLASH_ClearFlag(FLASH_FLAG_EOP | FLASH_FLAG_OPERR | FLASH_FLAG_WRPERR | FLASH_FLAG_PGAERR | FLASH_FLAG_PGPERR|FLASH_FLAG_PGSERR);
ptr(*((unsigned int *)(tempptrb+8)));
As details:
sizeof of a function doesn't work
linker returned me wrong functions addresses (odd addresses). Checking with the debugging tools i noticed that it was wrong, this is why i do the &0xFFFFFFFE.
After this code the function is perfectly copied to RAM, exactly the same code but when i run it with this:
ptr(*((unsigned int *)(tempptrb+8)));
I get an exception HardFault_Handler. After a lot of tests I was not able to fix this hardfault exception.
Checking the asm code I noticed that calls to __ramfunc and to normal flash functions is different and maybe the reason to get the HardFault exception.
This is the way that is is being called when defined as flash:
4782 ptr(*((unsigned int *)(tempptrb+8)));
\ 000000C6 0x6820 LDR R0,[R4, #+0]
\ 000000C8 0x6880 LDR R0,[R0, #+8]
\ 000000CA 0x47A8 BLX R5
4783 //(*ptr)();
Now if i call directly it define the code as a __ramfunc and directly i call it:
4786 GrabarFirmware(*((unsigned int *)(tempptrb+8)));
\ 0000007A 0x6820 LDR R0,[R4, #+0]
\ 0000007C 0x6880 LDR R0,[R0, #+8]
\ 0000007E 0x.... 0x.... BL GrabarFirmware
The reason for the exception is probably that I'm jumping from Flash to RAM and probably it is a cortex protection but when using the __ramfunc modifier I'm doing exactly that too, and debugging step by step, it doesnt jumps to the function in RAM, directly jumps to the exception as soon as I call the function.
A way to skip this would be a "goto" to the RAM memory. I tried it mixing C and ASM in C with asm("...") function but getting errors, and probably I would get the hardfault exception.
Any tip would be welcomed.
ptr was EVEN address, the first thing the processor is going to do is FAULT, you must jump to an ODD address to indicate it is 16-bit Thumb code, not 32-bit ARM code.
This was the problem here too, but not easy to find since it made a reference to a BOARD. Thanks to imbearr for finding it.
Here in official stm32 forums you can find more information about this

GTKMM Monitoring I/O example 100% CPU load

I am trying the Gtkmm Monitoring I/O example from here.
After something has been written to the fifo, the CPU load goes to 100%.
The code as shown in the example link is the code I used for testing (copy / paste), I only removed the build.config.h header to compile it.
I compiled it using:
g++ -Wall -o test main.cc `pkg-config gtkmm-3.0 sigc++-2.0 --cflags --libs`
After converting the code to Gtkmm 2 the behaviour is the same, CPU load still goes to 100% after something has been written to the fifo.
My question is, is this a bug or a known issue or maybe a non-issue?
So I finally figured out how to get the CPU usage down, I changed the following line in the example:
read_fd = open("testfifo", O_RDONLY);
to:
read_fd = open("testfifo", O_RDWR);
Hope this helps someone.
The answer in this thread on the gtkmm mailing list describes what is going on.

STM32F103 Ram issue with FreeRTOS+Trace

just starting with FreeRTOS and I am having problem with task, so I thought it is the best time to start with learning debugging.
Trying to use Trace library to assess situation I got stuck on compilation process.
I am using CooCox IDE with ST-LinkV2.
Target device is STM32F103C8T6.
FreeRTOS is V8.2.2.
Tracealyzer Recorder Library is v2.7.7.
Error is:
[cc] c:/arm_development/gcc-arm-none-eabi-4_9-2015q1-20150306-win32/bin/../lib/gcc/arm-none-eabi/4.9.3/../../../../arm-none-eabi/bin/ld.exe: FreeRTOSDemo.elf section `.bss' will not fit in region `ram'
[cc] c:/arm_development/gcc-arm-none-eabi-4_9-2015q1-20150306-win32/bin/../lib/gcc/arm-none-eabi/4.9.3/../../../../arm-none-eabi/bin/ld.exe: region ram overflowed with stack
[cc] c:/arm_development/gcc-arm-none-eabi-4_9-2015q1-20150306-win32/bin/../lib/gcc/arm-none-eabi/4.9.3/../../../../arm-none-eabi/bin/ld.exe: region `ram' overflowed by 6000 bytes
[cc] collect2.exe: error: ld returned 1 exit status
BUILD FAILED
Total time: 11 seconds
Any hints on that matter would be helpful, tnx in advance.
This is a basic tools question, not a FreeRTOS or FreeRTOS+Trace question, although you can fix it by changing the FreeRTOS configuration and/or FreeRTOS+Trace configuration.
The error is telling you that you have tried to use more RAM than the part you are using actually has, or at least, the amount of RAM you have told the linker your part actually has.
If you look at the map file for your application you will see which variables are consuming RAM. Probably the single largest will be the FreeRTOS heap. The FreeRTOS documentation tells you how to reduce that. Probably the second largest will be the trace buffer, and the trace configuration header file contains lots of documentation that will tell you how to reduce that.

Memory warning when using the accelerator for fft

I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);