Memory usage growing seems to be linked to Metal objects

Memory usage growing seems to be linked to Metal objects - swift

I am currently building an app that uses Metal to efficiently render triangles and evaluate a fitness function on textures, I noticed that the memory usage of my Metal app is growing and I can't really understand why.
First of all I am surprised to see that in debug mode, according to Xcode debug panel, memory usage grows really slowly (about 20 MB after 200 generated images), whereas it grows way faster in release (about 100 MB after 200 generated images).
I don't store the generated images (at least not intentionally... but maybe there is some leak I am unaware of).
I am trying to understand where the leak (if it one) comes from but I don't really know where to start, I took a a GPU Frame capture to see the objects used by Metal and it seems suspicious to me:
Looks like there are thousands of objects (the list is way longer than what you can see on the left panel).
Each time I draw an image, there is a moment when I call this code:
trianglesVerticesCoordiantes = device.makeBuffer(bytes: &positions, length: bufferSize , options: MTLResourceOptions.storageModeManaged)
triangleVerticiesColors = device.makeBuffer(bytes: &colors, length: bufferSize, options: MTLResourceOptions.storageModeManaged)
I will definitely make it a one time allocation and then simply copy data into this buffer when needed, but could it cause the memory leak or not at all ?
EDIT with screenshot of instruments :
EDIT #2 : Tons of command encoder objects present when using Inspector:
EDIT #3 : Here is what seems to be the most suspect memory graph when analysed which Xcode visual debugger:
And some detail :
I don't really know how to interpret this...
Thank you.

Related

High Memory Allocation debugging with Apple Instruments

I have an app written in swift which works fine initially, but throughout time the app gets sluggish. I have opened an instruments profiling session using the Allocation and Leaks profile.
What I have found is that the allocation increases dramatically, doing something that should only overwrite the current data.
The memory in question is in the group < non-object >
Opening this group gives hundreds of different allocations, with the responsible library all being libvDSP. So with this I can conclude it is a vDSP call that is not releasing the memory properly. However, double clicking on any of these does not present me with any code, but the raw language I do not understand.
The function that callas vDSP is wrapped like this:
func outOfPlaceComplexFourierTransform(
setup: FFTSetup,
resultSize:Int,
logSize: UInt,
direction: FourierTransformDirection) -> ComplexFloatArray {
let result = ComplexFloatArray.zeros(count:resultSize)
self.useAsDSPSplitComplex { selfPointer in
result.useAsDSPSplitComplex { resultPointer in
vDSP_fft_zop(
setup,
&selfPointer,
ComplexFloatArray.strideSize,
&resultPointer,
ComplexFloatArray.strideSize,
logSize,
direction.rawValue)
}
}
return result
}
This is called from another function:
var mags1 = ComplexFloatArray.zeros(count: measurement.windowedImpulse!.count)
mags1 = (measurement.windowedImpulse?.outOfPlaceComplexFourierTransform(setup: fftSetup, resultSize: mags1.count, logSize: UInt(logSize), direction: ComplexFloatArray.FourierTransformDirection(rawValue: 1)!))!
Within this function, mags1 is manipulated and overwrites an existing array. It was my understanding that mags1 would be deallocated once this function has finished, as it is only available inside this function.
This is the function that is called, many times per second at times. Any help would be appreciated, as what should only take 5mb, very quickly grows by two hundred megabytes in a couple of seconds.
Any pointers to either further investigate the source of the leak, or to properly deallocate this memory once finished would be appreciated.

I cannot believe I solved this so quickly after posting this. (I genuinely had several hours of pulling my hair out).
Not included in my code here, I was creating a new FFTSetup every time this was called. Obviously this is memory intensive, and it was not reusing this memory.
In instruments looking at the call tree I was able to see the function utilising this memory.

Can't get max ram size - STM32 with rtos

i'm using STM32F103R8T6,I'm currently setting max heap size for RTOS
When i try setting 12000
#define configTOTAL_HEAP_SIZE ((size_t)12000)
ERROR Compilation
region `RAM' overflowed by 780 bytes Project-STM32 C/C++ Problem
so what's the max i can use ?

Look in the linker (.ld) file. You'll see section defining RAM. That will tell you how much RAM you have, assuming the linker file was properly generated.

The error message you've pasted indicates that linker went 780 bytes past the end of available RAM area. In your case (STM32F103R8T6), it tried to place 21260 bytes (20KB + 780) into RAM which is defined to only fit 20KB. If you decrease configTOTAL_HEAP_SIZE by the amount reported by linker, it'll likely link successfully. There will however be 0 remaining space for regular / non-RTOS heap so no malloc or new will succeed, in case any part of your code wanted to use it.
You can determine exactly what gets put into RAM by your linker by analyzing your *.map file (sidenote: map file is created only if your program gets linked successfully, so you need to at least get it to that state). When you open it, search for 20000000 (start of your RAM region) and there you should see what exactly gets put there, including size of each chunk.
Unless you did something out-of-ordinary to your project (which I think is safe to assume you didn't as you mention using generated project), your RAM area during linking will need to at least fit the following sections:
.data segment where things like global variables initialized by value live
.bss segment which is similar to the one above except values are zero-initialized. This is where eventually the byte array of size configTOTAL_HEAP_SIZE will be put that RTOS uses as its own heap
Stack (don't confuse with RTOS stack sizes, this one is totally separate) - stack used outside of RTOS tasks. This has a constant size - consult your sections.ld file to find the value.
Heap segment that has a size calculated dynamically by the linker and which is equal to total size of RAM minus size of all other sections. The bigger you make your other segments, the smaller your regular heap will be.
Having said that, apart from going through the *.map file to determine what else other than the RTOS heap occupies your RAM, I'd also think twice about why you'd need 12KB (out of 20KB total) assigned only to RTOS heap. Things like do you need so many tasks, do they need such large stacks, do you need so many/so large queues/mutexes/semaphores.

How to memory manage a CMSampleBuffer

I'm getting frames from my camera in the following way:
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let imageBuffer: CVImageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
}
From the Apple documentation ...
If you need to reference the CMSampleBuffer object outside of the scope of this method, you must CFRetain it and then CFRelease it when you are finished with it.
To maintain optimal performance, some sample buffers directly reference pools of memory that may need to be reused by the device system and other capture inputs. This is frequently the case for uncompressed device native capture where memory blocks are copied as little as possible. If multiple sample buffers reference such pools of memory for too long, inputs will no longer be able to copy new samples into memory and those samples will be dropped.
Is it okay to hold a reference to CVImageBuffer without explicitly setting sampleBuffer = nil? I only ask because the latest version of Swift automatically memory manages CF data structures so CFRetain and CFRelease are not available.
Also, what is the reasoning behind "This is frequently the case for uncompressed device native capture where memory blocks are copied as little as possible." ? Why would a memory block be copied in the first place?

Is it okay to hold a reference to CVImageBuffer without explicitly setting sampleBuffer = nil?
If you're going to keep a reference to the image buffer, then keeping a reference to its "containing" CMSampleBuffer definitely cannot hurt. Will the "right thing" be done if you keep a reference to the CVImageBuffer but not the CMSampleBuffer? Maybe.
Also, what is the reasoning behind "This is frequently the case for uncompressed device native capture where memory blocks are copied as little as possible." ? Why would a memory block be copied in the first place?
There are questions on SO about how to do a deep copy on an image CMSampleBuffer, and the answers are not straightforward, so the chances of unintentionally copying one's memory block are very low. I think the intention of this documentation is to inform you that AVCaptureVideoDataOutput is efficient! and that this efficiency (via fixed size frame pools) can have the surprising side effect of dropped frames if you hang onto too many CMSampleBuffers for too long, so don't do that.
The warning is slightly redundant however, because even without the spectre of dropped frames, uncompressed video CMSampleBuffers are already a VERY hot potato due to their size and frequency. You only need to reference a few seconds' worth to use up gigabytes of RAM, so it is imperative to process them as quickly possible and then release/nil any references to them.

How to reduce the size of javacard applet

I wrote an applet which has 19 KB size on disk. It has three classes. The first one is extended from Applet, the second one has static functions and third one is a class that i create an instance from it in my applet.
I have three questions:
Is there any way to find out how much size is taken by my applet instance in my javacard?
Is there any tool to reduce the size of a javacard applet (.cap file)?
Can you explain rules that help me to reduce my applet size?

Is there any way to find out how much size is taken by my applet instance in my javacard?
(AFAIK) There is no official way to do that (in GlobalPlatform / Java Card).
You can estimate the real memory usage from the difference in free memory before applet loading and after installation (and most likely after personalization -- as you probably will create some objects during the personalization). Some ways to find out free memory information are:
Using JCSystem.getAvailableMemory() (see here) which gives information for all memory types (if implemented).
Using Extended Card Resources Information tag retrievable with GET DATA (see TS 102 226) (if implemented).
Using proprietary command (ask you vendor).
You can have a look inside your .cap file and see the sizes of the parts that are loaded into the card -- this one is surely VERY INACCURATE as card OS is free to deal with the content at its own discretion.
I remember JCOP Tools have some special eclipse view which shows various statistics for the current applet -- might be informative as well.
The Reference Implementation offers an option to get some resource consumption statistics -- might be useful as well (I have never used this, though).
Is there any tool to reduce the size of a javacard applet (.cap file)?
I used ProGuard in the past to improve applet performance (which in fact increased applet size as I used it mostly for method inlining) -- but it should work to reduce the applet size as well (e.g. eliminate dead code -- see shrinking options). There are many different optimizations as well -- just have a look, but do not expect miracles.
Can you explain rules that help me to reduce my applet size?
I would emphasize good design and proper code re-use, but there are definitely many resources regarding generic optimization techniques -- I don't know any Java Card specific ones -- can't help here :(
If you have more applets loaded into a single card you might place common code into a shared library.
Some additional (random) notes:
It might be more practical to just get a card with a larger memory.
Free memory information given by the card might be inaccurate.
I wonder you have problems with your applet size as usually there are problems with transient memory size (AFAIK).
Your applet might be simply leaking memory and thus use more and more memory.
Do not sacrifice security for lesser applet size!
Good luck!

To answer your 3rd Question
3.Can you explain rules that help me to reduce my applet size?
Some basic Guidelines are :
Keep the number of methods minimum as you know we have very limited resources for smart cards and method calling is an overhead so with minimum method calls,performance of the card will increase.Avoid using get/set methods.Recursive calls should also be avoided as the stack size in most of the cards is around 200 Bytes.
Avoid using more than 3 parameters for virtual methods and 4 for static methods. This way, the compiler will use a number or bytecode shortcuts, which reduces code size.
To work on temporary data, one can use APDU buffer or transient arrays as writing on EEPROM is about 1,000 times slower than writing on RAM.
Inheritance is also an overhead in javacard particularly when the hierarchy is complex.
Accessing array elements is also an overhead to card.So, in situations where there is repeated accessing of an array element try to store the element in a local variable and use it.
Instead of doing this:
if (arr[index1] == 1) do this;
OR
if (arr[index1] == 2) do this;
OR
if (arr[index1] == 3) do this;
Do this:
temp = arr[index1];
if (temp == 1) do this;
OR
if (temp == 2) do this;
OR
if (temp == 3) do this;
Replace nested if-else statements with equivalent switch statements as switch executes faster and takes less memory.

The stack size used in kernel development

I'm developing an operating system and rather than programming the kernel, I'm designing the kernel. This operating system is targeted at the x86 architecture and my target is for modern computers. The estimated number of required RAM is 256Mb or more.
What is a good size to make the stack for each thread run on the system? Should I try to design the system in such a way that the stack can be extended automatically if the maximum length is reached?
I think if I remember correctly that a page in RAM is 4k or 4096 bytes and that just doesn't seem like a lot to me. I can definitely see times, especially when using lots of recursion, that I would want to have more than 1000 integars in RAM at once. Now, the real solution would be to have the program doing this by using malloc and manage its own memory resources, but really I would like to know the user opinion on this.
Is 4k big enough for a stack with modern computer programs? Should the stack be bigger than that? Should the stack be auto-expanding to accommodate any types of sizes? I'm interested in this both from a practical developer's standpoint and a security standpoint.
Is 4k too big for a stack? Considering normal program execution, especially from the point of view of classes in C++, I notice that good source code tends to malloc/new the data it needs when classes are created, to minimize the data being thrown around in a function call.
What I haven't even gotten into is the size of the processor's cache memory. Ideally, I think the stack would reside in the cache to speed things up and I'm not sure if I need to achieve this, or if the processor can handle it for me. I was just planning on using regular boring old RAM for testing purposes. I can't decide. What are the options?

Stack size depends on what your threads are doing. My advice:
make the stack size a parameter at thread creation time (different threads will do different things, and hence will need different stack sizes)
provide a reasonable default for those who don't want to be bothered with specifying a stack size (4K appeals to the control freak in me, as it will cause the stack-profligate to, er, get the signal pretty quickly)
consider how you will detect and deal with stack overflow. Detection can be tricky. You can put guard pages--empty--at the ends of your stack, and that will generally work. But you are relying on the behavior of the Bad Thread not to leap over that moat and start polluting what lays beyond. Generally that won't happen...but then, that's what makes the really tough bugs tough. An airtight mechanism involves hacking your compiler to generate stack checking code. As for dealing with a stack overflow, you will need a dedicated stack somewhere else on which the offending thread (or its guardian angel, whoever you decide that is--you're the OS designer, after all) will run.
I would strongly recommend marking the ends of your stack with a distinctive pattern, so that when your threads run over the ends (and they always do), you can at least go in post-mortem and see that something did in fact run off its stack. A page of 0xDEADBEEF or something like that is handy.
By the way, x86 page sizes are generally 4k, but they do not have to be. You can go with a 64k size or even larger. The usual reason for larger pages is to avoid TLB misses. Again, I would make it a kernel configuration or run-time parameter.

Search for KERNEL_STACK_SIZE in linux kernel source code and you will find that it is very much architecture dependent - PAGE_SIZE, or 2*PAGE_SIZE etc (below is just some results - many intermediate output are deleted).
./arch/cris/include/asm/processor.h:
#define KERNEL_STACK_SIZE PAGE_SIZE
./arch/ia64/include/asm/ptrace.h:
# define KERNEL_STACK_SIZE_ORDER 3
# define KERNEL_STACK_SIZE_ORDER 2
# define KERNEL_STACK_SIZE_ORDER 1
# define KERNEL_STACK_SIZE_ORDER 0
#define IA64_STK_OFFSET ((1 << KERNEL_STACK_SIZE_ORDER)*PAGE_SIZE)
#define KERNEL_STACK_SIZE IA64_STK_OFFSET
./arch/ia64/include/asm/mca.h:
u64 mca_stack[KERNEL_STACK_SIZE/8];
u64 init_stack[KERNEL_STACK_SIZE/8];
./arch/ia64/include/asm/thread_info.h:
#define THREAD_SIZE KERNEL_STACK_SIZE
./arch/ia64/include/asm/mca_asm.h:
#define MCA_PT_REGS_OFFSET ALIGN16(KERNEL_STACK_SIZE-IA64_PT_REGS_SIZE)
./arch/parisc/include/asm/processor.h:
#define KERNEL_STACK_SIZE (4*PAGE_SIZE)
./arch/xtensa/include/asm/ptrace.h:
#define KERNEL_STACK_SIZE (2 * PAGE_SIZE)
./arch/microblaze/include/asm/processor.h:
# define KERNEL_STACK_SIZE 0x2000

I'll throw my two cents in to get the ball rolling:
I'm not sure what a "typical" stack size would be. I would guess maybe 8 KB per thread, and if a thread exceeds this amount, just throw an exception. However, according to this, Windows has a default reserved stack size of 1MB per thread, but it isn't committed all at once (pages are committed as they are needed). Additionally, you can request a different stack size for a given EXE at compile-time with a compiler directive. Not sure what Linux does, but I've seen references to 4 KB stacks (although I think this can be changed when you compile the kernel and I'm not sure what the default stack size is...)
This ties in with the first point. You probably want a fixed limit on how much stack each thread can get. Thus, you probably don't want to automatically allocate more stack space every time a thread exceeds its current stack space, because a buggy program that gets stuck in an infinite recursion is going to eat up all available memory.

If you are using virtual memory, you do want to make the stack growable. Forcing static allocation of stack sized, like is common in user-level threading like Qthreads and Windows Fibers is a mess. Hard to use, easy to crash. All modern OSes do grow the stack dynamically, I think usually by having a write-protected guard page or two below the current stack pointer. Writes there then tell the OS that the stack has stepped below its allocated space, and you allocate a new guard page below that and make the page that got hit writable. As long as no single function allocates more than a page of data, this works fine. Or you can use two or four guard pages to allow larger stack frames.
If you want a way to control stack size and your goal is a really controlled and efficient environment, but do not care about programming in the same style as Linux etc., go for a single-shot execution model where a task is started each time a relevant event is detected, runs to completion, and then stores any persistent data in its task data structure. In this way, all threads can share a single stack. Used in many slim real-time operating systems for automotive control and similar.

Why not make the stack size a configurable item, either stored with the program or specified when a process creates another process?
There are any number of ways you can make this configurable.
There's a guideline that states "0, 1 or n", meaning you should allow zero, one or any number (limited by other constraints such as memory) of an object - this applies to sizes of objects as well.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse