How memory fragmentation is avoided in Swift - swift

GC's compaction, sweep and mark avoids heap memory fragmentation. So how is memory fragmentation is avoided in Swift?
Are these statements correct?
Every time a reference count becomes zero, the allocated space gets added to an 'available' list.
For the next allocation, the frontmost chunk of memory which can fit the size is used.
Chunks of previously used up memory will be used again to be best extent possible
Is the 'available list' sorted by address location or size?
Will live objects be moved around for better compaction?

I did some digging in the assembly of a compiled Swift program, and I discovered that swift:: swift_allocObject is the runtime function called when a new Class object is instantiated. It calls SWIFT_RT_ENTRY_IMPL(swift_allocObject) which calls swift::swift_slowAlloc, which ultimately calls... malloc() from the C standard library. So Swift's runtime isn't doing the memory allocation, it's malloc() that does it.
malloc() is implemented in the C library (libc). You can see Apple's implementation of libc here. malloc() is defined in /gen/malloc.c. If you're interested in exactly what memory allocation algorithm is used, you can continue the journey down the rabbit hole from
there.
Is the 'available list' sorted by address location or size?
That's an implementation detail of malloc that I welcome you to discover in the source code linked above.
1. Every time a reference count becomes zero, the allocated space gets added to an 'available' list.
Yes, that's correct. Except the "available" list might not be a list. Furthermore, this action isn't necessarily done by the Swift runtime library, but it could be done by the OS kernel through a system call.
2. For the next allocation, the frontmost chunk of memory which can fit the size is used.
Not necessarily frontmost. There are many different memory allocation schemes. The one you've thought of is called "first fit". Here are some example memory allocation techniques (from this site):
Best fit: The allocator places a process in the smallest block of unallocated memory in which it will fit. For example, suppose a process requests 12KB of memory and the memory manager currently has a list of unallocated blocks of 6KB, 14KB, 19KB, 11KB, and 13KB blocks. The best-fit strategy will allocate 12KB of the 13KB block to the process.
First fit: There may be many holes in the memory, so the operating system, to reduce the amount of time it spends analyzing the available spaces, begins at the start of primary memory and allocates memory from the first hole it encounters large enough to satisfy the request. Using the same example as above, first fit will allocate 12KB of the 14KB block to the process.
Worst fit: The memory manager places a process in the largest block of unallocated memory available. The idea is that this placement will create the largest hold after the allocations, thus increasing the possibility that, compared to best fit, another process can use the remaining space. Using the same example as above, worst fit will allocate 12KB of the 19KB block to the process, leaving a 7KB block for future use.
Objects will not be compacted during their lifetime. The libc handles memory fragmentation by the way it allocates memory. It has no way of moving around objects that have already been allocated.

Alexander’s answer is great, but there are a few other details somewhat related to memory layout. Garbage Collection requires memory overhead for compaction, so the wasted space from malloc fragmentation isn’t really putting it at much of a disadvantage. Moving memory around also has a hit on battery and performance since it invalidates the processor cache. Apple’s memory management implementation can compress memory that hadn’t been accessed in awhile. Even though virtual address space can be fragmented, the actual RAM is less fragmented due to compression. The compression also allows faster swaps to disk.
Less related, but one of the big reasons Apple picked reference counting has more to do with c-calls then memory layout. Apple’s solution works better if you are heavily interacting with c-libraries. A garbage collected system is normally in order of magnitude slower interacting with c since it needs to halt garbage collection operations before the call. The overhead is usually about the same as a syscall in any language. Normally that doesn’t matter unless you are calling c functions in a loop such as with OpenGL or SQLite. Other threads/processes can normally use the processor resources while a c-call is waiting for the garbage collector so the impact is minimal if you can do your work in a few calls. In the future there may be advantages to Swift’s memory management when it comes to system programming and a rust-like lifecycle memory management. It is on Swift’s roadmap, but Swift 4 is not yet suited for systems programming. Typically in C# you would drop to managed c++ for systems programming and operations that make heavy use of c libraries.

Related

Where are multiple stacks and heaps put in virtual memory?

I'm writing a kernel and need (and want) to put multiple stacks and heaps into virtual memory, but I can't figure out how to place them efficiently. How do normal programs do it?
How (or where) are stacks and heaps placed into the limited virtual memory provided by a 32-bit system, such that they have as much growing space as possible?
For example, when a trivial program is loaded into memory, the layout of its address space might look like this:
[ Code Data BSS Heap-> ... <-Stack ]
In this case the heap can grow as big as virtual memory allows (e.g. up to the stack), and I believe this is how the heap works for most programs. There is no predefined upper bound.
Many programs have shared libraries that are put somewhere in the virtual address space.
Then there are multi-threaded programs that have multiple stacks, one for each thread. And .NET programs have multiple heaps, all of which have to be able to grow one way or another.
I just don't see how this is done reasonably efficient without putting a predefined limit on the size of all heaps and stacks.
I'll assume you have the basics in your kernel done, a trap handler for page faults that can map a virtual memory page to RAM. Next level up, you need a virtual memory address space manager from which usermode code can request address space. Pick a segment granularity that prevents excessive fragmentation, 64KB (16 pages) is a good number. Allow usermode code to both reserve space and commit space. A simple bitmap of 4GB/64KB = 64K x 2 bits to keep track of segment state gets the job done. The page fault trap handler also needs to consult this bitmap to know whether the page request is valid or not.
A stack is a fixed size VM allocation, typically 1 megabyte. A thread usually only needs a handful of pages of it, depending on function nesting level, so reserve the 1MB and commit only the top few pages. When the thread nests deeper, it will trip a page fault and the kernel can simply map the extra page to RAM to allow the thread to continue. You'll want to mark the bottom few pages as special, when the thread page faults on those, you declare this website's name.
The most important job of the heap manager is to prevent fragmentation. The best way to do that is to create a lookaside list that partitions heap requests by size. Everything less than 8 bytes comes from the first list of segments. 8 to 16 from the second, 16 to 32 from the third, etcetera. Increasing the size bucket as you go up. You'll have to play with the bucket sizes to get the best balance. Very large allocations come directly from the VM address manager.
The first time an entry in the lookaside list is hit, you allocate a new VM segment. You subdivide the segment into smaller blocks with a linked list. When such an allocation is released, you add the block to the list of free blocks. All blocks have the same size regardless of the program request so there won't be any fragmentation. When the segment is fully used and no free blocks are available you allocate a new segment. When a segment contains nothing but free blocks you can return it to the VM manager.
This scheme allows you to create any number of stacks and heaps.
Simply put, as your system resources are always finite, you can't go limitless.
Memory management always consists of several layers each having its well defined responsibility. From the perspective of the program, the application-level manager is visible that is usually concerned only with its own single allocated heap. A level above could deal with creating the multiple heaps if needed out of (its) one global heap and assigning them to subprograms (each with its own memory manager). Above that could be the standard malloc()/free() that it uses and above those the operating system dealing with pages and actual memory allocation per process (it is basically not concerned not only about multiple heaps, but even user-level heaps in general).
Memory management is costly and so is trapping into the kernel. Combining the two could impose severe performance hit, so what seems to be the actual heap management from the application's point of view is actually implemented in user space (the C runtime library) for the sake of performance (and other reason out of scope for now).
When loading a shared (DLL) library, if it is loaded at program startup, it will of course be most probably loaded to CODE/DATA/etc so no heap fragmentation occurs. On the other hand, if it is loaded at runtime, there's pretty much no other chance than using up heap space.
Static libraries are, of course, simply linked into the CODE/DATA/BSS/etc sections.
At the end of the day, you'll need to impose limits to heaps and stacks so that they're not likely to overflow, but you can allocate others.
If one needs to grow beyond that limit, you can either
Terminate the application with error
Have the memory manager allocate/resize/move the memory block for that stack/heap and most probably defragment the heap (its own level) afterwards; that's why free() usually performs poorly.
Considering a pretty large, 1KB stack frame on every call as an average (might happen if the application developer is unexperienced) a 10MB stack would be sufficient for 10240 nested call -s. BTW, besides that, there's pretty much no need for more than one stack and heap per thread.

overhead of reserving address space using mmap

I have a program that routinely uses massive arrays, where the memory is allocated using mmap
Does anyone know the typical overheads of allocating address space in large amounts before the memory is committed, either if allocating with MAP_NORESERVE or backing the space with a sparse file? It5 strikes me mmap can't be free since it must make page table entries for the allocated space. I want to have some idea of this overhead before implementing an algorithm I'm considering.
Obviously the answer is going to be platform dependent, im most interested in x64 linux, sparc solaris and sparc linux. I'm thinking that the availability of 1mb pages makes the overhead rather less on a sparc than x64.
The overhead of mmap depends on the way you use it. And it is typically negligible when you use it in an appropriate way.
In linux kernel, mmap operation can be divided into two parts:
look for a free address range that can hold the mapping
Create/enlarge vma struct in address space (mm_struct)
So allocate large amount of memory use mmap do not introduce more
overhead than small ones.
So you should allocate memory as larger as possible in each time. (avoid mutiple times of small mmap)
And you may provide the start address explicitly (if possible). This could save some time in kernel in looking for an large enough free space.
If your application is an multi-threaded program. You should avoid concurrent calls to mmap. That is because the address space is protected by a reader-writer lock and the mmap always takes the writer lock. mmap latency will be orders of magnitude greater in this case.
Moreover, mmap only create the mapping but not the page table. Pages are allocated in the page fault handler when being touched. Page fault handler would take the reader lock that protects address space and can also affects mmap performance.
In this case, you should always try to reuse your large array instead of munmap it and mmap again. (Avoid pagefaults)

How does external fragmentation happen?

As processes are loaded and removed from memory , the free memory space is broken into little pieces ,causing fragmentation ... but how does this happen ?
And what is the best solution to external fragmentation ?
External fragmentation exists when there is enough total memory to satisfy a request (from a process usually), but the total required memory is not available at a contiguous location i.e, its fragmented.
Solution to external fragmentation :
1) Compaction : shuffling the fragmented memory into one contiguous location.
2) Virtual memory addressing by using paging and segmentation.
External Fragmentation
External fragmentation happens when a dynamic memory allocation algorithm allocates some memory and a small piece is left over that cannot be effectively used. If too much external fragmentation occurs, the amount of usable memory is drastically reduced. Total memory space exists to satisfy a request, but it is not contiguous.
see following example
0x0000 0x1000 0x2000
A B C //Allocated three blocks A, B, and C, of size 0x1000.
A C //Freed block B
Now Notice that the memory that B used cannot be included for an allocation larger than B's size
External fragmentation can be reduced by compaction or shuffle memory contents to place all free memory together in one large block. To make compaction feasible, relocation should be dynamic.External fragmentation is also avoided by using paging technique.
The best solution to avoid external fragmentation is Paging.
Paging is a memory management technique usually used by virtual memory operating systems to help ensure that the data you need is available as quickly as possible.
for more see this : What's the difference between operating system "swap" and "page"?
In case of Paging there is no external fragmentation but it doesn't avoid internal fragmentation.

What is the rationale for small stack even when memory is available?

Recently, I was asked in an interview, why would you have a smaller stack when the available memory has no limit? Why would you have it in 1KB range even when you might have 4GB physical memory? Is this a standard design practice?
The other answers are good; I just thought I'd point out an important misunderstanding inherent in the question. How much physical memory you have is completely irrelevant. Having more physical memory is just an optimization; it prevents having to use disk as storage. The precious resource consumed by a stack is address space, not physical memory. The bits of the stack that aren't being used right now are not even going to reside in physical memory; they'll be paged out to disk. But as soon as they are committed, they are consuming virtual address space.
The smaller your stacks, the more of them you can have. A 1kB stack is pretty useless, as I can't think of an architecture that has pages that small. A more typical size is 128kB-1MB.
Since each thread has its own stack, the number of stacks you can have is an upper limit on the number of threads you can have. Some people complain about the fact that they can't create more than 2000 threads in a standard 2GB address space of a 32-bit Windows process, so it's not surprising that some people would want even smaller stacks to allow even more threads.
Also, consider that if a stack has to be completely reserved ahead of time, it is carving a chunk out of your address space that can't be returned until the stack isn't used anymore (i.e. the thread exits). That chunk of reserved address space then limits the size of a contiguous allocation you can make.
I don't know the "real" answer, but my guess is:
It's committed on-demand.
Do you really need it?
If the system uses 1 MiB for a stack, then a typical system with 1024 threads would be using 1 GiB of memory for (mostly) nothing... which may not be what you want, especially since you don't really need it.
One reason is, even though memory is huge these days, it is still not unlimited. A 32-bit process is normally limited to 4GB of address space (yes, you can use PAE to increase that, but that requires support from the OS and a return to a segmented memory model.) Each thread uses up some of that memory for its stack, and if a stack is megabytes in size -- whether it's paged in or not -- it's taking up a significant part of the app's address space.
The smaller the stack, the more threads you can squeeze into the app, and the more memory you have available for everything else. Ideally, you want a stack just large enough to handle all possible control flows through the thread, but small enough that you don't have wasted address space.
There are two things here. First, the limit on the stack size will put the limit on number of processes/threads in the system. And then too, the limit is not because of the size of physical memory but because of the limit on addressable virtual memory. Secondly, rarely processes/threads need more stack size then that, and if they do, they can ask for it (libraries handle this seamlessly). So, when starting a new process/thread, it makes sense to give them a small stack space.
Other answers here already mention the core concept, that the most significant consumed resource of a stack is address space (since its implementation requires chunks of contiguous address space) and that the default space consumed on windows for each thread is not insignificant.
However the full story is extremely nuanced (and can and will change over time) over many layers and levels.
This article by Mark Russinovich as part of his "Pushing the limits of Windows" series goes into extremely detailed levels of analysis. The work is in no way an introductory article though, and most people would not consider it the sort of thing that would be expected to be known in a job interview unless perhaps you were interviewing for a job in that particular field.
Maybe because everytime you call a function the OS has to allocate memory to be the stack of that function. Because functions can chain, several function calls will incur more stack allocations. A large default stack size, like 4GiB, would be impractical. But that's just my guess...

Does it make sense to cache data obtained from a memory mapped file?

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.
i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.
The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.