Memory usage of zfs for mapped files - solaris

I read the following on https://blogs.oracle.com/roch/entry/does_zfs_really_use_more
There is one peculiar workload that does lead ZFS to consume more
memory: writing (using syscalls) to pages that are also mmaped. ZFS
does not use the regular paging system to manage data that passes
through reads and writes syscalls. However mmaped I/O which is closely
tied to the Virtual Memory subsystem still goes through the regular
paging code . So syscall writting to mmaped pages, means we will keep
2 copies of the associated data at least until we manage to get the
data to disk. We don't expect that type of load to commonly use large
amount of ram
What does this mean exactly? does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file? or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
If so, am I better off keeping the working directories of files written this way on a ufs partition?

Does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file?
Hopefully, no.
or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
That method is just regular low level write(fd, buf, nbytes) system calls and similars and not what memory mapped files are designed to support: accessing file content just with reading / writing memory by using pointers, using the file data as a byte array or whatever.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
No, unless memory mapped files that are also written to using system calls sum to a significant part of your RAM workload, which is quite unlikely to happen.
PS: Note that this blog is almost ten years old. There might have been changes in the implementation since that time.

Related

why kafka index files use memory mapped files ,but log files don't?

We know that kafka use memory mapped files for it's index files ,however it's log files don't use the memory mapped files technology.
My question is why index files use memory mapped files, however log files don't ?
Implementing both log and index appending with mmap approach will bring data consistency problem. mmap is not 100% guarantee to flush the data from memory to file(assuming the flush reply on OS instead of an explicitly calling on munmap(2)), if the index update get flushed but log data not get flushed successfully due to some reason, the data in the log can not be understood anymore.
BTW, for a append-only data, in the write direction, we only need to care about next-to-write block(buffer), so the huge data should not impact this.
That how many bytes can be mapped into the memory relates to the address space. For example, a 32-bit architecture can only address 4GB or even smaller portions of files. Kafka logs which are often larger enough might have only portions mapped at a time, therefore complicating reading them.
However, index files are sparse which means they are relatively small in size. Mapping them into the memory could speed up the lookup process and that's the primary benefit memory-mapped files offer.
Logs are where the messages are stored, the index files point to the position in the logs.
There is a nice, colorful blog post, explaining what is going on.
Having a fast index to improve read performance is a common optimization in databases where writes are append-only(Almost all LSTM databases do some form of this). Also as others have pointed out:
indexes are sparse, so smaller memory footprint. Even the sparsity of the index is configurable, which is useful as data grows.
Append only write patterns are faster than random seeks(especially true for SSDs), and therefore don't need a lot of attention for optimization.
if mmap log file, as physical memory is limited, it may cause page fault frequently which is a seriously expensive overhead. use sendFile system call is more suitable

What is memory map in mongodb?

I read about this topic at
http://docs.mongodb.org/manual/faq/storage/#faq-storage-memory-mapped-files
But didn't understand point .Does it is used to keep query data in physical memory ? How it is related with virtual memory ? Why it is important and how it effect at performance ?
I'll try to explain in a simple way.
MongoDB (and other storage systems) stores data in files. Each database has its own files, created as they are needed. The first file weights 64 MB, the next 128 and so up to 2 GB. Then, new files created weigh 2 GB. Each of these files are logically divided into different blocks, that correspond with one virtual memory block.
When MongoDB needs to access a file or a part of it, loads all virtual blocks corresponding to that file or parts of the files into memory using mmap.On the other hand, mmap is a way for applications to leverage the system cache (linux).
So what really happens when you are doing a query is that MongoDB "tells" the OS to load the part it needs with the data requested, so the next time is requested will be faster. As you can imagine this is a very important feature to boost performance in databases like MongoDB, because accessing RAM is way faster than hard drive.
Another benefit of using mmap is that MongoDB memory will grow as it needs and the system memory is free.

mmap writes to file on disk(synchronous/asynchronous)

I havea question regarding mmap functionality. when mmap is used in asynchronous mode where the kernel takes care of persisting the data to the mapped file on the disk , is it possible to have the former updates overwrite the later updates ?
Lets say at time T, we modify a location in memory that is memory mapped to a file on disk and again at time T+1 we modify the same location in memory. As the writes to the file are not synchronous, is it possible that kernel first picks up the modifications at time T+1 and then picks up the modifications at time T resulting in inconsistency in the memory mapped file ?
It's not exactly possible. The file is allowed to be inconsistent till msync(2) or munmap(2) - when that happens, dirty (modified) pages are written to disk page by page (sometimes more, depends on filesystem in newer kernels). msync() allows you to specify synchronous operation and invalidation of caches after finished write, which allows you to ensure that the data in cache is the same as data in file. Without that, it's possible that your program will see newer data but file contains older - exact specifics of the rather hairy situation depend on CPU architecture and specific OS implementation of those routines.

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.

mmap() internals

It's widely known that the most significant mmap() feature is that file mapping is shared between many processes. But it's not less widely known that every process has its own address space.
The question is where are memmapped files (more specifically, its data) truly kept, and how processes can get access to this memory?
I mean not *(pa+i) and other high-level stuff, but I mean the internals of the process.
This happens at the virtual memory management layer in the operating system. When you memory map a file, the memory manager basically treats the file as if it were swap space for the process. As you access pages in your virtual memory address space, the memory mapper has to interpret them and map them to physical memory. When you cross a page boundary, this may cause a page fault, at which time the OS must map a chunk of disk space to a chunk of physical memory and resolve the memory mapping. With mmap, it simply does so from your file instead of its own swap space.
If you want lots of details of how this happens, you'll have to tell us which operating system you're using, as implementation details vary.
This is very implementation-dependent, but the following is one possible implementation:
When a file is a first memory-mapped, the data isn't stored anywhere at first, it's still on disk. The virtual memory manager (VMM) allocates a range of virtual memory addresses to the process for the file, but those addresses aren't immediately added to the page table.
When the program first tries to read or write to one of those addresses, a page fault occurs. The OS catches the page fault, figures out that that address corresponds to a memory-mapped file, and reads the appropriate disk sector into an internal kernel buffer. Then, it maps the kernel buffer into the process's address space, and restarts the user instruction that caused the page fault. If the faulting instruction was a read, we're all done for now. If it was a write, the data is written to memory, and the page is marked as dirty. Subsequent reads or writes to data within the same page do not require reading/writing to/from disk, since the data is in memory.
When the file is flushed or closed, any pages which have been marked dirty are written back to disk.
Using memory-mapped files is advantageous for programs which read or write disk sectors in a very haphazard manner. You only read disk sectors which are actually used, instead of reading the entire file.
I'm not really sure what you are asking, but mmap() sets aside a chunk of virtual memory to hold the given amount of data (usually. It can be file-backed sometimes).
A process is an OS entity, and it gains access to memory mapped areas through the OS-proscribed method: calling mmap().
The kernel has internal buffers representing chunks of memory. Any given process is assigned a memory mapping in its own address space which refers to that buffer. A number of proccesses may have their own mappings, but they all end up resolving to the same chunk (via the kernel buffer).
This is a simple enough concept, but it can get a little tricky when processes write. To keep things simple in the read-only case there's usually a copy-on-write functionality that's only used as needed.
Any data will be in some form of memory or others, some cases in HDD, in embedded systems may be some flash memory or even the ram (initramfs), barring the last one, the data in the memory are frequently cached in the RAM, RAM is logical divided into pages and the kernel maintains a list of descriptors which uniquely identify an page.
So at best accessing data would be accessing the physical pages. Process gets there own process address space which consists of many vm_are_struct which identifies a mapped section in the address space. In a call to mmap, new vm_area_struct may be created or may be merged with an existing one if the addresses are adjacent.
A new virtual address is returned to the call to mmap. Also new page tables are created which consists the mapping of the newly created virtual addresses to the physical address where the real data resides. mapping can be done on a file, or anonymously like malloc. The process address space structure mm_struct uses the pointer of pgd_t (Page global directory) to reach the physical page and access the data.