what is the suggested number of bytes each time for files too large to be memory mapped at one time? - memory-mapped-files

I am opening files using memory map. The files are apparently too big (6GB on a 32-bit PC) to be mapped in one ago. So I am thinking of mapping part of it each time and adjusting the offsets in the next mapping.
Is there an optimal number of bytes for each mapping or is there a way to determine such a figure?
Thanks.

There is no optimal size. With a 32-bit process, there is only 4 GB of address space total, and usually only 2 GB is available for user mode processes. This 2 GB is then fragmented by code and data from the exe and DLL's, heap allocations, thread stacks, and so on. Given this, you will probably not find more than 1 GB of contigous space to map a file into memory.
The optimal number depends on your app, but I would be concerned mapping more than 512 MB into a 32-bit process. Even with limiting yourself to 512 MB, you might run into some issues depending on your application. Alternatively, if you can go 64-bit there should be no issues mapping multiple gigabytes of a file into memory - you address space is so large this shouldn't cause any issues.
You could use an API like VirtualQuery to find the largest contigous space - but then your actually forcing out of memory errors to occur as you are removing large amounts of address space.
EDIT: I just realized my answer is Windows specific, but you didn't which platform you are discussing. I presume other platforms have similar limiting factors for memory-mapped files.

Does the file need to be memory mapped?
I've edited 8gb video files on a 733Mhz PIII (not pleasant, but doable).

Related

Page table entries in Virtual Memory

I have a simple question regarding page table entries. Suppose that we are given a 32 - bit Virtual address with 4 KiB page and a physical memory size of 2^28 bits.
Since the page offset is 12 bits, we would have 2^20 page table entries which would be mapped to 2^16 physical frames. But how is it possible for 2^20 entries to map to 2^16 entries. There would run out of physical frame addresses. Suppose that the process uses the full 2^20 pages, then assuming that the whole RAM consists of memory from only this process, all the 2^16 frames in the RAM would contain this processes memory. Am i right to say that 2^4 of the page table entries would show that it maps to disk ?
Also, if the process uses only one page table, then the remaining 2^20 - 1 page table entries would be invalid ?
There are a lot of assumptions in your question. To being with, you assume the page table entries are 32-bits. They could be 64 or even 128 bits.
You also assume that there are 20 bits available for indicating into page frames. Any real system is going to need some of those bits for control and protection purposes.
But how is it possible for 2^20 entries to map to 2^16 entries. There would run out of physical frame addresses.
That is the whole point of virtual memory system. Assuming you have 2^20 pages mapped to a process but only 2^16 physical pages, then not all of the process's pages are going to be mapped to page frames at the same time.
Am i right to say that 2^4 of the page table entries would show that it maps to disk ?
A rationally designed virtual memory system maintains a copy of all process pages on disk somewhere. The pages get copied from disk into memory ad mapped to the address space as needed.
But how is it possible for 2^20 entries to map to 2^16 entries. There would run out of physical frame addresses.
The important thing to understand about virtual memory is that it's "virtual" - it's an illusion that doesn't actually (physically) exist. This allows the OS to do various tricks, like:
marking some/lots of virtual pages as "unused/not present" so that RAM isn't wasted when it's not needed (and so that programs get an error when they try to access something that doesn't exist - e.g. SIGSEGV signal).
move pages between RAM and swap space to pretend that there's more RAM than there actually is. Note that this isn't limited to "swap space on disk" - e.g. it could be memory built into some kind of device (e.g. unused memory in a video card), memory that isn't in the current machine (e.g. using a network to store data in the RAM of a different computer) and it could be RAM in the same computer (e.g. if half the data can be compressed to half it's size, then "compression as swap space" would let you store 4 MiB of data in 3 MiB of RAM).
pretend RAM was allocated by mapping the same page full of zeros everywhere, and then allocate it later (if and only if the page is written to); so that you can have a large area of zeros (e.g. a program's ".bss" section) that costs almost nothing (until/unless it's written to).
pretend that a file is mapped/loaded into memory without allocating memory and without loading the file; and then allocating page/s and loading data into them if/when the data is accessed later, and also (to get more free RAM if the OS needs it for other things) freeing the memory if it wasn't modified (knowing that you can just get it from disk again later if it's accessed again later)
mapping the same pages of RAM into multiple processes. For example, if you have 10 processes that are all executing the same executable file; then the executable file might be stored in RAM once and then mapped into the virtual file system's caches plus mapped into 10 different processes.
"copy on write" tricks; where the same page of RAM is mapped into lots of processes (so any process can read from the page), and then if a process writes to the page the OS can allocate a new page and make a copy of the old page and replace the original (shared, read only) page with the new (not shared, writeable) copy.
If each virtual address space is 1 MiB, then you could have 100 processes (with 100 virtual address spaces and 1 GiB of total space) where the processes only use (on average) 512 KiB of virtual address space each, so that it looks like a total of 51200 KiB of virtual memory is being used; but the computer might only have 64 KiB of RAM where the remaining 51136 KiB of virtual memory is just trickery.

How to keep 32 bit mongodb memory usage down on changing dataset

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.
To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.
Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.
It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

What's the difference between "virtual memory" and "swap space"?

Can any one please make me clear what is the difference between virtual memory and swap space?
And why do we say that for a 32-bit machine the maximum virtual memory accessible is 4 GB only?
There's an excellent explantation of virtual memory over on superuser.
Simply put, virtual memory is a combination of RAM and disk space that running processes can use.
Swap space is the portion of virtual memory that is on the hard disk, used when RAM is full.
As for why 32bit CPU is limited to 4gb virtual memory, it's addressed well here:
By definition, a 32-bit processor uses
32 bits to refer to the location of
each byte of memory. 2^32 = 4.2
billion, which means a memory address
that's 32 bits long can only refer to
4.2 billion unique locations (i.e. 4 GB).
There is some confusion regarding the term Virtual Memory, and it actually refers to the following two very different concepts
Using disk pages to extend the conceptual amount of physical memory a computer has - The correct term for this is actually Paging
An abstraction used by various OS/CPUs to create the illusion of each process running in a separate contiguous address space.
Swap space, OTOH, is the name of the portion of disk used to store additional RAM pages when not in use.
An important realization to make is that the former is transparently possible due to the hardware and OS support of the latter.
In order to make better sense of all this, you should consider how the "Virtual Memory" (as in definition 2) is supported by the CPU and OS.
Suppose you have a 32 bit pointer (64 bit points are similar, but use slightly different mechanisms). Once "Virtual Memory" has been enabled, the processor considers this pointer to be made as three parts.
The highest 10 bits are a Page Directory Entry
The following 10 bits are a Page Table Entry
The last 12 bits make up the Page Offset
Now, when the CPU tries to access the contents of a pointer, it first consults the Page Directory table - a table consisting of 1024 entries (in the X86 architecture the location of which is pointed to by the CR3 register). The 10 bits Page Directory Entry is an index in this table, which points to the physical location of the Page Table. This, in turn, is another table of 1024 entries each of which is a pointer in physical memory, and several important control bits. (We'll get back to these later). Once a page has been found, the last 12 bits are used to find an address within that page.
There are many more details (TLBs, Large Pages, PAE, Selectors, Page Protection) but the short explanation above captures the gist of things.
Using this translation mechanism, an OS can use a different set of physical pages for each process, thus giving each process the illusion of having all the memory for itself (as each process gets its own Page Directory)
On top of this Virtual Memory the OS may also add the concept of Paging. One of the control bits discussed earlier allows to specify whether an entry is "Present". If it isn't present, an attempt to access that entry would result in a Page Fault exception. The OS can capture this exception and act accordingly. OSs supporting swapping/paging can thus decide to load a page from the Swap Space, fix the translation tables, and then issue the memory access again.
This is where the two terms combine, an OS supporting Virtual Memory and Paging can give processes the illusion of having more memory than actually present by paging (swapping) pages in and out of the swap area.
As to your last question (Why is it said 32 bit CPU is limited to 4GB Virtual Memory). This refers to the "Virtual Memory" of definition 2, and is an immediate result of the pointer size. If the CPU can only use 32 bit pointers, you have only 32 bit to express different addresses, this gives you 2^32 = 4GB of addressable memory.
Hope this makes things a bit clearer.
IMHO it is terribly misleading to use the concept of swap space as equivalent to virtual memory. VM is a concept much more general than swap space. Among other things, VM allows processes to reference virtual addresses during execution, which are translated into physical addresses with the support of hardware and page tables. Thus processes do not concern about how much physical memory the system has, or where the instruction or data is actually resident in the physical memory hierarchy. VM allows this mapping. The referenced item (instruction or data) may be resident in L1, or L2, or RAM, or finally on disk, in which case it is loaded into main memory.
Swap space it is just a place on secondary memory where pages are stored when they are inactive. If there is no sufficient RAM, the OS may decide to swap-out pages of a process, to make room for other process pages. The processor never ever executes instruction or read/write data directly from swap space.
Notice that it would be possible to have swap space in a system with no VM. That is, processes that directly access physical addresses, still could have portions of it on
disk.
Though the thread is quite old and has already been answered. Still would like to share this link as this is the simplest explanation I have found so far. Below link has got diagrams for better visualization.
Key Difference: Virtual memory is an abstraction of the main memory. It extends the available memory of the computer by storing the inactive parts of the content RAM on a disk. Whenever the content is required, it fetches it back to the RAM. Swap memory or swap space is a part of the hard disk drive that is used for virtual memory. Thus, both are also used interchangeably.
Virtual memory is quiet different from the physical memory. Programmers get direct access to the virtual memory rather than physical memory. Virtual memory is an abstraction of the main memory. It is used to hide the information of the real physical memory of the system. It extends the available memory of the computer by storing the inactive parts of the RAM's content on a disk. When the content is required, it fetches it back to the RAM. Virtual memory creates an illusion of a whole address space with addresses beginning with zero. It is mainly preferred for its optimization feature by which it reduces the space requirements. It is composed of the available RAM and disk space.
Swap memory is generally called as swap space. Swap space refers to the portion of the virtual memory which is reserved as a temporary storage location. Swap space is utilized when available RAM is not able to meet the requirement of the system’s memory. For example, in Linux memory system, the kernel locates each page in the physical memory or in the swap space. The kernel also maintains a table in which the information regarding the swapped out pages and pages in physical memory is kept.
The pages that have not been accessed since a long time are sent to the swap space area. The process is referred to as swapping out. In case the same page is required, it is swapped in physical memory by swapping out a different page. Thus, one can conclude that swap memory and virtual memory are interconnected as swap memory is used for the technique of virtual memory.
difference-between-virtual-memory-and-swap-memory
"Virtual memory" is a generic term. In Windows, it is called as Paging or pagination. In Linux, it is called as Swap.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

Total memory consumption of the system

Is it correct to assume that the total memory consumption (virtual + physical) of a system is sum of "Memory Usage" and "VM Size" columns shown by the task manager in windows?
Read these posts by Mark Russinovich:
http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx
In modern Windows there really is no single truth about "Total Memory Consumption". It depends of course on the definition, but the real question is what you want to do with the answer.
Some processes like SQL-Server tend to use every byte of memory they can get their hands on, if you let them. The .NET CLR garbage collector monitors memory use and acts accordingly, trying to free more memory when it gets scarce.
So for instance you can have a system with 8 GB of physical memory, of which 90% is "used". How much of that memory is actually needed, is very hard to say. The same system may run on a 4 GB machine with no noticeable performance loss or any other issues.
If you want to explore some of the complexities of memory management under Windows, download "VMMap v2.0" from the former sysinternals site. It shows very detailed memory usage per process and may aid you in your quest.
To quote from VMMaps Help:
VMMap categorizes memory into one of several types:
Image
The memory represents an executable file such as a .exe or .dll. The Details column shows the file's path.
Private
Private memory cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.
Shareable
Shareable memory can be shared with other processes, is charged against the system commit limit and typically contains data shared between DLLs in different processes or inter-process communication messages. The Windows APIs refer to this type of memory as pagefile-backed sections.
Mapped File
The memory represents a file on disk and the Details column shows the file's path. Mapped files typically contain application data.
Heap
Heaps represent memory managed by the user-mode heap manager and, like Private memory, is charged against the system commit limit and contains application data.
Managed Heap
Managed heap represents memory that's allocated and used by the .NET garbage collector.
Stack
Stacks are memory used to store function parameters, local function variables and function invocation records for individual threads. Stacks are charged agains the commit limit and typically grow on demand.
System
System memory is kernel-mode physical memory associated with the process. The vast majority of System memory consists of the process page tables.
Free
Free memory regions are spaces in the process address space that are not allocated.
Now you just need to define what types of memory you consider as "used", add these up for all processes, remove multiple duplicates and look at the number... There is a reason why in task manager or other tools, there is no single number labeled "Total Memory Consumption" :-)
No, physical memory and virtual memory may overlap. If a page of memory is in virtual memory and then paged in to physical memory the virtual memory is not necessarily freed, it may be reserved for when the page gets paged out again.