How to keep 32 bit mongodb memory usage down on changing dataset - mongodb

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.

To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.

Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.

It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

Related

PostgreSQL Table in memory

I created a database containing a total of 3 tables for a specific purpose. The total size of all tables is about 850 MB - very lean... out of which one single table contains about 800 MB (including index) of data and 5 million records (daily addition of about 6000 records).
The system is PG-Windows with 8 GB RAM Windows 7 laptop with SSD.
I allocated 2048MB as shared_buffers, 256MB as temp_buffers and 128MB as work_mem.
I execute a single query multiple times against the single table - hoping that the table stays in RAM (hence the above parameters).
But, although I see a spike in memory usage during execution (by about 200 MB), I do not see memory consumption remaining at at least 500 MB (for the data to stay in memory). All postgres exe running show 2-6 MB size in task manager. Hence, I suspect the LRU does not keep the data in memory.
Average query execution time is about 2 seconds (very simple single table query)... but I need to get it down to about 10-20 ms or even lesser if possible, purely because there are just too many times, the same is going to be executed and can be achieved only by keeping stuff in memory.
Any advice?
Regards,
Kapil
You should not expect postgres processes to show large memory use, even if the whole database is cached in RAM.
That is because PostgreSQL relies on buffered reads from the operating system buffer cache. In simplified terms, when PostgreSQL does a read(), the OS looks to see whether the requested blocks are cached in the "free" RAM that it uses for disk cache. If the block is in cache, the OS returns it almost instantly. If the block is not in cache the OS reads it from disk, adds it to the disk cache, and returns the block. Subsequent reads will fetch it from the cache unless it's displaced from the cache by other blocks.
That means that if you have enough free memory to fit the whole database in "free" operating system memory, you won't tend to hit the disk for reads.
Depending on the OS, behaviour for disk writes may differ. Linux will write-back cache "dirty" buffers, and will still return blocks from cache even if they've been written to. It'll write these back to the disk lazily unless forced to write them immediately by an fsync() as Pg uses at COMMIT time. When it does that it marks the cached blocks clean, but doesn't flush them. I don't know how Windows behaves here.
The point is that PostgreSQL can be running entirely out of RAM with a 1GB database, even though no PostgreSQL process seems to be using much RAM. Having shared_buffers too high just leads to double-caching and can reduce the amount of RAM available for the OS to cache blocks.
It isn't easy to see exactly what's cached in RAM because Pg relies on the OS cache. That's why I referred you to pg_fincore.
If you're on Windows and this won't work, you really just have to rely on observing disk activity. Does performance monitor show lots of uncached disk reads? Does operating system memory monitoring show lots of memory used for disk cache in the OS?
Make sure that effective_cache_size correctly reflects the RAM used for disk cache. It will help PostgreSQL choose appropriate query plans.
You are making the assumption, without apparent evidence, that the query performance you are experiencing is explained by disk read delays, and that it can be improved by in-memory caching. This may not be the case at all. You need to look at explain analyze output and system performance metrics to see what's going on.

MongoDB, NUMA hardware, page faults but enough RAM for working set, touch command or vmtouch/dd does not load into memory

MongoDB 2.46 & 2.4.8
Use case:
Load up 100.000 documents on a collection with 2 indexes. Resident memory increases (mongostat), and no page faults happen.
Restart mongod. Resident memory is low (this is expected)
Try to 'preheat' mongo, with touch command db.runCommand({ touch: collection, data: true, index: true }) or other means (on OS, vmtouch / dd)
a) On this step, on my development machine (MacOS), I see in mongostat a lot of page faults trying to heat it up (expected) and the resident memory is raised. From that point on, any updates do not raise page faults
b) On a numa server (256 GB RAM), even though I started up mongo with this guide: http://docs.mongodb.org/manual/administration/production-notes/#mongodb-on-numa-hardware (note: I do not have superuser access. However, the 2nd step, echoing 0 in /proc/sys/vm/zone_reclaim_mode, is already 0 so I left it like that), I cannot seem to be able to pre-heat the memory with the 'touch' command. Nothing happens, even though it returns successfully. In mongostat, only 'mapped' and 'vsize' is getting higher, and resident memory is the same (35m). I even tried to load up the data files in OS's memory with vmtouch and dd commands. Only re-indexing the collection changed the resident memory.
The problem started a while after I began to load up data into the server. I do a lot of upserts and the performance was awesome in the beginning (3000 - 4000 upserts/sec). This was expected because the working set would be able to fit in memory. After 30.000.000 documents the process seems to make a lot of page faults and I do not know why. The data files are approx. 33GB and the performance is about 500 upserts/sec, with a lot of page faults. That should mean that the working set is not in memory. However, 256GB RAM should be more than enough. I tried the 'touch' command, but resident memory was low (I even restarted the mongod process, ran the touch command, and even though 'mapped' and 'vsize' skyrocketed to a lot of GB, resident memory kept low, 35m). I tried to reIndex the collection and voilĂ , resident memory went from 35m -> 20GB. However, again, I saw page faults. Then I tried to vmtouch the data files (or with dd). Again, a lot of page faults.
The problem is that I cannot have 'only' 500 upserts/sec. Should I change my application logic? I thought with 256GB memory my 'active' working set (expected 60GB) should fit in memory. I am in the middle (30GB) and it seems that I cannot do anything to fix this. Is it the numa hardware? Should I make any other changes?
Thanks in advance
I just wrote a pretty detailed answer over on ServerFault regarding resident memory, page faulting, and how to troubleshoot, tweak and tune etc. so I will not re-hash that here.
I will say that Sammaye's comment is correct, the touch (or dd, vmtouch etc.) command will not cause memory to be reported as resident agains the mongod process until the process actually accesses the data (until then it is just in the FS cache), and then you can hit the issue in SERVER-9415 which can cause resident memory to be under reported.
I think you are already looking at the key metrics here, and you should be able to achieve higher resident memory than you are reporting (or at least, get more data into memory without significant page faults being seen). The situation you are describing sounds like memory pressure from elsewhere but I am assuming you would have notices another process eating significant amounts of memory.
What I will note is that I have previously spent days (literally) attempting to make a particular AWS instance go above a 30% memory threshold without success.
When we finally gave up and tried on another instance, without changing a thing (we just added a new instance as a secondary and failed over to it) it instantly went to over 70% resident memory. Granted, that was on m2.4xlarge instances, so not at the same scale as yours, but it's always worth bearing in mind. If you can try it on another instance, I would recommend giving it a shot.

Mongo Server Status - "Resident" memory

After starting Mongo via mongod, I ran a Mongo query that took 300 seconds. Calling db.serverStatus() on my "admin" db showed Mongo having resident memory of 1 GB. The docs explain that "resident" memory is the amount of physical disk/RAM that Mongo uses.
Then, I re-ran the same query, but it took 8 seconds. Looking at the resident memory this time, I saw 5 GB.
The large increase in RAM, I believe, helps to explain why the query time shrank from 300 to 8 seconds, but why did the resident memory jump so quickly?
Is there some type of "warming" step recommended to prepare Mongo so as to avoid 300 second queries?
There reason behind that MongoDB uses mmap functionality of the operating system. This means, at least on Linux systems That the memory handling of the mongodb is based on some functionality of the operating system called memory mapped files.
The memory in Linux systems is addressed in several levels basically any program will see an address space on 32-bit systems of 2GB over all, on 64-bit systems 128TB. This is a virtual address space which means on 32/64bit that amount of memory can be addressed with 4kb memory pages(page is the individually handled part of the memory). That is why if you start mongoDB on a 32 bit system it will rise a warning that the database on such system can handle only 2GB of data. Obviously this virtual address space is bigger than the amount of physical memory so there is a mapping between these virtual addresses and the physical ones. Some of the virtual addresses reside in really physical memory so they are in the real memory,but the algorithm which ensures this is on the side of the kernel. Programs running on Linux systems can deal only with virtual addresses, if one tries to access a virtual memory address which is not in physical memory, a pagefault occurs (you can track this on the serverStatus commands extra info field). (You can find short explanation of this here)
Accessing memory in case if the virtual address reside in physical memory is as fast as the memory, accessing a virtual address which has no physical currently means a paging from disk to memory and read the memory so as fast as the disks random read. (This makes the different in your case)
There is a command in mongoDB which with you can enforce the caching of a collection or an index this command is the touch
If you use this command to load the data into memory before the first query you will get the results in 8sec at first try. Unfortunately you cannot really force the OS to keep always this in memory, so if you have others things using up the memory OS will page out this data in some time.
IF you have enough physical memory mongoDB will keep everything the data and indexes in memory. This not always needed. There is a portion of data which need to be in memory to avoid extensive amount of pagefaults this is the workingset. You can check the size of the working set with the db.runCommand( { serverStatus: 1, workingSet: 1 } ) command.
You cannot handle the paging while it is OS level, but if you have enough memory usually the kernel likes to keep as much stuff cached as it can. If the workingset fits in memory you are more or less ok. If some documents really rarely accessed and there is not enough memory to keep everything there they will be paged out anyway.
When you run a query several things can happen. An index can cover which means no documents will be touched at all, if your query is selective in some notion only a part of the index will be touched. unfortunately it is really hard to define memory is sufficient and the only thing what you can do is to monitor (the workingset metric is an estimation). The symptom of running out of memory can be identified check this presentation. And use MMS.

What's the difference between "virtual memory" and "swap space"?

Can any one please make me clear what is the difference between virtual memory and swap space?
And why do we say that for a 32-bit machine the maximum virtual memory accessible is 4 GB only?
There's an excellent explantation of virtual memory over on superuser.
Simply put, virtual memory is a combination of RAM and disk space that running processes can use.
Swap space is the portion of virtual memory that is on the hard disk, used when RAM is full.
As for why 32bit CPU is limited to 4gb virtual memory, it's addressed well here:
By definition, a 32-bit processor uses
32 bits to refer to the location of
each byte of memory. 2^32 = 4.2
billion, which means a memory address
that's 32 bits long can only refer to
4.2 billion unique locations (i.e. 4 GB).
There is some confusion regarding the term Virtual Memory, and it actually refers to the following two very different concepts
Using disk pages to extend the conceptual amount of physical memory a computer has - The correct term for this is actually Paging
An abstraction used by various OS/CPUs to create the illusion of each process running in a separate contiguous address space.
Swap space, OTOH, is the name of the portion of disk used to store additional RAM pages when not in use.
An important realization to make is that the former is transparently possible due to the hardware and OS support of the latter.
In order to make better sense of all this, you should consider how the "Virtual Memory" (as in definition 2) is supported by the CPU and OS.
Suppose you have a 32 bit pointer (64 bit points are similar, but use slightly different mechanisms). Once "Virtual Memory" has been enabled, the processor considers this pointer to be made as three parts.
The highest 10 bits are a Page Directory Entry
The following 10 bits are a Page Table Entry
The last 12 bits make up the Page Offset
Now, when the CPU tries to access the contents of a pointer, it first consults the Page Directory table - a table consisting of 1024 entries (in the X86 architecture the location of which is pointed to by the CR3 register). The 10 bits Page Directory Entry is an index in this table, which points to the physical location of the Page Table. This, in turn, is another table of 1024 entries each of which is a pointer in physical memory, and several important control bits. (We'll get back to these later). Once a page has been found, the last 12 bits are used to find an address within that page.
There are many more details (TLBs, Large Pages, PAE, Selectors, Page Protection) but the short explanation above captures the gist of things.
Using this translation mechanism, an OS can use a different set of physical pages for each process, thus giving each process the illusion of having all the memory for itself (as each process gets its own Page Directory)
On top of this Virtual Memory the OS may also add the concept of Paging. One of the control bits discussed earlier allows to specify whether an entry is "Present". If it isn't present, an attempt to access that entry would result in a Page Fault exception. The OS can capture this exception and act accordingly. OSs supporting swapping/paging can thus decide to load a page from the Swap Space, fix the translation tables, and then issue the memory access again.
This is where the two terms combine, an OS supporting Virtual Memory and Paging can give processes the illusion of having more memory than actually present by paging (swapping) pages in and out of the swap area.
As to your last question (Why is it said 32 bit CPU is limited to 4GB Virtual Memory). This refers to the "Virtual Memory" of definition 2, and is an immediate result of the pointer size. If the CPU can only use 32 bit pointers, you have only 32 bit to express different addresses, this gives you 2^32 = 4GB of addressable memory.
Hope this makes things a bit clearer.
IMHO it is terribly misleading to use the concept of swap space as equivalent to virtual memory. VM is a concept much more general than swap space. Among other things, VM allows processes to reference virtual addresses during execution, which are translated into physical addresses with the support of hardware and page tables. Thus processes do not concern about how much physical memory the system has, or where the instruction or data is actually resident in the physical memory hierarchy. VM allows this mapping. The referenced item (instruction or data) may be resident in L1, or L2, or RAM, or finally on disk, in which case it is loaded into main memory.
Swap space it is just a place on secondary memory where pages are stored when they are inactive. If there is no sufficient RAM, the OS may decide to swap-out pages of a process, to make room for other process pages. The processor never ever executes instruction or read/write data directly from swap space.
Notice that it would be possible to have swap space in a system with no VM. That is, processes that directly access physical addresses, still could have portions of it on
disk.
Though the thread is quite old and has already been answered. Still would like to share this link as this is the simplest explanation I have found so far. Below link has got diagrams for better visualization.
Key Difference: Virtual memory is an abstraction of the main memory. It extends the available memory of the computer by storing the inactive parts of the content RAM on a disk. Whenever the content is required, it fetches it back to the RAM. Swap memory or swap space is a part of the hard disk drive that is used for virtual memory. Thus, both are also used interchangeably.
Virtual memory is quiet different from the physical memory. Programmers get direct access to the virtual memory rather than physical memory. Virtual memory is an abstraction of the main memory. It is used to hide the information of the real physical memory of the system. It extends the available memory of the computer by storing the inactive parts of the RAM's content on a disk. When the content is required, it fetches it back to the RAM. Virtual memory creates an illusion of a whole address space with addresses beginning with zero. It is mainly preferred for its optimization feature by which it reduces the space requirements. It is composed of the available RAM and disk space.
Swap memory is generally called as swap space. Swap space refers to the portion of the virtual memory which is reserved as a temporary storage location. Swap space is utilized when available RAM is not able to meet the requirement of the system’s memory. For example, in Linux memory system, the kernel locates each page in the physical memory or in the swap space. The kernel also maintains a table in which the information regarding the swapped out pages and pages in physical memory is kept.
The pages that have not been accessed since a long time are sent to the swap space area. The process is referred to as swapping out. In case the same page is required, it is swapped in physical memory by swapping out a different page. Thus, one can conclude that swap memory and virtual memory are interconnected as swap memory is used for the technique of virtual memory.
difference-between-virtual-memory-and-swap-memory
"Virtual memory" is a generic term. In Windows, it is called as Paging or pagination. In Linux, it is called as Swap.

what is the suggested number of bytes each time for files too large to be memory mapped at one time?

I am opening files using memory map. The files are apparently too big (6GB on a 32-bit PC) to be mapped in one ago. So I am thinking of mapping part of it each time and adjusting the offsets in the next mapping.
Is there an optimal number of bytes for each mapping or is there a way to determine such a figure?
Thanks.
There is no optimal size. With a 32-bit process, there is only 4 GB of address space total, and usually only 2 GB is available for user mode processes. This 2 GB is then fragmented by code and data from the exe and DLL's, heap allocations, thread stacks, and so on. Given this, you will probably not find more than 1 GB of contigous space to map a file into memory.
The optimal number depends on your app, but I would be concerned mapping more than 512 MB into a 32-bit process. Even with limiting yourself to 512 MB, you might run into some issues depending on your application. Alternatively, if you can go 64-bit there should be no issues mapping multiple gigabytes of a file into memory - you address space is so large this shouldn't cause any issues.
You could use an API like VirtualQuery to find the largest contigous space - but then your actually forcing out of memory errors to occur as you are removing large amounts of address space.
EDIT: I just realized my answer is Windows specific, but you didn't which platform you are discussing. I presume other platforms have similar limiting factors for memory-mapped files.
Does the file need to be memory mapped?
I've edited 8gb video files on a 733Mhz PIII (not pleasant, but doable).