What is mongodb behavior regarding keeping loaded indexes in ram? - mongodb

Say I have a single collection in mongodb with only one index, and I require the index for the entire life cycle of the application using that mongo collection.
I would like to know about the behaviour of mongodb.
In this case once the index is loaded into memory, will mongodb keep it in the ram?
Thanks

The first thing MongoDB will knock out of RAM will be the LRU (least recently used) piece of data. So if you only have one index, chances are it will continue to be used pretty regularly and it should stay in memory.
Source

Unfortunately you cannot currently pin a collection or index in memory. MongoDB uses memory mapped files to load collections and indexes into memory. As your activities touch various pieces of your database thru queries, updates, insertions and deletions, that data will get loaded into memory. This is referred to as the working set. If the total memory required to load the working set is less than available memory, no problem.
If not, MongoDB is going to use an LRU algorithm to pick what to unload from memory. This is why it's so important to understand the concept of the working set and how it relates to your available memory.
This writeup from the documentation should be helpful:
How do I calculate how much RAM I need for my application?
The amount of RAM you need depends on several factors, including but
not limited to:
The relationship between database storage and working set.
The operating system’s cache strategy for LRU (Least Recently Used)
The impact of journaling
The number or rate of page faults and other MMS gauges to detect when you need more RAM
Each database connection thread will need up to 1 MB of RAM. MongoDB
defers to the operating system when loading data into memory from
disk. It simply memory maps all its data files and relies on the
operating system to cache data. The OS typically evicts the
least-recently-used data from RAM when it runs low on memory. For
example if clients access indexes more frequently than documents, then
indexes will more likely stay in RAM, but it depends on your
particular usage.
To calculate how much RAM you need, you must calculate your working
set size, or the portion of your data that clients use most often.
This depends on your access patterns, what indexes you have, and the
size of your documents. Because MongoDB uses a thread per connection
model, each database connection also will need up to 1MB of RAM,
whether active or idle.
If page faults are infrequent, your working set fits in RAM. If fault
rates rise higher than that, you risk performance degradation. This is
less critical with SSD drives than with spinning disks.
http://docs.mongodb.org/manual/faq/diagnostics/
You can use the serverStatus command to get an estimate of your current working set:
db.runCommand( { serverStatus: 1, workingSet: 1 } )

Related

Memory usage remains 99 % even after create index is done on MongoDB Collection

While doing indexing on MongoDB. Now we have nearly 350 GBs of data in the database and its deployed as a windows service in AWS EC2.
And we are doing indexing for some experimentation. But every time I run the indexing command the memory usage goes to 99% and even after the indexing is done the memory usage keeps like that until I restart the service.
The instance has 30 GB of RAM and SSD drive. And right now we have the DB setup as stand alone (not sharded till now). And we are using the latest version of MongoDB.
Any feedback related to this will be helpful.
Thanks,
Arpan
That's normal behavior for MongoDB.
MongoDB grabs all the RAM it can get to cache each accessed document as long as possible. When you add an index to a collection, each document needs to be read once to build the index, which causes MongoDB to load each document into RAM. It then keeps them in RAM in case you want to access them later. But MongoDB will not squat the RAM. When another process needs memory, MongoDB will willingly release it.
This is explained in the FAQ:
Does MongoDB require a lot of RAM?
Not necessarily. It’s certainly
possible to run MongoDB on a machine with a small amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
See also: FAQ: MongoDB Diagnostics for answers to additional questions
about MongoDB and Memory use.

Mongo Server Status - "Resident" memory

After starting Mongo via mongod, I ran a Mongo query that took 300 seconds. Calling db.serverStatus() on my "admin" db showed Mongo having resident memory of 1 GB. The docs explain that "resident" memory is the amount of physical disk/RAM that Mongo uses.
Then, I re-ran the same query, but it took 8 seconds. Looking at the resident memory this time, I saw 5 GB.
The large increase in RAM, I believe, helps to explain why the query time shrank from 300 to 8 seconds, but why did the resident memory jump so quickly?
Is there some type of "warming" step recommended to prepare Mongo so as to avoid 300 second queries?
There reason behind that MongoDB uses mmap functionality of the operating system. This means, at least on Linux systems That the memory handling of the mongodb is based on some functionality of the operating system called memory mapped files.
The memory in Linux systems is addressed in several levels basically any program will see an address space on 32-bit systems of 2GB over all, on 64-bit systems 128TB. This is a virtual address space which means on 32/64bit that amount of memory can be addressed with 4kb memory pages(page is the individually handled part of the memory). That is why if you start mongoDB on a 32 bit system it will rise a warning that the database on such system can handle only 2GB of data. Obviously this virtual address space is bigger than the amount of physical memory so there is a mapping between these virtual addresses and the physical ones. Some of the virtual addresses reside in really physical memory so they are in the real memory,but the algorithm which ensures this is on the side of the kernel. Programs running on Linux systems can deal only with virtual addresses, if one tries to access a virtual memory address which is not in physical memory, a pagefault occurs (you can track this on the serverStatus commands extra info field). (You can find short explanation of this here)
Accessing memory in case if the virtual address reside in physical memory is as fast as the memory, accessing a virtual address which has no physical currently means a paging from disk to memory and read the memory so as fast as the disks random read. (This makes the different in your case)
There is a command in mongoDB which with you can enforce the caching of a collection or an index this command is the touch
If you use this command to load the data into memory before the first query you will get the results in 8sec at first try. Unfortunately you cannot really force the OS to keep always this in memory, so if you have others things using up the memory OS will page out this data in some time.
IF you have enough physical memory mongoDB will keep everything the data and indexes in memory. This not always needed. There is a portion of data which need to be in memory to avoid extensive amount of pagefaults this is the workingset. You can check the size of the working set with the db.runCommand( { serverStatus: 1, workingSet: 1 } ) command.
You cannot handle the paging while it is OS level, but if you have enough memory usually the kernel likes to keep as much stuff cached as it can. If the workingset fits in memory you are more or less ok. If some documents really rarely accessed and there is not enough memory to keep everything there they will be paged out anyway.
When you run a query several things can happen. An index can cover which means no documents will be touched at all, if your query is selective in some notion only a part of the index will be touched. unfortunately it is really hard to define memory is sufficient and the only thing what you can do is to monitor (the workingset metric is an estimation). The symptom of running out of memory can be identified check this presentation. And use MMS.

what constitutes "large amount of write activity" for Mongodb?

I am currently working on an online ordering application using Mongodb as the backend. In looking into sharding, the Mongo docs say that you should consider sharding if
"your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention."
So my question is: what constitutes a large amount of write activity? are we talking 1000's of writes per second? 100's?
I know that sharding introduces a level of infrastructure complexity that I'd rather not get into if I don't have to.
thanks!
R
The "large amount of write activity" is not defined in terms of a specific number .. but rather when your common usage pattern exceeds the resources of your server hardware. For example, when average I/O flush time or iowait indicates that I/O has become a significant limiting factor.
You do have other options to consider before sharding:
if your working set is larger than RAM and you have significant page faults, upgrade your RAM
if your disk I/O isn't keeping up, consider upgrading to faster disks, RAID, or SSD
review and adjust your readahead settings
look into optimization of slow or inefficient queries
review your indexes and remove unnecessary ones

How to keep 32 bit mongodb memory usage down on changing dataset

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.
To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.
Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.
It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

Can MongoDB work when size of database larger then RAM?

or when index larger then RAM?
Yes it can work. To what level it will perform is more of an "It Depends"
The key thing is to ensure your working set can fit in RAM. So if you have 16GB of RAM and 20GB database (inc. indexes) for example, if you need to only access half of all the data as the other half is older/never actually queried then you'll be fine as only half of your database needs to be in RAM (10GB).
Working set is key here. For example, if you have a logging application outputting to MongoDB, it may be that your working set is the amount of data (and indexes) from the past 3 months and that all data before that you don't access.
When your working set exceeds the amount of RAM, then it will carry on working but with noticeably degraded performance as things will then be constantly having to go to disk which is far less performant. If you're in this situation of exceeding RAM constraints on a machine, then this is where sharding comes into play - so you can balance the data out over a number of machines therefore increasing the amount of data that can be kept in RAM.
Nicely explained here: http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/