How do I have Mongo 3.0 / WiredTiger load my whole database into RAM? - mongodb

I have a static database (that will never even receive a write) of around 5 GB, while my server RAM is 30 GB. I'm focusing on returning complicated aggregations to the user as fast as possible, so I don't see a reason why I shouldn't have (a) the indexes and (b) the entire dataset stored entirely in RAM, and (c) automatically stored there whenever the Mongo server boots up. Currently my main bottleneck is running group commands to find unique elements out of millions of rows.
My question is, how can I do either (a), (b), or (c) while running on the new Mongo/WiredTiger? I know the "touch" command doesn't work with WiredTiger, so most information on the Internet seems out of date. Are (a), (b), or (c) already done automatically? Should I not be doing each of these steps with this use case?

Normaly you shouldn't have to do anything. The disk pages are loaded in RAM upon request and stay there. If there is no more free memory the older (unused) pages get unloaded to be used by other programs that need them.
If you must have your whole db in ram you could use a ramdisk and tell mongo to use it as a storage device.
I would recommend that you revise your indices and/or data structures. Having the correct ones can make a huge difference in performance. We are talking about seconds vs hours.

Related

What is memory map in mongodb?

I read about this topic at
http://docs.mongodb.org/manual/faq/storage/#faq-storage-memory-mapped-files
But didn't understand point .Does it is used to keep query data in physical memory ? How it is related with virtual memory ? Why it is important and how it effect at performance ?
I'll try to explain in a simple way.
MongoDB (and other storage systems) stores data in files. Each database has its own files, created as they are needed. The first file weights 64 MB, the next 128 and so up to 2 GB. Then, new files created weigh 2 GB. Each of these files are logically divided into different blocks, that correspond with one virtual memory block.
When MongoDB needs to access a file or a part of it, loads all virtual blocks corresponding to that file or parts of the files into memory using mmap.On the other hand, mmap is a way for applications to leverage the system cache (linux).
So what really happens when you are doing a query is that MongoDB "tells" the OS to load the part it needs with the data requested, so the next time is requested will be faster. As you can imagine this is a very important feature to boost performance in databases like MongoDB, because accessing RAM is way faster than hard drive.
Another benefit of using mmap is that MongoDB memory will grow as it needs and the system memory is free.

understand MongoDB cache system

This is a basic question, but very important, and i am not sure to really get the point.
On the official documentation we can read
MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory.
The part i am not sure to understand is
If you have created indexes for your queries and your working data set fits in RAM
what does mean "indexes" here?
For example, if i update a model, then i query it, because i have updated it, it's now in RAM so it will come from the memory, but this is not very clear in my mind.
How can we be sure that datas we query will come from the memory or not? I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
How do you globally advise to use MongoDB for huge traffic?
Note: This was written back in 2013 when MongoDB was still quite young, it didn't have the features it does today, while this answer still holds true for mmap, it does not for the other storage technologies MongoDB now implements, such as WiredTiger, or Percona.
A good place to start to understand exactly what is an index: http://docs.mongodb.org/manual/core/indexes/
After you have brushed up on that you will udersand why they are so good, however, skipping forward to some of the more intricate questions.
How can we be sure that datas we query will come from the memory or not?
One way is to look at the yields field on any query explain(). This will tell you how many times the reader yielded its lock because data was not in RAM.
Another more indepth way is to look on programs like mongostat and other such programs. These programs will tell you about what page faults (when data needs to be paged into RAM from disk) are happening on your mongod.
I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
This is actually incorrect. It is easier to just say that MongoDB does this but in reality it does not. It is in fact the OS and its own paging algorithms, usually the LRU, that does this for MongoDB. MongoDB does cache index plans for a certain period of time though so that it doesn't have to constantly keep checking and testing for indexes.
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
Not sure how you expect that to work...I mean the two do quite different things and if you intend to read your data from MongoDB into your application on startup into that var then I definitely would not recommend it.
Besides OS algorithms for memory management are extremely mature and fast, so it is ok.
How do you globally advise to use MongoDB for huge traffic?
Hmm, this is such a huge question. Really I would recommend you Google a little in this subject but as the documentation states you need to ensure your working set fits into RAM for one.
Here is a good starting point: What does it mean to fit "working set" into RAM for MongoDB?
MongoDB attempts to keep entire collections in memory: it memory-maps each collection page. For everything to be in memory, both the data pages, and the indices that reference them, must be kept in memory.
If MongoDB returns a record, you can rest assured that it is now in memory (whether it was before your query or not).
MongoDB doesn't keep a "cache" of records in the same way that, say, a web browser does. When you commit a change, both the memory and the disk are updated.
Mongo is great when matched to the appropriate use cases. It is very high performance if you have sufficient server memory to cache everything, and declines rapidly past that point. Many, many high-volume websites use MongoDB: it's a good thing that memory is so cheap, now.

mongod clean memory used in ram

I have a huge amount of data in my mongodb. It's filled with tweets (50 GB) and my Ram is 8 GB. When querying it retrieves all tweets and mongodb starts filling the ram, when it reaches 8 GB it starts moving files to disk. This is the part where it gets really slowwwww. So i changed the query from skipping and starting using indexes. Now i have indexes and i query only 8GB to my program, save the id of the last tweet used in a file and the program stops. Then restart the program and it goes get the id of the tweet from the file. But mogod server still is ocupping the ram with the first 8GB, that no longer will be used, because i have a index to the last. How can i clean the memory of the mongo db server without restarting it?
(running in a win)
I am a bit confused by your logic here.
So i changed the query from skipping and starting using indexes. Now i have indexes and i query only 8GB to my program, save the id of the last tweet used in a file and the program stops.
Using ranged queries will not help the amount of data you have to page in (in fact it might worsen it because of the index), it merely makes the query faster server side by using an index for huge skips (like 42K+ row skip). If you are dong the same as that skip() but in index then (without a covered index) then you are still paging in exactly the same.
It is slow due to memory mapping and your working set. You have more data than RAM and not only that but you are using more of that data than you have RAM as such you are page faulting probably all the time.
Restarting the program will not solve this, nor will clearing its data OS side (with restart or specific command) because of your queries. You probably need to either:
Think about your queries so that your working set is more in line to your memory
Or shard your data across many servers so that you don't have to build up your primary server
Or get a bigger primary server (moar RAM!!!!!)
Edit
The LRU of your OS should be swapping out old data already since MongoDB is using its fully allocated lot, which means that if that 8GB isn't swapped it is because your working set is taking that full 8GB (most likely with some swap on the end).

MongoDB consumes a lot of memory

For more than a month is my war with mongoDB. Until I lose =] ...
Battle 1. Battle 2.
And now a new problem. Again, not enough memory.
Initially, this was solved by simply increasing the memory at a rate of VPS. Then journal = false. But now I got to the top of your plan and continue to increase the memory is not possible.
For my base are lacking 4 GB of memory.
How should I choose a database for the project, was nowhere written that there are so many mongoDB memory. With about 10 million records in the mongoDB missing 4 GB of memory, when my MySQL database with 10 million easily copes with 1.4 GB of memory.
The problem as I understand it, a large number of index fields. But since I can not log into the database, respectively, can not remove them. They needed me in the early stages of development, now they are not important to me.
Tell me please, can I remove them somehow?
There is a dump of the database is completely whole folder database / data / db
On my PC with 4 GB of memory database does not start on a VPS with 4GB same.
As an alternative, I think to take a test period at some VPS / VDS to run mongo and delete keys.
Do you know a web hosting with a test period and 6 GB of memory?
Or if there is an alternative, could you say what?
The issues has very little to do with the size of your data set. MongoDB uses memory mapped files for its storage engine. As such it'll start swapping in pages of hot data into memory when it can and it does so fairly aggressively (or more accurately, the OS memory management does).
Basically it uses as much memory as is available to it and there's very little you can do to avoid it. All data pages (be it actual data or indexes) that are accessed during operation will be swapped into memory if there is space available.
There are plenty of references to this on the internet and on mongodb.org by the way. Saying it isn't mentioned anywhere isn't really true.

How does MonogoDB stack up for very large data sets where only some of the data is volatile

I'm working on a project where we periodically collect large quantities of e-mail via IMAP or POP, perform analysis on it (such as clustering into conversations, extracting important sentences etc.), and then present views via the web to the end user.
The main view will be a facebook-like profile page for each contact of the the most recent (20 or so) conversations that each of them have had from the e-mail we capture.
For us, it's important to be able to retrieve the profile page and recent 20 items frequently and quickly. We may also be frequently inserting recent e-mails into this feed. For this, document storage and MongoDB's low-cost atomic writes seem pretty attractive.
However we'll also have a LARGE volume of old e-mail conversations that won't be frequently accessed (since they won't appear in the most recent 20 items, folks will only see them if they search for them, which will be relatively rare). Furthermore, the size of this data will grow more quickly than the contact store over time.
From what I've read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way to work around this is to use virtual memory, which can carry a significant overhead. Particularly if Mongo isn't able to differentiate between the volatile data (profiles/feeds) and non-volatile data (old emails), this could end up being quite nasty (and since it seems to devolve the virtual memory allocation to the OS, I don't see how the this would be possible for Mongo to do).
It would seem that the only choices are to either (a) buy enough RAM to store everything, which is fine for the volatile data, but hardly cost efficient for capturing TB of e-mails, or (b) use virtual memory and see reads/writes on our volatile data slow to a crawl.
Is this correct, or am I missing something? Would MongoDB be a good fit for this particular problem? If so, what would the configuration look like?
MongoDB does not "require the entire data set to remain in RAM". See http://www.mongodb.org/display/DOCS/Caching for an explanation as to why/how it uses virtual memory the way it does.
It would be fine for this application. If your sorting and filtering were more complex you might, for example, want to use a Map-Reduce operation to create a collection that's "display ready" but for a simple date ordered set the existing indexes will work just fine.
MongoDB uses mmap to map documents into virtual memory (not physical RAM). Mongo does not require the entire dataset to be in RAM but you will want your 'working set' in memory (working set should be a subset of your entire dataset).
If you want to avoid mapping large amounts of email into virtual memory you could have your profile document include an array of ObjectIds that refer to the emails stored in a separate collection.
#Andrew J
Typical you need enough RAM to hold your working set, this is true for MongoDB as it is for an RDBMS. So if you want to hold the last 20 emails for all users without going to disk, then you need that much memory. If this exceed the memory on a single system, then you can use MongoDB's sharding feature to spread data across multiple machines, therefore aggregating the Memory, CPU and IO bandwidth of the machines in the cluster.
#mP
MongoDB allows you as the application developer to specify the durability of your writes, from a single node in memory to multiple nodes on disk. The choice is your depending on what your needs are and how critical the data is; not all data is created equally. In addition in MongoDB 1.8, you can specify --dur, this writes a journal file for all the writes. This further improves the durability of writes and speeds up recovery if there is a crash.
And what happens if your computer crashes to all the stuff Mongo had in memory. Im guessing that it has no logs so the answer is probably bad luck.