GridFS and standard collections, memory usage - mongodb

As I know, MongoDB is optimized for situation when all data fits into memory. And as I understood GridFS uses standard collection and all standard storage methods. Is it?
Does it mean that storing a large set of data (images at my case), that bigger that current amount of memory, it will forse out my real data from memory?
Maybe MongoDB smart enough to give less priority for GridFS collection?

MongoDB uses memory-mapped files to manage its data files. If you use data, it will stay in memory. If you don't use it, it will eventually be flushed to disk (and be read back, when you request it next time). If you need to read all your data, you better fit it all in RAM or your system might enter the deadly swap spiral (depends on your load, of course).
If you just store data and don't do much with it, MongoDB will use only a fraction of memory. For example, in one of my projects total dataset size is over 300 GB and mongo takes only 800 MB of RAM (because I almost don't read data, only write it).

Related

MongoDB collection size before/after dump

I have a question regarding MongoDB's collection size.
I did a small stress test in which my MongoDB server was constantly inserting, deleting and updating data for about 48 hours. The documents were only of small size, simply a numerical value and a timestamp as well as an ID.
Now, after those 48 hours, the collection used for inserting, deleting and updating data was 98.000 Bytes and the preallocated storage size was 696.320 Bytes. It has become that much higher than the actual collection size because of one input spike during an insertion phase. Due to following deletions of objects the actual collection size decreased again, the preallocated storage size didn't (AFAIK a common database management problem, since it's the same with e.g. MySQL).
After the stress test was completed I created a dump of my MongoDB database and dropped the database completely, so I could import the dump afterwards again and see how the stats would look then. And as I suspected, the collection size was still the same (98.000 Bytes) but the preallocated storage size went down to 40.960 Bytes (from 696.320 Bytes before).
Since we want to try out MongoDB for an application that produces hundreds of MB of data and therefore I/O traffic every day, we need to keep the database and its occupied space to a minimum. And preferably without having to create a dump, drop the whole database and import the dump again every now and then.
Now my question is: is there a way to call the MongoDB garbage collector functionally from code? The software behind it is a Java software and my idea was to call the garbage collector after a certain amount of time/operations or after the preallocated storage size has reached a certain threshold.
Or maybe there's an ever better (more elegant) way to minimize the occupied space?
Any help would be appreciated and I'll try to provide any further information if needed. Thanks in advance.

What is memory map in mongodb?

I read about this topic at
http://docs.mongodb.org/manual/faq/storage/#faq-storage-memory-mapped-files
But didn't understand point .Does it is used to keep query data in physical memory ? How it is related with virtual memory ? Why it is important and how it effect at performance ?
I'll try to explain in a simple way.
MongoDB (and other storage systems) stores data in files. Each database has its own files, created as they are needed. The first file weights 64 MB, the next 128 and so up to 2 GB. Then, new files created weigh 2 GB. Each of these files are logically divided into different blocks, that correspond with one virtual memory block.
When MongoDB needs to access a file or a part of it, loads all virtual blocks corresponding to that file or parts of the files into memory using mmap.On the other hand, mmap is a way for applications to leverage the system cache (linux).
So what really happens when you are doing a query is that MongoDB "tells" the OS to load the part it needs with the data requested, so the next time is requested will be faster. As you can imagine this is a very important feature to boost performance in databases like MongoDB, because accessing RAM is way faster than hard drive.
Another benefit of using mmap is that MongoDB memory will grow as it needs and the system memory is free.

design mongodb to load entire content in memory

I am involved in a project where they get enough RAM to store the entire database in memory. According to the manager, that is what 10Gen recommended. This is counter intuitive. Is that really the way you want to use Mongodb?
It is not counter intuitive... I find it quite intuitive, actually.
In How much faster is the memory usually than the disk? you can read:
(...) memory is only about 6 times faster when you're doing sequential
access (350 Mvalues/sec for memory compared with 58 Mvalues/sec for
disk); but it's about 100,000 times faster when you're doing random
access.
So if you can fit all your data in RAM, it is quite good because you are going to be really fast reading your data.
Regarding MongoDB, from the FAQ's:
It’s certainly possible to run MongoDB on a machine with a small
amount of free RAM.
MongoDB automatically uses all free memory on the machine as its
cache. System resource monitors show that MongoDB uses a lot of
memory, but its usage is dynamic. If another process suddenly needs
half the server’s RAM, MongoDB will yield cached memory to the other
process.
Technically, the operating system’s virtual memory subsystem manages
MongoDB’s memory. This means that MongoDB will use as much free memory
as it can, swapping to disk as needed. Deployments with enough memory
to fit the application’s working data set in RAM will achieve the best
performance.
The problem is that you usually have much more data than memory available. And then you have to go to disk, and disk I/O is slow. Regarding database performance, avoiding full scan queries is key (much more important when accessing to disk). Therefore, if your data set does not fit in memory, you should aim at having indexes for the vast majority of your access patterns and try to fit those indexes in memory:
If you have created indexes for your queries and your working data set
fits in RAM, MongoDB serves all queries from memory.
It all depends on the size of your database. I am guessing that you said your database was actually quite small, otherwise I cannot see how someone at 10gen gave such advice, I mean not even #Stennie gives such advice (he is 10gen by the way).
Even if your database is small I don't see how the manager recommended that. MongoDB does not do memory management of its own as such it does not "pin" data into pages like memcached does or other memory based databases do.
This means that the paging of mongods data can be quite unpredicatable, a.k.a you will spend more time trying to keep things in RAM than paging in data. This is why it is better to just make sure your working set fits and it can loaded with speed, such things are based upon your hardware and queries.
#Stennies comment pretty much sums up the stance you should be taking with MongoDB.

whats the typical size of data in memory vs data in tuple of a database?

I am working on an application that uses Openstreetmap data in a local server. To enhance speed of searching inside the OSM database, I am considering caching some amount of data in RAM memory (aka the java heap of my application). I want to determine how much RAM will be consumed to cache various amounts of data. The complete data file is around 330 gigs and growing. How much ram would that translate to in memory? In general, is there a way to tell how much RAM would each gig of data in a postgres database consume (if cached outside of the database)?
Thank you folks.
Sachin
In-memory representation generally needs more space than on-disk representation. It depends on the involved data types and can be around factor 2.
More info in this thread on pgsql-general by Tom Lane and Ondrej Ivanic.
That's for memory PostgreSQL itself uses. Not sure about the "outside the database" part of the question. You mean disk cache? OS cache? Not sure what the factor would be there.

MongoDB: Save files on memory instead of disk

Can I use MongoDB like Redis or Memcache?
My goal is to have everything in memory and make it faster to access. We already use MongoDB but we need to improve the speed of reads.
What's the best way to do that?
You can't force mongodb to keep everything in RAM. It will keep hot and recently used data in RAM and page out the rest. If you can't afford to suffer a delay on page fault, then use redis/memcached.
Or, alternatively, you can put mongodb's data dir on a ram disk. That will effectively keep everything in memory, but you'll duplicate some data (one copy on ram disk, another - in memory mapped files in mongo).