What is difference MongoDB wiredTiger cache with in-memory DB - mongodb

As i know MonogoDB cache working set in RAM.
Then if i increase wiredTigerCacheSizeGB as much as all of data in disk, does it work as fast as in-memory db?
if no, what is difference?

See In-Memory Storage Engine and WiredTiger Storage Engine
(In-memory) By avoiding disk I/O, the in-memory storage engine allows for more predictable latency of database operations.
Keep in mind that you are limited a 10000 GB when setting wiredTigerCacheSizeGB. You should also disable journaling and set storage.syncPeriodSecs to 0 in order to increase performance of WiredTiger. But, still WiredTiger has to create WiredTiger.wt and WiredTiger.turtle at least...
PS. I think this link might answer your question

I cannot answer all your questions.
A cache reads data from disk and keeps it in the RAM. When you access such data again then you read it from RAM instead of reading it again from disk - which would be much slower.
So, a cache is useless if you have to read the data only once. Some applications anticipate the data you may read in future and put it into the cache in advance.
The MongoDB in-memory DB puts all data into RAM only, it does not read or write anything from disk, apart from some logging data. When you stop an in-memory MongoDB process then all data is lost.
The wiredTiger storage engine is a data format used by MongoDB to store data persistently on disk.

If you set wiredTigerCacheSizeGB high enough to hold all of your data, then all of your reads will be satisfied from the cache. Writes will update the cache and also be written to storage.
If you use the in-memory configuration then all of your reads will be satisfied from memory. Writes will only go to memory and will not be stored on disk.
So if your workload is mostly reads, then the large cache will behave similarly to an in-memory DB. If your workload has a lot of writes, then the large cache configuration may be slower because it needs to write to disk.
Also, the in-memory DB will not preserve your data in the event of a crash, since it only holds data in memory.

Related

PostgreSQL benchmarking over a RAMdisk?

I have been considering the idea of moving to a RAMdisk for a while. I know its risks, but just wanted to do a little benchmark. I just had two questions: (a) when reading the query plan, will it still differentiate between disk and buffers hits? If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
(b) a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take? Is it the same as usual e.g. COPY command?
I do not recommend using RAM disks in PostgreSQL for persistent storage. With careful tuning, you can get PostgreSQL not to use more disk I/O than what is required to make your data persistent.
I recommend doing this:
Have more RAM in your machine than the size of the database.
Define shared_buffers big enough to contain the database (on Linux, define memory hugepages to contain them).
Increase checkpoint_timeout and max_wal_size to get fewer checkpoints.
Set synchronous_commit = off to keep PostgreSQL from syncing WAL to disk on every commit.
If you are happy to lose all your data in the case of a crash, define your tables UNLOGGED. The data will survive a normal shutdown.
Anyway, to answer your questions:
(a) You should set seq_page_cost and random_page_cost way lower to tell PostgreSQL how fast your storage is.
(b) You could run backups with either pg_dump or pg_basebackup, they don't care what kind of storage you have got.
when reading the query plan, will it still differentiate between disk and buffers hits?
It never distinguished between them in the first place. It distinguishes between "hit" and "read", but the "read" can't tell which are truly from disk and which are from OS/FS cache.
PostgreSQL has no idea you are running on a RAM disk, so will continue to report those as it always has.
If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
This is a question that should be answered through your benchmarking. On some systems, memory can be read-ahead from main memory into the faster caches, making sequential reads still faster than random reads. If you care, you will have to benchmark it on your own system.
Reading data from RAM into shared_buffers is still surprisingly expensive due to things like lock management. So as a rough starting point, maybe seq_page_cost=0.1 and random_page_cost=0.15.
a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take?
The risk would be that your system crashes before the export has finished. But what precaution can you take against that?

Mongodb terminates when it runs out of memory

I have the following configuration:
a host machine that runs three docker containers:
Mongodb
Redis
A program using the previous two containers to store data
Both Redis and Mongodb are used to store huge amounts of data. I know Redis needs to keep all its data in RAM and I am fine with this. Unfortunately, what happens is that Mongo starts taking up a lot of RAM and as soon as the host RAM is full (we're talking about 32GB here), either Mongo or Redis crashes.
I have read the following previous questions about this:
Limit MongoDB RAM Usage: apparently most RAM is used up by the WiredTiger cache
MongoDB limit memory: here apparently the problem was log data
Limit the RAM memory usage in MongoDB: here they suggest to limit mongo's memory so that it uses a smaller amount of memory for its cache/logs/data
MongoDB using too much memory: here they say it's WiredTiger caching system which tends to use as much RAM as possible to provide faster access. They also state it's completely okay to limit the WiredTiger cache size, since it handles I/O operations pretty efficiently
Is there any option to limit mongodb memory usage?: caching again, they also add MongoDB uses the LRU (Least Recently Used) cache algorithm to determine which "pages" to release, you will find some more information in these two questions
MongoDB index/RAM relationship: quote: MongoDB keeps what it can of the indexes in RAM. They'll be swaped out on an LRU basis. You'll often see documentation that suggests you should keep your "working set" in memory: if the portions of index you're actually accessing fit in memory, you'll be fine.
how to release the caching which is used by Mongodb?: same answer as in 5.
Now what I appear to understand from all these answers is that:
For faster access it would be better for Mongo to fit all indices in RAM. However, in my case, I am fine with indices partially residing on disk as I have a quite fast SSD.
RAM is mostly used for caching by Mongo.
Considering this, I was expecting Mongo to try and use as much RAM space as possible but being able to function also with few RAM space and fetching most things from disk. However, I limited Mongo Docker container's memory (to 8GB for instance), by using --memory and --memory-swap, but instead of fetching stuff from disk, Mongo just crashed as soon as it ran out of memory.
How can I force Mongo to use only the available memory and to fetch from disk everything that does not fit into memory?
Thanks to #AlexBlex's comment I solved my issue. Apparently the problem was that Docker limited the container's RAM to 8GB but the wiredTiger storage engine was still trying to use up 50% - 1GB of the total system RAM for it's cache (which in my case would have been 15 GB).
Capping wiredTiger's cache size by using this configuration option to a value less than what Docker was allocating solved the problem.

How can we read from journal on MongoDb?

In a write concern mechanism, Suppose our data to be written is now available at journal and not yet passed to hard drive. Meanwhile if a read/write operation comes to the same data then how the engine will handle them ?
If the data has been written to the journal, it has also been written in memory and would be available for reads.
There is no mechanism for you to access/read the journal, but this is not necessary.

What is mongodb behavior regarding keeping loaded indexes in ram?

Say I have a single collection in mongodb with only one index, and I require the index for the entire life cycle of the application using that mongo collection.
I would like to know about the behaviour of mongodb.
In this case once the index is loaded into memory, will mongodb keep it in the ram?
Thanks
The first thing MongoDB will knock out of RAM will be the LRU (least recently used) piece of data. So if you only have one index, chances are it will continue to be used pretty regularly and it should stay in memory.
Source
Unfortunately you cannot currently pin a collection or index in memory. MongoDB uses memory mapped files to load collections and indexes into memory. As your activities touch various pieces of your database thru queries, updates, insertions and deletions, that data will get loaded into memory. This is referred to as the working set. If the total memory required to load the working set is less than available memory, no problem.
If not, MongoDB is going to use an LRU algorithm to pick what to unload from memory. This is why it's so important to understand the concept of the working set and how it relates to your available memory.
This writeup from the documentation should be helpful:
How do I calculate how much RAM I need for my application?
The amount of RAM you need depends on several factors, including but
not limited to:
The relationship between database storage and working set.
The operating system’s cache strategy for LRU (Least Recently Used)
The impact of journaling
The number or rate of page faults and other MMS gauges to detect when you need more RAM
Each database connection thread will need up to 1 MB of RAM. MongoDB
defers to the operating system when loading data into memory from
disk. It simply memory maps all its data files and relies on the
operating system to cache data. The OS typically evicts the
least-recently-used data from RAM when it runs low on memory. For
example if clients access indexes more frequently than documents, then
indexes will more likely stay in RAM, but it depends on your
particular usage.
To calculate how much RAM you need, you must calculate your working
set size, or the portion of your data that clients use most often.
This depends on your access patterns, what indexes you have, and the
size of your documents. Because MongoDB uses a thread per connection
model, each database connection also will need up to 1MB of RAM,
whether active or idle.
If page faults are infrequent, your working set fits in RAM. If fault
rates rise higher than that, you risk performance degradation. This is
less critical with SSD drives than with spinning disks.
http://docs.mongodb.org/manual/faq/diagnostics/
You can use the serverStatus command to get an estimate of your current working set:
db.runCommand( { serverStatus: 1, workingSet: 1 } )

MongoDB RAM Consumption on Connections

I'm using pymongo to insert a big amount of jsons to MongoDB gridFS + some data to collection.
What I noticed some time ago is that MongoDB consumes just crazy amount of RAM within using single connection. As soon as I close this connection it releases it.
RAM consumption is like 10-12GB in total within connection and 200MB without. The actual size of collection is actually ~300MB with 10-18GB gridFS storage.
Why does it happen? How can opening new connection for any bulky operation can be lot less resource-dependent than using one single connection for everything?
Is it somehow related to Journaling?
I will have to break down this problem into multiple smaller problems for ease of understanding:
It is well known that MongoDB is RAM hungry, it will try to use as much RAM as possible.
GridFS tends to store files in collection fs.chunks and corresponding meta-data in fs.files. The files stored in GridFS are split into chunks of 256KB each.
When you read GridFS data by opening a connection, the chunks belonging to file(s) have to be loaded into the RAM from the disk(if it is not already present in RAM). So , RAM usage is directly proportional to the amount of data stored and importantly frequency of GridFS data access. Just to re-iterate GridFS data gets pulled into RAM if the query references it.
If you have a active connection for large amounts of GridFS data then you should expect heavy RAM usage. But if your query frequency is low(just write, but read rarely) then RAM usage will be relatively lower.If you are mostly writing data, then ensure the connection is closed after the operation in done.
The more the number of open connections, your RAM usage will increase.
This is no-way related to journaling.
Note: GridFS also supports sharding which will tend to solve your problem of excessive RAM usage.
Hope this clarifies.
Since MongoDB 2.0. each connection consumes about 1MB of RAM.
You can read more here.