our software design assumes a database per user in an attempt to partition the data and later be able to distribute and load balance per user.
We noticed that the mongod process is taking a lot of memory even when no user has ever logged in.
So I would like to know how/when the loading occurs, if there is a setting that could do some lazy loading or if there is a better strategy to achieve what we want.
Thank you
I believe got the answer digging into MongoDB documentation. Databases are always "loaded" regardless if active or idle.
To calculate how much RAM you need, you must calculate your working set size, or the portion of your data that clients use most often. This depends on your access patterns, what indexes you have, and the size of your documents. Because MongoDB uses a thread per connection model, each database connection also will need up to 1 MB of RAM, whether active or idle.
So for a very high number of databases even when only a few are active, memory could be a big concern and our strategy seems need to be changed.
Related
I have the same 5000 key/value pairs being read/written continuously (every 150ms or so) on a Debian system equivalent to a Raspberry Pi 3.
I don't care about persisting this data, it's recreated whenever my application server is launched.
Initially I used SQLite for this, using an in-memory table. However, now I want to access the data from multiple processes (using a tmpfs didn't work out great) and even from a remote client, as well as add an HTTP API, use LISTEN/NOTIFY for change notifications, so I'd like to switch to PG which is more appropriate for these.
Given these circumstances:
small dataset that fits in RAM
no need for persistence
low power PC
running 24/7 forever
don't want to thrash the flash storage
...what would be a good approach to configuring PG?
I found this 10yo question and the last update was 5 years ago saying to use a 3rd party extension, which I'm not too excited about.
You should create few indexes apart from the primary keys and keep the fillfactor of all your tables low, perhaps around 50. That should get you HOT updates, which will reduce the need for VACUUM and the amount of data written.
You may want to reduce shared_buffers to conserve memory, but keep it big enough to contain the database.
Set synchronous_commit to off to have less disk I/O. If you are ready to ditch the database after an unclean shutdown or system crash, you can set fsync = off, but then you have to remove the cluster after each crash. If you take it that far, you could reduce the write load further by using unlogged tables.
Set checkpoint_timeout high for fewer writes.
Say I have a single collection in mongodb with only one index, and I require the index for the entire life cycle of the application using that mongo collection.
I would like to know about the behaviour of mongodb.
In this case once the index is loaded into memory, will mongodb keep it in the ram?
Thanks
The first thing MongoDB will knock out of RAM will be the LRU (least recently used) piece of data. So if you only have one index, chances are it will continue to be used pretty regularly and it should stay in memory.
Source
Unfortunately you cannot currently pin a collection or index in memory. MongoDB uses memory mapped files to load collections and indexes into memory. As your activities touch various pieces of your database thru queries, updates, insertions and deletions, that data will get loaded into memory. This is referred to as the working set. If the total memory required to load the working set is less than available memory, no problem.
If not, MongoDB is going to use an LRU algorithm to pick what to unload from memory. This is why it's so important to understand the concept of the working set and how it relates to your available memory.
This writeup from the documentation should be helpful:
How do I calculate how much RAM I need for my application?
The amount of RAM you need depends on several factors, including but
not limited to:
The relationship between database storage and working set.
The operating system’s cache strategy for LRU (Least Recently Used)
The impact of journaling
The number or rate of page faults and other MMS gauges to detect when you need more RAM
Each database connection thread will need up to 1 MB of RAM. MongoDB
defers to the operating system when loading data into memory from
disk. It simply memory maps all its data files and relies on the
operating system to cache data. The OS typically evicts the
least-recently-used data from RAM when it runs low on memory. For
example if clients access indexes more frequently than documents, then
indexes will more likely stay in RAM, but it depends on your
particular usage.
To calculate how much RAM you need, you must calculate your working
set size, or the portion of your data that clients use most often.
This depends on your access patterns, what indexes you have, and the
size of your documents. Because MongoDB uses a thread per connection
model, each database connection also will need up to 1MB of RAM,
whether active or idle.
If page faults are infrequent, your working set fits in RAM. If fault
rates rise higher than that, you risk performance degradation. This is
less critical with SSD drives than with spinning disks.
http://docs.mongodb.org/manual/faq/diagnostics/
You can use the serverStatus command to get an estimate of your current working set:
db.runCommand( { serverStatus: 1, workingSet: 1 } )
I have a Mongo cluster that backs an application that I use in production. It's very important to my business and clustered across a number of boxes to optimize for speed and redundancy. I'd like to make the data in said cluster available for running analytical queries and enqueued tasks, but I definitely don't want these to harm production performance. Is it possible to just mirror all of my data against a single box I throw into the cluster with some special tag that I can then use for analytics? It's fine if it's slow. I just want it to be cheap and not to affect production read/write speeds.
Since you're talking about redundancy, I assume you have a replica set.
In that case you can use a hidden replica set member to perform the calculations you need.
Just keep in mind that the member count must be odd. If you add a node you might need to also add an arbiter. Or maybe you can just hide one of the already existing members.
If you are looking for a way to increase querying speed having a lot of data, you have to look might look into sharding with mongodb. Basically what it does is dividing your big amount of data into small shards and stores them on different machines.
If you are looking to increase redundancy (in order to make backup or to be able to do offline processing without touching primary servers) you have to look into replication with mongodb. If you are doing replication, keep in mind that the data on the replicas will be always lagging behind a primary (nothing to worry about, but just need to know this fact to decide can you allow read from the replicas). As it was pointed by Rafa, hidden replica sets are well suited for backup and offline data processing. They will still be able to get all the data from primary (with small lag), but are invisible to secondary reads and can not become primary.
There is a nice mongodb course which is talking in depth about replication and sharding, so may be it is worth listening and trying it.
We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.
I am working on a project which has some important data in it. This means we cannot to lose any of it if the light or server goes down. We are using MongoDB for the database. I'd like to be sure that my data is in the database after the insert and rollback the whole batch if one element was not inserted. I know it is the philosophy behind Mongo that we do not need transactions but how can I make sure that my data is really safely stored after insert rather than sent to some "black hole".
Should I make a search?
Should I use some specific mongoDB commands?
Should I use sharding even if one server is enough for satisfying
the speed and by the way it doesn't guarantee anything if the light
goes down?
What is the best solution?
Your best bet is to use Write Concerns - these allow you to tell MongoDB how important a piece of data is. The quickest Write Concern is also the least safe - the data is not flushed to disk until the next scheduled flush. The safest will confirm that the data has been written to disk on a number of machines before returning.
The write concern you are looking for is FSYNC_SAFE (at least that is what it is called from the point of view of the Java driver) or REPLICAS_SAFE which confirms that your data has been replicated.
Bear in mind that MongoDB does not have transactions in the traditional sense - your rollback will have to be rolled by hand as you can't tell the Mongo database to do this for you.
The other thing you need to do is either use the relatively new --journal option (which uses a Write Ahead Log), or use replica sets to share your data across many machines in order to maximise data integrity in the event of a crash/power loss.
Sharding is not so much a protection against hardware failure as a method for sharing the load when dealing with particularly large datasets - sharding shouldn't be confused with replica sets which is a way of writing data to more than one disk on more than one machine.
Therefore, if your data is valuable enough, you should definitely be using replica sets, perhaps even siting slaves in other data centres/availability zones/racks/etc in order to provide the resilience you require.
There is/will be (can't remember offhand whether this has been implemented yet) a way to specify the priority of individual nodes in a replica set such that if the master goes down the new master that is elected is one in the same data centre if such a machine is available (ie to stop a slave on the other side of the country from becoming master unless it really is the only other option).
I received a really nice answer from a person called GVP on google groups. I will quote it(basically it adds up to Rich's answer):
I'd like to be sure that my data is in the database after the
insert and rollback the whole batch if one element was not inserted.
This is a complex topic and there are several trade-offs you have to
consider here.
Should I use sharding?
Sharding is for scaling writes. For data safety, you want to look a
replica sets.
Should I use some specific mongoDB commands?
First thing to consider is "safe" mode or "getLastError()" as
indicated by Andreas. If you issue a "safe" write, you know that the
database has received the insert and applied the write. However,
MongoDB only flushes to disk every 60 seconds, so the server can fail
without the data on disk.
Second thing to consider is "journaling"
(v1.8+). With journaling turned on, data is flushed to the journal
every 100ms. So you have a smaller window of time before failure. The
drivers have an "fsync" option (check that name) that goes one step
further than "safe", it waits for acknowledgement that the data has
be flushed to the disk (i.e. the journal file). However, this only
covers one server. What happens if the hard drive on the server just
dies? Well you need a second copy.
Third thing to consider is
replication. The drivers support a "W" parameter that says "replicate
this data to N nodes" before returning. If the write does not reach
"N" nodes before a certain timeout, then the write fails (exception
is thrown). However, you have to configure "W" correctly based on the
number of nodes in your replica set. Again, because a hard drive
could fail, even with journaling, you'll want to look at replication.
Then there's replication across data centers which is too long to get
into here. The last thing to consider is your requirement to "roll
back". From my understanding, MongoDB does not have this "roll back"
capacity. If you're doing a batch insert the best you'll get is an
indication of which elements failed.
Here's a link to the PHP driver on this one: http://it.php.net/manual/en/mongocollection.batchinsert.php You'll have to check the details on replication and the W parameter. I believe the same limitations apply here.