mongoDB does a huge collection affects the preformance of other collections? - mongodb

In my application I'm about to save some files on the DB.
I've seen the debate whether to save on the filesystem \ db and chose to save the files on the database.
my database for the project is mongoDB.
I would like to know if i have lets say 20 collections in my mongoDB,
and exactly one of them is extremely big.
will i see a performance impact when i work on the other (less large) collections?
if So should i separate this collection from the other collections ? (create another DB for this huge collection alone)?
Does my-sql suffer from the same effect?
thanks.

There are two key considerations here:
Ensure that your working set fits in memory. This will mean that your available memory should exceed at least the total size of the indexes you use for your reads.
MongoDB has a database level write lock after v2.2. This means that during any write operation, the entire database is locked for reads. So for large bulk inserts into a single collection that may take a while, all other collections are locked for the duration of the bulk insert. Therefore, if you separate your large collection into a separate database, your key advantage will be that inserts to that collection will not block reads to collections in other databases.
I'd suggest firstly ensuring that you have enough memory for your working set, and secondly I'd separate the large collection into a separate DB if you intend to write to it a lot.

Related

What are the production best practices to store a large number of document when using MongoDB?

I am in need of storing applications transaction logs. Decided to use MongoDB. Every day there are almost 200000+- data is storing in single node MongoDB.
We have some reports and operation(if something happened then do something) depending on those logs. So, need to find documents matching different criteria. If going on that pace, is it vulnerable? Will it be slow to execute query?
Any suggestions to make it efficient to use MongoDB?
By the way, those data are in single collection. And MongoDB server version: 4.2.6
mongo collections can grow to be many terabytes without much issue. to be able to query that data in a speedy manner, you will have to analyze your queries and create indexes for the fields that are used in those queries.
indexes are not free though. they will take both diskspace and use up RAM, because for indexes to be useful, they need to fit entirely in RAM.
in most cases, if indexes and collections grow beyond what your hardware can handle, you will have to archive/evict old data and trim down the collections.
if your queries need to include that evicted data in order to generate your reports, you will have to have another collection for summarized values/data of the evicted records which you will have to combine with present data when generating the reports.
alternatively sharding can help with big data but there are some limitations on queries you can do with sharded collections.

What is the exact limit on the number of collections in a MongoDB database based on the WiredTiger(MongoDB 4.0) engine as of today(May 2019)?

I'm trying to put a forum-like structure in a MongoDB 4.0 database, which consists of multiple threads under a same "topic", each thread consists of a bunch of posts. So usually there are no limits on the numbers of the threads and posts. And I want to try fully utilizing the benefits of NoSQL features, grabbing a list of posts under any speicified thread at one time without having to scan and look up for the identical "thread_id" and "post_id" in a RDBMS table in the traditional way, so in my mind I want to put all the threads as collections in a database, as the thread_id as the code-generated collection names, and put all the posts of a thread as normal documents under that collection, so the way to access a post may look like:
forum_db【database name】.thread_id【collection name】.post_id【document ID】
But my concern is, despite of the obscure phrase saying at https://docs.mongodb.com/manual/reference/limits/#data,
Number of Collections in a Database
Changed in version 3.0.
For the MMAPv1 storage engine, the maximum number of collections in a database is a function of the size of the namespace file and the number of indexes of collections in the database.
The WiredTiger storage engine is not subject to this limitation.</pre>
Is it safe to do it in this way in terms of performance and scalability? Can we safely take it that there is no limit on the number of collections in a WiredTiger database (MongoDB 4.0+) today as there is pratically no limit on the number of documents in a collection? Many thanks in advance.
To calculate how many collections one can store in a MongoDB database, you need to figure out the number of indexes in each collection.
WiredTiger engine keeps an open file handler for each used collection (and its indexes). A large number of open file handlers can cause extremely long checkpoints operations.
Furthermore, each of the handles will take about ~22KB memory outside the WT cache; this means that just for keeping the files open, mongod process will need ~NUM_OF_FILE_HANDLES * 22KB of RAM.
High memory swapping will lead to a decrease in performance.
As you probably understand from the above, different hardware (RAM size & Disk speed) will behave differently.
From my point of view, you first need to understand the behavior of your application then calculate the required hardware for your MongoDB database server.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

How MongoDB manage data after inserting?

After data is inserted into the db, I know that MongoDB stores the data in files, however, I'm confused about memory.
Supposing when I will insert 50 million records into the db - will this data be loaded in memory? If not, how does MongoDB behave to keep its performance?
In that case documents are loaded into memory on request by blocks, that mean our collection is split-ed into chunks, and most frequent used chunks resides in memory.
To gain performance mongo uses indexes and there is a special query called coved query which means that all data needed is stored in index, which is smaller than collection.

Multiple single-collection databases, or database with multiple collections?

Is there any advantage to using multiple collections within a database, when multiple databases each with a single collection would accomplish the same thing? From what I can gather, using multiple databases reduces lock contention because locks are per-database, so I wonder why you'd ever want to put more than one collection in a single database.
The only downside I've found mentioned is that there's some overhead (~200MB) per database, and that with a large number of databases, OS file handles can become a limitation, but I imagine that if you have enough collections/databases for those to be issues, you've got too many databases. These overheads are bearable in my case; I'd like to know if there's something else I should know about.
EDIT: In my situation there are currently 30 collections spread across 8 databases. I'm asking this question because I think it may be better to make this 30 collections across 30 databases. There's no real reason for the current structure; it was set up by a team who don't know much about databases so picked arbitrarily. It's now used frequently enough for lock contention to be a factor (profiling shows some operations spending >1 second waiting for locks). We'll scale horizontally too, I just saw this as a potential low-hanging fruit since it just means using a different database name for some operations, instead of a different collection name.
Apologies if this has been asked before; the only similar questions I've found have been about whether to use e.g. "a collection per user" which isn't quite the same thing. In my case I have heterogeneous documents which I definitely do want stored in different collections, I'd just like to know whether to store those collections in the same database or not.
may be duplicate of this: creating a different database for each collection in MongoDB 2.2
in my solution I created own database for each large and highload collection, for rest collections I create another common database. Mongodb implements locks on a per-database basis for most read and write operations: http://docs.mongodb.org/manual/faq/concurrency/ But locks in mongoDb not so nasty as in SQL.
This solution increase productivity of mongodb for me.