What is the exact limit on the number of collections in a MongoDB database based on the WiredTiger(MongoDB 4.0) engine as of today(May 2019)? - mongodb

I'm trying to put a forum-like structure in a MongoDB 4.0 database, which consists of multiple threads under a same "topic", each thread consists of a bunch of posts. So usually there are no limits on the numbers of the threads and posts. And I want to try fully utilizing the benefits of NoSQL features, grabbing a list of posts under any speicified thread at one time without having to scan and look up for the identical "thread_id" and "post_id" in a RDBMS table in the traditional way, so in my mind I want to put all the threads as collections in a database, as the thread_id as the code-generated collection names, and put all the posts of a thread as normal documents under that collection, so the way to access a post may look like:
forum_db【database name】.thread_id【collection name】.post_id【document ID】
But my concern is, despite of the obscure phrase saying at https://docs.mongodb.com/manual/reference/limits/#data,
Number of Collections in a Database
Changed in version 3.0.
For the MMAPv1 storage engine, the maximum number of collections in a database is a function of the size of the namespace file and the number of indexes of collections in the database.
The WiredTiger storage engine is not subject to this limitation.</pre>
Is it safe to do it in this way in terms of performance and scalability? Can we safely take it that there is no limit on the number of collections in a WiredTiger database (MongoDB 4.0+) today as there is pratically no limit on the number of documents in a collection? Many thanks in advance.

To calculate how many collections one can store in a MongoDB database, you need to figure out the number of indexes in each collection.
WiredTiger engine keeps an open file handler for each used collection (and its indexes). A large number of open file handlers can cause extremely long checkpoints operations.
Furthermore, each of the handles will take about ~22KB memory outside the WT cache; this means that just for keeping the files open, mongod process will need ~NUM_OF_FILE_HANDLES * 22KB of RAM.
High memory swapping will lead to a decrease in performance.
As you probably understand from the above, different hardware (RAM size & Disk speed) will behave differently.
From my point of view, you first need to understand the behavior of your application then calculate the required hardware for your MongoDB database server.

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,
So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster
So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.
About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So as #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

mongoDB does a huge collection affects the preformance of other collections?

In my application I'm about to save some files on the DB.
I've seen the debate whether to save on the filesystem \ db and chose to save the files on the database.
my database for the project is mongoDB.
I would like to know if i have lets say 20 collections in my mongoDB,
and exactly one of them is extremely big.
will i see a performance impact when i work on the other (less large) collections?
if So should i separate this collection from the other collections ? (create another DB for this huge collection alone)?
Does my-sql suffer from the same effect?
thanks.
There are two key considerations here:
Ensure that your working set fits in memory. This will mean that your available memory should exceed at least the total size of the indexes you use for your reads.
MongoDB has a database level write lock after v2.2. This means that during any write operation, the entire database is locked for reads. So for large bulk inserts into a single collection that may take a while, all other collections are locked for the duration of the bulk insert. Therefore, if you separate your large collection into a separate database, your key advantage will be that inserts to that collection will not block reads to collections in other databases.
I'd suggest firstly ensuring that you have enough memory for your working set, and secondly I'd separate the large collection into a separate DB if you intend to write to it a lot.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

limits of number of collections in databases

Can anyone say are there any practical limits for the number of collections in mongodb?
They write here https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections:
Generally, having a large number of collections has no significant
performance penalty, and results in very good performance.
But for some reason mongodb set limit 24000 for the number of namespaces in the database, it looks like it can be increased, but I wonder why it has some the limit in default configuration if having many collections in the database doesn't cause any performance penalty?
Does it mean that it's a viable solution to have a practically unlimited number of collections in one database, for example, to have one collection of data of one account in a database for the multitenant application, having, for example, hundreds of thousands of collections in the database?
If it's the viable solution to have a very large number of collections for a database for every tenant, what's the benefits of it for example versus having documents of each tenant in one collection?
Thank you very much for your answers.
This answer is late however the other answers seem a bit...weak in terms of reliability and factual information so I will attempt to remedy that a little.
But for some reason mongodb set limit 24000 for the number of namespaces in the database,
That is merely the default setting. Yes, there is a default setting.
It does say on the limits page that 24000 is the limit ( http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Namespaces ), as though there is no way to expand that but there is.
However there is a maximum limit on how big a namespace file can be ( http://docs.mongodb.org/manual/reference/limits/#Size%20of%20Namespace%20File ) which is 2GB. That gives you roughly 3 million namespaces to play with in most cases which is quite impressive and I am unsure if many people will hit that limit quickly.
You can modify the default value to go higher than 16MB by using the nssize parameter either within the configuration ( http://docs.mongodb.org/manual/reference/configuration-options/#nssize ) or at runtime by manipulating the command used to run MongoDB ( http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--nssize ).
There is no real reason for why MongoDB implements 16MB by default for its nssize as far as I know, I have never heard about the motto of "not bother the user with every single detail" so I don't buy that one.
I think, in my opinion, the main reason why MongoDB hides this is because even though, as the documentation states:
Distinct collections are very important for high-throughput batch processing.
Using multiple collections as a means to scale vertically rather than horizontally through a cluster, as MongoDB is designed to, is considered (quite often) bad practice for large scale websites; as such 12K collections is normally considered something that people will never, and should never, ascertain.
No More Limits!
As other answers have stated - this is determined by the size of the namespace file. This was previously an issue, because it had a default limit of 16mb and a max of 2gb. However with the release of MongoDB 3.0 and the WiredTiger storage engine, it looks like this limit has been removed. WiredTiger seems to be better in almost every way, so I see little reason for anyone to use the old engine, except for legacy support reasons. From the site:
For the MMAPv1 storage engine, namespace files can be no larger than
2047 megabytes.
By default namespace files are 16 megabytes. You can configure the
size using the nsSize option.
The WiredTiger storage engine is not subject to this limitation.
http://docs.mongodb.org/manual/reference/limits/
A little background:
Every time mongo creates a database, it creates a namespace (db.ns) file for it. The namespace (or collections as you might want to call it) file holds the metadata about the collection. By default the namespace file is 16MB in size, though you can increase the size manually. The metadata for each collections is 648 bytes + some overhead bytes. Divide that by 16MB and you get approximately 24000 namespaces per database. You can start mongo by specifying a larger namespace file and that will let you create more collections per database.
The idea behind any default configuration is to not bother the user with every single detail (and configurable knob) and choose one that generally works for most people. Also, viability does go hand in hand with best/good design practices. As Chris said, consider the shape of your data and decide accordingly.
As others mention, the default namespace size is 16MB and you can get about 24000 namespace entries. Actually my 64 bit instance in Ubuntu topped out at 23684 using the default 16MB namespace file.
One important thing that isn't mentioned in the FAQ is that indexes also use namespace slots.
You can count the namespace entries with:
db.system.namespaces.count()
And it's also interesting to actually take a look at what's in there:
db.system.namespaces.find()
Set your limit higher than what you think you need because once a database is created, the namespace file cannot be extended (as far as I understand - if there is a way, please tell me!!!).
Practically, I have never run across a maximum. But I've definitely never gone beyond the 24,000 collection limit. I'm pretty sure I've never hit more than 200, other than when I was performance testing the thing. I have to admit, I think it sounds like an awful lot of chaos to have that many collections in a single database, rather than grouping like data in to their own collections.
Consider the shape of your data and business rules. If your data needs to be laid out such that you must have the data separated in to different logical groupings for your multi-tenant app, then you probably should consider other data stores. Because while Mongo is great, the fact that they put a limit on the amount of collections at all tells me that they know there is some theoretical limit where performance is effected.
Perhaps you should consider a store that would match the data shape? Riak, for example, has an unlimited number of 'buckets' (without theoretical maximum) that you can have in your application. One bucket per account is perfectly doable, but you sacrifice some querability by going that direction.
Otherwise, you may want to follow a more relational model of grouping like with like. In my view, Mongo feels like a half-way point between relational databases and key-value stores. That means that it's more easy to conceptualize it coming from a relational database world.
There seems to be a massive overhead for maintaining collections. I've just reduced a database which had around 1.5mio documents in 11000 collections to one with the same number of documents in around 300 collections; this has reduced the size of the database from 8GB to 1GB. I'm not familiar with the inner workings of MongoDB so this may be obvious but I thought might be worth noting in this context.