mongodb database schema for many-users-application

mongodb database schema for many-users-application - mongodb

Question: There is a tens of thousands of users (but less then 500K) in an application.
Solution: store every user's collection (10-20) in a separate users namespace (just one for every client) to save disk space by escaping from id 'column' of each user; speed up query time couse of small index of a namespace; reduce locked ratio (https://jira.mongodb.org/browse/SERVER-1240); simplify sharding (https://jira.mongodb.org/browse/SERVER-939).
Is this ok? Or maybe I should use one general collection with a namespaces?
Thanks for your answers.

I think I understand your question, but correct me if I'm wrong. Seems like you're looking to store the users of each Application in their own collection. This has several advantages and disadvantages that you have to weight based on complex DBA decisions like R/W ratio, load, etc.
Advantages
Like you've mentioned, indexes will take less time to update because they only have a segment of users.
Queries on non indexed fields (if there are any) will be quicker because of the smaller number of elements.
The global write lock won't play as much of a role since you're only locking per application.
Disadvantages
Since indexes are scoped by collection you will have (# of Applications) times more indexes to keep in memory (indexes do little good if you page them out).
Because indexes and collections occupy their own namespaces and each namespace occupies about 628 bytes , you need to worry about the default 16MB namespace limit. This will limit the number of applications you can have. e.g. with 2 indexes you're limited to about 8,000 collections.
Finally, since your users will be in different collections, you won't be able to query across applications. This can be subverted by MapReduce, but adds more complexity.
At the end of the day you can achieve most of these benefits while circumventing the disadvantages by simply sharding on some application key. The many collection scenario is tempting, but I think ultimately not what mongo is optimized for.

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,

So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster

So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.

About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So as #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

120 mongodb collections vs single collection - which one is more efficient?

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance

Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).

The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.

Is it better to model data as a Single collection or separate collections?

As an example, imagine a trivial "helpdesk" type app where there are support tickets, and the app supports multiple companies logging in and managing their tickets.
Given that companies won't interact with each others "Tickets"....
Is it better to have one collection of "Tickets" and query or is it better to create collections of Tickets per Company?

There are a couple of things to consider here.
The first thing is pre-allocation of space. You will find a couple of threads on the mongodb-user group whereby the OP is confused about why their database is taking so much space when their data is taking so little space. This is because when you reach a certain point of pre-alloc within a collection it will create files 2GB in size by default, even if you are only using 100meg of that space.
Now imagine this pre-alloc pattern for 1000 companies; this quickly creates inefficient use of disk space and, in most of the threads, performance and cost problems.
The second thing to consider here is the nssize, which is 2GB maximum. This may seem crazy but what if you do have more than 3 million members (assume a company is a "registered user")? You will quickly use up the maximum namespace file size that MongoDB can give.
Also you will gain no benefit from the lock (on DB level) without splitting them out into separate databases, this of course creates an operational overhead in maintaining the database connections for each company.
MongoDB is typically designed to scale through a cluster rather than scale vertically and scaling vertically is normally considered a bad idea for large websites.

I don't have much time using mongodb, but I'll give some arguments so we can discuss it. I think you should create just one Tickets collection, for the following reasons:
Creating a Collection for each company seems like redundancy.
You will have to create and configurate a collection every time you add a new company to your system in order to create tickets, when in the other hand you will only have to create the company.
I don't know how where you planning to create the link between your company document and it's corresponding ticket collection, but I think is more straightforward to create the link using the id of the company document with an idcompany attribute in the Tickets collection.
I think one of the reasons that might make you consider to create a ticket collection per company, is due to the large amount of data could decrease the speed of your queries (all the companies inserting to the same tickets collection). But the way you could counter this is creating a sharded cluster, using a compound shard key with idcompany and some usefull attribute from the Tickets document, this way is very likely that all the documents of a given company remains in the same shard, so the common queries will perform relatively quick.

My $0.02:
By separating out each company into their own collections, or better, databases... it makes customer migration and individualized backups, restores, imports and exports much easier at the expense of making your code a tad crappier.
Isolating customer data may reduce your data storage requirements, as you won't need to embed the customer ID into every single document. Of course, with separate databases, most drivers will treat that as a separate network connection.
As with everything, there are tradeoffs.

MongoDB -- large number of documents

This is related to my last question.
We have an app where we are storing large amounts of data per user. Because of the nature of data, previously we decided to create a new database for each user. This would have required a large no. of databases (probably millions) -- and as someone pointed out in a comment, that this indicated wrong design.
So we changed the design and now we are thinking about storing each user's entire information in one collection. This means one collection exactly maps to one user. Since there are 12,000 collections available per database, we can store 12,000 users per DB (and this limit could be increased).
But, now my question is -- is there any limit on the no. of documents a collection can have. Because of the way we need to store data per user, we expect to have a huge (tens of millions in extreme cases) no. of document per documents. Is that OK for MongoDB and design-wise?
EDIT
Thanks for the answers. I guess then it's OK to use large no of documents per collection.
The app is a specialized inventory control system. Each user has a large no. of little pieces of information related to them. Each piece of information has a category and some related stuff under that category. Moreover, no two collections need to see each other's data -- hence an index that touch more than one collection is not needed.

To adjust the number of collections/indexes you can have (~24k is the limit--~12k is what they say for collections because you have the _id index by default, but keep in mind, if you have more indexes on the collections, that will use namespace up as well), you can use the --nssize option when you start up mongod.
There are plenty of implementations around with billions of documents in a collection (and I'm sure there are several with trillions), so "tens of millions" should be fine. There are some numbers such as counts returned that have constraints of 64 bits, so after you hit 2^64 documents you might find some issues.
What sort of query and update load are you going to be looking at?

Your design still doesn't make much sense. Why store each user in a separate collection?
What indexes do you have on the data? If you are indexing by some field that has content that's common across all the users you'll get a significant saving in total index size by having a single collection with one index.
Index size is often the limiting factor not total database size when it comes to performance.
Why do you have so many documents per user? How large are they?
Craigslist put 2+ billion documents in MongoDB so that shouldn't be an issue if you have the hardware to support it and aren't being inefficient with your indexes.
If you posted more of your schema here you'd probably get better advice.

limits of number of collections in databases

Can anyone say are there any practical limits for the number of collections in mongodb?
They write here https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections:
Generally, having a large number of collections has no significant
performance penalty, and results in very good performance.
But for some reason mongodb set limit 24000 for the number of namespaces in the database, it looks like it can be increased, but I wonder why it has some the limit in default configuration if having many collections in the database doesn't cause any performance penalty?
Does it mean that it's a viable solution to have a practically unlimited number of collections in one database, for example, to have one collection of data of one account in a database for the multitenant application, having, for example, hundreds of thousands of collections in the database?
If it's the viable solution to have a very large number of collections for a database for every tenant, what's the benefits of it for example versus having documents of each tenant in one collection?
Thank you very much for your answers.

This answer is late however the other answers seem a bit...weak in terms of reliability and factual information so I will attempt to remedy that a little.
But for some reason mongodb set limit 24000 for the number of namespaces in the database,
That is merely the default setting. Yes, there is a default setting.
It does say on the limits page that 24000 is the limit ( http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Namespaces ), as though there is no way to expand that but there is.
However there is a maximum limit on how big a namespace file can be ( http://docs.mongodb.org/manual/reference/limits/#Size%20of%20Namespace%20File ) which is 2GB. That gives you roughly 3 million namespaces to play with in most cases which is quite impressive and I am unsure if many people will hit that limit quickly.
You can modify the default value to go higher than 16MB by using the nssize parameter either within the configuration ( http://docs.mongodb.org/manual/reference/configuration-options/#nssize ) or at runtime by manipulating the command used to run MongoDB ( http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--nssize ).
There is no real reason for why MongoDB implements 16MB by default for its nssize as far as I know, I have never heard about the motto of "not bother the user with every single detail" so I don't buy that one.
I think, in my opinion, the main reason why MongoDB hides this is because even though, as the documentation states:
Distinct collections are very important for high-throughput batch processing.
Using multiple collections as a means to scale vertically rather than horizontally through a cluster, as MongoDB is designed to, is considered (quite often) bad practice for large scale websites; as such 12K collections is normally considered something that people will never, and should never, ascertain.

No More Limits!
As other answers have stated - this is determined by the size of the namespace file. This was previously an issue, because it had a default limit of 16mb and a max of 2gb. However with the release of MongoDB 3.0 and the WiredTiger storage engine, it looks like this limit has been removed. WiredTiger seems to be better in almost every way, so I see little reason for anyone to use the old engine, except for legacy support reasons. From the site:
For the MMAPv1 storage engine, namespace files can be no larger than
2047 megabytes.
By default namespace files are 16 megabytes. You can configure the
size using the nsSize option.
The WiredTiger storage engine is not subject to this limitation.
http://docs.mongodb.org/manual/reference/limits/

A little background:
Every time mongo creates a database, it creates a namespace (db.ns) file for it. The namespace (or collections as you might want to call it) file holds the metadata about the collection. By default the namespace file is 16MB in size, though you can increase the size manually. The metadata for each collections is 648 bytes + some overhead bytes. Divide that by 16MB and you get approximately 24000 namespaces per database. You can start mongo by specifying a larger namespace file and that will let you create more collections per database.
The idea behind any default configuration is to not bother the user with every single detail (and configurable knob) and choose one that generally works for most people. Also, viability does go hand in hand with best/good design practices. As Chris said, consider the shape of your data and decide accordingly.

As others mention, the default namespace size is 16MB and you can get about 24000 namespace entries. Actually my 64 bit instance in Ubuntu topped out at 23684 using the default 16MB namespace file.
One important thing that isn't mentioned in the FAQ is that indexes also use namespace slots.
You can count the namespace entries with:
db.system.namespaces.count()
And it's also interesting to actually take a look at what's in there:
db.system.namespaces.find()
Set your limit higher than what you think you need because once a database is created, the namespace file cannot be extended (as far as I understand - if there is a way, please tell me!!!).

Practically, I have never run across a maximum. But I've definitely never gone beyond the 24,000 collection limit. I'm pretty sure I've never hit more than 200, other than when I was performance testing the thing. I have to admit, I think it sounds like an awful lot of chaos to have that many collections in a single database, rather than grouping like data in to their own collections.
Consider the shape of your data and business rules. If your data needs to be laid out such that you must have the data separated in to different logical groupings for your multi-tenant app, then you probably should consider other data stores. Because while Mongo is great, the fact that they put a limit on the amount of collections at all tells me that they know there is some theoretical limit where performance is effected.
Perhaps you should consider a store that would match the data shape? Riak, for example, has an unlimited number of 'buckets' (without theoretical maximum) that you can have in your application. One bucket per account is perfectly doable, but you sacrifice some querability by going that direction.
Otherwise, you may want to follow a more relational model of grouping like with like. In my view, Mongo feels like a half-way point between relational databases and key-value stores. That means that it's more easy to conceptualize it coming from a relational database world.

There seems to be a massive overhead for maintaining collections. I've just reduced a database which had around 1.5mio documents in 11000 collections to one with the same number of documents in around 300 collections; this has reduced the size of the database from 8GB to 1GB. I'm not familiar with the inner workings of MongoDB so this may be obvious but I thought might be worth noting in this context.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse