MongoDB Atlas performance / collection and index limits - mongodb

I am building a multi tenant app, where each tenant will have its own database with its own collections. All databases are stored in the same M10 cluster.
For now, a tenant represents around 56 collections and 208 indexes.
I have seen there is a recommended maximum for M10 cluster of 5000 collections and indexes (https://www.mongodb.com/docs/atlas/reference/atlas-limits/)
So if my understanding is correct, M10 cluster suits best for 18 maximum tenants (5000/(56+208)=18,93).
The documentation says The performance of a cluster might degrade if it serves a large number of collections and indexes. Does anyone have tried to exceed this limit? How big are these decreases in performance?

Apart from a hard limit on the number of collections and indexes you can have, the performance impact of having a large number of collections and indexes comes from the adverse impact of having too many data handles open. Not only this, but the maintenance also becomes a nightmare.
On the other hand, having large number of indexes will adversely impact the write operations in those collections, and will either continuously occupy the space in memory or if the memory is insufficient, will lead to continuous eviction and loading of indexes from and into the memory. To know more about the internal cache, see the official documentation here.
In conclusion, having more than 3500 indexes (for 18 tenants) and ~1000 collections will have a serious adverse impact on the performance of your overall cluster. You can monitor the same via the Cache Activity Metric, and others. Since it's anyway logical separation even if you create separate databases, you're advised to instead implement multi-tenancy via a tenant_id field in collections, instead of having different collections for different tenants.

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,
So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster
So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.
About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So asĀ #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

mongodb - Reclaim disk space regularly with no downtime

We have a replica set of 1 primary, 1 secondary, and 1 arbiter. We delete collections often, so I am looking for a fast way to reclaim disk space used by deleted collections with no downtime, current database size is close to 3TB.
I've been researching various ways of doing this, 2 common approaches are:
repairDatabase(): which needs free space equal the size of used space to be able to run, it will take the primary offline, then start initial Sync on the secondary,which is very lengthy process, during which only one node is available for read only from secondary during repairDatabase, and read/write during initial Sync.
run initial Sync on a new node, then claim as primary and retire the old one. Repeat the process for secondary. With this option, both primary and secondary are available, but very lengthy process and take almost 1 week to run initial Sync twice.
is there a better solution to reclaim disk space on a regular basis and relatively faster than the above solutions.
Note that every single collection is subject to deletion.
Thanks
there's no easy way to achieve this, unless you design your DB structure to keep different collections in different databases, which in turn will mean storing them in different paths in your HDD as long as you have the directoryPerDB set to true in your mongo.conf. This is a workaround and depending on your app it might be unpractical.
While it's true that dropping a collection won't free the hdd space, it's also true that the used space it's not lost. It will be eventually reused for new collections.
That being said, unless you are really short on space, don't reclain that space. The CPU and I/O cost of doing that regularly is far more expensive than the storage capacity cost in every provider I know of.
I'd take a look at using MongoDB's sharding functionality to address some of your issues. To quote from the documentation:
Sharding is a method for storing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets
and high throughput operations.
While sharding is frequently used to balance thru put for a large collection across more servers, to avoid hot spots and spread the overall load, it's also useful for managing storage for large collections. In your specific case I'd investigate the use of shard tags to pin a collection to a specific shard.
Again, to quote the documentation, shard tags are useful to
isolate a specific subset of data on a specific set of shards.
For example, let's say you split your production environment into a couple of shards, shard1 and shard2. You could, using shard tags and the sharding tools, pin the collections that you frequently delete onto shard2. In this use case shard1 contains all your normal collections. When you then choose to reclaim disk storage via your second option, you'd perform this only on the shard that has the deleted collections - that way you avoid have to recreate more static data. It should run faster that way (how much faster is a function of how much data is in the deleted collections shard at any given time).
It also has the secondary benefit that as each shard (actually replica set within each shard) requires smaller servers as they only contain a subset of the overall data.
The specifics of the best way to do this will be driven by your exact use case - number and size of collections, insert, update, query and deletion frequency, etc. I described a simple 2 shard case but you can do this with many more shards. You can also have some shards running on higher performance hardware for collections that have more transaction volume.
I can't really do sharding justice within the limited space here other than to point you in the right direction to investigate it. MongoDB has a lot of good information within their documentation and their 2 online DBA courses (Which are free) get into this in some detail.
Some useful links:
http://docs.mongodb.org/manual/core/sharding-introduction/
http://docs.mongodb.org/manual/core/tag-aware-sharding/

MongoDB - Store Indexes on SSD & Data Collections on Magnetic?

Is it possible to store index collections on separate high-performance storage (i.e. flash/SSD) while keeping data collections on lower-cost conventional storage? The performance issues I am seeing using MongoDB appear to be related to index maintenance operations, and I am having to partition my data across database buckets on a single instance in order to avoid drastic dips in write performance - a solution that will only scale for so long. Therefore I would like to use SSD for indexes, but it doesn't make sense to pay for high-performance storage where it's not warranted (data collections).
The only discussion I have found on this subject is somewhat dated:
https://jira.mongodb.org/browse/SERVER-965
You cannot do what you are asking with MongoDB.
The work around is splitting your collections into multiple databases. MongoDB cannot do joins, so there is not really a trade off at the data level. Your application will have to manage multiple connections to multiple databases.
Put your high-performance collections on SSD servers with larger RAM
allocations.
Put your lower performance collections on slower spindle
disk or AWS EBS servers.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?
Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server.
Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can
also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.
In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.
The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.
Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.
This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.
network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk
In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.
No, it does not make sense to shard a on a single server.
There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.
This is answered in the first paragraph of the Replica set tutorial
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial