MongoDB Cluster and 100,000 Capped Collections - mongodb

How does a MongoDB cluster distribute Capped Collections across nodes for balancing load? I am planning to use a Capped Collection for comments of each Post in a MongoDB based CMS. Lets assume we have 100,000 Posts and hence 100,000 Capped Collections storing comments for each post. Will these Capped Collections be distributed evenly across cluster for read and write scalability?
I dont want to shard a capped collection. I want to distribute all the capped collections evenly across the cluster for read and write scalability.
Lets assume we have 5 machines. When we create new collections, I need them to be created on different machines/nodes and also redistribute them when new machines are added.

1) When creating a collection (capped or not) it is set on the primary shard of the database. The solution would be to set a collection per database so that mongo equilibrate the databases across ythe cluster. The rule for equilibrium is not clear but depends mainly on the current load on each shard.
2) Believe me, you should use one big collection for all your post and shard it in a clever way. It will ensure really efficient and automatic balance of your data across your cluster.
More over capped collection are not really space efficient because it will pre-allocate all the space for all your collections (meaning that you'll have a lot of wasted space for nothing)
Unless you have a very good reason to go for capping, you have better try sharding.
One advice : use the 'postId' field in your shard key, it will probably the most performance.

Apparently it is not implemented yet for mongodb: Issue
Quote from similar question:
But you can create multiple capped collections on different shards to
increase write throughput; however, you must then run multiple queries
to access all your data.

Related

How to efficiently Shard a MongoDb collections that already has millions of documents?

I have a collection named order_error. Which has over 60 million documents. Today I was trying to shard it. I have 3 replica sets. Initially, no issues were there. The balancer was distributing the chunks among the clusters. But eventually, it has started to consume all Ram space and after all swap space too. Now everything is unresponsive. We can't follow this procedure in production. We need a better solution for that. How can I do the sharding in a better way?
If someone could help me with that please let me know
When you insert documents into an empty collection, then initially all date will be written to the primary shard, so it will not solve your issue.
But you can use sh.splitAt on empty collection to pre-split the it.
Note, even if the collection is empty it will take some time till chunks are distributed over all shards! When you split a chunk, then it still remains on the current shard. Check with db.collection.getShardDistribution() whether chunks are evenly distributed.

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,
So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster
So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.
About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So asĀ #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

Should I shard all collections in my MongoDB or just some

I am running MongoDB cluster (backend to my website). I am converting my previous DB from being plain into sharded structure.
Question is: should I shard all my collections or only those that I expect to grow a lot. I have some collections that will never get bigger than a few thousands documents, few hundred thousands at most, should I shard them anyway? If yes when? Right now during conversion or convert it without shading and shard later?
To rephrase the question : if a table is not too big, are there any benefits for it to be sharded?
A common misconception is that sharding is based upon the size of a collection. This is totally untrue. It is however, true that common sense dictates that when a collection reaches a certain size it is possibly too much to store on a single server, but on the other hand the cause to shard is decided by operations not size.
It makes sense that those that will "grow a lot" should be sharded to distribute those operations within a cluster however those that might be a lot quieter, such as your smaller collections can happily remain on the primary shard.
As to when to shard them: that depends on the operations. Sharding is designed to scale out reads and writes so it is merely a question of when a collection needs to be scaled out.
You could have a collection of maybe a 1,000 items but if the operations call for it to be sharded then it needs sharding. Vice versa you could have a collection of 1 billion items and it still doesn't merit sharding.

Mongodb - sharded and unsharded collections

I'm a bit confused as to how this works.
When sharding MySQL, we had some tables, usually small ones with reference data, whole in each shard. This was to enable joins.
If we have small collections in MongoDB, that we don't shard in a sharded setup, what happens to them? Do they get sent to each shard, or just stay in the first shard?
This strikes me as a possible potential bottleneck, if all processes in a heavily sharded system with many application servers were hitting on one server.
In MongoDB with the autosharding feature, a sharded collection will be distributed somehow evenly along all the shards you have.
With those collections which you not likely to shard (which are not sharded) you can specify a primary shard which will they reside on. This primary shard is a given one for a specific database, so it is on per database level. Can be moved and can be different for different databases.
There is the notion of shard tagging which with you can influence for sharded collections where to be placed. Basicly you can constraint a collection or a part of a collection to be stored on a specific set of shards. (Reference)

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.