How does MongoDB distribute data across a cluster - mongodb

I've read about sharding a collection in MongoDB. MongoDB lets me shard a collection explicitly by calling shardCollection method. There I can choose whether I want it to be rangely shareded or hashingly sharded.
My question is, what would happen if I didn't call the shardCollection method, and I had say 100 nodes?
Would MongoDB keep the collections intact and distribute them across the cluster?
Would MongoDB keep all the collections in a single node?
Do I completely not understand how this works?

A database can have a mixture of sharded and unsharded collections. Sharded collections are partitioned and distributed across shards in the cluster. As at MongoDB 3.4, each database has a primary shard where the unsharded collections are stored. If your deployment has a number of databases this may result in some distribution of unsharded collections, but there is no balancing activity for unsharded data. For more information on expected behaviours, see the Sharding section in the MongoDB manual.
If you are interested in distribution of unsharded collections within a sharded database, there is a relevant feature request you can watch/upvote in the MongoDB issue tracker: SERVER-939: Ability to distribute collections in a single DB.

Related

MongoDB sharding with repeated documents

I am new to mongodb and wish to create a distributed database environment using docker-compose with mongodb. I've created multiple docker with shards to simulate multiple sites. However, I have a problem to replicate the same set of documents into multiple shards.
For example I have a collection with a key that has value "A" and "B". I want to distribute this collection into 2 shards where
Shard 1 = A & B
Shard 2 = B only
However, when I run the balancer it distributes all A's into shard 1 and B's into shard 2. Is there any way I can do the sharding with repeated data or am I using the wrong approach for my problem?
You might be approaching sharding (horizontal scaling) incorrectly. What makes sharding in Mongo work is that the sharding key is chosen such that it results in (vertical) shards which have a roughly even distribution of data, or a similar number of Mongo documents. A requirement of sharding which makes it work well is that queries would typically be directed to only a single shard. If you have queries which need to return some field having the different values of A and B, then it implies that this field should not be the sharding key. Queries can go across shards, but certain cross-shard operations, such as joins, can be very costly. In your particular case, perhaps some other field could be used as sharding key.
Redundancy in MongoDB is provided by replica sets, not sharded clusters.
Each shard can be backed by a replica set with your desired number of nodes to provide the required redundancy level.
It is not possible to have the same document be (authoritatively) located in multiple shards.

MongoDB db.collection.count() vs db.collection.find().length()

I would like to understand why these commands, when run from a mongos instance against the same MongoDB collection, return different numbers?
db.users.count()
db.users.find().length()
What can be the reason and can it be a sign of underlying issues?
I believe your collection is sharded.
Most sharded databases solutions have such discrepancy, due to the fact that some commands consider the entire collection, meaning all the documents of all the shards, while some other commands only consider the documents of the shard it is connected to.
This is something to always keep in mind. It mostly applies to commands which:
count
return the document having the lowest value for a given field
return the document having the biggest value for a given field
...
Found on Mongo docs:
count() is equivalent to the db.collection.find(query).count()
construct. ... Sharded Clusters
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress. ...
So in the case of Mongo, it is simply because Mongo always runs, in a background process, some rebalancing of the documents within a shard, in order to keep the shards distribution compliant with the sharding policy defined on the collection.
Keep in mind that to offer the best performance, most sharded solutions will write the documents on the shard the client is connected to, and then later put it where it is really meant to be.
This is why nosql DBs are often flagged as eventually consistent.

MongoDB sharding is possible on collections?

Can is it possible sharding only on collections ? if yes than how..?
What is difference between sharding on database and on collections?
Mongodb shards collections. You enable sharding on database but just enabling sharding on database will not distribute data across shards. To distribute data accross shards you need to tell mongodb what collection to distribute. So, you have to shard your collection and then only that collection will be spread across the shards.
Remember, mongodb will distribute data on the basis of collections sharded. If you have 2 collections in your database and you shard one of them then data of sharded collection will be spread out across the shards but the other collection will have all data on one shard.
In plain language, mongodb doesn't shard whole database automatically. Mongodb sharding works on collection level.

Mongodb - sharded and unsharded collections

I'm a bit confused as to how this works.
When sharding MySQL, we had some tables, usually small ones with reference data, whole in each shard. This was to enable joins.
If we have small collections in MongoDB, that we don't shard in a sharded setup, what happens to them? Do they get sent to each shard, or just stay in the first shard?
This strikes me as a possible potential bottleneck, if all processes in a heavily sharded system with many application servers were hitting on one server.
In MongoDB with the autosharding feature, a sharded collection will be distributed somehow evenly along all the shards you have.
With those collections which you not likely to shard (which are not sharded) you can specify a primary shard which will they reside on. This primary shard is a given one for a specific database, so it is on per database level. Can be moved and can be different for different databases.
There is the notion of shard tagging which with you can influence for sharded collections where to be placed. Basicly you can constraint a collection or a part of a collection to be stored on a specific set of shards. (Reference)

MongoDB Cluster and 100,000 Capped Collections

How does a MongoDB cluster distribute Capped Collections across nodes for balancing load? I am planning to use a Capped Collection for comments of each Post in a MongoDB based CMS. Lets assume we have 100,000 Posts and hence 100,000 Capped Collections storing comments for each post. Will these Capped Collections be distributed evenly across cluster for read and write scalability?
I dont want to shard a capped collection. I want to distribute all the capped collections evenly across the cluster for read and write scalability.
Lets assume we have 5 machines. When we create new collections, I need them to be created on different machines/nodes and also redistribute them when new machines are added.
1) When creating a collection (capped or not) it is set on the primary shard of the database. The solution would be to set a collection per database so that mongo equilibrate the databases across ythe cluster. The rule for equilibrium is not clear but depends mainly on the current load on each shard.
2) Believe me, you should use one big collection for all your post and shard it in a clever way. It will ensure really efficient and automatic balance of your data across your cluster.
More over capped collection are not really space efficient because it will pre-allocate all the space for all your collections (meaning that you'll have a lot of wasted space for nothing)
Unless you have a very good reason to go for capping, you have better try sharding.
One advice : use the 'postId' field in your shard key, it will probably the most performance.
Apparently it is not implemented yet for mongodb: Issue
Quote from similar question:
But you can create multiple capped collections on different shards to
increase write throughput; however, you must then run multiple queries
to access all your data.