Mongodb - sharded and unsharded collections - mongodb

I'm a bit confused as to how this works.
When sharding MySQL, we had some tables, usually small ones with reference data, whole in each shard. This was to enable joins.
If we have small collections in MongoDB, that we don't shard in a sharded setup, what happens to them? Do they get sent to each shard, or just stay in the first shard?
This strikes me as a possible potential bottleneck, if all processes in a heavily sharded system with many application servers were hitting on one server.

In MongoDB with the autosharding feature, a sharded collection will be distributed somehow evenly along all the shards you have.
With those collections which you not likely to shard (which are not sharded) you can specify a primary shard which will they reside on. This primary shard is a given one for a specific database, so it is on per database level. Can be moved and can be different for different databases.
There is the notion of shard tagging which with you can influence for sharded collections where to be placed. Basicly you can constraint a collection or a part of a collection to be stored on a specific set of shards. (Reference)

Related

Mongodb shard zones for performance

It has the following one-to-many relationship.
UserProfile - UserActivity,
UserProfile - UserItem,
UserProfile - ... ,
and so on.
Since there are many documents such as UserActivity and UserItem, collections are used instead of arrays.
As far as I know, even if the _id of the documents is the same, they are distributed and stored.
Same shards across different MongoDB collections
What I'm curious about is whether using a shard zone to store documents of a specific user in one shard and access them as transaction is faster than distributed transaction. Both read and write.
(Shards are physically close)
https://docs.mongodb.com/manual/tutorial/sharding-segmenting-shards/
Pay attention to Sharding Query Pattern:
The ideal shard key distributes data evenly across the sharded cluster while also facilitating common query patterns. When you choose a shard key, consider your most common query patterns and whether a given shard key covers them.
In a sharded cluster, the mongos routes queries to only the shards that contain the relevant data if the queries contain the shard key. When the queries do not contain the shard key, the queries are broadcast to all shards for evaluation. These types of queries are called scatter-gather queries. Queries that involve multiple shards for each request are less efficient and do not scale linearly when more shards are added to the cluster.
This does not apply for aggregation queries that operate on a large amount of data. In these cases, scatter-gather can be a useful approach that allows the query to run in parallel on all shards.
See also Zones:
Some common deployment patterns where zones can be applied are as follows:
Isolate a specific subset of data on a specific set of shards. (Maybe enforced by some data protection laws)
Ensure that the most relevant data reside on shards that are geographically closest to the application servers.
Route data to shards based on the hardware / performance of the shard hardware.
Your question does not provide sufficient information whether any of above applies in your case.

MongoDB sharding with repeated documents

I am new to mongodb and wish to create a distributed database environment using docker-compose with mongodb. I've created multiple docker with shards to simulate multiple sites. However, I have a problem to replicate the same set of documents into multiple shards.
For example I have a collection with a key that has value "A" and "B". I want to distribute this collection into 2 shards where
Shard 1 = A & B
Shard 2 = B only
However, when I run the balancer it distributes all A's into shard 1 and B's into shard 2. Is there any way I can do the sharding with repeated data or am I using the wrong approach for my problem?
You might be approaching sharding (horizontal scaling) incorrectly. What makes sharding in Mongo work is that the sharding key is chosen such that it results in (vertical) shards which have a roughly even distribution of data, or a similar number of Mongo documents. A requirement of sharding which makes it work well is that queries would typically be directed to only a single shard. If you have queries which need to return some field having the different values of A and B, then it implies that this field should not be the sharding key. Queries can go across shards, but certain cross-shard operations, such as joins, can be very costly. In your particular case, perhaps some other field could be used as sharding key.
Redundancy in MongoDB is provided by replica sets, not sharded clusters.
Each shard can be backed by a replica set with your desired number of nodes to provide the required redundancy level.
It is not possible to have the same document be (authoritatively) located in multiple shards.

Should I shard all collections in my MongoDB or just some

I am running MongoDB cluster (backend to my website). I am converting my previous DB from being plain into sharded structure.
Question is: should I shard all my collections or only those that I expect to grow a lot. I have some collections that will never get bigger than a few thousands documents, few hundred thousands at most, should I shard them anyway? If yes when? Right now during conversion or convert it without shading and shard later?
To rephrase the question : if a table is not too big, are there any benefits for it to be sharded?
A common misconception is that sharding is based upon the size of a collection. This is totally untrue. It is however, true that common sense dictates that when a collection reaches a certain size it is possibly too much to store on a single server, but on the other hand the cause to shard is decided by operations not size.
It makes sense that those that will "grow a lot" should be sharded to distribute those operations within a cluster however those that might be a lot quieter, such as your smaller collections can happily remain on the primary shard.
As to when to shard them: that depends on the operations. Sharding is designed to scale out reads and writes so it is merely a question of when a collection needs to be scaled out.
You could have a collection of maybe a 1,000 items but if the operations call for it to be sharded then it needs sharding. Vice versa you could have a collection of 1 billion items and it still doesn't merit sharding.

MongoDB 'Manually Sharding' for multi-tenancy

We are using Mongo to host a multi-tenant application. Each tenant is going to have their own database. To get around resource utilization issues the approach that we are taking is to shard by database (as opposed to by collection - if that is the correct term to use).
This means for every x tenants we will create a new 3-node replica set. So we may have for example 1000 tenants on 1 shard and another 1000 tenants on another shard.
My question is regarding the placement of the databases for new signups. The approach we were going to take was to flag a shard as being the 'active' shard and creating all new tenants on that shard. When it reaches capacity, create a new shard, flag that as the active shard and continue on.
Can you choose which shard you create a new database on in Mongo directly? If left to Mongo, from what I understand, it will do it in round robin fashion when there is more then one shard which may leave our shards imbalanced.
Is this the right approach or is there an alternative better approach?
You can use shard tags to force some collections to reside only on specific shards. So you could, for example, tag each shard with its serial number, and tag the collections/databases you want to have on that shard with that tag, until it runs full at which point you create a new shard, increase the counter and use that for new data.
Another option then is to not enable sharding on the individual databases at all, and use the movePrimary command to force a specific shard to act as the primary shard for a specific database. Since the database won't be sharded, all its data will remain on its designated primary shard, which is exactly what you want.
That being said, it seems to me like this approach conflicts with the very concept of sharding, which is meant to evenly distribute data across multiple machines automatically.

MongoDB Cluster and 100,000 Capped Collections

How does a MongoDB cluster distribute Capped Collections across nodes for balancing load? I am planning to use a Capped Collection for comments of each Post in a MongoDB based CMS. Lets assume we have 100,000 Posts and hence 100,000 Capped Collections storing comments for each post. Will these Capped Collections be distributed evenly across cluster for read and write scalability?
I dont want to shard a capped collection. I want to distribute all the capped collections evenly across the cluster for read and write scalability.
Lets assume we have 5 machines. When we create new collections, I need them to be created on different machines/nodes and also redistribute them when new machines are added.
1) When creating a collection (capped or not) it is set on the primary shard of the database. The solution would be to set a collection per database so that mongo equilibrate the databases across ythe cluster. The rule for equilibrium is not clear but depends mainly on the current load on each shard.
2) Believe me, you should use one big collection for all your post and shard it in a clever way. It will ensure really efficient and automatic balance of your data across your cluster.
More over capped collection are not really space efficient because it will pre-allocate all the space for all your collections (meaning that you'll have a lot of wasted space for nothing)
Unless you have a very good reason to go for capping, you have better try sharding.
One advice : use the 'postId' field in your shard key, it will probably the most performance.
Apparently it is not implemented yet for mongodb: Issue
Quote from similar question:
But you can create multiple capped collections on different shards to
increase write throughput; however, you must then run multiple queries
to access all your data.