Need help to select sharding key in MongoDB - mongodb

For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).

In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.

Related

Mongodb shard zones for performance

It has the following one-to-many relationship.
UserProfile - UserActivity,
UserProfile - UserItem,
UserProfile - ... ,
and so on.
Since there are many documents such as UserActivity and UserItem, collections are used instead of arrays.
As far as I know, even if the _id of the documents is the same, they are distributed and stored.
Same shards across different MongoDB collections
What I'm curious about is whether using a shard zone to store documents of a specific user in one shard and access them as transaction is faster than distributed transaction. Both read and write.
(Shards are physically close)
https://docs.mongodb.com/manual/tutorial/sharding-segmenting-shards/
Pay attention to Sharding Query Pattern:
The ideal shard key distributes data evenly across the sharded cluster while also facilitating common query patterns. When you choose a shard key, consider your most common query patterns and whether a given shard key covers them.
In a sharded cluster, the mongos routes queries to only the shards that contain the relevant data if the queries contain the shard key. When the queries do not contain the shard key, the queries are broadcast to all shards for evaluation. These types of queries are called scatter-gather queries. Queries that involve multiple shards for each request are less efficient and do not scale linearly when more shards are added to the cluster.
This does not apply for aggregation queries that operate on a large amount of data. In these cases, scatter-gather can be a useful approach that allows the query to run in parallel on all shards.
See also Zones:
Some common deployment patterns where zones can be applied are as follows:
Isolate a specific subset of data on a specific set of shards. (Maybe enforced by some data protection laws)
Ensure that the most relevant data reside on shards that are geographically closest to the application servers.
Route data to shards based on the hardware / performance of the shard hardware.
Your question does not provide sufficient information whether any of above applies in your case.

MongoDB sharding with repeated documents

I am new to mongodb and wish to create a distributed database environment using docker-compose with mongodb. I've created multiple docker with shards to simulate multiple sites. However, I have a problem to replicate the same set of documents into multiple shards.
For example I have a collection with a key that has value "A" and "B". I want to distribute this collection into 2 shards where
Shard 1 = A & B
Shard 2 = B only
However, when I run the balancer it distributes all A's into shard 1 and B's into shard 2. Is there any way I can do the sharding with repeated data or am I using the wrong approach for my problem?
You might be approaching sharding (horizontal scaling) incorrectly. What makes sharding in Mongo work is that the sharding key is chosen such that it results in (vertical) shards which have a roughly even distribution of data, or a similar number of Mongo documents. A requirement of sharding which makes it work well is that queries would typically be directed to only a single shard. If you have queries which need to return some field having the different values of A and B, then it implies that this field should not be the sharding key. Queries can go across shards, but certain cross-shard operations, such as joins, can be very costly. In your particular case, perhaps some other field could be used as sharding key.
Redundancy in MongoDB is provided by replica sets, not sharded clusters.
Each shard can be backed by a replica set with your desired number of nodes to provide the required redundancy level.
It is not possible to have the same document be (authoritatively) located in multiple shards.

Writing on multiple shards in mongodb

Generally, if a query spreads across multiple shards, it is considered less optimized. It takes more time than reading from single shard.
Does it hold true for writing as well? If I am writing some data and it will distribute among multiple shards, will it be considered less optimized?
If yes, what is the best way to write a batch that should go to different shard?
It depends on the operations, see https://docs.mongodb.com/manual/core/sharded-cluster-query-router/#sharding-mongos-targeted.
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.

MongoDB Shard key Selection

I've a scenario in which I don't know what would be the structure & fields of collections in MongoDb. Also there will be like multiple single DB per user(Like Multi-tenant DB).
I'll be deploying Replicated sharded cluster in production.For scaling & better machine optimization, I'm applying sharding on per DB basis during the creation of each DB, and each collection under the same DB will be sharded to different shards. Now in this scenario I'm not sure which key would be the best choice since the structure & field(s) of collection(s) which would be created under each DB will be unknown. Since the structure of DB, Collection is unknown I can't forecast which type of query will be used most of the time. So I want to select a shard key which would fulfill all the criteria for shard key selection like: Cardinality, Query Isolation, Monotonically increasing, Write scaling, Easily divisible.
What would be the solution in this scenario?
Also What if I select all the fields under that collection for shard key along with hashed _id field as compound key?
Once you create a shard key you can not edit it.
So keep pumping the data into the collection, once you get clarity on the fields you can shard the collections any time.
Rebalancing happens automatically after sharding.

MongoDB- Compound shard key using three values

I am creating a collection which stores JSON object using MongoDB. I am stuck in Sharding part.
I have an Case ID,Customer ID and Location for each of the record in the collection
The Case ID is a 10 digit number (only number and no alphabets).
The CustomerID is a combination of customer name and case ID.
The location is a 2dsphere value and I am expecting a location of different distinct values.
In addition to this I have customer name and case description to the record.
All my search queries have search criteria of either Case ID, CustomerID or location.
Given this scenario, Can I create a compound key based on all these three values (CaseID, CustomerID and location). I believe this gives a high cardinality and easy to retrieve the records.
Could any one please suggest me if this is a good approach as I am not finding a compound shard key comprising of three values.
Thanks for your time and let me know if you need any information
The first thing to consider is whether it's necessary to shard. If your data set fits on a single server, then start out with an unsharded deployment. It's easy and seamless to convert this to a sharded cluster later on if needed.
Assuming you do indeed need to shard, your choice of shard key should be based on the following criteria:
Cardinality - choose a shard key that is not limited to a small number of possible values, so that MongoDB can evenly distribute data among the shards in your cluster.
Write distribution - choose a shard key that evenly distributes write operations among shards in the cluster, to prevent any single shard from becoming a bottleneck.
Query isolation - choose a shard key that is included in your most frequent queries, so that those queries may be efficiently routed to a single target shard that holds the data, as opposed to being broadcast to all shards.
You mention that all your queries contain either Case ID, Customer ID or location, but haven't described your use cases. By way of an example let's suppose your most frequent queries are to:
retrieve a customer case
retrieve all cases for a given customer
In such case, a good shard key candidate would be a compound shard key on (name, caseID) in that order (and a corresponding compound index). Consider whether this satisfies the above criteria:
Cardinality - each document has a different value for the shard key so cardinality is excellent.
Write distribution - cases for all customers are distributed across all shards.
Query isolation:
To retrieve a specific case, name and caseID should be included in the query. This query will be routed to the specific shard that holds the document.
To retrieve all cases for a given customer, include name in the query. This query therefore includes a prefix of the shard key so will also be efficiently routed only to the specific shard(s) that hold documents that match the query.
Note that you cannot use a geospatial index as part of a shard key index (as documented here). However, you can still create and use a geospatial index on a sharded collection if using some other fields for the shard key. So for example, with the above shard key:
a geospatial query that also includes customer name will be targeted at the relevant shard(s).
a geospatial query that doesn't include customer name will be broadcast to all shards (a 'scatter/gather' query).
Additional documentation on shard key considerations can be found here.