I have a collection A containing one type of documents, and a second collection B containing another kind of documents.
There are multiple documents in collection B that have the same value for the field "b" which references field "a" in the collection A.
If we shard the two collections A and B on "a" and "b" respectively, can we be assured that documents in collection A having "a=foobar" will be co-located with documents in collection B having "b=foobar"?
If we shard the two collections A and B on "a" and "b" respectively, can we be assured that documents in collection A having "a= " will be co-located with documents in collection B having "b=foobar"?
Shard key indexes are defined per collection, and (as at MongoDB 4.0) collections are balanced independently. Even if two collections have identical shard keys, there is definitely no guarantee that the chunk ranges or shard assignments will align.
If you plan to use server-side queries to combine data from these collections using $lookup or $graphLookup, note that additional collections you are looking up from cannot currently be sharded. For this use case you would only shard one of the collections. For sharded lookup support there are some relevant improvements to watch/upvote in the MongoDB issue tracker: SERVER-29159 (sharded $lookup) and SERVER-27533 (sharded $graphLookup).
There are a few possible approaches to co-locating data, but all have caveats:
Denormalize: duplicate the most commonly used fields from A into B. This can speed up data retrieval by avoiding the need for joins, but adds some overhead for updates and data storage.
Embed the related data so you have a single sharded collection. This will not be ideal if your collections have very different growth or access patterns, or a large one-to-many relationship.
Manage the data distribution manually: disable balancing for these collections, manually split (or pre-split) chunks so the chunk ranges are identical, and use zone sharding for shard affinity.
For more information on relationship patterns, the Six Rules of Thumb for MongoDB Schema Design blog series is a helpful read. It doesn't cover sharding but the general data model considerations still apply.
Related
It has the following one-to-many relationship.
UserProfile - UserActivity,
UserProfile - UserItem,
UserProfile - ... ,
and so on.
Since there are many documents such as UserActivity and UserItem, collections are used instead of arrays.
As far as I know, even if the _id of the documents is the same, they are distributed and stored.
Same shards across different MongoDB collections
What I'm curious about is whether using a shard zone to store documents of a specific user in one shard and access them as transaction is faster than distributed transaction. Both read and write.
(Shards are physically close)
https://docs.mongodb.com/manual/tutorial/sharding-segmenting-shards/
Pay attention to Sharding Query Pattern:
The ideal shard key distributes data evenly across the sharded cluster while also facilitating common query patterns. When you choose a shard key, consider your most common query patterns and whether a given shard key covers them.
In a sharded cluster, the mongos routes queries to only the shards that contain the relevant data if the queries contain the shard key. When the queries do not contain the shard key, the queries are broadcast to all shards for evaluation. These types of queries are called scatter-gather queries. Queries that involve multiple shards for each request are less efficient and do not scale linearly when more shards are added to the cluster.
This does not apply for aggregation queries that operate on a large amount of data. In these cases, scatter-gather can be a useful approach that allows the query to run in parallel on all shards.
See also Zones:
Some common deployment patterns where zones can be applied are as follows:
Isolate a specific subset of data on a specific set of shards. (Maybe enforced by some data protection laws)
Ensure that the most relevant data reside on shards that are geographically closest to the application servers.
Route data to shards based on the hardware / performance of the shard hardware.
Your question does not provide sufficient information whether any of above applies in your case.
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
My system is built on multi-tenancy, and I'm intending to apply database sharding and replica set on it. This is new to me, so I have some questions below:
Is it possible to partition collection disjoint to one shard only? That means instead of splitting some documents in 1 shard and some others in another shard, I want to put 1 collection completely in 1 shard, and another collection completely in another shard. Because my multi-tenant system is built on schema-per-tenant, so 1 collection represents 1 tenant. Putting each of them completely in 1 shard would make aggregate query more reliable with in that tenant's scope.
If MongoDB is unable to support the answer of question 1, how can I aggregate the queried data among shards correctly if a collection's documents are scattered?
I want to know the full extent of support provided by DBMS instead of delegating the logic into backend. Thank you very much
Generally, if a query spreads across multiple shards, it is considered less optimized. It takes more time than reading from single shard.
Does it hold true for writing as well? If I am writing some data and it will distribute among multiple shards, will it be considered less optimized?
If yes, what is the best way to write a batch that should go to different shard?
It depends on the operations, see https://docs.mongodb.com/manual/core/sharded-cluster-query-router/#sharding-mongos-targeted.
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.
I have a mongodb which links documents (the data cannot be embedded)
Does the mongos cluster (http://docs.mongodb.org/manual/core/sharding-introduction/) support sharding when the documents are linked?
How this impacts the performance?
Thanks!
Considering there is nothing special about referenced documents, it is just a logical relationship inferred by the application layer and not MongoDB itself, sharding is supported. This applies for "manual" references, as well as DBRefs. You can even shard on a DBRef property, although I'm not sure as to why you'd want to considering a DBRef should have inherently low cardinality.
There is an impact in performance for both manual and DBRefs, in that multiple queries must be performed to "join" the data. From the docs:
To resolve DBRefs, your application must perform additional queries to
return the referenced documents. Many drivers have helper methods that
form the query for the DBRef automatically. The drivers do not
automatically resolve DBRefs into documents.
There is no such thing as "document links" in MongoDB. Just fields in documents of collection A which happen to have the same values as fields of documents in collection B. DBRef's are just a conversion on the application layer and get no special treatment whatsoever by the database.
What matters for sharding efficiency is how you define the shard key for the referenced collection. When the field you search by is part of the shard key of the collection, mongos can accelerate it by redirecting the query to the correct shard.
You likely want all documents of collection A which belong to the same document of collection B to reside on the same shard. That means you should have the shard key of A include the field of A which is an unique identifier of B (objectID, name or whatever).