While browsing mongodb sharding tutorials I came across the following assertion :
"If you use shard key in the query, its going to hit a small number of shards, often only ONE"
On the other hand from some of my earlier elementary knowledge of sharding, I was under the impression that mongos routing service can uniquely point out the target shard if the query is fired on Shard Key. My question is - under what circumstances, a shard key based query stands a chance of hitting multiple shards?
A query using the shard key will target the subset of shards to retrieve data for your query, but depending on the query and data distribution this could be as few as one or as many as all shards.
Borrowing a helpful image from the MongoDB documentation on shard keys:
MongoDB uses the shard key to automatically partition data into logical ranges of shard key values called chunks. Each chunk represents approximately 64MB of data by default, and is associated with a single shard that currently owns that range of shard key values. Chunk counts are balanced across available shards, and there is no expectation of adjacent chunks being on the same shard.
If you query for a shard key value (or range of values) that falls within a single chunk, the mongos can definitely target a single shard.
Assuming chunk ranges as in the image above:
// Targeted query to the shard with Chunk 3
db.collection.find( { x: 50 } )
// Targeted query to the shard with Chunk 4
db.collection.find( {x: { $gte: 200} } )
If your query spans multiple chunk ranges, the mongos can target the subset of shards that contain relevant documents:
// Targeted query to the shard(s) with Chunks 3 and 4
db.collection.find( {x: { $gte: 50} } )
The two chunks in this example will either be on the same shard or two different shards. You can review the explain results for a query to find out more information about which shards were accessed.
It's also possible to construct a query that would require data from all shards (for example, based on a large range of shard key values):
// Query includes data from all chunk ranges
db.collection.find( {x: { $gte: -100} } )
Note: the above information describes range-based sharding. MongoDB also supports hash-based shard keys which will (intentionally) distribute adjacent shard key values to different chunk ranges after hashing. Range queries on hashed shard keys are expected to include multiple shards. See: Hashed vs Ranged Sharding.
Related
I want to shard a collection with data. When I try with sh.shardCollection("myDb.myCollection", {id:"hashed"}) then this collection shard but it's not spread to the whole shards. only spread to the primary shard. for example,
Empty collection after shard,
sh.status() result
Then data add it will spread to whole shards
Collection with data after shard,
sh.status() result
When data add only goes to the primary shard.
My question is how correctly shard a collection with data in MongoDB. Have any other alternative way?
I agree with #Wernfried Domscheit in the comments about the fact that the cluster will take care of distributing the data once the collection is sharded. As mentioned, that is done based on writing to the collection and happens over time. Your test may have too little data or too few writes to trigger the changes.
To your specific question about the initial distribution of chunks, this is covered in the documentation. Applying a hashed shard key on an empty collection in your first example is covered here:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster. You can use numInitialChunks option to specify a different number of initial chunks. This initial creation and distribution of chunks allows for faster setup of sharding.
And behavior on the collection with data is covered just above it here:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
Both of these described behaviors match what you have demonstrated in your question.
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
Requirements:
up to one billion of documents per chunk (single shard key)
tens of thousands chunks(30k)
queries are run only in chunk scope - filtered by shard key
3 indexes - single: hashed shard key, compound: shard key + _id, compound: shard key + 3 fields
all access paths are write - insert, find and update, find and delete
What sharding strategy should I pick for MongoDb?
Mongo hash-based sharding with shard key (String)
Application level pseudo-sharding with each chunk going to its separate collection
Concerns about MongoDb:
Indexes won't fit in memory for billion of documents
All queries are write and Mongo is Master-Slave
Is option 1 a good idea?
With option 2, is it possible to randomly distribute collections across Mongo cluster?
I would like to shard my collection on the basis of range on mongodb shards, my question is if shard key is string field then how will we divide string based shard key in different chunks for range based sharding ???
You can divide a string across shards using tag aware sharding. You create the "tags" denoting the ranges of the key to assign to a specific shard. Mongo's balancer will handle the distribution of the data and when you write a query for the key in question Mongo will know to target only that shard.
For more information see the following URL from the vendor. sharding-introduction/