Requirements:
up to one billion of documents per chunk (single shard key)
tens of thousands chunks(30k)
queries are run only in chunk scope - filtered by shard key
3 indexes - single: hashed shard key, compound: shard key + _id, compound: shard key + 3 fields
all access paths are write - insert, find and update, find and delete
What sharding strategy should I pick for MongoDb?
Mongo hash-based sharding with shard key (String)
Application level pseudo-sharding with each chunk going to its separate collection
Concerns about MongoDb:
Indexes won't fit in memory for billion of documents
All queries are write and Mongo is Master-Slave
Is option 1 a good idea?
With option 2, is it possible to randomly distribute collections across Mongo cluster?
Related
I have a collection with a lot of documents. I shared that collection. I have 2 shard clusters. But yet all documents reside in the primary shard cluster. Why it is not split into different shards?
100 % of data is still in the primary shard
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
At first below is my shard cluster I create by Ops Manager:
I have 2 Mongos and 2 Shard (each shard configure replicates set). I not configure any shard key, I mean not sharded collections esxit in my cluster.
When I use mongos to insert a database for testing purposes, the database store only one Shard.
So I want when I insert a database, data can split and store balance on both shards. And I can query from mongos to get accurate data.
Anyone have the same issue?
Databases and collections are not sharded automatically: a sharded deployment can contain both unsharded and sharded data. Unsharded collections will be created on the primary shard for a given database.
If you want to shard a collection you need to take a few steps in the mongo shell connected to a mongos process for your sharded deployment:
Run sh.enableSharding(<database>) for a database (this is a one-off action per database)
Choose a shard key using sh.shardCollection()
See Shard a Collection in the MongoDB manual for specific steps.
It is important to choose a good shard key for your data distribution and use case. Poor choices of shard key may result in unequal data distribution or limit your sharding performance. The MongoDB documentation has more information on the considerations and options for choosing a shard key.
If you are not sure a collection if a collection sharded or want to see a summary of the current data distribution, you can use db.collection.getShardDistribution() in the mongo shell.
You need to implement Zone Range so according the range the data will be stored for each shred.
The code bellows helps you to create zones :
For the zone01 :
sh.addShardTag("rs1", "zone01")
sh.addTagRange("myDB.col01", { num: 1 }, { num: 10 }, "zone01")
For the zone02 :
sh.addShardTag("rs2", "zone02")
sh.addTagRange("myDB.col01", { num: 11 }, { num: 20 }, "zone02")
This will help you Manage Shard Zones
While browsing mongodb sharding tutorials I came across the following assertion :
"If you use shard key in the query, its going to hit a small number of shards, often only ONE"
On the other hand from some of my earlier elementary knowledge of sharding, I was under the impression that mongos routing service can uniquely point out the target shard if the query is fired on Shard Key. My question is - under what circumstances, a shard key based query stands a chance of hitting multiple shards?
A query using the shard key will target the subset of shards to retrieve data for your query, but depending on the query and data distribution this could be as few as one or as many as all shards.
Borrowing a helpful image from the MongoDB documentation on shard keys:
MongoDB uses the shard key to automatically partition data into logical ranges of shard key values called chunks. Each chunk represents approximately 64MB of data by default, and is associated with a single shard that currently owns that range of shard key values. Chunk counts are balanced across available shards, and there is no expectation of adjacent chunks being on the same shard.
If you query for a shard key value (or range of values) that falls within a single chunk, the mongos can definitely target a single shard.
Assuming chunk ranges as in the image above:
// Targeted query to the shard with Chunk 3
db.collection.find( { x: 50 } )
// Targeted query to the shard with Chunk 4
db.collection.find( {x: { $gte: 200} } )
If your query spans multiple chunk ranges, the mongos can target the subset of shards that contain relevant documents:
// Targeted query to the shard(s) with Chunks 3 and 4
db.collection.find( {x: { $gte: 50} } )
The two chunks in this example will either be on the same shard or two different shards. You can review the explain results for a query to find out more information about which shards were accessed.
It's also possible to construct a query that would require data from all shards (for example, based on a large range of shard key values):
// Query includes data from all chunk ranges
db.collection.find( {x: { $gte: -100} } )
Note: the above information describes range-based sharding. MongoDB also supports hash-based shard keys which will (intentionally) distribute adjacent shard key values to different chunk ranges after hashing. Range queries on hashed shard keys are expected to include multiple shards. See: Hashed vs Ranged Sharding.
I am confused a bit about sharding key in mongo.
Is it possible to use several sharding keys when you creates shard?
Shard key indexes are defined at the collection level and each collection within a database can only have a single shard key index. Within a sharded cluster you have the choice of sharding some or all collections.
It is important to note that shard keys are immutable and once the shard key is created, it cannot be modified.
For more information see:
Deploy a Sharded Cluster
Considerations for Selecting Shard Keys