I am creating a collection which stores JSON object using MongoDB. I am stuck in Sharding part.
I have an Case ID,Customer ID and Location for each of the record in the collection
The Case ID is a 10 digit number (only number and no alphabets).
The CustomerID is a combination of customer name and case ID.
The location is a 2dsphere value and I am expecting a location of different distinct values.
In addition to this I have customer name and case description to the record.
All my search queries have search criteria of either Case ID, CustomerID or location.
Given this scenario, Can I create a compound key based on all these three values (CaseID, CustomerID and location). I believe this gives a high cardinality and easy to retrieve the records.
Could any one please suggest me if this is a good approach as I am not finding a compound shard key comprising of three values.
Thanks for your time and let me know if you need any information
The first thing to consider is whether it's necessary to shard. If your data set fits on a single server, then start out with an unsharded deployment. It's easy and seamless to convert this to a sharded cluster later on if needed.
Assuming you do indeed need to shard, your choice of shard key should be based on the following criteria:
Cardinality - choose a shard key that is not limited to a small number of possible values, so that MongoDB can evenly distribute data among the shards in your cluster.
Write distribution - choose a shard key that evenly distributes write operations among shards in the cluster, to prevent any single shard from becoming a bottleneck.
Query isolation - choose a shard key that is included in your most frequent queries, so that those queries may be efficiently routed to a single target shard that holds the data, as opposed to being broadcast to all shards.
You mention that all your queries contain either Case ID, Customer ID or location, but haven't described your use cases. By way of an example let's suppose your most frequent queries are to:
retrieve a customer case
retrieve all cases for a given customer
In such case, a good shard key candidate would be a compound shard key on (name, caseID) in that order (and a corresponding compound index). Consider whether this satisfies the above criteria:
Cardinality - each document has a different value for the shard key so cardinality is excellent.
Write distribution - cases for all customers are distributed across all shards.
Query isolation:
To retrieve a specific case, name and caseID should be included in the query. This query will be routed to the specific shard that holds the document.
To retrieve all cases for a given customer, include name in the query. This query therefore includes a prefix of the shard key so will also be efficiently routed only to the specific shard(s) that hold documents that match the query.
Note that you cannot use a geospatial index as part of a shard key index (as documented here). However, you can still create and use a geospatial index on a sharded collection if using some other fields for the shard key. So for example, with the above shard key:
a geospatial query that also includes customer name will be targeted at the relevant shard(s).
a geospatial query that doesn't include customer name will be broadcast to all shards (a 'scatter/gather' query).
Additional documentation on shard key considerations can be found here.
Related
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.
I am building a tracking platform which has the following use cases.
Need to track 50,000 vehicles
Each vehicle relays its location every 60 secs.
Get API which returns all the vehicles in the X km range.
So, i need to scale writes and also achieve query isolation.
I can create a shard cluster with geographical region as shard key(geohash). This will help me to balance the writes and also achieve query isolation. But what happens when a vehicle moves across regions does mangodb automatically move the document to the new shard in this case?
You cannot change the shard key fields for a record once written. Using the region as the shard key would prevent you from moving across regions unless you delete the record in the original region and the insert using the new one.
On choosing a shard key, look for one which matches your most common query pattern. Querying on the shard key will allow you to retrieve a record directly from a shard. Queries which don't use the shard key will have to perform a scatter gather query against all shards.
If are on or can use Mongodb 2.4, and you don't need to perform range based queries, you may want to consider using a hashed shard key which will allow for even distribution, even if your shard key is an monotonically increasing. See this page for advice on choosing a shard key.
From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.