I am building a tracking platform which has the following use cases.
Need to track 50,000 vehicles
Each vehicle relays its location every 60 secs.
Get API which returns all the vehicles in the X km range.
So, i need to scale writes and also achieve query isolation.
I can create a shard cluster with geographical region as shard key(geohash). This will help me to balance the writes and also achieve query isolation. But what happens when a vehicle moves across regions does mangodb automatically move the document to the new shard in this case?
You cannot change the shard key fields for a record once written. Using the region as the shard key would prevent you from moving across regions unless you delete the record in the original region and the insert using the new one.
On choosing a shard key, look for one which matches your most common query pattern. Querying on the shard key will allow you to retrieve a record directly from a shard. Queries which don't use the shard key will have to perform a scatter gather query against all shards.
If are on or can use Mongodb 2.4, and you don't need to perform range based queries, you may want to consider using a hashed shard key which will allow for even distribution, even if your shard key is an monotonically increasing. See this page for advice on choosing a shard key.
Related
For my application I need to shard a fairly big collection, the entire collection will contain app. 500 billion documents.
I have two potential fields which can be used as Sharding Key:
For inserting either Sharding Key will distribute documents evenly throughout the cluster, there is does not matter which field I use as Sharding Key.
For query it is different.
Field(1) is usually part of the query filter condition, thus query would be processed usually on a single shard only.
Field(2) is typically not part of the query filter condition, thus query would be processed over all shards and typically several shards will contribute to final query result.
Which one is the better field to be used as Sharding Key? I did not find anything in MongoDB documentation about that topic.
Either fields have the same range and very similar cardinality figures, there won't be any difference. Usually the number of documents returned by a query is very low (typically less than 20-30 documents).
In a sharded cluster the mongos router determines which shard is to be targeted for a read or write operation - based on the available shard key meta-data stored on the config servers.
For inserting either Sharding Key will distribute documents evenly
throughout the cluster, there is does not matter which field I use as
Sharding Key.
When you insert a document it will have a shard key and the document will be stored on a designated shard.
Field(1) is usually part of the query filter condition, thus query
would be processed usually on a single shard only.
The shard key's main purposes are (a) to distribute data evenly across shards in a cluster, and (b) to be able to query the data in such a way that the query targets a single shard.
For a query to target a single shard, the shard key must be part of the query's filter criteria. The mongos router will target the single shard using the shard key.
If the shard key is not part of the filter criteria it will be a scatter-gather operation (a long running query). It is important that the most important query operations of the application using the sharded collection must be able use the shard key.
Field(2) is typically not part of the query filter condition, thus
query would be processed over all shards and typically several shards
will contribute to final query result.
When the shard key is not part of the query filter, the operation will span across multiple shards (a scatter-gather operation) and it will be a slow running operation. The mongos router will not be able to determine which shards have the target data, and all the shards in the cluster will be queried to return the final result.
Which one is the better field to be used as Sharding Key?
It can be concluded that the Field(1) must be used as a shard key.
See documentation on shard keys and choosing a shard key # MongoDB docs on Shard Keys.
I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
I've a scenario in which I don't know what would be the structure & fields of collections in MongoDb. Also there will be like multiple single DB per user(Like Multi-tenant DB).
I'll be deploying Replicated sharded cluster in production.For scaling & better machine optimization, I'm applying sharding on per DB basis during the creation of each DB, and each collection under the same DB will be sharded to different shards. Now in this scenario I'm not sure which key would be the best choice since the structure & field(s) of collection(s) which would be created under each DB will be unknown. Since the structure of DB, Collection is unknown I can't forecast which type of query will be used most of the time. So I want to select a shard key which would fulfill all the criteria for shard key selection like: Cardinality, Query Isolation, Monotonically increasing, Write scaling, Easily divisible.
What would be the solution in this scenario?
Also What if I select all the fields under that collection for shard key along with hashed _id field as compound key?
Once you create a shard key you can not edit it.
So keep pumping the data into the collection, once you get clarity on the fields you can shard the collections any time.
Rebalancing happens automatically after sharding.
We are using Mongo to host a multi-tenant application. Each tenant is going to have their own database. To get around resource utilization issues the approach that we are taking is to shard by database (as opposed to by collection - if that is the correct term to use).
This means for every x tenants we will create a new 3-node replica set. So we may have for example 1000 tenants on 1 shard and another 1000 tenants on another shard.
My question is regarding the placement of the databases for new signups. The approach we were going to take was to flag a shard as being the 'active' shard and creating all new tenants on that shard. When it reaches capacity, create a new shard, flag that as the active shard and continue on.
Can you choose which shard you create a new database on in Mongo directly? If left to Mongo, from what I understand, it will do it in round robin fashion when there is more then one shard which may leave our shards imbalanced.
Is this the right approach or is there an alternative better approach?
You can use shard tags to force some collections to reside only on specific shards. So you could, for example, tag each shard with its serial number, and tag the collections/databases you want to have on that shard with that tag, until it runs full at which point you create a new shard, increase the counter and use that for new data.
Another option then is to not enable sharding on the individual databases at all, and use the movePrimary command to force a specific shard to act as the primary shard for a specific database. Since the database won't be sharded, all its data will remain on its designated primary shard, which is exactly what you want.
That being said, it seems to me like this approach conflicts with the very concept of sharding, which is meant to evenly distribute data across multiple machines automatically.
I am creating a collection which stores JSON object using MongoDB. I am stuck in Sharding part.
I have an Case ID,Customer ID and Location for each of the record in the collection
The Case ID is a 10 digit number (only number and no alphabets).
The CustomerID is a combination of customer name and case ID.
The location is a 2dsphere value and I am expecting a location of different distinct values.
In addition to this I have customer name and case description to the record.
All my search queries have search criteria of either Case ID, CustomerID or location.
Given this scenario, Can I create a compound key based on all these three values (CaseID, CustomerID and location). I believe this gives a high cardinality and easy to retrieve the records.
Could any one please suggest me if this is a good approach as I am not finding a compound shard key comprising of three values.
Thanks for your time and let me know if you need any information
The first thing to consider is whether it's necessary to shard. If your data set fits on a single server, then start out with an unsharded deployment. It's easy and seamless to convert this to a sharded cluster later on if needed.
Assuming you do indeed need to shard, your choice of shard key should be based on the following criteria:
Cardinality - choose a shard key that is not limited to a small number of possible values, so that MongoDB can evenly distribute data among the shards in your cluster.
Write distribution - choose a shard key that evenly distributes write operations among shards in the cluster, to prevent any single shard from becoming a bottleneck.
Query isolation - choose a shard key that is included in your most frequent queries, so that those queries may be efficiently routed to a single target shard that holds the data, as opposed to being broadcast to all shards.
You mention that all your queries contain either Case ID, Customer ID or location, but haven't described your use cases. By way of an example let's suppose your most frequent queries are to:
retrieve a customer case
retrieve all cases for a given customer
In such case, a good shard key candidate would be a compound shard key on (name, caseID) in that order (and a corresponding compound index). Consider whether this satisfies the above criteria:
Cardinality - each document has a different value for the shard key so cardinality is excellent.
Write distribution - cases for all customers are distributed across all shards.
Query isolation:
To retrieve a specific case, name and caseID should be included in the query. This query will be routed to the specific shard that holds the document.
To retrieve all cases for a given customer, include name in the query. This query therefore includes a prefix of the shard key so will also be efficiently routed only to the specific shard(s) that hold documents that match the query.
Note that you cannot use a geospatial index as part of a shard key index (as documented here). However, you can still create and use a geospatial index on a sharded collection if using some other fields for the shard key. So for example, with the above shard key:
a geospatial query that also includes customer name will be targeted at the relevant shard(s).
a geospatial query that doesn't include customer name will be broadcast to all shards (a 'scatter/gather' query).
Additional documentation on shard key considerations can be found here.