MongoDB Shard Key vs Query Index - mongodb

I have set up my first mongodb sharded cluster and am finally at the stage where I create a db/collection and choose the shard key. I’ve read about how to choose an appropriate shard key and am likely going with a hashed index but I might be having some conceptual misunderstandings.
My documents are super simple and contain a document id (some natural number), a document version id (a natural number), and a string of the raw text itself. If I understand correctly from the documentation, I can choose to shard on the document id but this can lead to jumbo shards since the document id will be incremented and new documents will be added to the same shard. And so I could set the shard key as a hashed value of the document id.
My question is whether or not I can still continue to query by the document id? My brain is making me doubt this and making me think that the indexing of the documents is over the hashed shard key and not over the document id. I am hoping that the hashed shard key is used strictly for sharding and that I can set any key (i.e., document id) to be indexed. Is this correct?

Yes, you can still query by the value of the shard key.
If you are referring to _id, that will be automatically indexed with it's natural value, otherwise you could explicitly create and index on the document id that is not hashed in addition to the shard key index.
As long as you test for equality to a single or explicit list of values, the query should be handled by the minimum number of shards.
However, if you use a ranged test such as $gte, the query will have to be forwarded to every shard to be processed.
Using the hashed document id as the shard key will result in the creation of an index for the hashed value in addition to any other indexes.
There is a pretty good description of hashed sharding in the documentation

Related

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

MongoDB - shard collection by hashed index on custom _id field

Problem: How to shard collection by hashed index on custom _id field?
Problem description:
I need to store pairs url => my_value in MongoDB
Url must be unique
I will execute a lot of queries to check, if i already have document with such url by matching {_id : md5(url_to_check)}
Collection will be huge (billions of pairs url => my_value), so i want to shard it by url.
Solution, i consider:
Create collection with such fields:
_id : md5(url)
url : url
value : my_value
I don't create any index. _id is default indexed by mongo
Questions:
I would like to shard collection by _id. Hashed shared key would be perfect, but do i have to create hashed shard key or can i just shard by regular _id key? I insert to _id already computed md5 by myself.
What do you think about storing in _id not-hashed url and query by it? I would use less space (don't have to storedmd5(url)), but sharding will by on bigger text field and index will be on biger string (usualy url has more than 32 sings)
What is it the best solution to solve such problem? Best means for me fast queries and use as less space for indexes, as it is required?
I would like to shard collection by _id. Hashed shared key would be perfect, but do i have to create hashed shard key or can i just shard by regular _id key? I insert to _id already computed md5 by myself.
A hashed shard key is intended to be used with fields that increase monotonically (like ObjectId() values or timestamps) in order to provide more uniform distribution of write load across your shards. If you've already hashed your _id values (or a field you want to shard on) you can use this as your shard key instead of requesting the server to calculate this for you.
FYI, MongoDB (as at 2.6) uses md5 to compute a hashed shard key, so effectively you are doing the same work in your application code already and making more effective use of the _id index. With your use case of a pre-hashed _id value you only need a single _id index as compared to two indexes (the default index of {_id:1} plus an extra hashed index {_id:hashed}).
What do you think about storing in _id not-hashed url and query by it? I would use less space (don't have to storedmd5(url)), but sharding will by on bigger text field and index will be on biger string (usualy url has more than 32 sings)
If index size is a concern, the smaller precomputed values will definitely save you space in the _id index (especially if you are storing billions of urls and only want to find documents by the md5 hash).
What is it the best solution to solve such problem? Best means for me fast queries and use as less space for indexes, as it is required?
Best solution is highly subjective, but it seems like this is a reasonable solution given what you've shared of your use case.
It's worth noting that any hashed namespace can potentially have collisions, so you may want to consider the collision resistance of your hash algorithm relative to the namespace. Although collisions should be extremely unlikely, with the hash value as your _id you will only store the first url observed for any hash collisions (or have to add something less efficient, like a comparison of the document url vs original url you were expecting).

MongoDB- Compound shard key using three values

I am creating a collection which stores JSON object using MongoDB. I am stuck in Sharding part.
I have an Case ID,Customer ID and Location for each of the record in the collection
The Case ID is a 10 digit number (only number and no alphabets).
The CustomerID is a combination of customer name and case ID.
The location is a 2dsphere value and I am expecting a location of different distinct values.
In addition to this I have customer name and case description to the record.
All my search queries have search criteria of either Case ID, CustomerID or location.
Given this scenario, Can I create a compound key based on all these three values (CaseID, CustomerID and location). I believe this gives a high cardinality and easy to retrieve the records.
Could any one please suggest me if this is a good approach as I am not finding a compound shard key comprising of three values.
Thanks for your time and let me know if you need any information
The first thing to consider is whether it's necessary to shard. If your data set fits on a single server, then start out with an unsharded deployment. It's easy and seamless to convert this to a sharded cluster later on if needed.
Assuming you do indeed need to shard, your choice of shard key should be based on the following criteria:
Cardinality - choose a shard key that is not limited to a small number of possible values, so that MongoDB can evenly distribute data among the shards in your cluster.
Write distribution - choose a shard key that evenly distributes write operations among shards in the cluster, to prevent any single shard from becoming a bottleneck.
Query isolation - choose a shard key that is included in your most frequent queries, so that those queries may be efficiently routed to a single target shard that holds the data, as opposed to being broadcast to all shards.
You mention that all your queries contain either Case ID, Customer ID or location, but haven't described your use cases. By way of an example let's suppose your most frequent queries are to:
retrieve a customer case
retrieve all cases for a given customer
In such case, a good shard key candidate would be a compound shard key on (name, caseID) in that order (and a corresponding compound index). Consider whether this satisfies the above criteria:
Cardinality - each document has a different value for the shard key so cardinality is excellent.
Write distribution - cases for all customers are distributed across all shards.
Query isolation:
To retrieve a specific case, name and caseID should be included in the query. This query will be routed to the specific shard that holds the document.
To retrieve all cases for a given customer, include name in the query. This query therefore includes a prefix of the shard key so will also be efficiently routed only to the specific shard(s) that hold documents that match the query.
Note that you cannot use a geospatial index as part of a shard key index (as documented here). However, you can still create and use a geospatial index on a sharded collection if using some other fields for the shard key. So for example, with the above shard key:
a geospatial query that also includes customer name will be targeted at the relevant shard(s).
a geospatial query that doesn't include customer name will be broadcast to all shards (a 'scatter/gather' query).
Additional documentation on shard key considerations can be found here.

Does MongoDB ensure unique _id field values when using a compound shard key with _id

I want to initiate the sharding. as you know shard key is very important. I found, MongoDB doesn't ensure unique _id field values when using a shard key other than _id.
In our collections, username shoud be shard key. If i create compound shard key and use _id as second part of shard key, does mongoDB guarantee uniqueness of _id ?
MongoDB does not ensure unique _id fields across shards when used as a compound key.
The documentation states :
MongoDB can enforce uniqueness for the shard key. For compound shard
keys, MongoDB will enforce uniqueness on the entire key combination,
and not for a specific component of the shard key.
So you if you want mongo to enforce uniqueness of the email, then simply use the email as the shard key.
An email address has some randomness, which is good (_id has some predictability built in), but I suggest you use the email field as a hashed shard key.
MongoDB ensure uniqueness for all fields of a compound key. But I'm not sure if this is you are looking for. What happen if you have these documents ? {_id:12345,email:"test#test.com"} and {_id:4567,email:"test#test.com"}. The sharded key is unique, but you have the same email twice.
If you are looking for a unique email on all shards, you can try to create a proxy collection with an unique index.
You have more information here

Mongodb choose shard key

I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]