MongoDB - shard collection by hashed index on custom _id field - mongodb

Problem: How to shard collection by hashed index on custom _id field?
Problem description:
I need to store pairs url => my_value in MongoDB
Url must be unique
I will execute a lot of queries to check, if i already have document with such url by matching {_id : md5(url_to_check)}
Collection will be huge (billions of pairs url => my_value), so i want to shard it by url.
Solution, i consider:
Create collection with such fields:
_id : md5(url)
url : url
value : my_value
I don't create any index. _id is default indexed by mongo
Questions:
I would like to shard collection by _id. Hashed shared key would be perfect, but do i have to create hashed shard key or can i just shard by regular _id key? I insert to _id already computed md5 by myself.
What do you think about storing in _id not-hashed url and query by it? I would use less space (don't have to storedmd5(url)), but sharding will by on bigger text field and index will be on biger string (usualy url has more than 32 sings)
What is it the best solution to solve such problem? Best means for me fast queries and use as less space for indexes, as it is required?

I would like to shard collection by _id. Hashed shared key would be perfect, but do i have to create hashed shard key or can i just shard by regular _id key? I insert to _id already computed md5 by myself.
A hashed shard key is intended to be used with fields that increase monotonically (like ObjectId() values or timestamps) in order to provide more uniform distribution of write load across your shards. If you've already hashed your _id values (or a field you want to shard on) you can use this as your shard key instead of requesting the server to calculate this for you.
FYI, MongoDB (as at 2.6) uses md5 to compute a hashed shard key, so effectively you are doing the same work in your application code already and making more effective use of the _id index. With your use case of a pre-hashed _id value you only need a single _id index as compared to two indexes (the default index of {_id:1} plus an extra hashed index {_id:hashed}).
What do you think about storing in _id not-hashed url and query by it? I would use less space (don't have to storedmd5(url)), but sharding will by on bigger text field and index will be on biger string (usualy url has more than 32 sings)
If index size is a concern, the smaller precomputed values will definitely save you space in the _id index (especially if you are storing billions of urls and only want to find documents by the md5 hash).
What is it the best solution to solve such problem? Best means for me fast queries and use as less space for indexes, as it is required?
Best solution is highly subjective, but it seems like this is a reasonable solution given what you've shared of your use case.
It's worth noting that any hashed namespace can potentially have collisions, so you may want to consider the collision resistance of your hash algorithm relative to the namespace. Although collisions should be extremely unlikely, with the hash value as your _id you will only store the first url observed for any hash collisions (or have to add something less efficient, like a comparison of the document url vs original url you were expecting).

Related

MongoDB Shard Key vs Query Index

I have set up my first mongodb sharded cluster and am finally at the stage where I create a db/collection and choose the shard key. I’ve read about how to choose an appropriate shard key and am likely going with a hashed index but I might be having some conceptual misunderstandings.
My documents are super simple and contain a document id (some natural number), a document version id (a natural number), and a string of the raw text itself. If I understand correctly from the documentation, I can choose to shard on the document id but this can lead to jumbo shards since the document id will be incremented and new documents will be added to the same shard. And so I could set the shard key as a hashed value of the document id.
My question is whether or not I can still continue to query by the document id? My brain is making me doubt this and making me think that the indexing of the documents is over the hashed shard key and not over the document id. I am hoping that the hashed shard key is used strictly for sharding and that I can set any key (i.e., document id) to be indexed. Is this correct?
Yes, you can still query by the value of the shard key.
If you are referring to _id, that will be automatically indexed with it's natural value, otherwise you could explicitly create and index on the document id that is not hashed in addition to the shard key index.
As long as you test for equality to a single or explicit list of values, the query should be handled by the minimum number of shards.
However, if you use a ranged test such as $gte, the query will have to be forwarded to every shard to be processed.
Using the hashed document id as the shard key will result in the creation of an index for the hashed value in addition to any other indexes.
There is a pretty good description of hashed sharding in the documentation

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

MongoDB - Compound Secondary Index vs Concatenated _id Index

I am designing my database with MongoDb thinking in the scalability in the future. My main concern right now is about representing the indexes, as I have read, it is a crucial factor while scaling huge collections, in terms of RAM consumption, and sharding efficiency.
For simplicity, I have two different collections. A user collection which stores the user username, email, and some metadata, and a devices collection, that contains a device name, some metadata, and should be related with its owner. One user can have millions of devices (so it is not worth to store all in a single user document).
The devices collection should support queries in term of the whole device identifier by (username, device_name), or also by the username.
In this case I see some different approaches for storing the indexes:
Use a secondary compound index with username and device_name (in this order)
Use a primary index with and _id containing an string with username#device_name
Use an object in the _id field with both values {owner:username, device:device_name}
For testing this indexes, I have done some server load. I have created three different collections with this different alternatives and filled 5M documents. Some data:
I do not use the automatically generated _id created by mongo, as all my queries requires username/device. So this approach takes some extra space for indexing. The index size is 524MB. It is efficient while querying both by user or by user/device.
As I am replacing the _id with my own string, the index takes less space. In this case 352MB. I am still able to query efficiently by user (with a regex like /^username#/ the explain() reports almost the same results like in 1 in), and by the exact username/device.
The _id index cannot be changed to a compound index, so it is required to create a secondary compound index with {_id.owner, _id.device}. This results in a huge index size of 1059MB!. Queries goes well as in previous cases.
So, I can discard alternative 3, as this is not so much efficient. Between alternative 1 and 2, I prefer 1 as this approach is more clean, but it uses a _id field I will not use. So at this moment, the winning approach seems to be the number 2, as it allows me query efficiently by username or username/device, and it also takes less index space.
Is there a good reason to not use number 2 and follow with number 1, like when selecting the sharding key? Is there something I am missing? I am new with mongoDB and do not want to have problems when scaling my schema.

Duplicate documents on _id (in mongo)

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at?
(Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).
This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.
How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
function counter(name) {
var ret = db.counters.findAndModify({
query:{_id:name},
update:{$inc:{next:1}},
"new":true,
upsert:true});
return ret.next;
}
db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.

does multiple shard key helps performance in mongodb?

Since sharding database use shard key to split chunk AND route queries, so I think maybe more shard key can helps to make more queries targeted
I tried to specify multiple keys like this
db.runCommand( { shardcollection : "test.users" , key : {_id:1, email : 1 ,address:1}
but I have no idea if it works and what the downsides of doing this
To be clear here, you can only have one shard key. So you cannot have multiple shard keys.
However, you are suggesting a compound index as the shard key. This can be done, but there are some limitations.
For example the combination of _id, email and address must be unique.
Documents for choosing a shard key. There are several more considerations that I cannot list here. Please see that document.
Selection of shard key based on :
{coarseLocality : 1, search : 1}
coarseLocality is whatever locality you want for your data,
search is a common search on your data.
You must have an index on the key you shard by, so if you choose a randomly-valued
key that you don’t query by, you’re basically wasting an index. Every additional index
makes writes slower, so it’s important to keep the number of indexes as low as possible.
So,increasing shard key combination doesn't help much.
Extract taken from Kristina Chodrow's book "Scaling MongoDB".