I'm unable to find documentation about the algorithm that mongodb is using for collection or shard keys.
Can anyone help with this or post a reference?
If you are more interested in how indexing in general works check this presentation about the internals : http://www.mongodb.com/presentations/storage-engine-internals or this one http://www.mongodb.com/presentations/mongodbs-storage-engine-bit-bit
As an individual shard knows not much about the whole structure of the cluster, it utilizes the same indexing algorithm internally just there is a metadata layer which knows which datapart related to the specific shard.
There are some special cases, which are described in this docs : http://docs.mongodb.org/manual/core/indexes/
So which is not covered this way in the presentations above are the geospatial indexes and the special one which is the hashed index (DOCS). This one is also could be used as shard key and called hashed index and in this case the sharding is hash based sharding.check THIS and THIS
About the hashing algorithm which is used for this is: md5 used in this file:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/hasher.cpp
implemented here :
https://github.com/mongodb/mongo/blob/master/src/mongo/util/md5.cpp
Currently works only for an individual field as shard key at least this could be read out from the comments in the https://github.com/mongodb/mongo/blob/master/src/mongo/db/index/hash_access_method.cpp source file.
The official doc about shard keys is
http://docs.mongodb.org/manual/core/sharded-clusters/
If your 'algorithm' means cluster, you can get help through:
http://docs.mongodb.org/manual/core/sharded-cluster-operations/
You can now use convertShardKeyToHashed to convert a key to its hash value from version 4.0
From this ref, browse the source code and read its implementation
Related
We are building our first MongoDB and currently we are trying to choose the right shard key.
Each document in our main collection contain around 40 voice call related fields and the main field that we use in queries is the UserId field. This is why we are thinking about compound shard key of userid and CallStartTime.
They are not sure regarding the second field since StartTime is always advancing and one might argue that it is not random enough. This led us to consider replace it with UserId and hashed _id (mongo internal id after hash).
Is the first option is ok or do we better use the latter?
Consider the recommendations in the documentation here: http://docs.mongodb.org/manual/core/sharded-cluster-internals/#shard-keys
Or, if there is no natural choice, consider using a hashed shard key (mongodb 2.4+)
http://docs.mongodb.org/manual/reference/glossary/#term-hashed-shard-key
What sort of queries are you performing? What are the access patterns.
Ideally you want a key with good cardinality, write scaling and query isolation.
In your examples above you would need to know the callstarttime or hash to avoid scatter-gather operations.
I looked through the docs, and couldn't find a clear answer
Say I have a sparse index on [a,b,c]
Will documents with "a" "b" fields but not "c" be inserted to the index?
Is having the shard key indexed obligatory in the latest mongodb version ?
If so, is it possible to shard on [a] using the above compound sparse index?
(say a,b will always exist)
If c is not present, and query uses c index in the query plan, then document will not be found because it is not present in the index.
Shard key must be indexed and be unique. Also have a look at the notes on shard key on the sharding reference doc, it says
The ideal shard key:
is easily divisible which makes it easy for MongoDB to distribute
content among the shards. Shard keys that have a limited number of
possible values are not ideal as they can result in some chunks that
are “unsplitable.” See the Cardinality section for more information.
will distribute write operations among the cluster, to prevent any
single shard from becoming a bottleneck. Shard keys that have a high
correlation with insert time are poor choices for this reason;
however, shard keys that have higher “randomness” satisfy this
requirement better. See the Write Scaling section for additional
background. will make it possible for the mongos to return most query
operations directly from a single specific mongod instance. Your shard
key should be the primary field used by your queries, and fields with
a high degree of “randomness” are poor choices for this reason. See
the Query Isolation section for specific examples.
so if hypothetically, if mongo accepts a sparse index as shard key, mongo will not know where to place docs which don't fit in the index. One can argue, put them all in another shard for this purpose. Counter argument would be, what happens if it outgrows ... hence I don't think it would make sense to do it, even if it is allowed.
3- I doubt sparse index will work because shards require a unique index and a sparse index does not fulfill the criteria. The unique index requirement, I haven't found in docs, but if you use the mongo admin shell help, it tells you about it.
I'm new in NoSQL databases and now I use MongoDB, BTW I have a question about MongoDB shard key and I want to know what it does actually? Is it related to queries performance? And how we can choose a good shard key for a collection?
Thanks in advance
From 10gen's docs: http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
Choosing a shard-key is very dependent on your data and your use case.
Here's some more documentation you may find relevant:
http://docs.mongodb.org/manual/faq/sharding/
http://docs.mongodb.org/manual/sharding/
Specifically:
http://docs.mongodb.org/manual/core/sharding/
Essentially sharding allows you to partition your data across different servers. This means different writes/reads are going to different servers -- distributing the load of the application across multiple servers.
The shard key is the value in the collection that you are evaluating to determine which shard/server the document is being routed too.
You can have more explanation on shard key selection and working in Kristina Chodrow's book "Scaling MongoDB"
Check out this also
I'm looking for a good approach to do the following :
Given a document with some field F I want to set up sharding to that my application can generate a static hash for that field value (meaning, the hash will always be the same if the value is the same) and then use that hash to target the appropriate shard in a normal MongoDB sharded setup.
Questions :
Is this a safe/good approach?
What is a good way to go about implementing it
Are there any gotchas concerning shard cluster setup that I should be aware of.
Thanks!
I've actually implemented this and it's very doable and results in very good write performance. I'm assuming you have the same reasons for implementing it as I did (instant shard targeting without warmup/balancing, write throughput, no performance degradation during chunk moves/splits, etc.).
Your questions :
Yes it is provided you implement it correctly.
What I do in our in-house ORM layer is mark certain fields in document as a hash sharded field. Our ORM will then automatically generate a hash value for that field value just prior to writing or reading the document. Outgoing queries are then decorated with that hash value (in our case always called "hash") which MongoDB sharding then uses for shard targeting. Obviously in this scenario "hash" is always the only sharding key.
Most important by far is to generate good hashes. A lot of field values (most commonly an ObjectId based _id field) are incremental so your hash algorithm must be so that the generated hashes for incremental values result in hash values that hit different shards. Other issues include selecting the appropriate chunk size.
Some downsides to consider :
Default MongoDB chunk balancing becomes less useful since you typically set up your initial cluster with a lot of chunks (this facilitates adding shards to your cluster while maintaining good chunk spread across all shards). This means the balancer will only start splitting if you have enough data in your premade chunks to require splitting.
It's likely to become an officially supported MongoDB feature in the near future which may make this entire effort a bit wasteful. Like me you may not have the luxury of waiting though.
Good luck.
UPDATE 25/03/2013 : As of version 2.4 MongoDB supports hash indexes natively.
This is a good and safe idea.
However the choice of the hash function is crucial :
do you want it to be uniform (you smooth the load of all your shard but you loose some semantic bulk access) ?
do you want it human readable ? (you loose efficiency (compared to binary hash which are very fast) but you win, well, readability
can you make it consistent (beware of language provided hash function)
can you you enforce unicity if you want ?
I have successfully choosen : uniformity, binary form, consistency and unicity with a murmurHash3 function :
value -> murmurmHash(valueInBinaryForm) followed by the valueInBinaryForm
I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.
I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.
I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.
ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)
Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .
Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)