I use 'id' field in mongodb documents which is the HASH of '_id' (ObjectId field generated by mongo). I want to use RANGE sharding with 'id' field. The question is the following:
How can I set ranges for each shard when 'shardKey' is some long String (for example 64 chars)?
If you want your data to be distributed based on a hash key, MongoDB has a built-in way of doing that:
sh.shardCollection("yourDB.yourCollection", { _id: "hashed" })
This way, data will be distributed between your shards randomly, as well as uniformly (or very close to it) .
Please note that you can't have both logical key ranges and random data distribution. It's either one or the other, they are mutually exclusive. So:
If you want random data distribution, use { fieldName: "hashed" } as your shard key definition.
If you want to manually control how data is distributed and accessed, use a normal shard key and define shard tags.
Related
I have documents with the following schema:
{
idents: {
list: ['foo', 'bar', ...],
id: 123
}
...
}
the field idents.list is an array of string and always contains at least one element.
the field idents.id may or may not be existant.
over time more entries are added to 'idents.list' and at some point in the future the field idents.id may be set too.
these two fields are used to clearly identify a document and therefore are relevant for a shard key.
is it possible to use sharding with this schema?
UPDATE:
documents are always queried via {idents.list: 'foo'} OR { $or: [ {idents.list: 'foo'}, {idents.id: 42} ] }
Yes,you can do this. The documentation says:
Use a compound shard key that uses two or three values from all documents that provide the right mix of cardinality with scalable write operations and query isolation.
https://docs.mongodb.org/manual/tutorial/choose-a-shard-key/
I am trying to create a collection with 50+ fields. I understand that the purpose of the primary key is to uniquely identify a record. Since the primary key is the _id in MongoDB that gets created automatically, isn't it obvious that all my records including duplicate would go into my DB with unique _id for evert record? Tell me where I'm going wrong.Other articles and discussions are more confusing.
How to set any one/more of the other fields as a primary key? But I don't want the default _id as primary key.
In what way, compound indexes are different from compound/primary key?
There is no such notion as a primary key in MongoDB. Terminology matters. Not knowing the terminology is a sure sign someone hasn't read the docs or at least not carefully.
A document in a collection must have an _id field which may be and by default is an ObjectId. This field has an index on it which enforces a unique constraint, so there can not be any two documents with the same value or combination of values in the _id field. Which, by what you describe, presumably is what you want.
My suggestion is to reuse the default _id as often as you can. Additional indices are expensive (RAM-wise). You have two options here: either use a different single value as _id or use multiple values, if the cardinality of the single field isn't enough.
Let us assume you want a clickstream per user recorded. Obviously, you need to have the unique user. But that would not be enough, since a user only could only have one entry. But since you need a timestamp fo each click anyway, you move it to the _id field:
{
_id:{
user: "some user",
ts: new ISODate()
},
...
}
Unless your Mongo installation is sharded, you can you create a unique compound index on multiple fields and use this as a surrogate composite primary key.
db.collection.createIndex( { a: 1, b: 1 }, { unique: true } )
Alternatively you could create your own _id values. However, as the default ObjectId is also a timestamp, personally I find it useful for auditing purposes.
Regarding the difference between compound index and composite primary key, by definition primary keys cannot be defined on a missing (null) fields and there can only be one primary key per document. In MongoDB only the _id field can be used as a primary key, as it is added by default when missing. In contrast, a compound index can be applied on missing fields by defining it as parse and you can define multiple compound indices on the same document.
I am trying to shard a collection with approximately 6M documents. Following are some details about the sharded cluster
Mongod version 2.6.7, two shards, 40 % writes, 60% reads.
My database has a collection events with around 6M documents. The normal document looks like below:
{
_id : ObjectId,
sector_id : ObjectId,
subsector_id: ObjectId,
.
.
.
Many event specific fields go here
.
.
created_at: Date,
updated_at: Date,
uid : 16DigitRandomKey
}
Each sector has multiple (1,2, ..100) subsectors and each subsector has multiple events. There are 10000 such sectors, 30000 subsectors and 6M events. The numbers keep growing.
The normal read query includes sector_id, subsector_id. Every write operation includes sector_id, subsector_id, uid (randomly generated unique key) and rest of the data.
I tried/considered following shard keys and the results are described below:
a. _id:hashed --> will not provide query isolation, reason: _id is not passed to read query.
b. sector_id :1, subsector_id:1, uid:1 --> Strange distribution: Few sectors with old ObjectId goes to shard 1, Few sectors having sector_id of mid age(ObjectId) are well balanced and equally distributed among both shards. Few sectors with recent ObjectId stays on shard 0.
c. subsector_id: hashed --> results were same as shard key b.
d. subsector_id:1, uid:1 --> same as b.
e. subsector_id:hashed, uid:1 --> can not create such index
f. uid:1 --> writes are distributed but no query isolation
What may the reason for this uneven distribution? What can be the right shard key based upon given data.
I see it as an expected behaviour Astro, the sectorIds and subsectorIds are ObjectId type which contains the timestamp as the first 4 bytes which is monotonic in nature and would always go to the same chunk (and hence same shard) as it failed to provide the randomness which is also pointed by you in point (b).
the best way to choose a shard key is the key which has business meaning (unlike some ObjectId field) and should be mixed with some hash as the suffix to ensure a good random mix on that for equal distribution. if you have a sectorName and subsectorName then pls try out and let us know if its working using that.
you may consider this link to choose the right shard key.
MongoDB shard by date on a single machine
-$
In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.
For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.
In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.
Have some data that looks like this:
widget:
{
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
}
The data needs to be queried like this:
SELECT * FROM widget_table WHERE categories == 'hair' AND colors == 'red'
Would like to put this data into a MongoDB sharded cluster. However, it seems like an ideal shard key would not be a list field. In this case, that is not possible because all of the fields are list fields.
Is it possible to use a list field, such as the field categories as the shard key in MongoDB?
If so, what things should I look out for / be aware of?
Thanks so much!
Based on some of the feed back I am getting that seems to assert that it is not possible to shard using a list field as a shard key, I wanted to illustrate how this use case could be sharded using the limitations of MongoDB:
Original object:
widget:
{
primary_key: '2389sdjsdafnlfda'
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
#All the other fields in the document that don't need to be queried upon:
...
...
}
Data layer splits object into multiple pointer objects based on the number of elements in the field chosen for the shard key:
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'hair',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'nails',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'dress',
colors: ['red', 'white']
}
Explanation:
The field categories can now be the shard key in MongoDB.
The original object will now be stored in a key-value store. Queries against the data in MongoDB will return a pointer object that will be used to get the object from the key-value store.
Queries on the MongoDB data will hit only one shard.
Insertions on the MongoDB data will hit as many shards as there are elements in the list, in most cases, only a small subset of the total number of shards will be affected.
Sharding in MongoDB (as at 2.4) works by partitioning your documents into ranges of values based on the shard key. A list or array shard key does not make sense as a shard key because it contains multiple values.
It's also worth noting that the shard key is immutable (cannot be changed once set for a document), so you do not want to choose fields that you intend to update.
If you do not have any candidate fields in your documents, you could always add one. A straightforward solution in your case could be to use the new hashed sharding in MongoDB 2.4:
The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.
An obvious question to consider before sharding is "do you need to shard?". Sharding is an approach for scaling out writes with MongoDB, but can be overkill if you aren't yet pushing the limits of your current configuration.