MongoDB shard key as (ObjectId, ObjectId, RandomKey). Unbalanced collections - mongodb

I am trying to shard a collection with approximately 6M documents. Following are some details about the sharded cluster
Mongod version 2.6.7, two shards, 40 % writes, 60% reads.
My database has a collection events with around 6M documents. The normal document looks like below:
{
_id : ObjectId,
sector_id : ObjectId,
subsector_id: ObjectId,
.
.
.
Many event specific fields go here
.
.
created_at: Date,
updated_at: Date,
uid : 16DigitRandomKey
}
Each sector has multiple (1,2, ..100) subsectors and each subsector has multiple events. There are 10000 such sectors, 30000 subsectors and 6M events. The numbers keep growing.
The normal read query includes sector_id, subsector_id. Every write operation includes sector_id, subsector_id, uid (randomly generated unique key) and rest of the data.
I tried/considered following shard keys and the results are described below:
a. _id:hashed --> will not provide query isolation, reason: _id is not passed to read query.
b. sector_id :1, subsector_id:1, uid:1 --> Strange distribution: Few sectors with old ObjectId goes to shard 1, Few sectors having sector_id of mid age(ObjectId) are well balanced and equally distributed among both shards. Few sectors with recent ObjectId stays on shard 0.
c. subsector_id: hashed --> results were same as shard key b.
d. subsector_id:1, uid:1 --> same as b.
e. subsector_id:hashed, uid:1 --> can not create such index
f. uid:1 --> writes are distributed but no query isolation
What may the reason for this uneven distribution? What can be the right shard key based upon given data.

I see it as an expected behaviour Astro, the sectorIds and subsectorIds are ObjectId type which contains the timestamp as the first 4 bytes which is monotonic in nature and would always go to the same chunk (and hence same shard) as it failed to provide the randomness which is also pointed by you in point (b).
the best way to choose a shard key is the key which has business meaning (unlike some ObjectId field) and should be mixed with some hash as the suffix to ensure a good random mix on that for equal distribution. if you have a sectorName and subsectorName then pls try out and let us know if its working using that.
you may consider this link to choose the right shard key.
MongoDB shard by date on a single machine
-$

Related

MongoDB - Weird difference in _id Index size

I have two sharded collections on 12 shards, with the same number of documents. The shard key of Collection1 is compound (two fields are used), and its document consists of 4 fields. The shard key of Collection2 two is single, and its documents consists of 5 fields.
Via db.collection.stats() command, I get the information about the indexes.
What seems strange to me, is that for the Collection1, the total size of _id index is 1342MB.
Instead, the total size of the _id index for Collection2 is 2224MB. Is this difference reasonable? I was awaiting that the total size would be more less the same because of the same number of docucments. Note that the sharding key for both collections, does not integrate the _id field.
MongoDB uses prefix compression for indexes.
This means that if sequential values in the index begin with the same series of bytes, the bytes are stored for the first value, and subsequent values contain a tag indicating the length of the prefix.
Depending on the datatype of the _id value, this could be quite a bit.
There may also be orphaned documents causing one node to have more entries in its _id index.

mongodb range sharding with string field

I use 'id' field in mongodb documents which is the HASH of '_id' (ObjectId field generated by mongo). I want to use RANGE sharding with 'id' field. The question is the following:
How can I set ranges for each shard when 'shardKey' is some long String (for example 64 chars)?
If you want your data to be distributed based on a hash key, MongoDB has a built-in way of doing that:
sh.shardCollection("yourDB.yourCollection", { _id: "hashed" })
This way, data will be distributed between your shards randomly, as well as uniformly (or very close to it) .
Please note that you can't have both logical key ranges and random data distribution. It's either one or the other, they are mutually exclusive. So:
If you want random data distribution, use { fieldName: "hashed" } as your shard key definition.
If you want to manually control how data is distributed and accessed, use a normal shard key and define shard tags.

Slow creation of four-field index in MongoDB

I have a ProductRequest collection in MongoDB. It is somewhat large collection, but not that many documents. Number of documents is a bit over 300,000, but average size of a document is close to 1MB, thus data footprint is large.
To speed up certain queries I am setting up index on this collection:
db.ProductRequest.ensureIndex ({processed: 1, parsed: 1, error:1,processDate:1})
First three fields are Boolean, the last one is date time.
The command runs for soon 24 hours and would not come back
I already have index on ‘processed’ and ‘parsed’ fields (together) and a separate one on ‘error’. Why creation of that four-field index takes forever? My understanding is that size of an individual record should not matter in this case, am I wrong?
Additional Info:
MongoDB version 2.6.1 64-bit
Host OS Centos 6.5
Sharding: yes, shard key is _id. Number of shards: 2, number of replica sets in each shard is 3.
I belive its because of putting index for boolean fields.
since there are only two values (true or false), if you have 300.000 rows putting an index on that field will have to scan 150.00 rows to find all documents and in your case you have 3 Boolean fields, it makes it more slow.
You won't see a huge benefit from an index on those three fields and processDate compared to an index just on processDate. Indexes on boolean fields aren't very useful in the presence of other index-able fields because they aren't very selective. If you give a process date, there are only 8 possibilities for the combination of the other fields to further narrow down the results via the index.
Also, you should switch the order. Put processDate first as it is much more selective than a boolean field. That should greatly simplify the index and speed up the index build.
Finally, index creation in MongoDB is sometimes unavoidably slow and expensive because it involves creating large B-trees. The payoff, which is absolutely worth it, of course, is faster queries. It's possible that more than 24 hours are needed for an index build. Have you checked what the saturated resource is? It's likely the CPU for an index build. Your best option for this case is to create the index in the background. Background index builds
don't block read and write operation for the duration like foreground index builds
take longer
produce initially larger indexes that will converge to the size of an equivalent foreground index over time
You set an index build to occur in the background with an extra option to the ensureIndex call:
db.myCollection.ensureIndex({ "myField" : 1 }, { "background" : 1 })

Update document record to the corresponding tag aware shard db

Does Mongo move the newly updated document in tag aware shard db set up to the correct shard?
We have the following set up with MongoDb ver. 2.4.6 and using C# driver 1.8.3, it did not return the expected result for the update scenario on tag aware sharding. Please assist to review the following scenario and let us know whether MongoDb is capable of this.
We have the following set up for the experiement:
//use the default 'test' database
db = db.getSiblingDB('test');`
//Add shards
sh.addShard( "shard0001.local:27017" );
sh.addShard( "shard0002.local:27017" );
//Enable sharding for the database,
sh.enableSharding("test");
//Enable sharding for a collection,
sh.shardCollection("test.persons", { "countryCode": 1, "_id": 1 } );
//Add shard tags,
sh.addShardTag("shard0001", "USA");
sh.addShardTag("shard0002", "JPN");
//Tag shard key ranges,
sh.addTagRange("test.persons", { countryCode: 0 }, { countryCode: 1 }, "USA");
sh.addTagRange("test.persons", { countryCode: 1 }, { countryCode: 2 }, "JPN");
Then we execute the following script for the initial data population:
//MongoDB sharding test,
db = db.getSiblingDB('test')
//Load data
//USA: countryCode 0
//JPN: countryCode 1
for (var i=0; i < 1000, i++) {
db.persons.insert( { name: "USA-" + i, countryCode: 0 } )
db.persons.insert( { name: "JPN-" + i, countryCode: 1 } )
At this point, we have 1000 records for each shard, 1000 records for USA country code in shard0001 and 1000 records for JPN in shard0002.
From C#, we have the following pseudo code:
collection.insert( 1 document of countryCode=0)
collection.insert( 1 document of countryCode=1)
Upon execution, we have 1001 documents for each shard, so far so good.
Then we updated one document in shard0001 from countryCode=0 to countryCode=1 with the _id. However, we ended up having 1002 records in JPN shard(shard0002) and 1001 record in USA shard(shard0001). It appears that Mongos routes the update to shard0002 based on the new countryCode of 1 and executed the insert, and never made the update to the document in shard0001. Hence now we have 2 documents of the same _id in two different shards.
We were expecting mongo would update the actual document in shard0001, then realized changing of countryCode from 0 to 1 will move that document to shard0002 instead. Does Mongo do this automatically?
We know we can manually removing the document record from shard0001, do we really have to do this manually on our own?
If you check the documentation fro the keyrange asignment it have the notice: "To assign a tag to a range of shard keys use the sh.addTagRange() method when connected to a mongos instance. Any given shard key range may only have one assigned tag. You cannot overlap defined ranges, or tag the same range more than once.
"
The reason is in the backgroud mongodb will make splits to have chunks which are related only to that specific tags keyrange, and this way they able to move aroud according the tagging. So the alignment defined by the tags are ensured with two steps:
The engine make splits to have chunks with key range exclusive for one shard tag
The chunks in a balancing round will be aligned/moved around according the current shard-tag mapping.
I assume you checked the number of the docs from the side of the shards connecting there directly not through the mongos instance. Weather it is a bug or not that you ended up with the two docs but due to the shard key mapping you only able to access one of them through mongos due to the keyrange based alignment. If it not deleted automaticaly it is definitely a BUG and it should be addressed. I cannot check the jira from my current location. I will set up some test and get back to you with the results. That was due to the misunderstanding the described behaviour.
Based on your comments that you used save command to perform the update and this documentation, the situation is when you save a document with
{countrycode:0 _id:x}
combination which is a new one (you had previously a {countrycode:1 _id:x}) the _id is the same as for another document (the old one), and the new document reside on another shard (It is true due to the tagging is based on the countrycode), it will be inserted without a problem due to the uniqueness of the _id field is only ensured inside a given shard and collection. In different shards if _id is not the shard key, or not the first in a compound shard key, there is no garantie for the values of the _id field is globally unique. Basicly as it generated it is most likely will be unique, despite this situation when you on propose gave the same _id to preform update kind behaviour.
To give an answer for your question: if this situation for you is not the expected you have to delete the old document and than create the new one, or more safely mark the old one as deleted (with a flag or so, and handle it on application side), and later look for deleted docs and really remove them if needed (if you run out of space).

Is MongoDB _id unique by default?

Is MongoDB _id unique by default, or do I have to set it to unique?
For the most part, _id in mongodb is unique enough. There is one edge case where it's possible to generate a duplicate id though:
If you generate more than 16,777,215 object ids in the same single second, while running in one process while on the same machine then you will get a duplicate id due to how object ids are generated. If you change any one of those things like generating an id in a different process or on a different machine, or at a different unix time, then you won't get a duplicate.
Let me know if you ever manage to pull this off with a realistic use case. Apparently google only gets around 70,000 searches a second and those are all processed on different machines.
All documents contain an _id field. All collections (except for capped ones) automatically create unique index on _id.
Try this:
db.system.indexes.find()
ok .. short version
YES YES YES
_id uniqid by default , mongoDB creates index on _id by default and you do not need any settings
According to MongoDB's manual the answer is yes, it's unique by default:
MongoDB creates the _id index, which is an ascending unique index on the _id field, for all collections when the collection is created. You cannot remove the index on the _id field.
Share what I learned about this topic here:
Different from other RDBMS, Mongodb document's Id is generated on the client side. This functionality is usually implemented in the drivers of various programming languages.
The Id is string with 12 bytes length, which consists of several parts as follows:
TimeStamp(4 bytes) + MachineId(3 bytes) + ProcessId(2 bytes) + Counter(3 bytes)
Based on this pattern, it's extremely unlikely to have two Ids duplicated.