MongoDB Compound Key Sharding And chunks vs disk size - mongodb

After going through the 10Gen manual, I can seem to understand how sharding works in the following scenarios. I will use a document with userid, lastupdatetime, data for the example:
Chunks contain an ordered list of Shard Ids. so if my shard id is userid i expect chunk1 to contain a list of ids: user1...user999(up to the 64mb limit) and chunk2 will hold user1000...user1999. is that correct?
In the previous case, lets say that chunk1 is on shard1 and chunk2 is on shard2. if user1 (which is on shard1) has lots of lots of documents and all other users have 1-2 documents, it will make shard1 disk usage a lot bigger than shard 2 disk usage. If this is correct, what's MongoDB mitigation in that case?
How Compound shard key is ordered inside the chunks? for example, if the compound shard key is userid+lastupdatetime, is it safe to assume the following (assuming user1 has lots of documents):
chunk1 to contain a list of values: user1, 10:00:00; user1, 10:01:00...;user1,14:04:11..(up to the 64mb limit) and chunk2 will hold user1,14:05:33; user2,9:00:00...user34, 19:00:00;..
is that correct?

Yes, you are correct.
Your shard key determines where chunks can be split. If your shard key is "userid" then the smallest it can split up is on the userID. MongoDB automatically sizes chunks based on the document sizes. So it's going to be very likely that chunk1 (on shard1) only has f.e. documents with UserIDs in the range 1..10, and chunk2 (on shard2) the documents where the userIDs are 11..1000. MongoDB automatically will pick the best fitting range that maps to each chunk.
That is correct as well. With a compound shard key, the "unit" in which documents can be divided is the combination of both fields. So you can have { MinValue } to { user1, 12:00:00 } in chunk one, { user1, 12:00:01 } to { user2, 04:00:00 } in chunk two and { user2, 04:00:01 } to { MaxValue } on chunk three. MinValue and MaxValue are special values that are either smaller than everything else, or larger. The first chunk actually doesn't start with the first value (in your example { user1, 10:00:00 } but rather with MinValue.

Related

MongoDB - Weird difference in _id Index size

I have two sharded collections on 12 shards, with the same number of documents. The shard key of Collection1 is compound (two fields are used), and its document consists of 4 fields. The shard key of Collection2 two is single, and its documents consists of 5 fields.
Via db.collection.stats() command, I get the information about the indexes.
What seems strange to me, is that for the Collection1, the total size of _id index is 1342MB.
Instead, the total size of the _id index for Collection2 is 2224MB. Is this difference reasonable? I was awaiting that the total size would be more less the same because of the same number of docucments. Note that the sharding key for both collections, does not integrate the _id field.
MongoDB uses prefix compression for indexes.
This means that if sequential values in the index begin with the same series of bytes, the bytes are stored for the first value, and subsequent values contain a tag indicating the length of the prefix.
Depending on the datatype of the _id value, this could be quite a bit.
There may also be orphaned documents causing one node to have more entries in its _id index.

Best shard key (or optimised query) for range query on sub-document array

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 100
},
{
_id: b,
data : 200
},
{
_id: c,
data: 150
}
]
}
So imagine I have lots of these documents with varied data values (say 0 - 1000).
Currently my query is something like:
db.myDb.find(
{ sub_docs.data : { $elemMatch: { $gte: 110, $lt: 160 } } }
)
Is there any shard key I could use to help this query? As currently it is querying all shards.
If not is there a better way to structure my query?
Jackson,
You are thinking about this problem the right way. The problem with broadcast queries in MongoDB is that they can't scale.
Any MongoDB query that does not filter on the shard key, will be broadcast to all shards. Also, range queries are likely to either cause broadcasts of at the very least cause your queries to be sent to multiple shards.
So here is some things to think about
Query Frequency -- Is the range query your most frequent query? What
is the expected workload?
Range Logic -- Is there any instrinsic logic to how you are going to
apply the ranges? Let's say, you would say 0-200 is small, 200 - 400
is medium. You could potentially add another field to your document
and shard on it.
Additional shard key candidates -- Sometimes there are other fields
that can be included in all or most of your queries and it would
provide good distribution. By combining filtering with your range
queries you could restrict your query to one or fewer shards.
Break array -- You could potentially have multiple documents instead
of an array. In this scenario, you would have multiple docs, one for
each occurrence of the array and main data would be duplicated across
mulitple documents. Range query on this item would still be a
problem, but you could involve multiple shards, not necessarily all
(it depends on your data demographics and query patterns)
It boils down to the nature of your data and queries. The sample document that you provided is very anonymized so it is harder to know what would be good shard key candidates in your domain.
One last piece of advice is to be careful on your insert/update query patterns if you plan to update your document frequently to add more entries to the array. Growing documents present scaling problems for MongoDB. See this article on this topic.

MongoDB shard key as (ObjectId, ObjectId, RandomKey). Unbalanced collections

I am trying to shard a collection with approximately 6M documents. Following are some details about the sharded cluster
Mongod version 2.6.7, two shards, 40 % writes, 60% reads.
My database has a collection events with around 6M documents. The normal document looks like below:
{
_id : ObjectId,
sector_id : ObjectId,
subsector_id: ObjectId,
.
.
.
Many event specific fields go here
.
.
created_at: Date,
updated_at: Date,
uid : 16DigitRandomKey
}
Each sector has multiple (1,2, ..100) subsectors and each subsector has multiple events. There are 10000 such sectors, 30000 subsectors and 6M events. The numbers keep growing.
The normal read query includes sector_id, subsector_id. Every write operation includes sector_id, subsector_id, uid (randomly generated unique key) and rest of the data.
I tried/considered following shard keys and the results are described below:
a. _id:hashed --> will not provide query isolation, reason: _id is not passed to read query.
b. sector_id :1, subsector_id:1, uid:1 --> Strange distribution: Few sectors with old ObjectId goes to shard 1, Few sectors having sector_id of mid age(ObjectId) are well balanced and equally distributed among both shards. Few sectors with recent ObjectId stays on shard 0.
c. subsector_id: hashed --> results were same as shard key b.
d. subsector_id:1, uid:1 --> same as b.
e. subsector_id:hashed, uid:1 --> can not create such index
f. uid:1 --> writes are distributed but no query isolation
What may the reason for this uneven distribution? What can be the right shard key based upon given data.
I see it as an expected behaviour Astro, the sectorIds and subsectorIds are ObjectId type which contains the timestamp as the first 4 bytes which is monotonic in nature and would always go to the same chunk (and hence same shard) as it failed to provide the randomness which is also pointed by you in point (b).
the best way to choose a shard key is the key which has business meaning (unlike some ObjectId field) and should be mixed with some hash as the suffix to ensure a good random mix on that for equal distribution. if you have a sectorName and subsectorName then pls try out and let us know if its working using that.
you may consider this link to choose the right shard key.
MongoDB shard by date on a single machine
-$

MongoDB Sharded collection shard key confusion

Suppose I have a DB called 'mydb' and a collection in that DB called 'people' and documents in mydb.people all have a 5 digit US zip code field: ex) 90210. If I set up a sharded collection by splitting up this collection in to 2 shards using the zip code as the shard key, how would document insertion be handled?
So if I insert a document with zipcode = 00000 would that go to the first shard because this zip code value is less than the center zipcode value of 50000? And if I insert a document with zipcode = 99999 would it be inserted into the second shard?
I setup a sharded cluster according to http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ with a collection with common key of zipcode sharded across 2 DB's and am not finding this even distribution of documents.
Also what do they mean by Chunk size? A chunk is basically a range of the shard index, right? Why do they talk about chunk sizes in sizes of MB and not in terms of range of the shard key?
Confusing

Update document record to the corresponding tag aware shard db

Does Mongo move the newly updated document in tag aware shard db set up to the correct shard?
We have the following set up with MongoDb ver. 2.4.6 and using C# driver 1.8.3, it did not return the expected result for the update scenario on tag aware sharding. Please assist to review the following scenario and let us know whether MongoDb is capable of this.
We have the following set up for the experiement:
//use the default 'test' database
db = db.getSiblingDB('test');`
//Add shards
sh.addShard( "shard0001.local:27017" );
sh.addShard( "shard0002.local:27017" );
//Enable sharding for the database,
sh.enableSharding("test");
//Enable sharding for a collection,
sh.shardCollection("test.persons", { "countryCode": 1, "_id": 1 } );
//Add shard tags,
sh.addShardTag("shard0001", "USA");
sh.addShardTag("shard0002", "JPN");
//Tag shard key ranges,
sh.addTagRange("test.persons", { countryCode: 0 }, { countryCode: 1 }, "USA");
sh.addTagRange("test.persons", { countryCode: 1 }, { countryCode: 2 }, "JPN");
Then we execute the following script for the initial data population:
//MongoDB sharding test,
db = db.getSiblingDB('test')
//Load data
//USA: countryCode 0
//JPN: countryCode 1
for (var i=0; i < 1000, i++) {
db.persons.insert( { name: "USA-" + i, countryCode: 0 } )
db.persons.insert( { name: "JPN-" + i, countryCode: 1 } )
At this point, we have 1000 records for each shard, 1000 records for USA country code in shard0001 and 1000 records for JPN in shard0002.
From C#, we have the following pseudo code:
collection.insert( 1 document of countryCode=0)
collection.insert( 1 document of countryCode=1)
Upon execution, we have 1001 documents for each shard, so far so good.
Then we updated one document in shard0001 from countryCode=0 to countryCode=1 with the _id. However, we ended up having 1002 records in JPN shard(shard0002) and 1001 record in USA shard(shard0001). It appears that Mongos routes the update to shard0002 based on the new countryCode of 1 and executed the insert, and never made the update to the document in shard0001. Hence now we have 2 documents of the same _id in two different shards.
We were expecting mongo would update the actual document in shard0001, then realized changing of countryCode from 0 to 1 will move that document to shard0002 instead. Does Mongo do this automatically?
We know we can manually removing the document record from shard0001, do we really have to do this manually on our own?
If you check the documentation fro the keyrange asignment it have the notice: "To assign a tag to a range of shard keys use the sh.addTagRange() method when connected to a mongos instance. Any given shard key range may only have one assigned tag. You cannot overlap defined ranges, or tag the same range more than once.
"
The reason is in the backgroud mongodb will make splits to have chunks which are related only to that specific tags keyrange, and this way they able to move aroud according the tagging. So the alignment defined by the tags are ensured with two steps:
The engine make splits to have chunks with key range exclusive for one shard tag
The chunks in a balancing round will be aligned/moved around according the current shard-tag mapping.
I assume you checked the number of the docs from the side of the shards connecting there directly not through the mongos instance. Weather it is a bug or not that you ended up with the two docs but due to the shard key mapping you only able to access one of them through mongos due to the keyrange based alignment. If it not deleted automaticaly it is definitely a BUG and it should be addressed. I cannot check the jira from my current location. I will set up some test and get back to you with the results. That was due to the misunderstanding the described behaviour.
Based on your comments that you used save command to perform the update and this documentation, the situation is when you save a document with
{countrycode:0 _id:x}
combination which is a new one (you had previously a {countrycode:1 _id:x}) the _id is the same as for another document (the old one), and the new document reside on another shard (It is true due to the tagging is based on the countrycode), it will be inserted without a problem due to the uniqueness of the _id field is only ensured inside a given shard and collection. In different shards if _id is not the shard key, or not the first in a compound shard key, there is no garantie for the values of the _id field is globally unique. Basicly as it generated it is most likely will be unique, despite this situation when you on propose gave the same _id to preform update kind behaviour.
To give an answer for your question: if this situation for you is not the expected you have to delete the old document and than create the new one, or more safely mark the old one as deleted (with a flag or so, and handle it on application side), and later look for deleted docs and really remove them if needed (if you run out of space).