compound shard key chunks boundaries - mongodb

How are compound shard key used to generate new chunks? I mean, if I have a shard key like
{key1 : 1, key2 : 1}
and the collection is being populated as we go, how does MongoDB create the new chunks boundaries given that there are two keys? I can see them in the config server BUT I can not read them. they look like
[min key1,min key2] ---> [max key1, max key2]
and many often, min key 2> max key 2. How does that make sense?
In other words, how are the chunks min and max being set on new chunks given the shard key is compound?
key 1 is of type string and key 2 is of type int
I would appreciate it if you could explain it by an example.

The boundary is always from positive to negative infinity. As you insert it will break that initial chunk into smaller ones.
Here is a thread which should answer your question.

Related

Get Redis values while scanning

I've created a Redis key / value index this way :
set 7:12:321 '{"some:"JSON"}'
The key is delimited by a colon separator, each part of the key represents a hierarchic index.
get 7:12:321 means that I know the exact hierarchy and want only one single item
scan 7:12:* means that I want every item under id 7 in the first level of hierarchy and id 12 in the second layer of hierarchy.
Problem is : if I want the JSON values, I have to first scan (~50000 entries in a few ms) then get every key returned by scan one by one (800ms).
This is not very efficient. And this is the only answer I found on stackoverflow searching "scanning Redis values".
1/ Is there another way of scanning Redis to get values or key / value pairs and not only keys ? I tried hscan with as follows :
hset myindex 7:12:321 '{"some:"JSON"}'
hscan myindex MATCH 7:12:*
 
But it destroys the performance (almost 4s for the 50000 entries)
2/ Is there another data structure in Redis I could use in the same way but which could "scan for values" (hset ?)
3/ Should I go with another data storage solution (PostgreSQL ltree for instance ?) to suit my use case with huge performance ?
I must be missing something really obvious, 'cause this sounds like a common use case.
Thanks for your answers.
Optimization for your current solution
Instead of geting every key returned by scan one-by-one, you should use mget to batch get key-value pairs, or use pipeline to reduce RTT.
Efficiency problem of your current solution
scan command iterates all keys in the database, even if the number of keys that match the pattern is small. The performance decreases when the number of keys increases.
Another solution
Since the hierarchic index is an integer, you can encode the hierarchic indexes into a number, and use that number as the score of a sorted set. In this way, instead of searching by pattern, you can search by score range, which is very fast with a sorted set. Take the following as an example.
Say, the first (right-most) hierarchic index is less than 1000, the second index is less than 100, then you can encode the index (e.g. 7:12:321) into a score (321 + 12 * 1000 + 7 * 100 * 1000 = 712321). Then set the score and the value into a sorted set: zadd myindex 712321 '{"some:"JSON"}'.
When you want to search keys that match 7:12:*, just use zrangebyscore command to get data with a score between 712000 and 712999: zrangebyscore myindex 712000 712999 withscores.
In this way, you can get key (decoded with the returned score) and value together. Also it should be faster than the scan solution.
UPDATE
The solution has a little problem: members of sorted set must be unique, so you cannot have 2 keys with the same value (i.e. json string).
// insert OK
zadd myindex 712321 '{"the_same":"JSON"}'
// failed to insert, members should be unique
zadd myindex 712322 '{"the_same":"JSON"}'
In order to solve this problem, you can combine the key with the json string to make it unique:
zadd myindex 712321 '7:12:321-{"the_same":"JSON"}'
zadd myindex 712321 '7:12:322-{"the_same":"JSON"}'
You could consider using a Sorted Set and lexicographical ranges as long as you only need to perform prefix searches. For more information about this and indexing in general refer to http://redis.io/topics/indexes
Updated with an example:
Consider the following -
$ redis-cli
127.0.0.1:6379> ZADD anotherindex 0 '7:12:321:{"some:"JSON"}'
(integer) 1
127.0.0.1:6379> ZRANGEBYLEX anotherindex [7:12: [7:12:\xff
1) "7:12:321:{\"some:\"JSON\"}"
Now go and read about this so you 1) understand what it does and 2) know how to avoid possible pitfalls :)

MongoDB add fields of low cardinality to compound indexes?

I have read putting indexes on low cardinality fields is pointless.
Would this hold true for a compound index as such:
db.perms.createIndex({"owner": 1, "object_type": 1, "target": 1});
With queries as such:
db.perms.find({"owner": "me", "object_type": "square"});
db.perms.find({"owner": "me", "object_type": "circle", "target": "you"});
The amount of distinct object_type's would grow over time (probably no more than 10 or 20 max) but would only start out with about 2 or 3.
Similarly would a hash index be worth looking into?
UPDATE:
owner and target would grow immensely. Think of this like a file system wherein the owner would "own" a target (i.e. file). But, like unix systems, a file could be a folder, a symlink, or a regular file (hence the type). So although there are only 3 object_type's, a owner and target combination could have thousands of entries with an even distribution of types.
I may not be able to answer your question, but giving my cents for index cardinality:
Index cardinality: it refers to the number of index points for each different type of index that MongoDB supports.
Regular - for every single key that we put in the index, there's certainly going to be an index point. And in addition, if there is no key, then there's going to be an index point under the null entry. We get 1:1 relative to the number of documents in the collection in terms of index cardinality. That makes the index a certain size. It's proportional to the collection size in terms of it's end pointers to documents
Sparse - when a document is missing a key being indexed, it's not in the index because it's a null and we don't keep nulls in the index for a sparse index. We're going to have index points that could be potentially less than or equal to the number of documents.
Multikey - this is an index on array values. There'll be multiple index points (for each element of the array) for each document. So, it'll be greater than the number of documents.
Let's say you update a document with a key called tags and that update causes the document to need to get moved on disk. Assume you are using the MMAPv1 storage engine. If the document has 100 tags in it, and if the tags array is indexed with a multikey index, 100 index points need to be updated in the index to accommodate the move?

mongodb range sharding with string field

I use 'id' field in mongodb documents which is the HASH of '_id' (ObjectId field generated by mongo). I want to use RANGE sharding with 'id' field. The question is the following:
How can I set ranges for each shard when 'shardKey' is some long String (for example 64 chars)?
If you want your data to be distributed based on a hash key, MongoDB has a built-in way of doing that:
sh.shardCollection("yourDB.yourCollection", { _id: "hashed" })
This way, data will be distributed between your shards randomly, as well as uniformly (or very close to it) .
Please note that you can't have both logical key ranges and random data distribution. It's either one or the other, they are mutually exclusive. So:
If you want random data distribution, use { fieldName: "hashed" } as your shard key definition.
If you want to manually control how data is distributed and accessed, use a normal shard key and define shard tags.

MongoDB shard key as (ObjectId, ObjectId, RandomKey). Unbalanced collections

I am trying to shard a collection with approximately 6M documents. Following are some details about the sharded cluster
Mongod version 2.6.7, two shards, 40 % writes, 60% reads.
My database has a collection events with around 6M documents. The normal document looks like below:
{
_id : ObjectId,
sector_id : ObjectId,
subsector_id: ObjectId,
.
.
.
Many event specific fields go here
.
.
created_at: Date,
updated_at: Date,
uid : 16DigitRandomKey
}
Each sector has multiple (1,2, ..100) subsectors and each subsector has multiple events. There are 10000 such sectors, 30000 subsectors and 6M events. The numbers keep growing.
The normal read query includes sector_id, subsector_id. Every write operation includes sector_id, subsector_id, uid (randomly generated unique key) and rest of the data.
I tried/considered following shard keys and the results are described below:
a. _id:hashed --> will not provide query isolation, reason: _id is not passed to read query.
b. sector_id :1, subsector_id:1, uid:1 --> Strange distribution: Few sectors with old ObjectId goes to shard 1, Few sectors having sector_id of mid age(ObjectId) are well balanced and equally distributed among both shards. Few sectors with recent ObjectId stays on shard 0.
c. subsector_id: hashed --> results were same as shard key b.
d. subsector_id:1, uid:1 --> same as b.
e. subsector_id:hashed, uid:1 --> can not create such index
f. uid:1 --> writes are distributed but no query isolation
What may the reason for this uneven distribution? What can be the right shard key based upon given data.
I see it as an expected behaviour Astro, the sectorIds and subsectorIds are ObjectId type which contains the timestamp as the first 4 bytes which is monotonic in nature and would always go to the same chunk (and hence same shard) as it failed to provide the randomness which is also pointed by you in point (b).
the best way to choose a shard key is the key which has business meaning (unlike some ObjectId field) and should be mixed with some hash as the suffix to ensure a good random mix on that for equal distribution. if you have a sectorName and subsectorName then pls try out and let us know if its working using that.
you may consider this link to choose the right shard key.
MongoDB shard by date on a single machine
-$

MongoDB Compound Key Sharding And chunks vs disk size

After going through the 10Gen manual, I can seem to understand how sharding works in the following scenarios. I will use a document with userid, lastupdatetime, data for the example:
Chunks contain an ordered list of Shard Ids. so if my shard id is userid i expect chunk1 to contain a list of ids: user1...user999(up to the 64mb limit) and chunk2 will hold user1000...user1999. is that correct?
In the previous case, lets say that chunk1 is on shard1 and chunk2 is on shard2. if user1 (which is on shard1) has lots of lots of documents and all other users have 1-2 documents, it will make shard1 disk usage a lot bigger than shard 2 disk usage. If this is correct, what's MongoDB mitigation in that case?
How Compound shard key is ordered inside the chunks? for example, if the compound shard key is userid+lastupdatetime, is it safe to assume the following (assuming user1 has lots of documents):
chunk1 to contain a list of values: user1, 10:00:00; user1, 10:01:00...;user1,14:04:11..(up to the 64mb limit) and chunk2 will hold user1,14:05:33; user2,9:00:00...user34, 19:00:00;..
is that correct?
Yes, you are correct.
Your shard key determines where chunks can be split. If your shard key is "userid" then the smallest it can split up is on the userID. MongoDB automatically sizes chunks based on the document sizes. So it's going to be very likely that chunk1 (on shard1) only has f.e. documents with UserIDs in the range 1..10, and chunk2 (on shard2) the documents where the userIDs are 11..1000. MongoDB automatically will pick the best fitting range that maps to each chunk.
That is correct as well. With a compound shard key, the "unit" in which documents can be divided is the combination of both fields. So you can have { MinValue } to { user1, 12:00:00 } in chunk one, { user1, 12:00:01 } to { user2, 04:00:00 } in chunk two and { user2, 04:00:01 } to { MaxValue } on chunk three. MinValue and MaxValue are special values that are either smaller than everything else, or larger. The first chunk actually doesn't start with the first value (in your example { user1, 10:00:00 } but rather with MinValue.