To quote the docs:
When creating an index, the number associated with a key specifies the
direction of the index, so it should always be 1 (ascending) or -1
(descending). Direction doesn't matter for single key indexes or for
random access retrieval but is important if you are doing sorts or
range queries on compound indexes.
However, I see no reason why direction of the index should matter on compound indexes. Can someone please provide a further explanation (or an example)?
MongoDB concatenates the compound key in some way and uses it as the key in a BTree.
When finding single items - The order of the nodes in the tree is irrelevant.
If you are returning a range of nodes - The elements close to each other will be down the same branches of the tree. The closer the nodes are in the range the quicker they can be retrieved.
With a single field index - The order won't matter. If they are close together in ascending order they will also be close together in descending order.
When you have a compound key - The order starts to matter.
For example, if the key is A ascending B ascending the index might look something like this:
Row A B
1 1 1
2 2 6
3 2 7
4 3 4
5 3 5
6 3 6
7 5 1
A query for A ascending B descending will need to jump around the index out of order to return the rows and will be slower. For example it will return Row 1, 3, 2, 6, 5, 4, 7
A ranged query in the same order as the index will simply return the rows sequentially in the correct order.
Finding a record in a BTree takes O(Log(n)) time. Finding a range of records in order is only OLog(n) + k where k is the number of records to return.
If the records are out of order, the cost could be as high as OLog(n) * k
The simple answer that you are looking for is that the direction only matters when you are sorting on two or more fields.
If you are sorting on {a : 1, b : -1}:
Index {a : 1, b : 1} will be slower than index {a : 1, b : -1}
Why indexes
Understand two key points.
While an index is better than no index, the correct index is much better than either.
MongoDB will only use one index per query, making compound indexes with proper field ordering what you probably want to use.
Indexes aren't free. They take memory, and impose a performance penalty when doing inserts, updates and deletes. Normally the performance hit is negligible (especially compared to gains in read performance), but that doesn't mean that we can't be smart about creating our indexes.
How Indexes
Identifying what group of fields should be indexed together is about understanding the queries that you are running. The order of the fields used to create your index is critical. The good news is that, if you get the order wrong, the index won't be used at all, so it'll be easy to spot with explain.
Why Sorting
Your queries might need Sorting. But sorting can be an expensive operation, so it's important to treat the fields that you are sorting on just like a field that you are querying. So it will be faster if it has index. There is one important difference though, the field that you are sorting must be the last field in your index. The only exception to this rule is if the field is also part of your query, then the must-be-last-rule doesn't apply.
How Sorting
You can specify a sort on all the keys of the index or on a subset; however, the sort keys must be listed in the same order as they appear in the index. For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } but not on { b: 1, a: 1 }.
The sort must specify the same sort direction (i.e. ascending/descending) for all its keys as the index key pattern or specify the reverse sort direction for all its keys as the index key pattern. For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } and { a: -1, b: -1 } but not on { a: -1, b: 1 }.
Suppose there are these indexes:
{ a: 1 }
{ a: 1, b: 1 }
{ a: 1, b: 1, c: 1 }
Example Index Used
db.data.find().sort( { a: 1 } ) { a: 1 }
db.data.find().sort( { a: -1 } ) { a: 1 }
db.data.find().sort( { a: 1, b: 1 } ) { a: 1, b: 1 }
db.data.find().sort( { a: -1, b: -1 } ) { a: 1, b: 1 }
db.data.find().sort( { a: 1, b: 1, c: 1 } ) { a: 1, b: 1, c: 1 }
db.data.find( { a: { $gt: 4 } } ).sort( { a: 1, b: 1 } ) { a: 1, b: 1 }
Related
I'm using mongodb (any version, I'd happily upgrade to make this work as I'm asking). I have a large collection (~50 million documents) with several numeric fields, e.g.:
{_id: 'doc1', a: 5, b: 2, c: 0},
{_id: 'doc2', a: 4, b: 9, c: 6},
{_id: 'doc3', a: 1, b: 7, c: 4},
{_id: 'doc4', a: 8, b: 1, c: 1},
...
I'd like to sort this collection by various linear combinations of a, b, and c (in reality there are something like 20 fields I'd like to combine). So for example I'd like to sort by 3*a + 4*b + 10*c. The weights (3, 4 and 10 in my example) are something I'd like to experiment with rapidly.
I didn't see a simple way for indexes to support efficient sorting on linear combinations of fields. I know I can do this with the aggregation pipeline, but I think it will still require a collection scan for every new set of weights (?).
If I were implementing mongodb I can imagine that perhaps I could compute a index on the expression 3*a + 4*b + 10*c efficiently by using indices on a, b and c, i.e. I wouldn't need to do a full collection scan to compute a new linearly weighted index. I'm not sure if that's true theoretically, and if it translates into anything I can do practically to solve this problem.
Any input is welcome!
To quote the docs:
When creating an index, the number associated with a key specifies the
direction of the index, so it should always be 1 (ascending) or -1
(descending). Direction doesn't matter for single key indexes or for
random access retrieval but is important if you are doing sorts or
range queries on compound indexes.
However, I see no reason why direction of the index should matter on compound indexes. Can someone please provide a further explanation (or an example)?
MongoDB concatenates the compound key in some way and uses it as the key in a BTree.
When finding single items - The order of the nodes in the tree is irrelevant.
If you are returning a range of nodes - The elements close to each other will be down the same branches of the tree. The closer the nodes are in the range the quicker they can be retrieved.
With a single field index - The order won't matter. If they are close together in ascending order they will also be close together in descending order.
When you have a compound key - The order starts to matter.
For example, if the key is A ascending B ascending the index might look something like this:
Row A B
1 1 1
2 2 6
3 2 7
4 3 4
5 3 5
6 3 6
7 5 1
A query for A ascending B descending will need to jump around the index out of order to return the rows and will be slower. For example it will return Row 1, 3, 2, 6, 5, 4, 7
A ranged query in the same order as the index will simply return the rows sequentially in the correct order.
Finding a record in a BTree takes O(Log(n)) time. Finding a range of records in order is only OLog(n) + k where k is the number of records to return.
If the records are out of order, the cost could be as high as OLog(n) * k
The simple answer that you are looking for is that the direction only matters when you are sorting on two or more fields.
If you are sorting on {a : 1, b : -1}:
Index {a : 1, b : 1} will be slower than index {a : 1, b : -1}
Why indexes
Understand two key points.
While an index is better than no index, the correct index is much better than either.
MongoDB will only use one index per query, making compound indexes with proper field ordering what you probably want to use.
Indexes aren't free. They take memory, and impose a performance penalty when doing inserts, updates and deletes. Normally the performance hit is negligible (especially compared to gains in read performance), but that doesn't mean that we can't be smart about creating our indexes.
How Indexes
Identifying what group of fields should be indexed together is about understanding the queries that you are running. The order of the fields used to create your index is critical. The good news is that, if you get the order wrong, the index won't be used at all, so it'll be easy to spot with explain.
Why Sorting
Your queries might need Sorting. But sorting can be an expensive operation, so it's important to treat the fields that you are sorting on just like a field that you are querying. So it will be faster if it has index. There is one important difference though, the field that you are sorting must be the last field in your index. The only exception to this rule is if the field is also part of your query, then the must-be-last-rule doesn't apply.
How Sorting
You can specify a sort on all the keys of the index or on a subset; however, the sort keys must be listed in the same order as they appear in the index. For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } but not on { b: 1, a: 1 }.
The sort must specify the same sort direction (i.e. ascending/descending) for all its keys as the index key pattern or specify the reverse sort direction for all its keys as the index key pattern. For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } and { a: -1, b: -1 } but not on { a: -1, b: 1 }.
Suppose there are these indexes:
{ a: 1 }
{ a: 1, b: 1 }
{ a: 1, b: 1, c: 1 }
Example Index Used
db.data.find().sort( { a: 1 } ) { a: 1 }
db.data.find().sort( { a: -1 } ) { a: 1 }
db.data.find().sort( { a: 1, b: 1 } ) { a: 1, b: 1 }
db.data.find().sort( { a: -1, b: -1 } ) { a: 1, b: 1 }
db.data.find().sort( { a: 1, b: 1, c: 1 } ) { a: 1, b: 1, c: 1 }
db.data.find( { a: { $gt: 4 } } ).sort( { a: 1, b: 1 } ) { a: 1, b: 1 }
If I choose {a:1,b:1,c:1} as my shard key and in my query I filter {a:1} in a hashed sharding strategy , is the query a targeted operation or it is broadcasting to every shard in the cluster?
If it is targeted operation how mongodb determine it? as hash of {a:1} is completely differ from hash of {a:1,b:1,c:1}
The simple answear is: Yes.
Look at it this way:
Let's assume you have got the following collection:
//1
{
a: 1,
b: 1,
c: 1,
d: 1
},
//2
{
a: 1,
b: 1,
c: 1,
d: 2
},
//3
{
a: 1,
b: 1,
c: 2,
d: 5
}
According to your index, docs 1 and 2 must be at the same bulk (let's say, on shard number 1) while doc 3 could be stored on a different bulk (let's say, on shard number 2)
Now, if you search for {a: 1}, all three docs should appear. Meaning that mongo had to distribute the que both to shard no.1 and shard no.2.
As for your second question, in MongoDb, you cannot perform Compound-Hashed-Index at all (and even if you could, than... yes. The hashed value would probably be diff)
Have object holds three values of doubles. Two objects are equal if all three values are equal in any possible combination. Need a function to determine number of "unique" objects in array.
I think about make Set from array and return count, but conforming to Hashable protocol requires hashValue function...
Coding on Swift but algorithm on any language (except alien) would be appreciated :)
So I need hashValue of three doubles (order of values doesn't matter) or any other solution of determining number unique objects in array
UPDATE: "Unique" means not equal. As I said above equal of two objects with double values (a, b, c for example) is equal of all three values in any possible combinations. For example:
obj1 = (a: 1, b: 2, c: 3)
obj2 = (a: 3, b: 2, c: 1)
// obj1 is equal obj2
Here is an example of making a class holding 3 Doubles Hashable so that you can use Set to determine how many unique ones are held in an array:
class Triple: Hashable {
var a: Double
var b: Double
var c: Double
// hashValue need only be the same for "equal" instances of the class,
// so the hashValue of the smallest property will suffice
var hashValue: Int {
return min(a, b, c).hashValue
}
init(a: Double, b: Double, c: Double) {
self.a = a
self.b = b
self.c = c
}
}
// Protocol Hashable includes Equatable, so implement == for type Triple:
// Compare sorted values to determine equality
func ==(lhs: Triple, rhs: Triple) -> Bool {
return [lhs.a, lhs.b, lhs.c].sorted() == [rhs.a, rhs.b, rhs.c].sorted()
}
let t1 = Triple(a: 3, b: 2, c: 1)
let t2 = Triple(a: 1, b: 2, c: 3)
let t3 = Triple(a: 1, b: 3, c: 2)
let t4 = Triple(a: 3, b: 3, c: 2)
let arr = [t1, t2, t3, t4]
// Find out how many unique Triples are in the array:
let set = Set(arr)
print(set.count) // 2
Discussion: A better hashing function
As #PatriciaShanahan noted in the comments, using min in the computation of hashValue has a couple of drawbacks:
min can be expensive since it involves comparisons.
Only considering the smallest item when computing the hashValue for Triple will result in many "common" Triples with the same hashValue. For instance, any Triple with a value of 0 and two positive values would have the same hashValue.
I chose min because I felt it was easy to understand that the min(a, b, c) would result in the same value for all orderings of a, b, and c which means we'd get the same hashValue for all orderings. That is important because since we are considering Triples to be equal independent of the orderings of the 3 values, the hashValue must be the same for any ordering of the values since a == b implies a.hashValue == b.hashValue.
I had considered other hashing functions such as:
(a + b + c).hashValue
(a * b * c).hashValue
but these are flawed. Mathematically speaking, addition and multiplication are commutative, so theoretically these would result in the same hashValue no matter the order of a, b, and c. But, in practice, changing the order of operations could result in an overflow or underflow.
Consider the following example:
let a = Int.max
let b = Int.min
let c = 5
let t1 = (a + b) + c // 4
let t2 = (a + c) + b // Overflow!
An ideal hashing function for class Triplewould:
Guarantee the same hashValue for all orderings of a, b, and c.
Be fast to compute.
Would change the result if any of a, b, or c change.
Could not overflow or underflow.
One good mathematical operation for combining numbers is the bitwise OR function ^. It combines two values by comparing them bitwise and setting the resulting bit to 0 if both bits are the same and to 1 if the bits are different.
a b result
--- --- ------
0 0 0
0 1 1
1 0 1
1 1 0
Extending this to 3 values:
a b c result
--- --- --- ------
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 1
As shown in the table above, XOR(0, 0, 0) = 0 for all orderings, XOR(1, 0, 0) = 1 for all orderings, XOR(1, 1, 0) = 0 for all orderings, and XOR(1, 1, 1) = 1 for all orderings. So using exclusive OR to combine the values meets the first criterium of providing the same result for all orderings.
Exclusive OR is a fast operation. It is implemented by a single assembly instruction. So it meets the second criterium of a good hashing function.
If any of a, b, or c changes, then the result of a ^ b ^ c would change. So exclusive OR meets the third criterium of a good hashing function.
Exclusive OR cannot overflow or underflow because it simply sets the bits. So it meets the fourth criterium of a good hashing function.
Thus, a better hashing function would be:
var hashValue: Int {
return a.hashValue ^ b.hashValue ^ c.hashValue
}
I have mongo db collection which looks like the following
collection {
X: 1,
Y: 2,
Z: 3,
T_update: 123,
T_publish: 243,
T_insert: 342
}
I have to create an index like
{X: 1, Y: 1, Z: 1, T_update: 1}
{X: 1, Y: 1, Z: 1, T_publish: 1}
{X: 1, Y: 1, Z: 1, T_insert: 1}
But what I see is that the value X: 1, Y:1, Z:1 will lead to redundancy and only time paramter which I intend to use for sorting is changing. Is there any better way to create the above indexes so that I do not ave to create three separate indexes.
Also say if I have index like
{X: 1, Y: 1, Z: 1, T_update: 1}
and I want Mongo to return result such that x = 5, y = any value, Z = 4, sort = T_update
will the above index be useful or should I create an index such as
{X:1, Z:1, T_update: 1},
I hope that I can avoid it.
The answer here is going to depend on the selectivity of the fields you are indexing - if the criteria you will be using to filter X, Y, or Z are not very selective then they can essentially be left out (or moved to the right of the compound key).
Let's say you are using a filter like Y is not equal to 1, where 1 is a rare value. Since you will be traversing almost the entire index to return most of the values, and scanning the data, having an index on Y will be of less benefit than having an index for the sort first. Given that scenario, if sorting on T_Update it would probably be beneficial to have an index like: {T_update: 1, Y : 1}.
In the end, there are lots and lots of permutations here in terms of what might be the most efficient way to index. The real way to figure out the best indexes for your data set is to use explain() and hint() to test the various indexes with your specific query pattern and data set.