How do I index the following query in mongodb? - mongodb

I am trying to figure out the best index to use for this in mongodb:
db.articles.find({"images.url":{"$exists":true}, \
"source_id": {"$in":[ObjectId("511baf3aa56bde8e94000002"), ObjectId("511baf3aa56bde8e94000999")]}}) \
.sort({"published_at": -1})
I only want to include articles where the images.url exists, so I'm wondering if it would be a sparse index? And not sure which fields to index in order, as i've read different pointers of:
First, fields on which you will query for exact values. ("images.url": exists)
Second, fields on which you will sort. (:published_at)
Finally, fields on which you will query for a range of values. (source_id)
Also, in the example above, I am not sure whether source_id would be a range of values or not?
I was thinking:
index "images.url": -1, published_at: -1, source_id: 1, {sparse: true}
But I'm also torn on maximing exclusivity for an index, so I am considering:
index source_id: 1, "images.url": -1, published_at: -1, {sparse: true}

If we have a collection like this
{ a:1, b:1, c:1 }
{ a:1, b:1, c:2 }
{ a:1, b:1, c:3 }
{ a:1, b:2, c:1 }
... // all permutations up to:
{ a:3, b:3, c:3 }
imagine this collection in random order
this is how the compound index on ({a:1,b:1,c:1}) would look like
a: | 1 | 2 | 3 |
|-----------------+-----------------+-----------------|
b: | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 |
|-----+-----+-----+-----+-----+-----+-----+-----+-----|
c: |1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|1|2|3|
for each a you have all its b with all its c in turn, okay?
For the query: db.xx.find({a:2}).sort({b:1}), you can see that the b elements are in order below the a=2; the index will be used for sorting - "scanAndOrder" : false in explain(). The same happens, if your query is db.xx.find({a:2,c:{$in:[1,3]}}).sort({b:1})
But this: db.xx.find({a:{$in:[1,3]}}).sort({b:1}).explain() will tell you "scanAndOrder" : true, which means that the index was not used for sorting (it was used for the query, though) - from the schema above you can see, that "b" is not in sequence for a=[1,3].
That's why the efficient sequence for indexes is:
(1) exact matches (only one!)
(2) sort criteria
(3) matches that point to more than one document
In your case, there is no exact match; both queries return more than one document. Let's try this out in our example:
db.xx.find({a:{$in:[1,3]},b:{$in:[1,3]}}).sort({c:1}).explain(): uses the index for querying, but not for sorting, it scans 15 and returns 12 objects.
db.xx.find({b:{$in:[1,3]},c:{$in:[1,3]}}).sort({a:1}).explain(): uses the index for querying and for sorting, but scans 21 and returns 12 objects.
Which one is better? It will depend on your use case. If your find usually returns many documents, it could be more efficient to have the sort use the index - but if it normally returns only a few (out of many) then you might prefer the more efficient scan. Try it out and see what's better using explain()
Does this help?
regards
Ronald
P.S. I used this to create the example collection:
[1,2,3].forEach(function(a){
[1,2,3].forEach(function(b){
[1,2,3].forEach(function(c){
db.xx.insert({a:a,b:b,c:c});
})
})
})

Related

What's the best Mongo index strategy that includes a date range

I have the following schema:
{
a: string;
b: date;
c: number;
}
My query is
find({
a: 'some value',
b: {
$gte: new Date('some date')
}
})
.sort({
c: -1
});
I have an index that is:
{ a: 1, b: 1, c: 1 }
But it's not using this index.
I have several other indexes, and when analyzing my explain(), it shows it's employing multiple other indexes to accomplish my query.
I believe since my "b" query is a date range, that's not considered an equality condition, so maybe that index won't work?
Should I have two indexes:
{ a: 1, c: 1} and separately { b: 1 }
Dates tend to be much more selective than other fields, so when you have an index that looks like {dateField: 1, otherField: 1}, the selectivity of the dateField means that otherField will be useless unless you have multiple items that share the same date.
Depending on what your data distribution actually looks like, you might consider {otherField: 1, dateField: 1} (which means that mongo can go through in sorted order to check whether the docs match your date query). In general, putting your sort field before any fields used in a range query is a good idea.
Mlab's indexing docs are the best resource I've seen on index usage, and they recommend:
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values
Second, one small $in array
Third, the field(s) on which you will sort in the same order and specification as the sort itself (sorting on multiple fields)
Finally, the field(s) on which you will query for a range of values in the order of most selective to least selective (see range operators below)

MongoDB compound shard key

I have a doubt regarding Mongo compound shard keys. Let's suppose I have document that is structured like this:
{
"players": [
{
"id": "12345",
"name": "John",
},
{
"id": "23415",
"name": "Doe",
}
]
}
Players embedded documents are always present and always 2. I think that the "players.0.id" and "players.1.id" should be a good choice as shard keys because are not monotonic and are evenly distributed.
What I can't understand from the documentation is if:
All documents with same "players.0.id" OR same "players.1.id" are supposed to be saved into the same Chunk, or
All documents with same "players.0.id" AND same "players.1.id" are supposed to be saved into the same Chunk.
In other words, if I query the Collection to get all games played by John (as player 1 or player 2) the query will be sent to one chunk or to all chunks?
You cannot create a shard key where part of the key is a multikey index (i.e. index on an array field). This is mentioned in Shard Key Index Type:
A shard key index cannot be an index that specifies a multikey index, a text index or a geospatial index on the shard key fields.
If you have exactly two items under the players field, why not create two sub-documents instead of using an array? An array is typically useful for use cases where you have multiple items of indeterminate number in a document. For example, this structure might work for your use case:
{
"players": {
"player_1": {
"id" : 12345,
"name": "John"
},
"player_2": {
"id": 54321,
"name": "Doe"
}
}
}
You can then create an index like:
> db.test.createIndex({'players.player_1.id':1, 'players.player_2.id':1})
To answer your questions, if you're using this shard key, then:
There is no guarantee that the same player_1.id and player_2.id will be on the same chunk. This will depend on your data distribution.
If you query John as player_1 OR player_2, the query will be sent to all shards. This is because you have a compound index as the shard key, and you're searching for an exact match on the non-prefix field.
To elaborate on question 2:
The query you're doing is this:
db.test.find({$or: [
{'players.player_1.id':123},
{'players.player_2.id':123}
]})
In a compound index, the index was first sorted by player_1.id, then for each player_1.id, there exist sorted player_2.id. For example, if you have 10 documents with some combination of values for player_1.id and player_2.id, you can visualize the index like this:
player_1.id | player_2.id
------------|-------------
0 | 10
0 | 123
1 | 100
1 | 123
2 | 123
2 | 150
123 | 10
123 | 100
123 | 123
123 | 150
Note that the value player_2.id: 123 occur multiple times in the table, once per each player_1.id. Also note that for each player_1.id value, the player_2.id values are sorted within it.
This is how MongoDB's compound index works and how it's sorted. There are more nuances with compound indexes that is too long to explain here, but the details are explained in the Compound Indexes page
The effect of this ordering method is that, there are many, many identical player_2.id values spread across the index. Since the overall index is only sorted in terms of player_1.id, it is not possible to find an exact player_2.id without specifying player_1.id. Hence, the above query will be sent to all shards.

MongoDB index suggestion

I have the following query:
a : true AND (b : 1 OR b : 2) AND ( c: null OR (c > startDate AND c <endDate))
So basically i am thinking of a compound index of all the three fields, because i have no sorting at all. At the first step, with the index on the boolean field, i will eliminate the largest portion of documents.
Then with the index on the second field, i saw that OR clause creates two separate queries and then combines them, while removing duplicates. So this should be pretty fast and efficient.
The last condition is a simple range of dates, so i think that adding the field to the index will be a good option.
Any suggestion on my thoughts? thanks
This query:
a : true AND (b : 1 OR b : 2) AND ( c: null OR (c > startDate AND c <endDate))
could otherwise be translated as:
db.collection.find({
a:true,
b:{$in:[1,2]},
$or: [
{c:null},
{c: {$gt: startDate, $lt: endDate}}
]
})
Because of that $or you will most likely need two indexes, however, since the $or covers only c then you only need an index on c. So that our first index:
db.collection.ensureIndex({c:1})
Now we cannot use the $or with a compound index because compound indexes work upon a prefix manner and $ors are evaluated as completely separate queries for each clause, as such it would be best to use a,b as the prefix to our index here.
This means you just need an index to cover the other part of your query:
db.collection.ensureIndex({b:1,a:1})
We put b first due to the boolean value of a, our index should perform better with b first.
Note: I am unsure about an index on a at all due to its low cardinality.

MongoDB Covered Query For Two Fields Without Compound Index

Can you perform a MongoDB covered query for two fields, for example
db.collection.find( { _id: 1, a: 2 } )
without having a compound index such as
db.collection.ensureIndex( { _id: 1, a: 1 } )
but instead having only one index for _id (you get that by default) and another index for field "a", as in
db.collection.ensureIndex( { a: 1 } )
In other words, I'd like to know if in order to perform a covered query for two fields I need a compound index vs. needing only two single (i.e., not compound) indexes, one for each field.
Queries only use one index.
Your example shows _id as one of the elements of your index? _id Needs to be unique in a collection, so it wouldn't make sense to make a compound index of _id and something else.
If you instead had:
db.collection.ensureIndex( { a: 1, b: 1 })
You could then use the a index as needed, independently, or as a compound index with b.

Does order of indexes matter in MongoDB?

Is tip #25 in Tips and Tricks for MongoDB Developers correct?
It says that this query:
collection.find({"x" : criteria, "y" : criteria, "z" : criteria})
can be optimized with
collection.ensureIndex({"y" : 1, "z" : 1, "x" : 1})
I think it's false because for this to work, x should be in front. I thought the order of indexes matter.
So where did I go wrong?
The order of the fields in the index only matters if the query doesn't include all of the fields in the index. This query is referencing all three fields so the order of the fields in the index doesn't matter.
See more details in the docs on compound indexes.
The order of the fields in the find query object is not relevant.
For beginners who wants to understand it better
Mongodb says The index contains references to documents sorted first by the values of the item field and, within each value of the item field, sorted by values of the stock field." What does this mean ????
let's create a compound index on fields a, b, c, and d in ascescending order(1)
Model.createIndex({ a: 1, b: 1, c: 1, d: 1 });
I visualize it as:
at level-1, list of references sorted in a specified order(1) based on the value of the first index field(a)
at level-2, each reference at level-1, holds another set of references from thier location in a specified order(1) based on the value of the second field in the chain(b).
at level-3, each reference at level-2, holds another set of references from thier location in a specified order(1) based on the value of the second field in the chain(c).
at level-4, each reference at level-3, holds another set of references from thier location in a specified order(1) based on the value of the second field in the chain(d).
This chain forms a tree structure thus is chosen to store in B-TREE data structure.
I would love to call this storage system a compound-index-chain in this context.
Normally we build indices to perform two types of operations 1. Query Operation like find() and 2. Non-query operation like Sort()
Now you created compound index on { a: 1, b: 1, c: 1, d: 1 }. But only index creation is not enought. It becomes inefficient and sometimes useless if you don't structure your database operatons(find and sort) in a way that use those indexes.
Let's dig deeper into what kinds of query supports what kind of index ?
find():
The following prefixes of the compound index supports also indexed find() query operation on fields
{a:1},
{a:1, b:1},
{a:1, b:1, c:1}
// Index prefixes are the beginning subsets of indexed fields
#JohnnyHK already said "The order of the fields in the find query object is not relevant."
The fields could be in ANY ORDER like {b:1, a1} instead of {a1:, b:1}. Index will still be utilized as long as it is find() operation being operated on compound index or the prefix of the compound index.
However the performance of the query will not be same(may degrade) even though the find() query is using the same index and index is being utilized if the order of the the fields in find() is not highly selective than other subsequent fields.
Meaning, if the first field in a query say find({a: 'red', b: tshrt}), has HIGH SELECTIVITY, the query will be less efficient than find({a: 'tshirt', b: 'red'}) as this query hs LOWER SELECTIVITY even though both queries are using one index {a:1, b:1}.
However the HIGHLY SELECTIVE query will perform better than not having any index at all.
I think #Sushil tried to touch this topic.
In case if you are still wondering, Query selectivity refers to how well the query predicate excludes or filters out documents in a collection. Query selectivity can determine whether or not queries can use indexes effectively or even use indexes at all.
Now Let's come to the prefixes of compound indexs
Note:find() behaves differently on this {a:1, c:1} prefix of the compound index {a:1, b:1, c:1, d1} than rest of its prefixes?
In this case, The find() operation will not be able to utilize our compound index efficiently.
What happens is a:1 field index will only be able to support the find query. index on c:1 field will not be used at all because compound-index-chain has been broken in between due to the absence of b:1 index field in the prefix.
So if find() query operates on a and c field together, for field a:1 IXSCAN( i.e use of index on a) and field c COLLSCAN(i.e no use of index) will be used. Meaining the query will be slower than having separate compound index on {a:1,c:1} but faster than not having any index at all.
Conclusion is Index fields are parsed in order; if a query omits a particular index prefix, it is unable to make use of any index fields that follow that prefix.
2. Sort():
For non-query-operation(i.e Sort), the subsets of the compount index must in the same order of the index as well as must also be in the either same or oposite direction of the direction specified for each field while creating the compound index.
Let's see how the our compound index { a: 1, b: 1, c: 1, d: 1 } with ascescending direction behave with sort() operation:
Let's look at the direction of the indexed fields in sorting.
As we know on single field index on {a:1} can support sort on {a:1} same-direction and {a:-1} reverse-direction,
Compound indexes follow the same rules while sorting.
{a:1, b:1, c:1, d:1} // in same-direction as of our compound index
{a:-1, b:-1, c:-1, d:-1 } // in reverse-direction of our compound index
// But these field have neither same-direction nor reverse-direction but is ARBITARY/MIXED. Thus
// Index will be discarded while performing sorting with these fields and directions
{a:-1:, b:1, c:1, d:1}
Another example would be compound index on {a:1, b:-1} can support indexed sorting on {a:1,b:-1} (same-direction) and on {a:-1,b:1}(reverse-direction) BUT NOT support {a:-1, b:-1}.
Now let's look at the order of the indexed fields in sorting
OPTIMUM SORTING:
When a Sort operation using the compound index or using the prefix of the compound index, examining the result set in the memory(RAM) is not needed. Such sorting operation is solely sattisfied by the fields available in the index, gives optimum performance in sorting operation.
For instance:
// compound index
{a:1, b:1, c:1, d:1}
// prefix of the compound index
{a:1},
{a:1, b:1},
{a:1, b:1, c:1}
Compound-Index-Chain-Break:
When a sort operation is partially covered by the compound index, may require to examine the non-indexed matched result set in the memory.
Model.find({ a: 2 }).sort({ c: 1 }); // will not use index for sorting using field c. But will be used for finding
Model.find({ a: { $gt: 2 } }).sort({ c: 1 }); // will not use index for sorting But will be used for finding
// because compound-index-chain-break due to absence of field b of the prefix {a:1, b:1, c:1} of our compound index {a:1, b:1, c:1, d:1}
Sort on Non-prefix Subset:
When prefix keys of the index appear in both the query predicate(i.e find()) and the sort(), that index fields which precedes(or overlap) the sort subset MUST have the equality conditions($eq,$gte,$lte) in the query. So
A compound index can support indexed query on the its index prefixes as well.
Model.find({ c: 5 }).sort({ c: 1 }); // will not use index at all because it does not belongs to any of the prefix of our compound index
Model.find({ b: 3, a: 4 }).sort({ c: 1 }); // will use the index for both finding and sorting as it belongs to one our index prexfix ie. {a:1, b:1, c:1}
Model.find({ a: 4 }).sort({ a: 1, b: 1 }); // will use index for finding but not use index for sorting because a field is overlapped.
Model.find({ a: { $gt: 4 } }).sort({ a: 1, b: 1 }); // will use index for both finding and sorting because overlapped field (a) in the predicate uses equality operator and it belongs to the prefix {a:1, b:1}
Model.find({ a: 5, b: 3 }).sort({ b: 1 }); // will not use index for sorting
Model.find({ a: 5, b: { $lt: 3 } }).sort({ b: 1 }); // will use index for both finding and sorting
Hope this helps somebody
The books states the below scenario
You have 3 queries to run:
Collection.find({x:criteria, y:criteria,z:criteria})
Collection.find({z:criteria, y:criteria,w:criteria})
Collection.find({y:criteria, w:criteria })
To use
collection.ensureIndex({y:1,z:1,x:1})
it is considering the occurrence and as occurrence of y is more, it want all the queries to hit y followed by z and lastly as you will be running the 1st query a thousand times more than the other two hence including x, if this was not the case and you run all 3 queries equally then the suggestion is
Collection.ensureIndex({y:1,w:1,z:1}) .
Moreover as per the MongoDB documentation “The order of fields in a compound index is very important.” But in the above case scenario the use case is different. It is trying to optimize all the use case queries with one index.