Order of query fields in MongoDB using Perl driver - perl

I found an strange behavior in MongoDB::Collection->find function and wanted to ask if this is a result of my misunderstanding of the way the driver works of it it's really an odd behavior.
I want to make a query that searches on the indexed fields but also on some of the other ones. To be sure that MongoDB would use the existing index, I thought of passing the search terms as an array_ref, with the fields in the order I wanted them to be, but the explain() function shows that uses the BasicCursor and scans all documents. If, on the other side, I pass the search terms as a hash_ref (which doesn't ensure and field order), MongoDB gets the fields in the correct order and uses the existing index.
I have a collection with documents as follow:
{
"customer" : NumberLong(),
"date" : ISODate(),
field_a : 1,
field_b: 2,
[...]
field_n: 1
}
There is an index called customer_1_date_1 on fields customer and date.
When I give the search terms as an array ref, the explain command states:
my $SearchTerms = [
'customer', $customer,
'date' , $date ,
'field_a' , $field_a ,
'field_b' , $field_b ,
];
cursor "BasicCursor",
[...]
nscanned 5802,
nscannedAllPlans 5802,
[...]
On the other side, when I give the search terms as a hash ref, the explain command states:
my $SearchTerms = {
customer => $customer,
date => $date,
field_a => $field_a,
field_b => $field_b,
};
allPlans [
[0] {
cursor "BtreeCursor customer_1_date_1",
[...]
n 1,
nscanned 4,
nscannedObjects 4
},
[1] {
cursor "BasicCursor",
[...]
n 0,
nscanned 4,
nscannedObjects 4
}
],
[...]
nscanned 4,
nscannedAllPlans 8,
nscannedObjects 4,
nscannedObjectsAllPlans 8,
So, my question is: should I trust the hash ref to always get the existing index?
Is there any way, without using additional modules, to ensure this?
Thank you all,
Juanma

Related

Can I do a second 'query' on a MongoDB cursor?

Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.

MongoDB with 1B documents, what is most optimum filter to return recently updated documents

I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport
Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.

Continuing a Query (paginating) on a compound index

I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.

mongodb index (reverse) optimization

I have a mongodb collection, "features", having 3 fields: name, active, weight.
I will sort features by weight descending:
db.features.find({active:true},{name:1, weight:1}).sort({weight:-1})
for optimization, i create index for it:
db.features.ensureIndex({'active': 1, 'weight': -1})
I can see it works well when using explain() in query.
However, when i query it by weight ascending, i suppose the index i just created will not work and i need to create another index on weight ascending.
Query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
when i use explain() to show how index working, i find it prints out:
"cursor" : "BtreeCursor active_1_weight_-1 reverse",
does the index reverse mean the query is optimized by the index?
generally, do i need to create 2 index like ascending on weight and descending on weight if i will sort it by weight ascending in some case and descending in other cases?
I know I'm late but I would like to add a little more detail. When you use explain() and it outputs cursor: BtreeCursor, it doesn't always guarantee that only the index is used to satisfy your query. You also have to check the "indexOnly" option in the results of explain(). If indexOnly is outputted as true it means that your query was satisfied using the index only and the documents in the collection was not referred to at all. This is called 'covered index query' http://docs.mongodb.org/manual/applications/indexes/
But if the results of explain are cursor: BtreeCursor and indexOnly:false, it means that in addition to using the index, the collection was also referred to. In you case, for the query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
Mongo would have used the index 'active': 1, 'weight': -1 to satisfy the initial part of the query i.e. db.features.find({active:true}) and would have done the sort without using the index. So to know exactly, you have to look at the indexOnly result within explain().
As you can see from this document, when explain() outputs BtreeCursor, it means that an index was used. When an index is used, indexBounds will be set to indicate the key bounds for scanning in the index. However, if the putput showed BasicCursor, it indicates a table scan style operation.
So based on what you've said, from the explain() results, you can see that you're using a BTree Cursor on the index named active_1_weight_-1 and the reverse means that you're iterating over the index in reverse order.
So no, you don't need to create separate indexes.
this is very confusing. in mongodb class there is an example see below.
notice BtreeCursor reverse is used ONLY for the purpose of sorting in skip and limit command
NOT for the purpose of locating the record.
the lesson is if nscan =40k and n=10 means btree index is not used in locating record.
so when u see btreecursor index reverse does not necessay mean index get used to locating the reocrd.
Suppose you have a collection called tweets whose documents contain information about thecreated_at time of the tweet and the user's followers_count at the time they issued the tweet. What can you infer from the following explain output?
db.tweets.find({"user.followers_count":{$gt:1000}}).sort({"created_at" : 1 }).limit(10).skip(5000).explain()
{
"cursor" : "BtreeCursor created_at_-1 reverse",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 46462,
"nscanned" : 46462,
"nscannedObjectsAllPlans" : 49763,
"nscannedAllPlans" : 49763,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 205,
"indexBounds" : {
"created_at" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost.localdomain:27017"
}
This query performs a collection scan. yes
The query uses an index to determine the order in which to return result documents. yes
The query uses an index to determine which documents match. no
The query returns 46462 documents no
Assuming you are using 2.0+ then reverse traversal is not more costly to MongoDB, so for this case you don't need to create separate indexes for the forward/reverse sort. You can confirm by creating it and using hint() if you wish (the optimizer will cache the current index for a while, so will not automatically select the other index).

How to improve query performance with operators like $nin, $in for Mongodb

I have a reasonably large dataset of over 3 million documents that have tags similar to StackOverflow that uses tags for each question. The schema that I use for storing the tags is as follows:
{"id": 12345, "tags":["tag1", "tag2", "tag3"]}, {"id": 12346, "tags":["tag2", "tag3"]}
I have a multi-key index created on tags field. When I am performing queries using $in or $nin operators to find the intersection, union of the tags, the performance is around 7 seconds on a server class machine. Is there anything that I can do to improve the speed of query search?
EDIT 1:
Here is the explain plan as requested. What I observed is that the queries returned much faster after I restarted my server and just ran just the mongodb server. The queries performed much faster(< 50ms). I suspect the indexes were not cached in memory, although I had ample unused ram available and my index (800MB) could easily fit in memory.
db.tagsCollection.find( { "tags" : { $in : ['tag1', 'tag2'], $nin : ['tag4', '
tag5', 'tag6', 'tag7'] } } ).explain();
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 6145193,
"nscannedObjects" : 6145192,
"n" : 969386,
"millis" : 19640,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"tags" : [
[
"tag1",
"tag1"
],
[
"tag2",
"tag2"
]
]
}
}
Note
This is what I thought of as an optimization ( though you might need to test it )
Instead of storing tags,store a small key which identifies all the tags particular document has.
say for post#125 the tags are : PHP, MongoDb , database .
a) clean the tags like convert all of them to small case etc
and then sort them alphabetically .
current tags will be : database,mongodb,php
b) Have a seperate collection which stores integer to tag mapping :
{ "_id" : 1 , "t" : "mongodb" }
{ "_id" : 2 , "t" : "php" } and so on store all the possible tags for your website
c) to store a document, create the tag key using tags to number map from previous collection.
so curent database,mongodb,php will become something like 1-12-2
d) store your document like :
{ "id" : 12345 , "tags" : [1,12,3] }
QUERYING :
The use of integers instead of strings on an indexed field would reduce the index size by great extent, and also make querying faster as compared to a string index.
Not sure about amount of performance gain, but still worth a try to compare to your current implementation.
Check the size of your multi-key tags index using db.col.stats(). If it doesn't fit in RAM then you might be disk-bound and incurring some disk IO cost. If the index fits entirely in memory then I'm not sure what else you can do, apart from throw more hardware at it, unless you can optimise the queries themselves.
Do you need to search through all the data, or can you query a subset that's filtered by another indexed field? Or can you eliminate the $nin queries, which will tend to be slower because the have to iterate every tag, where as $in only has to iterate until it finds a match.
If you want performance to be super fast and dont have space contraints, I would suggest to have separate collection of tags with video id array and have an index on tag name.
Here is another suggestion but I've had not a chance to test it.
{
tags:{
items:[ 'a', 'b', 'c' ],
mixed:{
a:1, // hash value for a tag
b:2, // hash value for b tag
c:3 // hash value for c tag
}
}
}
and search query is
db.demo.find({ 'tags.mixed.a':1, 'tags.mixed.b':2 })
if possible have to create compound index for tags.mixed