Can I do a second 'query' on a MongoDB cursor? - mongodb

Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?

In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.

Related

DocumentDb Compound Query really slow for Date

I am using DocumentDb database(total size ~13TB)
This is the schema of the database, all the keys are single indexed
[
{
{key : {"customerId" : 1}},
{key : {"typeOfProduct" : 1}},
{key : {"date": 1}},
}
]
1st query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>}).count()
returns 500 documents
2nd query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "typeOfProduct" : <TYPE>}).count()
returns 200 documents
3nd query db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "date" : { "$gte" : NumberLong(1584055385383), "$lte" : NumberLong(1584141785383)}}).count()
query just keeps on running infinitely
In the 2nd query, My understanding is that DocumentDb first gets all the documents for matching customerId which is 500, and then iteratively searches for matching typeOfProduct through each document.
In the 3rd query, it is supposed to be working in a similar manner, but the query keeps on running infinitely.
Can someone explain why this is happening? Why is it so slow with the Date? Is it because of the size of the database or am I writing the query wrong?
I'm not sure if you have one three-field index (1) or three single-field indexes (2).
Both first and second query uses that index because they're prefix-matching it. In general, the third query could use the index easily, skipping the second field. You can check it using explain().
It's not obvious how Mongo will pick the index, but knowing your data structure, you can suggest one, using hint(index) ({customerId: 1} in your case). To check which one is being used for this query by default, use explain().

Mongodb compound index not being used

I have a mongodb index with close to 100k documents. On each document, there are the following 3 fields.
arrayX: [ObjectId]
someID: ObjectId
timestamp: Date
I have created a compound index for the 3 fields in that order.
When I try to then fire an aggregate query (written below in pseudocode), as
match(
and(
arrayX: (elematch: A),
someId: Y
)
)
sort (timestamp: 1)
it does not end up using the compound index.
The way I know this is when I use .explain(), the winningPlan stage is FETCH, the inputStage is IXSCAN and the indexname is timestamp_1
which means its only using the other single key index i created for the timestamp field.
What's interesting is that if I remove the sort stage, and keep everything the exact same, mongodb ends up using the compound index.
What gives?
Multi-key indexes are not useful for sorting. I would expect that a plan using the other index was listed in rejectedPlans.
If you run explain with the allPlansExecution option, the response will also show you the execution times for each plan, among other things.
Since the multi-key index can't be used for sorting the results, that plan would require a blocking sort stage. This means that all of the matching documents must be retrieved and then sorted in memory before sending a response.
On the other hand, using the timestamp_1 index means that documents will be encountered in a presorted order while traversing the index. The tradeoff here is that there is no blocking sort stage, but every document must be examined to see if it matches the query.
For data sets that are not huge, or when the query will match a significant fraction of the collection, the plan without a blocking sort will return results faster.
You might test creating another index on { someID:1, timestamp:1 } as this might reduce the number of documents scanned while still avoiding the blocking sort.
The reason the compound index is selected when you remove the sort stage is because that stage probably accounts for the majority of the execution time.
The fields in the executionStats section of the explain output are explained in Explain Results. Comparing the estimated execution times for each stage may help you determine where you can tune the queries.
I am using documents like this (based on the question post) for discussion:
{
_id: 1,
fld: "One",
arrayX: [ ObjectId("5e44f9ed221e963909537848"), ObjectId("5e44f9ed221e963909537849") ],
someID: ObjectId("5e44f9e7221e963909537845"),
timestamp: ISODate("2020-02-12T01:00:00.0Z")
}
The Indexes:
I created two indexes, as mentioned in the question post:
{ timestamp: 1 } and { arrayX:1, someID:1, timestamp:1 }
The Query:
db.test.find(
{
someID: ObjectId("5e44f9e7221e963909537845"),
arrayX: ObjectId("5e44f9ed221e963909537848")
}
).sort( { timestamp: 1 } )
In the above query I am not using $elemMatch. A query filter using $elemMatch with single field equality condition can be written without the $elemMatch. From $elemMatch Single Query Condition:
If you specify a single query predicate in the $elemMatch expression,
$elemMatch is not necessary.
The Query Plan:
I ran the query with explain, and found that the query uses the arrayX_1_someID_1_timestamp_1index. The index is used for the filter as well as the sort operations of the query.
Sample plan details:
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"arrayX" : 1,
"someID" : 1,
"timestamp" : 1
},
"indexName" : "arrayX_1_someID_1_timestamp_1",
...
The IXSCAN specifies that the query uses the index. The FETCH stage specifies that the document is retrieved for getting other details using the index id. This means that both the query's filter as well as the sort use the index. The way to know that sort uses an index is the plan will not have a SORT stage - as in this case.
Reference:
From Sort and Non-prefix Subset of an Index:
An index can support sort operations on a non-prefix subset of the
index key pattern. To do so, the query must include equality
conditions on all the prefix keys that precede the sort keys.

Created indexes on a mongodb collection, still fails while sorting a large data set

My Query below:
db.chats.find({ bid: 'someID' }).sort({start_time: 1}).limit(10).skip(82560).pretty()
I have indexes on chats collection on the fields in this order
{
"cid" : 1,
"bid" : 1,
"start_time" : 1
}
I am trying to perform sort, but when I write a query and check the result of explain(), I still get the winningPlan as
{
"stage":"SKIP",
"skipAmount":82560,
"inputStage":{
"stage":"SORT",
"sortPattern":{
"start_time":1
},
"limitAmount":82570,
"inputStage":{
"stage":"SORT_KEY_GENERATOR",
"inputStage":{
"stage":"COLLSCAN",
"filter":{
"ID":{
"$eq":"someID"
}
},
"direction":"forward"
}
}
}
}
I was expecting not to have a sort stage in the winning plan as I have indexes created for that collection.
Having no indexes will result into the following error
MongoError: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM [duplicate]
However I managed to make the sort work by increasing the size allocation on ram from 32mb to 64mb, looking for help in adding indexes properly
The order of fields in an index matters. To sort query results by a field that is not at the start of the index key pattern the query must include equality conditions on all the prefix keys that precede the sort keys. The cid field is not in the query nor used for sorting, so you must leave it out. Then you put the bid field first in the index definition as you use it in the equality condition. The start_time goes after that to be used in sorting. Finally, the index must look like this:
{"bid" : 1, "start_time" : 1}
See the documentation for further reference.

Continuing a Query (paginating) on a compound index

I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.

mongodb index (reverse) optimization

I have a mongodb collection, "features", having 3 fields: name, active, weight.
I will sort features by weight descending:
db.features.find({active:true},{name:1, weight:1}).sort({weight:-1})
for optimization, i create index for it:
db.features.ensureIndex({'active': 1, 'weight': -1})
I can see it works well when using explain() in query.
However, when i query it by weight ascending, i suppose the index i just created will not work and i need to create another index on weight ascending.
Query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
when i use explain() to show how index working, i find it prints out:
"cursor" : "BtreeCursor active_1_weight_-1 reverse",
does the index reverse mean the query is optimized by the index?
generally, do i need to create 2 index like ascending on weight and descending on weight if i will sort it by weight ascending in some case and descending in other cases?
I know I'm late but I would like to add a little more detail. When you use explain() and it outputs cursor: BtreeCursor, it doesn't always guarantee that only the index is used to satisfy your query. You also have to check the "indexOnly" option in the results of explain(). If indexOnly is outputted as true it means that your query was satisfied using the index only and the documents in the collection was not referred to at all. This is called 'covered index query' http://docs.mongodb.org/manual/applications/indexes/
But if the results of explain are cursor: BtreeCursor and indexOnly:false, it means that in addition to using the index, the collection was also referred to. In you case, for the query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
Mongo would have used the index 'active': 1, 'weight': -1 to satisfy the initial part of the query i.e. db.features.find({active:true}) and would have done the sort without using the index. So to know exactly, you have to look at the indexOnly result within explain().
As you can see from this document, when explain() outputs BtreeCursor, it means that an index was used. When an index is used, indexBounds will be set to indicate the key bounds for scanning in the index. However, if the putput showed BasicCursor, it indicates a table scan style operation.
So based on what you've said, from the explain() results, you can see that you're using a BTree Cursor on the index named active_1_weight_-1 and the reverse means that you're iterating over the index in reverse order.
So no, you don't need to create separate indexes.
this is very confusing. in mongodb class there is an example see below.
notice BtreeCursor reverse is used ONLY for the purpose of sorting in skip and limit command
NOT for the purpose of locating the record.
the lesson is if nscan =40k and n=10 means btree index is not used in locating record.
so when u see btreecursor index reverse does not necessay mean index get used to locating the reocrd.
Suppose you have a collection called tweets whose documents contain information about thecreated_at time of the tweet and the user's followers_count at the time they issued the tweet. What can you infer from the following explain output?
db.tweets.find({"user.followers_count":{$gt:1000}}).sort({"created_at" : 1 }).limit(10).skip(5000).explain()
{
"cursor" : "BtreeCursor created_at_-1 reverse",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 46462,
"nscanned" : 46462,
"nscannedObjectsAllPlans" : 49763,
"nscannedAllPlans" : 49763,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 205,
"indexBounds" : {
"created_at" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost.localdomain:27017"
}
This query performs a collection scan. yes
The query uses an index to determine the order in which to return result documents. yes
The query uses an index to determine which documents match. no
The query returns 46462 documents no
Assuming you are using 2.0+ then reverse traversal is not more costly to MongoDB, so for this case you don't need to create separate indexes for the forward/reverse sort. You can confirm by creating it and using hint() if you wish (the optimizer will cache the current index for a while, so will not automatically select the other index).