MongoDB aggregation $lookup to a field that is an indexed array - mongodb

I am trying a fairly complex aggregate command on two collections involving $lookup pipeline. This normally works just fine on simple aggregation as long as index is set on foreignField.
But my $lookup is more complex as the indexed field is not just a normal Int64 field but actually an array of Int64. When doing a simple find(), it is easy to verify using explain() that the index is being used. But explaining the aggregate pipeline does not explain whether index is being used in the $lookup pipeline. All my timing tests seem to indicate that the index is not being used. MongoDB version is 3.6.2. Db compatibility is set to 3.6.
As I said earlier, I am not using simple foreignField lookup but the 3.6-specific pipeline + $match + $expr...
Could using pipeline be showstopper for the index? Does anyone have any deep experience with the new $lookup pipeline syntax and/or the index on an array field?
Examples
Either of the following works fine and if explained, shows that index on followers is being used.
db.col1.find({followers: {$eq : 823778}})
db.col1.find({followers: {$in : [823778]}})
But the following one does not seem to make use of the index on followers [there are more steps in the pipeline, stripped for readability].
db.col2.aggregate([
{$match:{field: "123"}},
{$lookup:{
from: "col1",
let : {follower : "$follower"},
pipeline: [{
$match: {
$expr: {
$or: [
{ $eq : ["$follower", "$$follower"] },
{ $in : ["$$follower", "$followers"]}
]
}
}
}],
as: "followers_all"
}
}])

This is a missing feature which is going to part of 3.8 version.
Currently eq matches in lookup sub pipeline are optimised to use indexes.
Refer jira fixed in 3.7.1 ( dev version).
Also, this may be relevant as well for non-multi key indexes.

Related

Will $or query in mongo findOne operation use index if not all expressions are indexed?

Mongo's documentation on $or operator says:
When evaluating the clauses in the
$or
expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an
$or
expression, all the clauses in the
$or
expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So effectively, if you want the query to be efficient, both of the attributes used in the $or condition should be indexed.
However, I'm not sure if this applies to "findOne" operations as well since I can't use Mongo's explain functionality for a findOne operation. It seems logical to me that if you only care about returning one document, you would check an indexed condition first, since you could bail after just finding one without needing to care about the non-indexed fields.
Example
Let's pretend in the below that "email" is indexed, but username is not. It's a contrived example, but bear with me.
db.users.findOne(
{
$or: [
{ email : 'someUser#gmail.com' },
{ username: 'some-user' }
]
}
)
Would the above use the email index, or would it do a full collection scan until it finds a document that matches the criteria?
I could not find anything documenting what would be expected here. Does someone know the answer?
Well I feel a little silly - it turns out that the docs I found saying you can only run explain on "find" may have been just referring to when you're in MongoDB Compass.
I ran the below (userEmail not indexed)
this.userModel
.findOne()
.where({ $or: [{ _id: id }, { userEmail: email }] })
.explain();
this.userModel
.getModelInstance()
.findOne()
.where({ _id: id })
.explain();
...and found that the first does do a COLLSCAN and the second's plan was IDHACK (basically it does a normal ID lookup).
When running performance tests, I can see that the first is about 4x slower in a collection with 20K+ documents (depends on where in the natural order the document you're finding is)

MongoDB conditional query on nested document array

Hi I'm trying to write a conditional query on nested document array.
I've read the document for days and couldn't figure out how to make this work.
DB looks like below :
[
{
"id":1,
"team":"team1",
"players":[
{
"name":"Mario",
"substitutes":[
"Luigi",
"Yoshi"
]
},
{
"name":"Wario",
"substitutes":[
]
}
]
},
{
"id":2,
"team":"team2",
"players":[
{
"name":"Bowser",
"substitutes":[
"Toad",
"Mario"
]
},
{
"name":"Wario",
"substitutes":[
]
}
]
}
]
Due to my lack of English, it's hard to put but what I'm trying to do is
to find teams that includes all queried players.
Each object in players array, some have substitutes.
For each objects in players array, if one of the queried players is not the main player("players.name"), then I want it to look for if one of substitutes("players.substitutes") is.
Team.find({players:{$in:[ 'Mario', 'Wario' ]}}) (mongoose query)
this will give me an array with 'team1'.
but what I want to get is both teams because 'Mario' is one of the substitutes for 'Bowser'(team2).
I failed to make a query but what I've been trying is not to use $where since the official MongoDB docs says :
AGGREGATION ALTERNATIVES PREFERRED
Starting in MongoDB 3.6, the $expr operator allows the use of
aggregation expressions within the query language. And, starting in
MongoDB 4.4, the $function and $accumulator allows users to define
custom aggregation expressions in JavaScript if the provided pipeline
operators cannot fulfill your application’s needs.
Given the available aggregation operators:
The use of $expr with aggregation operators that do not use JavaScript
(i.e. non-$function and non-$accumulator operators) is faster than
$where because it does not execute JavaScript and should be preferred
if possible. However, if you must create custom expressions, $function
is preferred over $where.
BUT if it could be easily written with $where operator then it's totally fine.
Any suggestions or ideas that lead to any further would be highly appreciated.
Firstly, your query is incorrect. And it is not very obvious what exactly is your filter criteria. So I am giving two suggestions:
If you want to filter all documents that have name defined in your matching criteria (which returns both documents):
db.Team.find({"players.name":{$in:[ 'Mario', 'Wario' ]}}).pretty()
If you want to filter all documents that have any provided player names in the substitutes array (which returns only one, because team1 doesn't have any substitutes are Mario/Wario)
db.Team.find({"players.substitutes":{$in:[ 'Mario', 'Wario' ]}}).pretty()
The names being looked at could be present in name or substitute
db.Team.find({ $or: [{"players.substitutes":{$in:[ 'Mario', 'Wario' ]}}, {"players.name":{$in:[ 'Mario', 'Wario' ]}}] }).pretty()

Mongoose aggregate pipeline: sorting indexed date in MongoDB is slow

I've been working with this error for some time on my App here and was hoping someone can lend a hand finding the error of this aggregation query.
I'm using a docker container running MongoDB shell version v4.2.8. The app uses an Express.js backend with Mongoose middleware to interface with the database.
I want to make an aggregation pipeline that first matches by an indexed field called 'platform_number'. We then sort that by the indexed field 'date' (stored as an ISODate type). The remaining pipeline does not seem to influence the performance, its just some projections and filtering.
{$sort: {date: -1}} bottlenecks the entire aggregate, even though there are only around 250 documents returned. I do have an unindexed key called 'cycle_number' that correlates directly with the 'date' field. Replacing {date: -1} with {cycle_number: -1} speeds up the query, but then I get an out of memory error. Sorting has a max 100MB cap on Ram and this sort fails with 250 documents.
A possible solution would be to include the additional option { "allowDiskUse": true }. But before I do, I want to know why 'date' isn't sorting properly in the first place. Another option would be to index 'cycle_number' but again, why does 'date' throw up its hands?
The aggregation pipeline is provided below. It is first a match, followed by the sort and so on. I'm happy to explain what the other functions are doing, but they don't make much difference when I comment them out.
let agg = [ {$match: {platform_number: platform_number}} ] // indexed number
agg.push({$sort: {date: -1}}) // date is indexed in decending order
if (xaxis && yaxis) {
agg.push(helper.drop_missing_bgc_keys([xaxis, yaxis]))
agg.push(helper.reduce_bgc_meas([xaxis, yaxis]))
}
const query = Profile.aggregate(agg)
query.exec(function (err, profiles) {
if (err) return next(err)
if (profiles.length === 0) { res.send('platform not found') }
else {
res.json(profiles)
}
})
Once again, I've been tiptoeing around this issue for some time. Solving the issue would be great, but understanding the issue better is also awesome, Thank you for your help!
The query executor is not able to use a different index for the second stage. MongoDB indexes map the key values to the location of documents in the data files.
Once the $match stage has completed, the documents are in the pipeline, so no further index use is possible.
However, if you create a compound index on {platform_number:1, date:-1} the query planner can combine the $match and $sort stages into a single stage that will not require a blocking sort, which should greatly improve the performance of this pipeline.

MongoDB: How To Save Returned Results To Another Collection?

Consider the following:
I have a MongoDB collection named C_a. It contains a very large number of documents (e.g., more than 50,000,000).
For the sake of simplicity let's assume that each document has the following schema:
{
"username" : "Aventinus"
"text": "I love StackOverflow!",
"tags": [
"programming",
"mongodb"
]
}
Using text index I can return all documents which contain the keyword StackOverflow like this:
db.C_a.find({$text:{$search:"StackOverflow"}})
My question is the following:
Considering that the query above may return hundreds of thousands of documents, what is the easiest/fastest way to directly save the returned results into another collection named C_b?
Note: This post explains how to use aggregate to find exact matches (i.e., specific dates). I'm interested in using Text Index to save all the posts which include a specific keyword.
The referenced answer is correct. The example query from that answer can be updated to use your criteria:
db.C_a.aggregate([
{$match: {$text: {$search:"StackOverflow"}}},
{$out:"C_b"}
]);
From the MongoDB documentation for $text:
If using the $text operator in aggregation, the following restrictions also apply.
The $match stage that includes a $text must be the first stage in the pipeline.
A text operator can only occur once in the stage.
The text operator expression cannot appear in $or or $not expressions.
The text search, by default, does not return the matching documents in order of matching scores. Use the $meta aggregation expression in the $sort stage.

Aggregate framework can't use indexes

I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.
This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.
The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.
$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.