MongoDB index use on mongoose aggregate() call - mongodb

I have the following aggregate() call in Mongoose 6.9.0:
const latestTradeForPair = await Trade.aggregate([
{
$lookup: {
from: 'tradablepairs',
localField: 'tradablePair',
foreignField: '_id',
as: 'tradablePairDetails',
},
},
{
$match: {
'tradablePairDetails.pair_name': pairName,
},
},
{
$sort: {
time: -1,
},
},
{
$limit: 1,
},
]);
I have been trying, unsuccessfully, to create an index to speed up this query, since I have millions of the Trade documents. I have this:
tradeSchema.index({ tradablePair: 1, time: -1 });
However, the query does not appear to be hitting this index (according to Compass). I have been quizzing ChatGPT about this, but he/she is not very helpful and every suggestion gives rise to errors about writeConcerns and other such things.
How do I create an index that will be targeted by this query?

Related

Mongo performance is extremely slow for an aggregation query

Hope someone can help with the slow Mongo query - it runs fine against smaller collections but once we test it against the larger production collections, it fails with the message "Not enough disk space" even though we had limited the result set to 100.
I feel like there is an issue with the query structure and/or appropriate indexes
Both collections are ~5 million records.
We need help to make this query fast.
// divide these by 1000 because the ts field isn't javascript milliseconds
const startDate = (ISODate("2022-07-01T00:00:00.000Z").getTime()/1000)
const endDate = (ISODate("2022-08-10T00:00:00.000Z").getTime()/1000)
const clientId = xxxx
const ordersCollection = "orders"
const itemsCollection = "items"
db[ordersCollection].aggregate(
[
{
$lookup: {
from: itemsCollection,
localField: "data.id",
foreignField: "data.orders_id",
as: "item"
}
},
{
$unwind: "$item"
},
{
$match: {"data.client_id": clientId}
},
{
$match: {"item.ts": {$gt: startDate, $lt: endDate}}
},
{
$project: {
order_id: "$data.id",
parent_id: "$data.parent_id",
owner_id: "$data.owner_id",
client_id: "$data.client_id",
ts: "$item.ts",
status: {
$cond: {
if: {$eq: ["$item.data.status",10] },then: 3,
else: {
$cond: { if: { $eq: ["$item.data.status",4] },
then: 2,
else: "$item.data.status"
}
}
}
}
}
},
{$group: { _id: {"order_id": "$order_id", "status": "$status"},
order_id: {$first:"$order_id"},
parent_id: {$first:"$parent_id"},
owner_id: {$first:"$owner_id"},
client_id: {$first:"$client_id"},
ts: {$first:"$ts"},
status:{$first:"$status"}
}},
{$sort: {"ts": 1}}
]
).limit(100).allowDiskUse(true)
Try pulling $match on the main collection up.
This way you limit the number of documents you need to $lookup on (otherwise we'll try to lookup 5 million documents in other collection of 5 million documents).
Be sure to have an index on data.client_id.
db[ordersCollection].aggregate(
[
{
$match: {"data.client_id": clientId}
},
{
$lookup: {
from: itemsCollection,
localField: "data.id",
foreignField: "data.orders_id",
as: "item"
}
},
{
$unwind: "$item"
},
{
$match: {"item.ts": {$gt: startDate, $lt: endDate}}
},
...
As a side note limiting the result set to 100 is not helping, as the heaviest part - aggregation with lookups and grouping can not be limited.

How to count number of root documents in intermediate aggregate stage in mongo?

I want to implement pagination on a website and I'd like my mongodb query to return first perform the lookup between 2 collections, sort the documents, calculate the total number of documents and then return the relevant documents after $skip and $limit stages in the aggregation. This is my query:
const res = await Product.aggregate([
{
$lookup: {
from: 'Brand',
localField: 'a',
foreignField: 'b',
as: 'brand'
}
},
{
$sort: {
c: 1,
'brand.d': -1
}
},
{
$skip: offset
},
{
$limit: productsPerPage
}
])
I don't want to make 2 queries which are essentially the same only for the first one to return the count of documents and for the other to return the documents themselves.
So the result would be something like this:
{
documents: [...],
totalMatchedDocumentsCount: x
}
such that there will be for example 10 documents but totalMatchedDocumentsCount may be 500.
I can't figure out how to do this, I don't see that aggregate method returns cursor. Is it possible to achieve what I want in one query?
You need $facet and you can run your pipeline with $limit and $skip as one subpipeline while $count can be used simultaneously:
const res = await Product.aggregate([
// $match here if needed
{
$facet: {
documents: [
{
$lookup: {
from: 'Brand',
localField: 'a',
foreignField: 'b',
as: 'brand'
}
},
{
$sort: {
c: 1,
'brand.d': -1
}
},
{
$skip: offset
},
{
$limit: productsPerPage
}
],
total: [
{ $count: "count" }
]
}
},
{
$unwind: "$total"
}
])

Use $match on fields from two separate collections in an aggregate query mongodb

I have an aggregate query where I join 3 collections. I'd like to filter the search based on fields from two of those collections. The problem is, I'm only able to use $match on the initial collection that mongoose initialized with.
Here's the query:
var pipeline = [
{
$lookup: {
from: 'blurts',
localField: 'followee',
foreignField: 'author.id',
as: 'followerBlurts'
}
},
{
$unwind: '$followerBlurts'
},
{
$lookup: {
from: 'users',
localField: 'followee',
foreignField: '_id',
as: 'usertbl'
}
},
{
$unwind: '$usertbl'
},
{
$match: {
'follower': { $eq: req.user._id },
//'blurtDate': { $gte: qryDateFrom, $lte: qryDateTo }
}
},
{
$sample: { 'size': 42 }
},
{
$project: {
_id: '$followerBlurts._id',
name: '$usertbl.name',
smImg: '$usertbl.smImg',
text: '$followerBlurts.text',
vote: '$followerBlurts.vote',
blurtDate: '$followerBlurts.blurtDate',
blurtImg: '$followerBlurts.blurtImg'
}
}
];
keystone.list('Follow').model.aggregate(pipeline)
.sort({blurtDate: -1})
.cursor().exec()
.toArray(function(err, data) {
if (!err) {
res.json(data);
} else {
console.log('Error getting following blurts --> ' + err);
}
});
Within the pipeline, I can only use $match on the 'Follow' model. When I use $match on the 'Blurt' model, it simply ignores the condition (you can see where I tried to include it in the commented line under $match).
What's perplexing is that I can utilize this field in the .sort method, but not in the $match conditions.
Any help much appreciated.
You can use the mongo dot notation to access elements of the collection that is being looked up via $lookup.
https://docs.mongodb.com/manual/core/document/#dot-notation
So, in this case followerBlurts.blurtDate should give you the value you are looking for.

mongodb aggregate skip limit count for pagination

I want to apply the pagination on this aggregated data (all the documents which matched and project with both collection 2 and 3). I have tried multiple
queries, i am passing limit 25 but it will get only 20 document, which changes require in this query for pagination
var pipeline = [{
$match: query
}, {
$limit: limit
}, {
$skip: skip
}, {
$lookup: {
from: "collection2",
localField: "collection1Field",
foreignField: "collection2Field",
as: "combined1"
}
}, {
"$unwind": "$combined1"
}, {
$lookup: {
from: "collection3",
localField: "collection1Field",
foreignField: "collection3Field",
as: "combined2"
}
}, {
"$unwind": "$combined2"
}, {
$project: {
"collection1Field1": 1,
"collection1Field2": 1,
"collection1Field3": 1,
"collection2Field.field1": 1,
"collection2Field.field2": 1,
"collection3Field.field1": 1,
"collection3Field.field2": 1,
}
}
];
I just had a similar issue. I was getting no results back on "page 2" due to the fact that I ran limit, then skip.
The docs for $skip:
Skips over the specified number of documents that pass into the stage
and passes the remaining documents to the next stage in the pipeline.
The docs for $limit:
Limits the number of documents passed to the next stage in the
pipeline.
This means that if you run limit before skip then the results returned by the limit are then potentially truncated by the skip.
For example if limit is 50 and skip is 50 (for say page 2) then the match will find items, limit the results to 50, and then subsequently skip 50, thus feeding 0 results into any stages afterwards.
I WOULD NOT RECOMMEND running $skip and $limit at the end of the query as the DB will have done pipeline operations on a substantial amount of data that will be skipped/limited out at the end. Additionally there is a limit to the amount of memory aggregations can use which will end queries in a error if the limit is exceeded (100MB - see bottom). There is an option to turn on disk use if you exceed this limit, but a good way to optimize your code without relying on disk is to skip + limit your results before entering any $lookup or $unwind steps.
The one exception to the rule is $sort which is intelligent enough to use $limit if $limit immediately follows $sort (docs):
When a $sort immediately precedes a $limit in the pipeline, the $sort
operation only maintains the top n results as it progresses, where n
is the specified limit, and MongoDB only needs to store n items in
memory. This optimization still applies when allowDiskUse is true and
the n items exceed the aggregation memory limit.
From your code I would recommend the following (skip then limit):
var pipeline = [{
$match: query
}, {
$skip: skip
}, {
$limit: limit
}, {
$lookup: {
from: "collection2",
localField: "collection1Field",
foreignField: "collection2Field",
as: "combined1"
}
}, {
"$unwind": "$combined1"
}, {
$lookup: {
from: "collection3",
localField: "collection1Field",
foreignField: "collection3Field",
as: "combined2"
}
}, {
"$unwind": "$combined2"
}, {
$project: {
"collection1Field1": 1,
"collection1Field2": 1,
"collection1Field3": 1,
"collection2Field.field1": 1,
"collection2Field.field2": 1,
"collection3Field.field1": 1,
"collection3Field.field2": 1,
}
}];
Docs regarding aggregatation pipeline limits:
Memory Restrictions Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage
exceeds this limit, MongoDB will produce an error. To allow for the
handling of large datasets, use the allowDiskUse option to enable
aggregation pipeline stages to write data to temporary files.
Changed in version 3.4.
The $graphLookup stage must stay within the 100 megabyte memory limit.
If allowDiskUse: true is specified for the aggregate() operation, the
$graphLookup stage ignores the option. If there are other stages in
the aggregate() operation, allowDiskUse: true option is in effect for
these other stages.
You want to paginate after you get the results.
var pipeline = [{
$match: query
}, {
$lookup: {
from: "collection2",
localField: "collection1Field",
foreignField: "collection2Field",
as: "combined1"
}
}, {
"$unwind": "$combined1"
}, {
$lookup: {
from: "collection3",
localField: "collection1Field",
foreignField: "collection3Field",
as: "combined2"
}
}, {
"$unwind": "$combined2"
}, {
$project: {
"collection1Field1": 1,
"collection1Field2": 1,
"collection1Field3": 1,
"collection2Field.field1": 1,
"collection2Field.field2": 1,
"collection3Field.field1": 1,
"collection3Field.field2": 1,
}
}, {
$limit: limit
}, {
$skip: skip
}
];
If you use some npm modules for pagination you can Implementing pagination very easily. for exapmle if you use mongoose-aggregate-paginate then you just add it into the schema like...
const mongoose = require('mongoose');
const mongooseAggregatePaginate = require('mongoose-aggregate-
paginate');
mongoose.Promise = global.Promise;
const chatSchema = new mongoose.Schema({
text: {
type: String,
},
},
{
collection: 'chat',
timestamps: true
});
chatSchema.plugin(mongooseAggregatePaginate);
const chat = mongoose.model('chat', chatSchema);
module.exports = chat;
After this whenever you need pagination, Your query should be
UserCtr.get = (req, res) => {
const { limit } = 10;
const { page } = req.query.page;
const aggregateRules = [{
$match: {
_id: req.user.id
},
{
//Perform your query
}
];
Chat.aggregatePaginate(aggregateRules, {
page,
limit
}, (err, docs,
pages, total) => {
if (!err) {
const results = {
docs,
total,
limit,
page,
pages,
};
res.status(200).json(results);
} else {
res.status(500).json(err);
}
});
};
It gives response like
{
"docs": [
{
"_id": "5a7676c938c185142f99c4c3",
},
{
"_id": "5a7676c938c185142f99c4c4",
},
{
"_id": "5a7676cf38c185142f99c4c5",
}
],
"total": 3,
"limit": 50,
"page": "1",
"pages": 1
}

Poor lookup aggregation performance

I have two collections
Posts:
{
"_Id": "1",
"_PostTypeId": "1",
"_AcceptedAnswerId": "192",
"_CreationDate": "2012-02-08T20:02:48.790",
"_Score": "10",
...
"_OwnerUserId": "6",
...
},
...
and users:
{
"_Id": "1",
"_Reputation": "101",
"_CreationDate": "2012-02-08T19:45:13.447",
"_DisplayName": "Geoff Dalgas",
...
"_AccountId": "2"
},
...
and I want to find users who write between 5 and 15 posts.
This is how my query looks like:
db.posts.aggregate([
{
$lookup: {
from: "users",
localField: "_OwnerUserId",
foreignField: "_AccountId",
as: "X"
}
},
{
$group: {
_id: "$X._AccountId",
posts: { $sum: 1 }
}
},
{
$match : {posts: {$gte: 5, $lte: 15}}
},
{
$sort: {posts: -1 }
},
{
$project : {posts: 1}
}
])
and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second.
Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.
from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
foreignField Specifies the field from the documents in the from
collection. $lookup performs an equality match on the foreignField to
the localField from the input documents. If a document in the from
collection does not contain the foreignField, the $lookup treats the
value as null for matching purposes.
This will be performed the same as any other query.
If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.
db.users.ensureIndex("_AccountId", 1)
speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.
In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.
The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!
Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.
First use $match then $lookup. $match filter the rows need to be examined to $lookup. It's efficient.
as long as you're going to group by user _AccountId, you should do the $group first by _OwnerUserId then lookup only after filtering accounts having 10<postsCount<15 this will reduce lookups:
db.posts.aggregate([{
$group: {
_id: "$_OwnerUserId",
postsCount: {
$sum: 1
},
posts: {
$push: "$$ROOT"
} //if you need to keep original posts data
}
},
{
$match: {
postsCount: {
$gte: 5,
$lte: 15
}
}
},
{
$lookup: {
from: "users",
localField: "_id",
foreignField: "_AccountId",
as: "X"
}
},
{
$unwind: "$X"
},
{
$sort: {
postsCount: -1
}
},
{
$project: {
postsCount: 1,
X: 1
}
}
])