MongoDB $group performance - mongodb

I have to collections: Account and MyCollection.
MyCollection has a One To Many relationship with account.
This relation is set by storing the id of the linked Account into MyCollection.
I need to filter MyCollection by an attribute of Account, and I need to extract only one MyCollection object per Account.
Here is the request I am making and an overview of my data structure:
// The document structure
Account: {
types: ["type_A"], // An indexed array
}
MyCollection: {
accountId: "accountId", // Id of the linked account, indexed
date: new ISODate(), // an indexed date
// Other data that we need
}
// The actual request
db.MyCollection.aggregate([
{
$match: {someData: "foo"},
},
{
$sort: {date: -1},
},
// Inject foreign account
{
$lookup: {
from: "Account",
localField: "accountId",
foreignField: "_id",
as: "account",
},
},
{
$unwind: "$account",
},
// match account type
{
$match: {
"account.type": {$in: ["type_A", "type_B"]},
},
},
// And extract latest document
{
$group: {
_id: "$accountId",
myCollection: {$first: "$$ROOT"},
},
},
]);
This request gives me the correct results, but takes several minutes to process...
How can I speed up all this ? I have an index on Account.type, but it is not used here.
My tests showed that it is the $group operation that is the most time consuming (the request takes half a second without it).

Related

MongoDB aggregate - return as separate objects

So I have 2 collections (tenants and campaigns) and I'm trying to compose a query to return 1 tenant and 1 campaign. As an input, there is a tenant domain and campaign slug. Since I first need the tenant _id to query the campaign (based on both tenantId and slug), aggregation seems more performative option (than making 2 consecutive queries).
Technically speaking, I know how to do that:
[{
$match: { 'domains.name': '<DOMAIN_HERE>' },
}, {
$lookup: {
from: 'campaigns',
localField: '_id',
foreignField: 'tenantId',
as: 'campaign',
pipeline: [{
$match: { slug: '<SLUG_HERE>' },
}],
},
}]
which returns:
{
_id: ObjectId('...'),
campaign: [{
_id: ObjectId('...'),
}],
}
But it feels very uncomfortable, because for one the campaign is returned as a field of tenant and for other the campaign is returned as a single item in an array. I know, I can process and better format the result programmatically afterwards. But is there any way to „hack“ the aggregation to achieve a result that looks more like this?
{
tenant: {
_id: ObjectId('...'),
},
campaign: {
_id: ObjectId('...'),
},
}
This is just a simplified example, in reality this aggregation query is a bit more complicated (across more collections, upon few of which I need to perform a very similar query), so it's not just about this one simple query. So the ability to return an aggregated document as a separate object, rather than an array field on parent document would be quite helpful - if not, the world won't fall apart :)
To all those whom it may concern...
Thanks to answers from some good samaritans here, I've figured it out as a combination of $addFields, $project and $unwind. Extending my original aggregation query, the final pipeline would look like this:
[{
$match: { 'domains.name': '<DOMAIN_HERE>' },
}, {
$addFields: { tenant: '$$ROOT' },
}, {
$project: { _id: 0, tenant: 1 },
}, {
$lookup: {
from: 'campaigns',
localField: 'tenant._id',
foreignField: 'tenantId',
as: 'campaign',
pipeline: [{
$match: { slug: '<SLUG_HERE>' },
}],
},
}, {
$unwind: {
path: '$campaign',
preserveNullAndEmptyArrays: true,
},
}]
Thanks for the help! 😊

Access root document map in the $filter aggregation (MongoDb)

I apologize for the vague question description, but I have quite a complex question regarding filtration in MongoDB aggregations. Please, see my data schema to understand the question better:
Company {
_id: ObjectId
name: string
}
License {
_id: ObjectId
companyId: ObjectId
userId: ObjectId
}
User {
_id: ObjectId
companyId: ObjectId
email: string
}
The goal:
I would like to query all non-licensed users. In order to do this, you would need these plain MongoDB queries:
const licenses = db.licenses.find({ companyId }); // Get all licenses for specific company
const userIds = licenses.toArray().map(l => l.userId); // Collect all licensed user ids
const nonLicensedUsers = db.users.find({ _id: { $nin: userIds } }); // Query all users that don't hold a license
The problem:
The code above works perfectly fine. However, in our system, companies may have hundreds of thousands of users. Therefore, the first and the last step become exceptionally expensive. I'll elaborate on this. First things first, you need to fetch a big number of documents from DB and transmit them via the network, which is fairly expensive. Then, we need to pass a huge $nin query to MongoDB over the network again, which doubles overhead costs.
So, I would like to perform all the mentioned operations on the MongoDB end and return a small slice of non-licensed users to avoid network transmission costs. Are there ideas on how to achieve this?
I was able to come pretty close using the following aggregation (pseudo-code):
db.company.aggregate([
{ $match: { _id: id } }, // Step 1. Find the company entity by id
{ $lookup: {...} }, // Step 2. Joins 'users' collection by `companyId` field
{ $lookup: {...} }, // Step 3. Joins 'licenses' collection by `companyId` field
{
$project: {
licensesMap: // Step 4. Convert 'licenses' array to the map with the shape { 'user-id': true }. Could be done with $arrayToObject operator
}
},
{
$project: {
unlicensedUsers: {
$filter: {...} // And this is the place, where I stopped
}
}
}
]);
Let's have a closer look at the final stage of the above aggregation. I tried to utilize the $filter aggregation in the following manner:
{
$filter: {
input: "$users"
as: "user",
cond: {
$neq: ["$licensesMap[$$user._id]", true]
}
}
}
But, unfortunately, that didn't work. It seemed like MongoDB didn't apply interpolation and just tried to compare a raw "$licensesMap[$$user._id]" string with true boolean value.
Note #1:
Unfortunately, we're not in a position to change the current data schema. It would be costly for us.
Note #2:
I didn't include this in the aggregation example above, but I did convert Mongo object ids to strings to be able to create the licensesMap. And also, I stringified the ids of the users list to be able to access licensesMap properly.
Sample data:
Companies collection:
[
{ _id: "1", name: "Acme" }
]
Licenses collection
[
{ _id: "1", companyId: "1", userId: "1" },
{ _id: "2", companyId: "1", userId: "2" }
]
Users collection:
[
{ _id: "1", companyId: "1" },
{ _id: "2", companyId: "1" },
{ _id: "3", companyId: "1" },
{ _id: "4", companyId: "1" },
]
The expected result is:
[
_id: "1", // company id
name: "Acme",
unlicensedUsers: [
{ _id: "3", companyId: "1" },
{ _id: "4", companyId: "1" },
]
]
Explanation: unlicensedUsers list contains the third and the fourth users because they don't have corresponding entries in the licenses collection.
How about something simple like:
db.usersCollection.aggregate([
{
$lookup: {
from: "licensesCollection",
localField: "_id",
foreignField: "userId",
as: "licensedUsers"
}
},
{$match: {"licensedUsers.0": {$exists: false}}},
{
$group: {
_id: "$companyId",
unlicensedUsers: {$push: {_id: "$_id", companyId: "$companyId"}}
}
},
{
$lookup: {
from: "companiesCollection",
localField: "_id",
foreignField: "_id",
as: "company"
}
},
{$project: {unlicensedUsers: 1, company: {$arrayElemAt: ["$company", 0]}}},
{$project: {unlicensedUsers: 1, name: "$company.name"}}
])
playground example
users collection and licenses collection, both have anything you need on the users so after the first $lookup that "merges" them, and a simple $match to keep only the unlicensed users, all that left is just formatting to the format you request.
Bonus: This solution can work with any type of id. For example playground
If you're facing a similar situation. Bear in mind that the above solution will work fast only with the hashed index.

How to make a query in two different collections in mongoDB? (without using ORM)

Suppose, In MongoDB i have two collections. one is "Students" and the another is "Course".
Student have the document such as
{"id":"1","name":"Alex"},..
and Course has the document such as
{"course_id":"111","course_name":"React"},..
and there is a third collection named "students-courses" where i have kept student's id with their corresponding course id. Like this
{"student_id":"1","course_id":"111"}
i want to make a query with student's id so that it gives the output with his/her enrolled course. like this
{
"id": "1",
"name":"Alex",
"taken_courses": [
{"course_id":"111","course_name":"React"},
{"course_id":"112","course_name":"Vue"}
]
}
it will be many to many relationship in MongoDB without using ORM. How can i make this query?
Need to use $loopup with pipeline,
First $group by student_id because we are going to get courses of students, $push all course_id in course_ids for next step - lookup purpose
db.StudentCourses.aggregate([
{
$group: {
_id: "$student_id",
course_ids: {
$push: "$course_id"
}
}
},
$lookup with Student Collection and get the student details in student
$unwind student because its an array and we need only one from group of same student record
$project required fields
{
$lookup: {
from: "Student",
localField: "_id",
foreignField: "id",
as: "student"
}
},
{
$unwind: "$student"
},
{
$project: {
id: "$_id",
name: "$student.name",
course_ids: 1
}
},
$lookup Course Collection and get all courses that contains course_ids, that we have prepared in above $group
$project the required fields
course details will store in taken_courses
{
$lookup: {
from: "Course",
let: {
cId: "$course_ids"
},
pipeline: [
{
$match: {
$expr: {
$in: [
"$course_id",
"$$cId"
]
}
}
},
{
$project: {
_id: 0
}
}
],
as: "taken_courses"
}
},
$project details, removed not required fields
{
$project: {
_id: 0,
course_ids: 0
}
}
])
Working Playground: https://mongoplayground.net/p/FMZgkyKHPEe
For more details related syntax and usage, check aggregation

How to lookup a field in an array of subdocuments in mongoose?

I have an array of review objects like this :
"reviews": {
"author": "5e9167c5303a530023bcae42",
"rate": 5,
"spoiler": false,
"content": "This is a comment This is a comment This is a comment.",
"createdAt": "2020-04-12T16:08:34.966Z",
"updatedAt": "2020-04-12T16:08:34.966Z"
},
What I want to achieve is to lookup the author field and get the user data, but the problem is that the lookup I am trying to use only returns this to me:
Code :
.lookup({
from: 'users',
localField: 'reviews.author',
foreignField: '_id',
as: 'reviews.author',
})
Response :
Any way to get the author's data in that field? That's where the author's Id is.
Try to execute below query on your database :
db.reviews.aggregate([
/** unwind in general is not needed for `$lookup` for if you wanted to match lookup result with specific elem in array is needed */
{
$unwind: { path: "$reviews", preserveNullAndEmptyArrays: true },
},
{
$lookup: {
from: "users",
localField: "reviews.author",
foreignField: "_id",
as: "author", // Pull lookup result into 'author' field
},
},
/** Update 'reviews.author' field in 'reviews' object by checking if 'author' field got a match from 'users' collection.
* If Yes - As lookup returns an array get first elem & assign(As there will be only one element returned -uniques),
* If No - keep 'reviews.author' as is */
{
$addFields: {
"reviews.author": {
$cond: [
{ $ne: ["$author", []] },
{ $arrayElemAt: ["$author", 0] },
"$reviews.author",
],
},
},
},
/** Group back the documents based on '_id' field & push back all individual 'reviews' objects to 'reviews' array */
{
$group: {
_id: "$_id",
reviews: { $push: "$reviews" },
},
},
]);
Test : MongoDB-Playground
Note : Just in case if you've other fields in document along with reviews that needs to be preserved in output then starting at $group use these stages :
{
$group: {
_id: "$_id",
data: {
$first: "$$ROOT"
},
reviews: {
$push: "$reviews"
}
}
},
{
$addFields: {
"data.reviews": "$reviews"
}
},
{
$project: {
"data.author": 0
}
},
{
$replaceRoot: {
newRoot: "$data"
}
}
Test : MongoDB-Playground
Note : Try to keep queries to run on lesser datasets maybe by adding $match as first stage to filter documents & also have proper indexes.
you should use populate('author') method of mongoose on the request to the server which gets the id of that author and adds the user data to the response of mongoose
and dont forget to set your schema in a way that these two collections are connected
in your review schema you should add ref to the schema which the author user is saved
author: { type: Schema.Types.ObjectId, ref: 'users' },
You can follow this code
$lookup:{
from:'users',
localField:'reviews.author',
foreignField:'_id',
as:'reviews.author'
}
**OR**
> When You find the doc then use populate
> reviews.find().populate("author")

Poor lookup aggregation performance

I have two collections
Posts:
{
"_Id": "1",
"_PostTypeId": "1",
"_AcceptedAnswerId": "192",
"_CreationDate": "2012-02-08T20:02:48.790",
"_Score": "10",
...
"_OwnerUserId": "6",
...
},
...
and users:
{
"_Id": "1",
"_Reputation": "101",
"_CreationDate": "2012-02-08T19:45:13.447",
"_DisplayName": "Geoff Dalgas",
...
"_AccountId": "2"
},
...
and I want to find users who write between 5 and 15 posts.
This is how my query looks like:
db.posts.aggregate([
{
$lookup: {
from: "users",
localField: "_OwnerUserId",
foreignField: "_AccountId",
as: "X"
}
},
{
$group: {
_id: "$X._AccountId",
posts: { $sum: 1 }
}
},
{
$match : {posts: {$gte: 5, $lte: 15}}
},
{
$sort: {posts: -1 }
},
{
$project : {posts: 1}
}
])
and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second.
Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.
from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
foreignField Specifies the field from the documents in the from
collection. $lookup performs an equality match on the foreignField to
the localField from the input documents. If a document in the from
collection does not contain the foreignField, the $lookup treats the
value as null for matching purposes.
This will be performed the same as any other query.
If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.
db.users.ensureIndex("_AccountId", 1)
speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.
In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.
The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!
Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.
First use $match then $lookup. $match filter the rows need to be examined to $lookup. It's efficient.
as long as you're going to group by user _AccountId, you should do the $group first by _OwnerUserId then lookup only after filtering accounts having 10<postsCount<15 this will reduce lookups:
db.posts.aggregate([{
$group: {
_id: "$_OwnerUserId",
postsCount: {
$sum: 1
},
posts: {
$push: "$$ROOT"
} //if you need to keep original posts data
}
},
{
$match: {
postsCount: {
$gte: 5,
$lte: 15
}
}
},
{
$lookup: {
from: "users",
localField: "_id",
foreignField: "_AccountId",
as: "X"
}
},
{
$unwind: "$X"
},
{
$sort: {
postsCount: -1
}
},
{
$project: {
postsCount: 1,
X: 1
}
}
])