MongoDB aggregation query optimization: $match, $lookup and double $unwind

MongoDB aggregation query optimization: $match, $lookup and double $unwind - mongodb

Let's say we have two collections:
devices: the objects from this collection have (among others) the fields name (string) and cards (array); each part from that array has the fields model and slot. The cards are not another collection, it's just some nested data.
interfaces: the objects from this collection have (among others) the fields name and owner.
Extra info:
for cards, I'm only interested in the ones where slot is a number
for a part of a device that matches the previous condition, there is an interface object in the other collection where the owner fields has as value the name of the device in cause and the name is s[slot]p1 (the character 's' + the slot of that part + 'p1')
My job is to create a query to generate a summary of all the existing cards in all of those devices, each entry being enriched with information from the interfaces collection. I also need to be able to parametrize the query (in case I'm interested only in a certain device with a certain name, only a certain model for cards etc.)
So far, I have this:
mongo_client.devices.aggregate([
# Retrieve all the devices having the cards field
{
"$match": {
# "name": "<device-name>",
"cards": {
"$exists": "true"
}
}
},
# Group current content with every cards object
{
"$unwind": "$cards"
},
# Only take the ones having "slot" a number
{
"$match": {
"cards.slot": {
"$regex": "^\d+$"
}
}
},
# Retrieve the device's interfaces
{
"$lookup": {
"from": "interfaces",
"let": {
"owner": "$name",
},
"as": "interfaces",
"pipeline": [{
"$match": {
"$expr": {
"$eq": ["$owner", "$$owner"]
},
},
}]
}
},
{
"$unwind": "$interfaces"
},
{
"$match": {
"$expr": {
"$eq": ["$interfaces.name", {
"$concat": ["s", "$cards.slot", "p1"]
}]
}
}
},
# Build the final object
{
"$project": {
# Card related fields
"slot": "$cards.slot",
"model": "$cards.model",
# Device related fields
"device_name": "$name",
# Fields from interfaces
"interface_field_x": "$interfaces.interface_field_x",
"interface_field_y": "$interfaces.interface_field_y",
}
},
])
The query works and it's quite fast, but I have a question:
Is there any way I can avoid the 2nd $unwind? If for every device there are 50-150 interface objects where owner is the name of that device, I feel that I'm slowing it down. Every device has a unique interface named s[slot]p1. How can I get that specific object in a better way? I tried to use two $eq expressions in the $match inside the $lookup or even $regex or $regexMatch, but I couldn't use the outside slot fields, even if I put it inside let.
If I want to parametrize my query to filter the data if needed, would you add match expressions as intermediary steps or just filter at the end?
Any other improvements to the query are welcome. I'm also interested in how to make it errors proof (if by mistake cards is missing or that s1p1 interface is not found.
Thanks!

Your question is missing sample data for the query, but:
Merge the third stage into the first stage, get rid of $exists
Instead of pipeline use localField+foreignField, pipeline is much slower
The number of unwinds in the query should correspond to what objects you want in the result set:
0 unwinds for devices
1 unwind for cards
2 unwinds for interfaces
To match the desired conditions no unwinds are needed.

Related

Remove duplicates by field based on secondary field

I have a use case where I am working with objects that appear as such:
{
"data": {
"uuid": 0001-1234-5678-9101
},
"organizationId": 10192432,
"lastCheckin": 2022-03-19T08:23:02.435+00:00
}
Due to some old bugs in our application, we've accumulated many duplicates for these items in the database. The origin of the duplicates has been resolved in an upcoming release, but I need to ensure that prior to the release there are no such duplicates because the release includes a unique constraint on the "data.uuid" property.
I am trying to delete records based on the following criteria:
Any duplicate record based on "data.uuid" WHERE lastCheckin is NOT the most recent OR organizationId is missing.
Unfortunately, I am rather new to using MongoDB and do not know how to express this in a query. I have tried aggregated to obtain the duplicate records and, while I've been able to do so, I have so far been unable to exclude the records in each duplicate group containing the most recent "lastCheckin" value or even include "organizationId" as a part of the aggregation. Here's what I came up with:
db.collection.aggregate([
{ $match: {
"_id": { "$ne": null },
"count": { "$gt": 1 }
}},
{ $group: {
_id: "$data.uuid",
"count": {
"$sum": 1
}
}},
{ $project: {
"uuid": "$_id",
"_id": 0
}}
])
The above was mangled together based on various other stackoverflow posts describing the aggregation of duplicates. I am not sure whether this is the right way to approach this problem. One immediate problem that I can identify is that simply getting the "data.uuid" property without any additional criteria allowing me to identify the invalid duplicates makes it hard to envision a single query that can delete the invalid records without taking the valid records.
Thanks for any help.

I am not sure if this is possible via a single query, but this is how I would approach it, first sort the documents by lastCheckIn and then group the documents by data.uuid, like this:
db.collection.aggregate([
{
$sort: {
lastCheckIn: -1
}
},
{
$group: {
_id: "$data.uuid",
"docs": {
"$push": "$$ROOT"
}
}
},
]);
Playground link.
Once you have these results, you can filter out the documents, according to your criteria, which you want to delete and collect their _id. The documents per group will be sorted by lastCheckIn in descending order, so filtering should be easy.
Finally, delete the documents, using this query:
db.collection.remove({_id: { $in: [\\ array of _ids collected above] }});

Incredibly slow query performance with $lookup and "sub" aggregation pipeline

Let's say I have two collections, tasks and customers.
Customers have a 1:n relation with tasks via a "customerId" field in customers.
I now have a view where I need to display tasks with customer names. AND I also need to be able to filter and sort for customer names. Which means I can't do the $limit or $match stage before $lookup in the following query.
So here is my example query:
db.task.aggregate([
{
"$match": {
"_deleted": false
}
},
"$lookup": {
"from": "customer",
"let": {
"foreignId": "$customerId"
},
"pipeline": [
{
"$match": {
"$expr": {
"$and": [
{
"$eq": [
"$_id",
"$$foreignId"
]
},
{
"$eq": [
"$_deleted",
false
]
}
]
}
}
}
],
"as": "customer"
},
{
"$unwind": {
"path": "$customer",
"preserveNullAndEmptyArrays": true
}
},
{
"$match": {
"customer.name": 'some_search_string'
}
},
{
"$sort": {
"customer.name": -1
}
},
{
"$limit": 35
},
{
"$project": {
"_id": 1,
"customer._id": 1,
"customer.name": 1,
"description": 1,
"end": 1,
"start": 1,
"title": 1
}
}
])
This query is getting incredibly slow when the collections are growing in size. With 1000 tasks and 20 customers it already takes about 500ms to deliver result.
I'm aware, that this happens because the $lookup operator has to do a tablescan for each row that enters the aggregation pipeline's lookup stage.
I have tried to set indexes like described here: Poor lookup aggregation performance but that doesn't seem to have any impact.
My next guess was that the "sub"-pipeline in the $lookup stage is not capable of using indexes, so I replaced it with a simple
"$lookup": {
"from": "customer",
"localField": "customerId",
"foreignField": "_id",
"as": "customer"
}
But still the indexes are not used or don't have any impact on performance. (To be honest I don't know which of both is the case since .explain() won't work with aggregation pipelines.)
I have tried the following indexes:
Ascending, desecending, hashed and text index on customerId
Ascending, desecending, hashed and text index on customer.name
I'm grateful for any ideas on what I'm doing wrong or how I could achive the same thing with a better aggregation pipeline.
Additional info:
I'm using a three member replica set. I'm on MongoDB 4.0.
Please note: I'm aware that I'm using a non-relational database to achieve highly relational objectives, but in this project MongoDB was our choice due to it's ChangeStream feature. If anybody knows a different database with a comparable feature (realtime push notifications on changes), which can be run on-premise (so Firebase drops out), I would love to hear about it!
Thanks in advance!

I found out why my indexes weren't used.
I queried the collection using a different collation than the collection's own collation.
But the id indexes on a collection are always implemented using the collections default collation.
Therefore the indexes were not used.
I changed the collection's collation to the same as for the queries and now the query takes just a fraction of the time (but still slow :)).
(Yes you have to recreate the collections to change the collation, no on-the-fly change is possible.)

Have you considered having a single collection for customer with tasks as an embedded array in each document? That way, you would be able to index search on both customer and task fields.

Accelerate mongo update within two collections

I have a Payments collection with playerId field, which is the _id key of Person collection. I need to count once, what's the maximal payment of a person and save the value to person's document. This is how I do it now:
db.Person.find().forEach( function(person) {
var cursor = db.Payment.aggregate([
{$match: {playerId: person._id}},
{$group: {
_id:"$playerId",
maxp: {$max:"$amount"}
}}
]);
var maxPay = 0;
if (cursor.hasNext()) {
maxPay = cursor.next().maxp;
}
person.maxPay = maxPay;
db.Person.save(person);
});
I suppose seeking maxPay on Payments collection once for all Persons should be faster, but I dunno how to write that in code. Could you help me please?

You can run just a single aggregation pipeline operation which has a $lookup pipeline initially to do a "left join" on the Payment collection. This is necessary in order to get the data from the right collection (payments) embedded within the resulting documents as an array called payments.
The preceding $unwind pipeline deconstructs the embedded payments array i.e. it will generate a new record for each and every element of the payments data field. It basically flattens the data which will be useful for the next $group stage.
In this $group pipeline stage, you calculate your desired aggregates by applying the accumulator expression(s). If for instance your Person schema has other fields you wish to retain, then the $first accumulator operator should suffice in addition to the $max operator for the extra maxPay field.
UPDATE
Unfortunately, there is no operator to "include all fields" in the $group aggregation pipeline operation. This is because the $group pipeline step is mostly used to group and calculate/aggregate data from collection fields (sum, avg, etc.) and returning all the collection's fields is not the pipeline's intended purpose. The group pipeline operator is similar to the SQL's GROUP BY clause where you can't use GROUP BY unless you use any of the aggregation functions (accumulator operators in MongoDB). The same way, if you need to retain most fields, you have to use an aggregation function in MongoDB as well. In this case, you have to apply $first to each field you want to keep.
You can also use the $$ROOT system variable which references the root document. Keep all fields of this document in a field within the $group pipeline, for example:
{
"$group": {
"_id": "$_id",
"maxPay": { "$max": "$payments.amount" },
"doc": { "$first": "$$ROOT" }
}
}
The drawback with this approach is you would need a further $project pipeline to reshape the fields so that they match the original schema because the documents from the resulting pipeline will have only three fields; _id, maxPay and the embedded doc field.
The final pipeline stage, $out, writes the resulting documents of the aggregation pipeline to the same collection, akin to updating the Person collection by atomically replacing the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the pre-existing collection:
db.Person.aggregate([
{
"$lookup": {
"from": "Payment",
"localField": "_id",
"foreignField": "playerId",
"as": "payments"
}
},
{ "$unwind": {
"path": "$payments",
"preserveNullAndEmptyArrays": true
} },
{
"$group": {
"_id": "$_id",
"maxPay": { "$max": "$payments.amount" },
/* extra fields for demo purposes
"firstName": { "$first": "$firstName" },
"lastName": { "$first": "$lastName" }
*/
}
},
{ "$out": "Person" }
])

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?

No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.

Get distinct records with specified fields that match a value, paginated

I'm trying to get all documents in my MongoDB collection
by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields
var Order = mongoose.model('Order', orderSchema());
My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":
Order
.distinct('request.headers.custID')
.where('response.status.code').equals(200)
.limit(limit)
.skip(skip)
.exec(function (err, orders) {
callback({
data: orders
});
});
So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:
Order.aggregate(
[
{
"$project" :
{
'request.headers.custID' : 1,
//other fields to include
}
},
{
"$match" :
{
"response.status.code" : 200
}
},
{
"$group": {
"_id": "$request.headers.custID"
}
},
{
"$skip": skip
},
{
"$limit": limit
}
],
function (err, order) {}
);
This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.
Any thoughts?

The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:
ps aux | grep mongo | tee out.txt
So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.
Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.
Hence most optimially you do:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"otherField": { "$first": "$otherField" },
// and so on for each field to select
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.
In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.
Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.
Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"document": { "$first": "$$ROOT" }
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.
But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.