Incredibly slow query performance with $lookup and "sub" aggregation pipeline - mongodb

Let's say I have two collections, tasks and customers.
Customers have a 1:n relation with tasks via a "customerId" field in customers.
I now have a view where I need to display tasks with customer names. AND I also need to be able to filter and sort for customer names. Which means I can't do the $limit or $match stage before $lookup in the following query.
So here is my example query:
db.task.aggregate([
{
"$match": {
"_deleted": false
}
},
"$lookup": {
"from": "customer",
"let": {
"foreignId": "$customerId"
},
"pipeline": [
{
"$match": {
"$expr": {
"$and": [
{
"$eq": [
"$_id",
"$$foreignId"
]
},
{
"$eq": [
"$_deleted",
false
]
}
]
}
}
}
],
"as": "customer"
},
{
"$unwind": {
"path": "$customer",
"preserveNullAndEmptyArrays": true
}
},
{
"$match": {
"customer.name": 'some_search_string'
}
},
{
"$sort": {
"customer.name": -1
}
},
{
"$limit": 35
},
{
"$project": {
"_id": 1,
"customer._id": 1,
"customer.name": 1,
"description": 1,
"end": 1,
"start": 1,
"title": 1
}
}
])
This query is getting incredibly slow when the collections are growing in size. With 1000 tasks and 20 customers it already takes about 500ms to deliver result.
I'm aware, that this happens because the $lookup operator has to do a tablescan for each row that enters the aggregation pipeline's lookup stage.
I have tried to set indexes like described here: Poor lookup aggregation performance but that doesn't seem to have any impact.
My next guess was that the "sub"-pipeline in the $lookup stage is not capable of using indexes, so I replaced it with a simple
"$lookup": {
"from": "customer",
"localField": "customerId",
"foreignField": "_id",
"as": "customer"
}
But still the indexes are not used or don't have any impact on performance. (To be honest I don't know which of both is the case since .explain() won't work with aggregation pipelines.)
I have tried the following indexes:
Ascending, desecending, hashed and text index on customerId
Ascending, desecending, hashed and text index on customer.name
I'm grateful for any ideas on what I'm doing wrong or how I could achive the same thing with a better aggregation pipeline.
Additional info:
I'm using a three member replica set. I'm on MongoDB 4.0.
Please note: I'm aware that I'm using a non-relational database to achieve highly relational objectives, but in this project MongoDB was our choice due to it's ChangeStream feature. If anybody knows a different database with a comparable feature (realtime push notifications on changes), which can be run on-premise (so Firebase drops out), I would love to hear about it!
Thanks in advance!

I found out why my indexes weren't used.
I queried the collection using a different collation than the collection's own collation.
But the id indexes on a collection are always implemented using the collections default collation.
Therefore the indexes were not used.
I changed the collection's collation to the same as for the queries and now the query takes just a fraction of the time (but still slow :)).
(Yes you have to recreate the collections to change the collation, no on-the-fly change is possible.)

Have you considered having a single collection for customer with tasks as an embedded array in each document? That way, you would be able to index search on both customer and task fields.

Related

Mongodb aggregation that returns all documents where $lookup foreign doc DOESN'T exist

I'm working with a CMS system right now where if pages are deleted the associated content isn't. In the case of one of my clients this has become burdensome, as we now have millions of content docs accumulated over time, and it's making it prohibitive to do daily functions, like restoring dbs, backing up dbs, etc.
Consider this structure:
Page document:
{
_id: pageId,
contentDocumentId: someContentDocId
}
Content document:
{
_id: someContentDocId,
page_id: pageId,
content: [someContent, ...etc]
}
Is there a way to craft a MongoDB aggregation where we aggregate Content docs based on checking page_id, and if our check for page_id returns null, then we aggregate that doc? It's not something as simple as foreignField in a $lookup being set to null, is it?
This should do the trick:
db.content.aggregate([
{
"$lookup": {
"from": "pages",
"localField": "page_id",
"foreignField": "_id",
"as": "pages"
}
},
{
"$addFields": {
"pages_length": {
"$size": "$pages"
}
}
},
{
"$match": {
"pages_length": 0
}
},
{
"$unset": [
"pages",
"pages_length"
]
}
])
We create an aggregation from the content collection and do a normal $loopup with the pages collection. When no matching page is found, the array pages will be [] so we just filter every document where the array is empty.
You can't use $size inside $match to filter array lengths, so we need to create a temporary field pages_length to store the length of the array.
In the end we remove the temporary fields with $unset.
Playground

MongoDB aggregation query optimization: $match, $lookup and double $unwind

Let's say we have two collections:
devices: the objects from this collection have (among others) the fields name (string) and cards (array); each part from that array has the fields model and slot. The cards are not another collection, it's just some nested data.
interfaces: the objects from this collection have (among others) the fields name and owner.
Extra info:
for cards, I'm only interested in the ones where slot is a number
for a part of a device that matches the previous condition, there is an interface object in the other collection where the owner fields has as value the name of the device in cause and the name is s[slot]p1 (the character 's' + the slot of that part + 'p1')
My job is to create a query to generate a summary of all the existing cards in all of those devices, each entry being enriched with information from the interfaces collection. I also need to be able to parametrize the query (in case I'm interested only in a certain device with a certain name, only a certain model for cards etc.)
So far, I have this:
mongo_client.devices.aggregate([
# Retrieve all the devices having the cards field
{
"$match": {
# "name": "<device-name>",
"cards": {
"$exists": "true"
}
}
},
# Group current content with every cards object
{
"$unwind": "$cards"
},
# Only take the ones having "slot" a number
{
"$match": {
"cards.slot": {
"$regex": "^\d+$"
}
}
},
# Retrieve the device's interfaces
{
"$lookup": {
"from": "interfaces",
"let": {
"owner": "$name",
},
"as": "interfaces",
"pipeline": [{
"$match": {
"$expr": {
"$eq": ["$owner", "$$owner"]
},
},
}]
}
},
{
"$unwind": "$interfaces"
},
{
"$match": {
"$expr": {
"$eq": ["$interfaces.name", {
"$concat": ["s", "$cards.slot", "p1"]
}]
}
}
},
# Build the final object
{
"$project": {
# Card related fields
"slot": "$cards.slot",
"model": "$cards.model",
# Device related fields
"device_name": "$name",
# Fields from interfaces
"interface_field_x": "$interfaces.interface_field_x",
"interface_field_y": "$interfaces.interface_field_y",
}
},
])
The query works and it's quite fast, but I have a question:
Is there any way I can avoid the 2nd $unwind? If for every device there are 50-150 interface objects where owner is the name of that device, I feel that I'm slowing it down. Every device has a unique interface named s[slot]p1. How can I get that specific object in a better way? I tried to use two $eq expressions in the $match inside the $lookup or even $regex or $regexMatch, but I couldn't use the outside slot fields, even if I put it inside let.
If I want to parametrize my query to filter the data if needed, would you add match expressions as intermediary steps or just filter at the end?
Any other improvements to the query are welcome. I'm also interested in how to make it errors proof (if by mistake cards is missing or that s1p1 interface is not found.
Thanks!
Your question is missing sample data for the query, but:
Merge the third stage into the first stage, get rid of $exists
Instead of pipeline use localField+foreignField, pipeline is much slower
The number of unwinds in the query should correspond to what objects you want in the result set:
0 unwinds for devices
1 unwind for cards
2 unwinds for interfaces
To match the desired conditions no unwinds are needed.

How to convert an sql data retrieval into mongodb?

Imagine there are two collections in MongoDB User and History. Here I have written the SQL query of a data retrieval if those are in a relational DB. I want to prepare a similar query for this in MongoDB.
(History table contains many records for a particular User)
SELECT U.id FROM User U
WHERE EXISTS (SELECT * FROM History H
WHERE H.userId = U.id AND H.usage > 25 AND H.balance < 100)
AND U.category = 'VIP' AND U.area = 'XXX';
Based on my understanding of question i have written query using $lookup. Hope this is what you are looking for. Also adding link to MongoPlaygroud so that you can run the query.
db.user.aggregate([
{
"$lookup": {
"from": "history",
"localField": "id",
"foreignField": "userId",
"as": "history"
}
},
{
"$match": {
"category": "VIP",
"area": "XXX",
"history.usage": {
"$gt": 25
},
"history.balance": {
"$lt": 100
}
}
},
{
"$project": {
"id": 1
}
}
])
Try it here
There is no way to retrieve users in a collection based on a condition in a different collection.
But the following query filters users, and brings data to those you asked for. If this is not found, an empty array is retrieved, so we filter those out.
I'll show you a way with $graphLookup
//from users
var pipeline = [
{ $match:{category:"VIP", area:"XXX"} }, //get a subset of users
{ $graphLookup:{ //include docs from History as "myNewData"
from:"History",
as:"myNewData",
startWith:"$id", //match _id with
connectToField:"userId", //userId in History
connectFromField:"id", //this is irrelevant
maxDepth:0, //because is used for depth > 0
restrictSearchWithMatch:{usage:{$gt:25}, balance:{$lt:100}} //condition }
},
{$match:{"myNewData":{$elemMatch:{$exists:true}}}
}]
db.Users.aggregate(pipeline)
Collections have to be in the same database.
No data is permanently moved unless you use $out or other stage.
This should perform better than $lookup (only a subset of docs is brought over)
Test here (I just tweaked the example provided by #zac786)

MongoDB aggregate ID's efficiently for bulk searches?

I have more than 8 references in a MongoDB document. Those are Object ID's stored in the origin document and in order to get the real data of the foreign I have to make an aggregation query, something like this:
{
$lookup: {
from: "departments",
let: { "department": "$_department" },
pipeline: [
{ $match: { $expr: { $eq: ["$_id", "$$department"] }}},
],
as: "department"
}
},
{
$unwind: { "path": "$department", "preserveNullAndEmptyArrays": true }
},
That is working and instead of ObjectId I got the real department object.
However this takes time and make the finding queries to take lot of time.
I have noticed that I have the same ID's multiple times and it's better to collect all of the unique ID's and just fetch them once from DB and then just reuse the same object.
I don't know any plugin or a service doing so, using MongoDB. I can make one bymyself I just want to know before I work on something like this, if there any kind of a service or package in Github?

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.