MongoDB: $lookup before $match. Is there performance issue? - mongodb

I have 2 Collections:
Users:
{
_id: "unique_user_id",
name: "developer"
}
Reports:
{
_id: "unique_report_id",
createdBy: "unique_user_id"
}
The requirement that I have got:
extract all reports that created by the user with the name developer and it will be preferable to do this in the scope of one request to Mongo DB, the query that was implemented by me:
db.getCollection('reports').aggregate(
[
{
"$lookup" : {
from: "users",
localField: "createdBy",
foreignField: "_id",
as: "user"
}
},
{"$unwind": "$user"},
{"$addFields": {"createdBy": "$user.email"}},
{"$match": {"createdBy": "developer"}}
]
);
On the local machine, it works fine, but I think about the PROD environment, where thousands Reports documents already stored:
My query executes $lookup operation against all Documents that available in Collection and only after that applies $match operation and I am afraid that in this case, it is a huge performance issue.
could you please confirm or reject my fears?
or provide alternative examples?

You can modify your lookup stage as:
{
"$lookup": {
"from": "users",
"pipeline":[{$match:{
"createdBy":"developer"
}}],
"as":"user"
}
This will bring in memory only the matching documents. It will help if the "createdBy" field is indexed. After that, use $project to get only the fields you need.

Since you need only developer, first you reduce the number of documents passed to the next stage by applying match at the beginning.
Currently you are passing all the documents in your collection to all of your stages. It will impact performance for sure.
Use users collection in aggregate
Match by email with developer
Then add lookup with reports collection
Or
Filter developer reports from reports collection
Lookup with users collection
You can avoid unwind, addFields by this if you dont need flattened output structure.

Related

how to prevent retrieval of data from disk in a $lookup?

In the $lookup stage of my aggregation pipeline, it needs to join based on _id, which is indexed on the joined collection. The purpose is to simply check whether there are any matches in joined collection. The actual data of the joined document(s) does not matter, and hence does not need to be retrieved at all whether on disk or RAM.
So, how do I write the $lookup to ensure that data is never retrieved from disk. Instead, a value of true can be returned if matching records were found?
EDITED:
Retrieving data from disk is expensive, hence the reason to avoid it.
In the $lookup, use a pipeline that starts with a $project stage that includes only the _id field.
{$lookup: {
from: "targetCollection",
localField: "fieldName",
foreignField: "_id",
as: "matched",
pipeline: [
{ $project: {_id: 1}},
{ $count: "count" }
]
}}
The query executor should realize that all of the data needed to satisfy that part of the query can be obtained from the index, and not load any documents.
Note: this assumes you are using MongoDB 5.0+

Will the joined field use an Index in MongoDB aggregation lookup query [duplicate]

I have the following code snippet to run the aggregation command.
console.time("something");
const cursor = await db.collection("main").aggregate([
{
$match: {
mainField: mainField,
},
},
{
$lookup: {
from: "reference",
localField: "referenceId",
foreignField: "referenceField",
as: "something",
},
},
]);
const results = await cursor.toArray();
console.timeEnd("something");
I have a cheap cloud server for testing purposes (2gb ram, 1 cpu etc.) where mongodb is stored.
I insert 10k documents into main and reference collections (so combined 20k documents inserted).
Without using indexes and running the above aggregation query it takes more than 30 seconds to return the result.
If I have the following index on the reference collection and run the above aggregation query the results take around 1.2 seconds.
await db.collection("reference").createIndex({ referenceField: 1 });
Unfortunately the MongoDB manual doesn't currently mention potential index usage for $lookup, but this is definitely the case.
A simple $lookup query similar to your example performs an equality match on the foreignField in another collection, so you've added the correct index to improve performance (assuming this field also is reasonably selective).
As at MongoDB 4.0 the index usage for $lookup is not reported in aggregation explain output. There is a relevant issue to watch/upvote in the MongoDB issue tracker: SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection.

If mongodb aggregate does not use indexes for $lookup why does my performance increase when using indexes?

I have the following code snippet to run the aggregation command.
console.time("something");
const cursor = await db.collection("main").aggregate([
{
$match: {
mainField: mainField,
},
},
{
$lookup: {
from: "reference",
localField: "referenceId",
foreignField: "referenceField",
as: "something",
},
},
]);
const results = await cursor.toArray();
console.timeEnd("something");
I have a cheap cloud server for testing purposes (2gb ram, 1 cpu etc.) where mongodb is stored.
I insert 10k documents into main and reference collections (so combined 20k documents inserted).
Without using indexes and running the above aggregation query it takes more than 30 seconds to return the result.
If I have the following index on the reference collection and run the above aggregation query the results take around 1.2 seconds.
await db.collection("reference").createIndex({ referenceField: 1 });
Unfortunately the MongoDB manual doesn't currently mention potential index usage for $lookup, but this is definitely the case.
A simple $lookup query similar to your example performs an equality match on the foreignField in another collection, so you've added the correct index to improve performance (assuming this field also is reasonably selective).
As at MongoDB 4.0 the index usage for $lookup is not reported in aggregation explain output. There is a relevant issue to watch/upvote in the MongoDB issue tracker: SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection.

Joins in mongodb using aggregate $lookup

need help here.
I have two collections, the first collection isn't so big with pretty small documents.
The second has much more items (thousands, could be much more) with medium size documents.
There's a property in the first document that matches a property in the second document.
The relation here is that items in the first collection have many other items in the second collection that are referencing it.
Say for example i have the first collection representing persons and the second one representing credit card transactions. A person may have many transactions.
PersonId is the id of the persons collection and every transaction document in the transactions collection.
I want to write a query to count how many trasactions each person has.
I've seen that it is recommended to use aggregate and lookup.
But when i try that i get a message that the document size exceeds limit.
I'm guessing that this is because it aggregate a person with all its transaction into one document... not sure, its the first time ever i'm experiencing with mongodb.
What would be the best approach to achieve that? Is the aggregate method the right choice?
Thanx!
Gili
You can use simple grouping to get transactions count for each person
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}}
])
Output will contain person ids and count of transactions for that person. You can match people and transactions stats in memory. You can also use lookup to return transactions stats with some person data. But keep in mind - you will not retrieve entry for people without transactions:
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}},
{ $lookup:
{
from: "people",
localField: "_id",
foreignField: "_id",
as: "person"
}
},
{$unwind: "$person"},
{$project:{name:"$person.name", count:"$count"}}
])

Why isn't there join relation in MongoDB?

I'm learning MongoDB these days. I find that MongoDB doesn't support join.
I just want to know why MongoDB choose to do this?
THANKS in advance..
Mongo - is not relational database and does not have physical relations and constraints.
Join kills scalability.
Usually denormalization replace sql join.
For example, on stackoverflow you have question and his owner, in mongodb it is normal case to denormilize owner data into question and avoid a join:
question
{
_id,
text,
user_short :
{
id,
full_name
}
}
It is for sure lead to additional complexity on updates, but it give you significant performance improvements when you read the data. And for the most applications read is 95% and writes only 5% or even less.
Because MongoDb is a non relational database. Non-relational database does not support join it is by design.
You can now do it in Mongo 3.2 using $lookup
$lookup takes four arguments
from: Specifies the collection in the same database to perform the join with. The from collection cannot be sharded.
localField: Specifies the field from the documents input to the $lookup stage. $lookup performs an equality match on the localField to the foreignField from the documents of the from collection.
foreignField: Specifies the field from the documents in the from collection.
as: Specifies the name of the new array field to add to the input documents. The new array field contains the matching documents from the from collection.
db.Foo.aggregate(
{$unwind: "$bars"},
{$lookup: {
from:"bar",
localField: "bars",
foreignField: "_id",
as: "bar"
}},
{$match: {
"bar.testprop": true
}}
)