how to prevent retrieval of data from disk in a $lookup? - mongodb

In the $lookup stage of my aggregation pipeline, it needs to join based on _id, which is indexed on the joined collection. The purpose is to simply check whether there are any matches in joined collection. The actual data of the joined document(s) does not matter, and hence does not need to be retrieved at all whether on disk or RAM.
So, how do I write the $lookup to ensure that data is never retrieved from disk. Instead, a value of true can be returned if matching records were found?
EDITED:
Retrieving data from disk is expensive, hence the reason to avoid it.

In the $lookup, use a pipeline that starts with a $project stage that includes only the _id field.
{$lookup: {
from: "targetCollection",
localField: "fieldName",
foreignField: "_id",
as: "matched",
pipeline: [
{ $project: {_id: 1}},
{ $count: "count" }
]
}}
The query executor should realize that all of the data needed to satisfy that part of the query can be obtained from the index, and not load any documents.
Note: this assumes you are using MongoDB 5.0+

Related

MongoDB: $lookup before $match. Is there performance issue?

I have 2 Collections:
Users:
{
_id: "unique_user_id",
name: "developer"
}
Reports:
{
_id: "unique_report_id",
createdBy: "unique_user_id"
}
The requirement that I have got:
extract all reports that created by the user with the name developer and it will be preferable to do this in the scope of one request to Mongo DB, the query that was implemented by me:
db.getCollection('reports').aggregate(
[
{
"$lookup" : {
from: "users",
localField: "createdBy",
foreignField: "_id",
as: "user"
}
},
{"$unwind": "$user"},
{"$addFields": {"createdBy": "$user.email"}},
{"$match": {"createdBy": "developer"}}
]
);
On the local machine, it works fine, but I think about the PROD environment, where thousands Reports documents already stored:
My query executes $lookup operation against all Documents that available in Collection and only after that applies $match operation and I am afraid that in this case, it is a huge performance issue.
could you please confirm or reject my fears?
or provide alternative examples?
You can modify your lookup stage as:
{
"$lookup": {
"from": "users",
"pipeline":[{$match:{
"createdBy":"developer"
}}],
"as":"user"
}
This will bring in memory only the matching documents. It will help if the "createdBy" field is indexed. After that, use $project to get only the fields you need.
Since you need only developer, first you reduce the number of documents passed to the next stage by applying match at the beginning.
Currently you are passing all the documents in your collection to all of your stages. It will impact performance for sure.
Use users collection in aggregate
Match by email with developer
Then add lookup with reports collection
Or
Filter developer reports from reports collection
Lookup with users collection
You can avoid unwind, addFields by this if you dont need flattened output structure.

Will the joined field use an Index in MongoDB aggregation lookup query [duplicate]

I have the following code snippet to run the aggregation command.
console.time("something");
const cursor = await db.collection("main").aggregate([
{
$match: {
mainField: mainField,
},
},
{
$lookup: {
from: "reference",
localField: "referenceId",
foreignField: "referenceField",
as: "something",
},
},
]);
const results = await cursor.toArray();
console.timeEnd("something");
I have a cheap cloud server for testing purposes (2gb ram, 1 cpu etc.) where mongodb is stored.
I insert 10k documents into main and reference collections (so combined 20k documents inserted).
Without using indexes and running the above aggregation query it takes more than 30 seconds to return the result.
If I have the following index on the reference collection and run the above aggregation query the results take around 1.2 seconds.
await db.collection("reference").createIndex({ referenceField: 1 });
Unfortunately the MongoDB manual doesn't currently mention potential index usage for $lookup, but this is definitely the case.
A simple $lookup query similar to your example performs an equality match on the foreignField in another collection, so you've added the correct index to improve performance (assuming this field also is reasonably selective).
As at MongoDB 4.0 the index usage for $lookup is not reported in aggregation explain output. There is a relevant issue to watch/upvote in the MongoDB issue tracker: SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection.

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

If mongodb aggregate does not use indexes for $lookup why does my performance increase when using indexes?

I have the following code snippet to run the aggregation command.
console.time("something");
const cursor = await db.collection("main").aggregate([
{
$match: {
mainField: mainField,
},
},
{
$lookup: {
from: "reference",
localField: "referenceId",
foreignField: "referenceField",
as: "something",
},
},
]);
const results = await cursor.toArray();
console.timeEnd("something");
I have a cheap cloud server for testing purposes (2gb ram, 1 cpu etc.) where mongodb is stored.
I insert 10k documents into main and reference collections (so combined 20k documents inserted).
Without using indexes and running the above aggregation query it takes more than 30 seconds to return the result.
If I have the following index on the reference collection and run the above aggregation query the results take around 1.2 seconds.
await db.collection("reference").createIndex({ referenceField: 1 });
Unfortunately the MongoDB manual doesn't currently mention potential index usage for $lookup, but this is definitely the case.
A simple $lookup query similar to your example performs an equality match on the foreignField in another collection, so you've added the correct index to improve performance (assuming this field also is reasonably selective).
As at MongoDB 4.0 the index usage for $lookup is not reported in aggregation explain output. There is a relevant issue to watch/upvote in the MongoDB issue tracker: SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection.

Why isn't there join relation in MongoDB?

I'm learning MongoDB these days. I find that MongoDB doesn't support join.
I just want to know why MongoDB choose to do this?
THANKS in advance..
Mongo - is not relational database and does not have physical relations and constraints.
Join kills scalability.
Usually denormalization replace sql join.
For example, on stackoverflow you have question and his owner, in mongodb it is normal case to denormilize owner data into question and avoid a join:
question
{
_id,
text,
user_short :
{
id,
full_name
}
}
It is for sure lead to additional complexity on updates, but it give you significant performance improvements when you read the data. And for the most applications read is 95% and writes only 5% or even less.
Because MongoDb is a non relational database. Non-relational database does not support join it is by design.
You can now do it in Mongo 3.2 using $lookup
$lookup takes four arguments
from: Specifies the collection in the same database to perform the join with. The from collection cannot be sharded.
localField: Specifies the field from the documents input to the $lookup stage. $lookup performs an equality match on the localField to the foreignField from the documents of the from collection.
foreignField: Specifies the field from the documents in the from collection.
as: Specifies the name of the new array field to add to the input documents. The new array field contains the matching documents from the from collection.
db.Foo.aggregate(
{$unwind: "$bars"},
{$lookup: {
from:"bar",
localField: "bars",
foreignField: "_id",
as: "bar"
}},
{$match: {
"bar.testprop": true
}}
)