Joins in mongodb using aggregate $lookup - mongodb

need help here.
I have two collections, the first collection isn't so big with pretty small documents.
The second has much more items (thousands, could be much more) with medium size documents.
There's a property in the first document that matches a property in the second document.
The relation here is that items in the first collection have many other items in the second collection that are referencing it.
Say for example i have the first collection representing persons and the second one representing credit card transactions. A person may have many transactions.
PersonId is the id of the persons collection and every transaction document in the transactions collection.
I want to write a query to count how many trasactions each person has.
I've seen that it is recommended to use aggregate and lookup.
But when i try that i get a message that the document size exceeds limit.
I'm guessing that this is because it aggregate a person with all its transaction into one document... not sure, its the first time ever i'm experiencing with mongodb.
What would be the best approach to achieve that? Is the aggregate method the right choice?
Thanx!
Gili

You can use simple grouping to get transactions count for each person
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}}
])
Output will contain person ids and count of transactions for that person. You can match people and transactions stats in memory. You can also use lookup to return transactions stats with some person data. But keep in mind - you will not retrieve entry for people without transactions:
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}},
{ $lookup:
{
from: "people",
localField: "_id",
foreignField: "_id",
as: "person"
}
},
{$unwind: "$person"},
{$project:{name:"$person.name", count:"$count"}}
])

Related

how to prevent retrieval of data from disk in a $lookup?

In the $lookup stage of my aggregation pipeline, it needs to join based on _id, which is indexed on the joined collection. The purpose is to simply check whether there are any matches in joined collection. The actual data of the joined document(s) does not matter, and hence does not need to be retrieved at all whether on disk or RAM.
So, how do I write the $lookup to ensure that data is never retrieved from disk. Instead, a value of true can be returned if matching records were found?
EDITED:
Retrieving data from disk is expensive, hence the reason to avoid it.
In the $lookup, use a pipeline that starts with a $project stage that includes only the _id field.
{$lookup: {
from: "targetCollection",
localField: "fieldName",
foreignField: "_id",
as: "matched",
pipeline: [
{ $project: {_id: 1}},
{ $count: "count" }
]
}}
The query executor should realize that all of the data needed to satisfy that part of the query can be obtained from the index, and not load any documents.
Note: this assumes you are using MongoDB 5.0+

Return & sort users depending on lookup results

I have a mongoDb collections for users and a collections for messages. My goal is to return a result of users that
Have the role of patient
Have messages (filter out any users that don't have messages)
The result is sorted via the latest message (i.e the user with the latest message should be on top of the result)
I'm fairly new to aggregations but attempting to use one to solve this. The query I have so far returns users with messages via a lookup however does not filter out users without messages (2) nor does it sort the result (3).
db.users.aggregate([
{$match: {role: "patient"}},
{$lookup:
{
from: "messages",
localField: "phoneNumber",
foreignField: "phoneNumber",
as: "messages"
}
}
])
While you can do that via the aggregation pipeline, a much more efficient solution is to write the most recent message's timestamp into the user at which point no join would be needed to retrieve the data you need.

MongoDB: $lookup before $match. Is there performance issue?

I have 2 Collections:
Users:
{
_id: "unique_user_id",
name: "developer"
}
Reports:
{
_id: "unique_report_id",
createdBy: "unique_user_id"
}
The requirement that I have got:
extract all reports that created by the user with the name developer and it will be preferable to do this in the scope of one request to Mongo DB, the query that was implemented by me:
db.getCollection('reports').aggregate(
[
{
"$lookup" : {
from: "users",
localField: "createdBy",
foreignField: "_id",
as: "user"
}
},
{"$unwind": "$user"},
{"$addFields": {"createdBy": "$user.email"}},
{"$match": {"createdBy": "developer"}}
]
);
On the local machine, it works fine, but I think about the PROD environment, where thousands Reports documents already stored:
My query executes $lookup operation against all Documents that available in Collection and only after that applies $match operation and I am afraid that in this case, it is a huge performance issue.
could you please confirm or reject my fears?
or provide alternative examples?
You can modify your lookup stage as:
{
"$lookup": {
"from": "users",
"pipeline":[{$match:{
"createdBy":"developer"
}}],
"as":"user"
}
This will bring in memory only the matching documents. It will help if the "createdBy" field is indexed. After that, use $project to get only the fields you need.
Since you need only developer, first you reduce the number of documents passed to the next stage by applying match at the beginning.
Currently you are passing all the documents in your collection to all of your stages. It will impact performance for sure.
Use users collection in aggregate
Match by email with developer
Then add lookup with reports collection
Or
Filter developer reports from reports collection
Lookup with users collection
You can avoid unwind, addFields by this if you dont need flattened output structure.

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

Why isn't there join relation in MongoDB?

I'm learning MongoDB these days. I find that MongoDB doesn't support join.
I just want to know why MongoDB choose to do this?
THANKS in advance..
Mongo - is not relational database and does not have physical relations and constraints.
Join kills scalability.
Usually denormalization replace sql join.
For example, on stackoverflow you have question and his owner, in mongodb it is normal case to denormilize owner data into question and avoid a join:
question
{
_id,
text,
user_short :
{
id,
full_name
}
}
It is for sure lead to additional complexity on updates, but it give you significant performance improvements when you read the data. And for the most applications read is 95% and writes only 5% or even less.
Because MongoDb is a non relational database. Non-relational database does not support join it is by design.
You can now do it in Mongo 3.2 using $lookup
$lookup takes four arguments
from: Specifies the collection in the same database to perform the join with. The from collection cannot be sharded.
localField: Specifies the field from the documents input to the $lookup stage. $lookup performs an equality match on the localField to the foreignField from the documents of the from collection.
foreignField: Specifies the field from the documents in the from collection.
as: Specifies the name of the new array field to add to the input documents. The new array field contains the matching documents from the from collection.
db.Foo.aggregate(
{$unwind: "$bars"},
{$lookup: {
from:"bar",
localField: "bars",
foreignField: "_id",
as: "bar"
}},
{$match: {
"bar.testprop": true
}}
)