Using $group in aggregate Spring Data equivalent - mongodb

Following is the MongoDB query. I am stuck on finding the right Spring API equivalent for the group and project.
db.users.aggregate([
{ $lookup: { from: "organizations", localField: "orgId",
foreignField: "_id", as:"organization" } },
{$unwind: "$organization"},
{ $group: { _id: "$userOrgMap.orgId", userCount: { $sum: 1 }, name: { "$first": "$organization.name"}, status: { "$first": "$organization.status"}, },
{$project:{ _id:1, name:1, status:1, userCount :1 } }, { $sort:{userCount:-1} } ]);
Need help to provide details on the Spring Data MongoDB API which needs to be used for the $group and $project pipelines

Related

MongoDB aggregations group by and count with a join

I have MongoDB model called candidates
appliedJobs: [
{
job: { type: Schema.ObjectId, ref: "JobPost" },
date:Date
},
],
candidate may have multiple records in appliedJobs array. There I refer to the jobPost.
jobPost has the companyName, property.
companyName: String,
What I want is to get the company names with send job applications counts. For an example
|Company|Applications|
|--------|---------------|
|Facebook|10 applications|
|Google|5 applications|
I created this query
Candidate.aggregate([
{
$match: {
appliedJobs: { $exists: true },
},
},
{ $group: { _id: '$companyName', count: { $sum: 1 } } },
])
The problem here is I can't access the companyName like this. Because it's on another collection. How do I solve this?
In order to get data from another collection you can use $lookup (nore efficient) or populate (mongoose - considered more organized), so one option is:
db.candidate.aggregate([
{$match: {appliedJobs: {$exists: true}}},
{$unwind: "$appliedJobs"},
{$lookup: {
from: "JobPost",
localField: "appliedJobs.job",
foreignField: "_id",
as: "appliedJobs"
}
},
{$project: {companyName: {$first: "$appliedJobs.companyName"}}},
{$group: {_id: {candidate: "$_id", company: "$companyName"}, count: {$sum: 1}}},
{$group: {
_id: "$_id.candidate",
appliedJobs: {$push: {k: "$_id.company", v: "$count"}}
}},
{$project: {appliedJobs: {$arrayToObject: "$appliedJobs"}}}
])
See how it works on the playground example
Simply $unwind the appliedJobs array. Perform $lookup to get the companyName. Then, $group to get count of applications by company.
db.Candidate.aggregate([
{
$match: {
appliedJobs: {
$exists: true
}
}
},
{
$unwind: "$appliedJobs"
},
{
"$lookup": {
"from": "JobPost",
"localField": "appliedJobs._id",
"foreignField": "_id",
"as": "JobPostLookup"
}
},
{
$unwind: "$JobPostLookup"
},
{
"$group": {
"_id": "$JobPostLookup.companyName",
"Applications": {
"$sum": 1
}
}
}
])
Here is the Mongo Playground for your reference.

Mongodb group by a nested field

db.logs.aggregate([{$match: { _device: { $in: devices }, takenTimestamp: {$gt: 1547650800 } }}, { $lookup: {
from: "users",
localField: "_deviceTakenByUser",
foreignField: "_id",
as: "user"
}}, { $project: { _id: 0, user: { email: 1 }}} ])
I know want to get distinct values from the user email, is it possible to do so? Can I do it with grouping?
After your $lookup, user is a list of users.
You can do an $unwind to split it
And then use a $group an the email field to have a list of all the emails
...
{
"$unwind": "$user"
},
{
"$group": {
"_id": "$user.email"
}
}

Operation timeout for a MongoDB aggregation pipeline

I have a MongodDB database on MongoDB Atlas.
It has an "orders", "products", "itemTypes" and "brands".
"orders" only keep track of product id ordered.
"products" only keep track of brand id and itemType id
"itemTypes" keep track of item type name
"brands" keep track of brand name.
If I aggregate orders + products + itemTypes it is ok:
[{
$unwind: {
path: '$orders'
}
}, {
$lookup: {
from: 'products',
localField: 'orders.productId',
foreignField: 'productId',
as: 'products'
}
}, {
$lookup: {
from: 'itemTypes',
localField: 'products.typeId',
foreignField: 'typeId',
as: 'itemTypes'
}
}, {
$set: {
'orders.price': {
$arrayElemAt: ['$products.price', 0]
},
'orders.brandId': {
$arrayElemAt: ['$products.brandId', 0]
},
'orders.typeId': {
$arrayElemAt: ['$products.typeId', 0]
},
'orders.typeName': {
$arrayElemAt: ['$itemTypes.name', 0]
}
}
}, {
$group: {
_id: '$_id',
createdAt: {
$first: '$createdAt'
},
status: {
$first: '$status'
},
retailerId: {
$first: '$retailerId'
},
retailerName: {
$first: '$retailerName'
},
orderId: {
$first: '$orderId'
},
orders: {
$push: '$orders'
}
}
}]
If I aggregate orders + products + itemTypes + brands, either Mongo Compass or the web UI of Mongo Atlas aggregation builder will give operation timeout error.
[{
$unwind: {
path: '$orders'
}
}, {
$lookup: {
from: 'products',
localField: 'orders.productId',
foreignField: 'productId',
as: 'products'
}
}, {
$lookup: {
from: 'itemTypes',
localField: 'products.typeId',
foreignField: 'typeId',
as: 'itemTypes'
}
}, {
$lookup: {
from: 'brands',
localField: 'products.brandId',
foreignField: 'brandId',
as: 'brands'
}
}, {
$set: {
'orders.price': {
$arrayElemAt: ['$products.price', 0]
},
'orders.brandId': {
$arrayElemAt: ['$products.brandId', 0]
},
'orders.typeId': {
$arrayElemAt: ['$products.typeId', 0]
},
'orders.typeName': {
$arrayElemAt: ['$itemTypes.name', 0]
},
'orders.brandName': {
$arrayElemAt: ['$brands.name', 0]
}
}
}, {
$group: {
_id: '$_id',
createdAt: {
$first: '$createdAt'
},
status: {
$first: '$status'
},
retailerId: {
$first: '$retailerId'
},
retailerName: {
$first: '$retailerName'
},
orderId: {
$first: '$orderId'
},
orders: {
$push: '$orders'
}
}
}]
This is a demo of the aggregation that timed out:
https://mongoplayground.net/p/Jj6EhSl58MS
We have approximately 50k orders, 14k products, 200 brands, 89 item types.
Is there anyway to optimise this aggregation so that it won't timeout?
P/s: My ultimate goal is to visualise popular brands and item types ordered using beautiful chart in the Mongodb Charts function.
If you are on Mongo Atlas, you can use Triggers to run the aggregation query in the background - either when the database is updated or as a scheduled trigger (https://docs.mongodb.com/realm/triggers/).
When the trigger runs, you can save the result of the aggregation pipeline in a new collection using the "$merge" operation.
exports = function() {
const mongodb = context.services.get(CLUSTER_NAME);
const orders = mongodb.db(DATABASE_NAME).collection("orders");
const ordersSummary = mongodb.db(DATABASE_NAME).collection("orders.summary");
const pipeline = [
{
YOUR_PIPELINE
},
{ $merge: { into: "orders.summary", on: "_id", whenMatched: "replace", whenNotMatched: "insert" } }
];
orders.aggregate(pipeline);
};
This way, your charts will be very fast, since they only have to do a simple query from the new collection.
Do you have index on the collections you $lookup from:
products (productId) + itemTypes (typeId) + brands (brandId).
Otherwise, the lookups can take a long time to complete.

MongoDB: slow performance pipeline lookup compared to basic lookup

I have two collection:
matches:
[{
date: "2020-02-15T17:00:00Z",
players: [
{_id: "5efd9485aba4e3d01942a2ce"},
{_id: "5efd9485aba4e3d01942a2cf"}
]
},
{...}]
and players:
[{
_id: "5efd9485aba4e3d01942a2ce",
name: "Rafa Nadal"
},
{
_id: "5efd9485aba4e3d01942a2ce",
name: "Roger Federer"
},
{...}]
I need to use lookup pipeline because I'm building a graphql resolver with recursive functions and I need nested lookup. I've followed this example https://docs.mongodb.com/datalake/reference/pipeline/lookup-stage#nested-example
My problem is that with pipeline lookup I need 11 seconds but with basic lookup only 0.67 seconds. And my test database is very short! about 1300 players and 700 matches.
This is my pipeline lookup (11 seconds to resolve)
db.collection('matches').aggregate([{
$lookup: {
from: 'players',
let: { ids: '$players' },
pipeline: [{ $match: { $expr: { $in: ['$_id', '$$ids' ] } } }],
as: 'players'
}
}]);
And this my basic lookup (0.67 seconds to resolve)
db.collection('matches').aggregate([{
$lookup: {
from: "players",
localField: "players",
foreignField: "_id",
as: "players"
}
}]);
Why so much difference? In what way can I do faster pipeline lookup?
The thing is that when you do a lookup using pipeline with a match stage, then the index would be used only for the fields that are matched with $eq operator and for the rest index will not be used.
And the example you specified with pipeline will work like this ( again index will not be used here as it is not $eq )
db.matches.aggregate([
{
$lookup: {
from: "players",
let: {
ids: {
$map: {
input: "$players",
in: "$$this._id"
}
}
},
pipeline: [
{
$match: {
$expr: {
$in: [
"$_id",
"$$ids"
]
}
}
}
],
as: "players"
}
}
])
As players is an array of object so it need to be mapped to array of ids first
MongoDB Playground
As #namar sood comments there are several tickets that refer to this issue:
https://jira.mongodb.org/browse/SERVER-37470
https://jira.mongodb.org/browse/SERVER-32549
Meanwhile a solution could be (also works nested):
db.collection('matches').aggregate([
{ $unwind: '$players' },
{
$lookup: {
from: 'players',
let: { id: '$players' },
pipeline: [{ $match: { $expr: { $eq: ['$_id', '$$id' ] } } }],
as: 'players'
},
{ $unwind: '$players' },
{
$group: {
"_id": "$_id",
"data": { "$first": "$$ROOT" },
"players": {$push: "$players"}
}
},
{ $addFields: {"data.players": "$players"} },
{ $replaceRoot: { newRoot: "$data" }}
]);

mongodb 2 level aggregate lookup

I have those collection schemas
Schema.users = {
name : "string",
username : "string",
[...]
}
Schema.rooms = {
name : "string",
hidden: "boolean",
user: "string",
sqmt: "number",
service: "string"
}
Schema.room_price = {
morning : "string",
afternoon: "string",
day: "string",
room:'string'
}
I need to aggregate the users with the rooms and foreach room the specific room prices.
the expected result would be
[{
_id:"xXXXXX",
name:"xyz",
username:"xyz",
rooms:[
{
_id: 1111,
name:'room1',
sqmt: '123x',
service:'ppp',
room_prices: [{morning: 123, afternoon: 321}]
}
]}]
The first part of the aggregate could be
db.collection('users').aggregate([
{$match: cond},
{$lookup: {
from: 'rooms',
let: {"user_id", "$_id"},
pipeline: [{$match:{expr: {$eq: ["$user", "$$user_id"]}}}],
as: "rooms"
}}])
but I can't figure out how to get the room prices within the same aggregate
Presuming that room from the room_prices collection has the matching data from the name of the rooms collection, then that would the expression to match on for the "inner" pipeline of the $lookup expression with yet another $lookup:
db.collection('users').aggregate([
{ $match: cond },
{ $lookup: {
from: 'rooms',
let: { "user_id": "$_id" },
pipeline: [
{ $match:{ $expr: { $eq: ["$user", "$$user_id"] } } },
{ $lookup: {
from: 'room_prices',
let: { 'name': '$name' },
pipeline: [
{ $match: { $expr: { $eq: [ '$room', '$$name'] } } },
{ $project: { _id: 0, morning: 1, afternoon: 1 } }
],
as: 'room_prices'
}}
],
as: "rooms"
}}
])
That's also adding a $project in there to select only the fields you want from the prices. When using the expressive form of $lookup you actually do get to express a "pipeline", which can be any aggregation pipeline combination. This allows for complex manipulation and such "nested lookups".
Note that using mongoose you can also get the collection name from the model object using something like:
from: RoomPrice.collection.name
This is generally future proofing against possible model configuration changes which might possibly change the name of the underlying collection.
You can also do pretty much the same with the "legacy" form of $lookup prior to the sub-pipeline syntax available from MongoDB 3.6 and upwards. It's just a bit more processing and reconstruction:
db.collection('users').aggregate([
{ $match: cond },
// in legacy form
{ $lookup: {
from: 'rooms',
localField: 'user_id',
foreignField: 'user',
as: 'rooms'
}},
// unwind the output array
{ $unwind: '$rooms' },
// lookup for the second collection
{ $lookup: {
from: 'room_prices',
localField: 'name',
foreignField: 'room',
as: 'rooms.room_prices'
}},
// Select array fields with $map
{ $addFields: {
'rooms': {
'room_prices': {
$map: {
input: '$rooms.room_prices',
in: {
morning: '$this.morning',
afternoon: '$this.afternoon'
}
}
}
}
}},
// now group back to 'users' data
{ $group: {
_id: '$_id',
name: { $first: '$name' },
username: { $first: '$username' },
// same for any other fields, then $push 'rooms'
rooms: { $push: '$rooms' }
}}
])
That's a bit more overhead mostly from usage of $unwind and also noting that the "field selection" does actually mean you did return the "whole documents" from room_prices "first", and only after that was complete can you select the fields.
So there are advantages to the newer syntax, but it still could be done with earlier versions if you wanted to.