Windowing function in MongoDB - mongodb

I have a collection that is made up of companies. Each company has a "number_of_employees" as well as a subdocument of "offices" which includes "state_code" and "country_code". For example:
{
'_id': ObjectId('52cdef7c4bab8bd675297da5'),
'name': 'Technorati',
'number_of_employees': 35,
'offices': [
{'description': '',
'address1': '360 Post St. Ste. 1100',
'address2': '',
'zip_code': '94108',
'city': 'San Francisco',
'state_code': 'CA',
'country_code': 'USA',
'latitude': 37.779558,
'longitude': -122.393041}
]
}
I'm trying to get the number of employees per state across all companies. My latest attempt looks like:
db.research.aggregate([
{ "$match": {"offices.country_code": "USA" } },
{ "$unwind": "$offices" },
{ "$project": { "_id": 1, "number_of_employees": 1, "offices.state_code": 1 } }
])
But now I'm stuck on how to do the $group. Because the num_of_employees is at the company level and not the office level I want to split them evenly across the offices. For example, if Technorati has 5 offices in 5 different states then each state would be allocated 7 employees.
In SQL I could do this easily enough using a windowed function to get average employees across offices by company and then summing those while grouping by state. I can't seem to find any clear examples of similar functionality in MongoDB though.
Note, this is for a school assignment, so the use of third-party libraries isn't feasible. Also, I'm hoping that this can all be done in a simple snippet of code, possibly even one call. I could certainly create new intermediate collections or do this in Python and process data there, but that's probably outside of the scope of the homework.
Anything to point me in the right direction would be greatly appreciated!

You are actually on the right track. You just need to derive an extra field numOfEmpPerOffice by using $divide and $sum it when $group by state.
db.collection.aggregate([
{
"$match": {
"offices.country_code": "USA"
}
},
{
"$addFields": {
"numOfEmpPerOffice": {
"$divide": [
"$number_of_employees",
{
"$size": "$offices"
}
]
}
}
},
{
"$unwind": "$offices"
},
{
$group: {
_id: "$offices.state_code",
totalEmp: {
$sum: "$numOfEmpPerOffice"
}
}
}
])
Here is the Mongo playground for your reference.

Related

MongoDB, Panache, Quarkus: How to do aggregate, $sum and filter

I have a table in mongodb with sales transactions each containing a userId, a timestamp and a corresponding revenue value of the specific sales transaction.
Now, I would like to query these users and getting the minimum, maximum, sum and average of all transactions of all users. There should only be transactions between two given timestamps and it should only include users, whose sum of revenue is greater than a specified value.
I have composed the corresponding query in mongosh:
db.salestransactions.aggregate(
{
"$match": {
"timestamp": {
"$gte": new ISODate("2020-01-01T19:28:38.000Z"),
"$lte": new ISODate("2020-03-01T19:28:38.000Z")
}
}
},
{
$group: {
_id: { userId: "$userId" },
minimum: {$min: "$revenue"},
maximum: {$max: "$revenue"},
sum: {$sum: "$revenue"},
avg: {$avg: "$revenue"}
}
},
{
$match: { "sum": { $gt: 10 } }
}
]
)
This query works absolutely fine.
How do I implement this query in a PanacheMongoRepository using quarkus ?
Any ideas?
Thanks!
A bit late but you could do it something like this.
Define a repo
this code is in kotkin
class YourRepositoryReactive : ReactivePanacheMongoRepository<YourEntity>{
fun getDomainDocuments():List<YourView>{
val aggregationPipeline = mutableListOf<Bson>()
// create your each stage with Document.parse("stage_obj") and add to aggregates collections
return mongoCollection().aggregate(aggregationPipeline,YourView::class.java)
}
mongoCollection() automatically executes on your Entity
YourView, a call to map related properties part of your output. Make sure that this class has
#ProjectionFor(YourEntity.class)
annotation.
Hope this helps.

Merge Names From Data For Message Application

Hello guys I'm writing a Message Application with Node.js and Mongoose. I keep datas in mongodb like that:
I want to list users who messaged before so I need to filter my 'Messages' collection but I can't do what exactly I want. If he sent a message to a person I need to take persons name but, if he take a message from a person I need to take persons name however in first situation person name in reciever, in second situation person name in sender. I made a table for explain more easily. I have left table and I need 3 name like second table.(Need to eliminate one John's name)
Sorry, if this problem asked before but I don't know how can I search this problem.
I tried this but it take user name who logged in and duplicate some names.
Message.find({$or: [{sender: req.user.username}, {reciever: req.user.username}]})
One option is to use an aggregation pipeline to create two sets and simply union them:
db.collection.aggregate([
{$match: {$or: [{sender: req.user.username}, {reciever: req.user.username}]}},
{$group: {
_id: 0,
recievers: {$addToSet: "$reciever"},
senders: {$addToSet: "$sender"}
}},
{$project: {
_id: req.user.username,
previousChats: {"$setDifference":
[
{$setUnion: ["$recievers", "$senders"]},
[req.user.username]
]
}
}}
])
See how it works on the playground example
This is a tricky one, but can be solved with a fairly simple aggregation pipeline.
Explanation
On our first stage of the pipeline, we will want to get all the messages sent or received by the user (in our case David), for that we will use a $match stage:
{
$match: {
$or: [
{sender: 'David'},
{receiver: 'David'}
]
}
}
After we found all the messages from or to David, we can start collecting the people he talks to, for that we will use a $group stage and use 2 operations that will help us to achieve this:
$addToSet - This will add all the names to a set. Sets only contain one instance of the same value and ignore any other instance trying to be added to the set of the same value.
$cond - This will be used to add either the receiver or the sender, depending which one of them is David.
The stage will look like this:
{
$group: {
_id: null,
chats: {$addToSet: {$cond: {
if: {$eq: ['$sender', 'David']},
then: '$receiver',
else: '$sender'
}}}
}
}
Combining these 2 stages together will give us the expected result, one document looking like this:
{
"_id": null, // We don't care about this
"chats": [
"John",
"James",
"Daniel"
]
}
Final Solution
Message.aggregate([{
$match: {
$or: [
{
sender: req.user.username
},
{
receiver: req.user.username
}
]
}
}, {
$group: {
_id: null,
chats: {
$addToSet: {
$cond: {
'if': {
$eq: [
'$sender',
req.user.username
]
},
then: '$receiver',
'else': '$sender'
}
}
}
}
}])
Sources
Aggregation
$match aggregation stage
$group aggregation stage
$addToSet operation
$cond operation

MongoDB $lookup with conditional foreignField

Playground: https://mongoplayground.net/p/OxMnsCFZpmQ
My MongoDB version: 4.2.
I have a collection car_parts and customers.
As the name suggests car_parts has car parts, where some of them can have a field sub_parts which is a list of car_parts._ids this part consists of.
Every customer that bought something at us is stored in customers. The parts field for a customer contains a list of parts the customer bought together on a certain date.
I would like to have an aggregate query in MongoDB that returns a mapping of which car parts were bought (bought_parts) from which customers. However, if the car_parts has the field sub_parts, the customer should show up for the subparts only.
So the query in the playground gives almost the correct result already, except for the sub_parts topic.
Example for customer_3:
{
"_id": "customer_3",
"parts": [
{
"bought_parts": [
3
],
date: "15.07.2020"
}
]
}
Since bought_parts has car_parts._id = 3:
{
"_id": 3,
"name": "steering wheel",
"sub_parts": [
1, // other car_parts._id s
2
]
}
The result should show customer_3 as a customer of car parts 1 and 2.
I'm not sure how to accomplish this, but I assume a "temporary" replacement of the id 3 in bought_parts with the actual ids [1,2] might solve it.
Expected output:
[
{
"_id": 1,
"customers": [
"customer_1",
"customer_2",
"customer_3" // <- since customer_3 bought car part 3 which has 1 in sub_parts
]
},
{
"_id": 2,
"customers": [
"customer_3" // <- since customer_3 bought car part 3 which has 2 in sub_parts
]
},
{
"_id": 3,
"customers": [
"customer_1", // <- since car_parts.id = 3 has [1, 2] in sub_parts, show customers of ids [1, 2]
"customer_2",
"customer_3"
]
},
{
"_id": 4,
"customers": [
"customer_1",
"customer_2"
]
}
]
Thanks a lot in advance!
EDIT: One way to do it is:
db.car_parts.aggregate([
{
$project: {
topLevel: {$concatArrays: [{$ifNull: ["$sub_parts", []]}, ["$_id"]]},
sub_parts: 1
}
},
{$unwind: "$topLevel"},
{
$group: {
_id: "$topLevel",
parts: {$push: "$_id"},
sub_parts: {$first: "$sub_parts"}
}
},
{
$project: {
parts: {$concatArrays: [{"$ifNull": ["$sub_parts", []]}, "$parts"]}
}
},
{
$lookup: {
from: "customers",
localField: "parts",
foreignField: "parts.scanned_parts",
as: "customers"
}
},
{$project: {customers: "$customers._id"}}
])
As you can see working on this playground.
Since you said there is only one level of sub-parts, I used another idea: creating a top level before the $lookup. Since you want customers that used part 3 for example, to be registered under parts 1,2 which are sub-parts of 3, the idea is to group them. This connection is a bit clumsy after the $lookup, but if we use the data that we have on the car_parts collection before the $lookup, we actually knows already that parts 1,2 are subpart of 3. Creating a topLevel temporary field, allows to group, in advance, all the parts and sub-parts that if a customer used on of them, he should be registered under this top level part. This makes things much more elegant...

mongoDB find document greatest date and check value

I have a Conversation collection that looks like this:
[
{
"_id": "QzuTQYkGDBkgGnHrZ",
"participants": [
{
"id": "YymyFZ27NKtuLyP2C"
},
{
"id": "d3y7uSA2aKCQfLySw",
"lastVisited": "2016-02-04T02:59:10.056Z",
"lastMessage": "2016-02-04T02:59:10.056Z"
}
]
},
{
"_id": "e4iRefrkqrhnokH7Y",
"participants": [
{
"id": "d3y7uSA2aKCQfLySw",
"lastVisited": "2016-02-04T03:26:33.953Z",
"lastMessage": "2016-02-04T03:26:53.509Z"
},
{
"id": "SRobpwtjBANPe9hXg",
"lastVisited": "2016-02-04T03:26:35.210Z",
"lastMessage": "2016-02-04T03:15:05.779Z"
}
]
},
{
"_id": "twXPHb76MMxQ3MQor",
"participants": [
{
"id": "d3y7uSA2aKCQfLySw"
},
{
"id": "SRobpwtjBANPe9hXg",
"lastMessage": "2016-02-04T03:27:35.281Z",
"lastVisited": "2016-02-04T03:57:51.036Z"
}
]
}
]
Each conversation (document) can have a participant object with the properties of id, lastMessage, lastVisited.
Sometimes, depending on how new the conversation is, some of these values don't exist just yet (such as lastMessage, lastVisited).
What I'm trying to do is compare each participant in each individual conversation (document) and see if out of the all the participants, the greatest lastMessage field value belongs to the logged in user. Otherwise, I'm assuming that the conversation has messages that the logged in user hasn't seen yet. I want to get that count of messages that the user possibly hasn't seen yet.
In the example above, say we're logged in as d3y7uSA2aKCQfLySw. We can see that he was the last person to send a message for conversation 1, 2 BUT not 3. The count returning for how many updated conversations that d3y7uSA2aKCQfLySw hasn't seen should be 1.
Can someone point me in the right direction? I haven't the slightest clue as to how to approach the issue. My apologies for the lengthy question.
It is always advisable to store dates as ISODate rather than strings to leverage the flexibility provided by various date operators in the aggregation framework.
One way of getting the count is to,
$match the conversations in which the user is involved.
$unwind the participants field.
$sort by the lastMessage field in descending order
$group by the _id to get back the original conversations intact, and get the latest message per group(conversation) using the $first operator.
$project a field with value 0, for each group where the top most record is of the user we are looking for and 1 for others.
$group again to get the total count of the conversations in which he has not been the last one to send a message.
sample code:
var userId = "d3y7uSA2aKCQfLySw";
db.t.aggregate([
{
$match:{"participants.id":userId}
},
{
$unwind:"$participants"
},
{
$sort:{"participants.lastMessage":-1}
},
{
$group:{"_id":"$_id","lastParticipant":{$first:"$$ROOT.participants"}}
},
{
$project:{
"hasNotSeen":{$cond:[
{$eq:["$lastParticipant.id",userId]},
0,
1
]},
"_id":0}
},
{
$group:{"_id":null,"count":{$sum:"$hasNotSeen"}}
},
{
$project:{"_id":0,"numberOfConversationsNotSeen":"$count"}
}
])
I'd like to try this function.
function findUseen(uId) {
var numMessages = db.demo.aggregate(
[
{
$project: {
"participants.lastMessage": 1,
"participants.id": 1
}
},
{$unwind: "$participants"},
{$sort: {"participants.lastMessage": -1}},
{
$group: {
_id: "$_id",
participantsId: {$first: "$participants.id"},
lastMessage: {$max: "$participants.lastMessage"}
}
},
{$match: {participantsId: {$ne: uId}}},
]
).toArray().length;
return numMessages;
}
calling findUnseen("d3y7uSA2aKCQfLySw") will return 1.
I have adopted this function just to return count, but as you see it's easy to tweak it to return all unseen message metadata too.

How to get (or aggregate) distinct keys of array in MongoDB

I'm trying to get MongoDB to aggregate for me over an array with different key-value pairs, without knowing keys (Just a simple sum would be ok.)
Example docs:
{data: [{a: 3}, {b: 7}]}
{data: [{a: 5}, {c: 12}, {f: 25}]}
{data: [{f: 1}]}
{data: []}
So basically each doc (or it's array really) can have 0 or many entries, and I don't know the keys for those objects, but I want to sum and average the values over those keys.
Right now I'm just loading a bunch of docs and doing it myself in Node, but I'd like to offload that work to MongoDB.
I know I can unwind those first, but how to proceed from there? How to sum/avg/min/max the values if I don't know the keys?
If you do not know the keys or cannot make a reasonable educated guess then you are basically stuck from going any further with the aggregation framework. You could supply "all of the keys" for consideration, but I supect your acutal data looks more like this:
{ "data": [{ "film": 10 }, { "televsion": 5 },{ "boardGames": 1 }] }
So there would be little point here findin out all the "key names" and then throwing that at an aggregation statement.
For the record though, "this is why you do not structure your data storage like this". Information like "film" here should not be used as a "key" name, because it is useful "data" that could be searched upon and most importantly "indexed" in a database system.
So your data should really look like this:
{
"data": [
{ "type": "film", "value": 10 },
{ "type": "televsion", "valule": 5 },
{ "type": "boardGames", "value": 1 }
]
}
Then the aggregation statement is simple, as are many other things:
db.collection.aggregate([
{ "$unwind": "$data" },
{ "$group": {
"_id": null,
"sum": { "$sum": "$data.value" },
"avg": { "$avg": "$data.value" }
}}
])
But since the key names are constantly changing in documents and do not have a uniform structure, then you need JavaScript processing on the server to traverse the keys, and that meand mapReduce:
db.collection.mapReduce(
function() {
this.data.forEach(function(data) {
Object.keys(data).forEach(function(key) {
emit(null,data[key]); // emit the value regardless of key name
});
});
},
function(key,values) {
return Array.sum(values); // Just summing for example
},
{ "out": { "inline": 1 } }
)
And of course the JavaScript execution here will work much more slowly than the native coded operators available to the aggregation framework.
So this should be an abject lesson as to why you don not use "data" as "key names" when storing data in a database. The aggregation framework works with standard structres and is fast, falling back to JavaScript processing is more flexible, but the cost is mostly in speed and other features.