MongoDB $setUnion on object ($setUnion but with additional information) - mongodb

stackoverflow community,
I do not often work with big Arrays of Objects within in mongodb
so I have no idea how to solve this problem:
1.
i am working within one file, so obviously it's an aggregate witch firstly does an {$match:{"_id" : ObjectId("5c3f5cb04147b3082648278b") }},
2.
ok now I have another step that $project + $filter to filter out some objects, but it is not important for this (i think)
I have an array of objects, similar to this
{
"_id": ObjectId(".."),
"data":
[
{
id : 01,
groupId: 22,
noteId: 876543
},
{
id : 02,
groupId: 33,
noteId: 767676
},
{
id : 03,
groupId: 22,
noteId: 876543
},
{
id : 04,
groupId: 76,
noteId: 876543
}
]
}
but with thousands of entries and more values per object.
Every groupId can have any noteId, but the same groups have always the same noteId.
The Problem: noteIds can be shared between groups.
I added this
{ $project: {
"groupIds": {"$setUnion": "$data.groupId"}
}}
witch gives me all the groupIds
but it is very important that I also get all the related noteId's because
it is an arbitrary ID in relation with nothing else.
is it possible to somehow union an object by a specified field?
or is there another way to solve this? If I maybe filter for Objects with $in($data.groupId, $setUnion('union from above') I still would not know how to only extract the 2 fields that I need.
thanks for your help in advance
H.M.

You can use below aggregation
db.collection.aggregate([
{ "$unwind": "$data" },
{ "$group": {
"_id": {
"_id": "$_id",
"groupId": "$data.groupId"
},
"noteIds": {
"$push": {
"noteId": "$data.noteId",
}
}
}},
{ "$group": {
"_id": "$_id._id",
"data": {
"$push": {
"groupId": "$_id.groupId",
"noteIds": "$noteIds"
}
}
}}
])

Related

multi-stage aggregation pipeline matching data based on fields retrieved through $lookup

I'm trying to build a complex, nested aggregation pipeline in MongoDB (4.4.9 Community Edition, using the pymongo driver for Python 3.10).
There are relevant data points in different collections which I want to aggregate into one, NEW (ideally) view (or, if that doesn't work) collection.
The collections, and the relevant fields therein follow a hierarchy. There is members, which contains the top-level key on which other data is to be merged,
membershipNumber.
> members.find_one()
{'_id': ObjectId('61153299af6122XXXXXXXXXXXXX'), 'membershipNumber': 'N03XXXXXX'}
Then, there's a different collection, which contains membershipNumber, but also a different, linked field, an_user_id. an_user_id is used in other collections to denote records/fields in arrays that pertain to that particular user.
I 'join' members and an_users like so:
result = members.aggregate([
{
'$lookup': {
'from': 'an_users',
'localField': 'membershipNumber',
'foreignField': 'memref',
'as': 'an_users'
}
},
{ '$unwind' : '$an_users' },
{
'$project' : {
'_id' : 1,
'membershipNumber' : 1,
'an_user_id' : '$an_users.user_id'
}
}
]);
So far so good, this returns the desired, aggregated record:
{'_id': ObjectId('61153253aBBBBBBBBBBBB'),
'membershipNumber': 'N0XXXXXXXX',
'an_user_id': '48XXXXXX'}
Now, I have a third collection, which contains the an_user_id as a string in arrays, denoting wherever that user clicked a given email, whereby a record is an email (and the an_user_ids in the clicks array are users that clicked a link in that email.
{'_id': ObjectId('blah'),
'email_id': '407XXX',
'actions_count': 17,
'administrative_title': 'test',
'bounce': ['3440XXXX'],
'click': ['38294CCC',
'418FFFF',
'48XXXXXX',
'38eGGGG'}
I want to count the number occurences of a given an_user_id (which I've attained from aggregating) in arrays (e.g. clicks, bounces, opens) in the emails collection, and include it in the .aggregate call, to retrieve something like this:
{'_id': ObjectId('61153253aBBBBBBBBBBBB'),
'membershipNumber': 'N0XXXXXXXX',
'an_user_id': '48XXXXXX',
'n_email_clicks' : 412,
'n_email_bounces' : 12
}
Further, I might want to also attach counts of an_user_id in other collections in my DB.
Consider, e.g., this collection called events:
{
"_id": "617ffa96ee11844e143a63dd",
"id": "12345",
"administrative_title": "my_event",
"created_at": {
"$date": "2020-01-15T16:28:50.000Z"
},
"event_creator_id": "123456",
"event_title": "my_event",
"group_id": "123456",
"permalink": "event_id",
"rsvp_count": 54,
"rsvps": [{
"rsvp_id": "56789",
"display_name": "John Doe",
"rsvp_user_id": "48XXXXXX",
"rsvp_created_at": {
"$date": "2020-01-28T15:38:50.000Z"
},
"rsvp_updated_at": {
"$date": "2020-01-28T15:38:50.000Z"
},
"first_name": "John",
"last_name": "Doe",
}, {
"rsvp_id": "543895",
"display_name": "James Appleslice",
"rsvp_user_id": "N03XXXXXX",
"rsvp_created_at": {
"$date": "2020-02-05T13:15:14.000Z"
},
"rsvp_updated_at": {
"$date": "2020-02-05T13:15:14.000Z"
},
"first_name": "James",
"last_name": "Appleslice"}
]
}
So, the end-product would look something like this:
{'_id': ObjectId('61153253aBBBBBBBBBBBB'),
'membershipNumber': 'N0XXXXXXXX',
'an_user_id': '48XXXXXX',
'n_email_clicks' : 412,
'n_email_bounces' : 12,
'n_rsvps' : 12
}
My idea was to use the $lookup parameter -- however, I only know how to use this for matching on fields that I have in the parent collection that I'm performing the aggregation on, but not on fields that have been generated in the process of the aggregation.
Any help would be hugely appreciated!
You could use $lookup pipeline. First you would $lookup the user id followed by another $lookup to verify if the user id exists in email. Lastly few more stages to collect the results and format per your need. Furthermore, you can add $out stage if you would like to write the results into another collection.
db.members.aggregate([{
$lookup: {
from: "an_users",
let: {
membershipNumber: "$membershipNumber"
},
pipeline: [
{
$match: {
$expr: {
$eq: [
"$memref",
"$$membershipNumber"
]
},
}
},
{
"$lookup": {
"from": "emails",
"localField": "user_id",
"foreignField": "click",
"as": "clicks"
}
},
{
"$project": {
"_id": 1,
"membershipNumber": 1,
"an_user_id": "$user_id",
"n_email_clicks": {
$size: "$clicks"
}
}
}
],
as: "details"
}
},
{
$replaceRoot: {
newRoot: {
$mergeObjects: [
{
$arrayElemAt: [
"$details",
0
]
},
"$$ROOT"
]
}
}
},
{
$project: {
details: 0
}
}])
Working example - https://mongoplayground.net/p/yrFsNp44hpi

Maintaining an embedded array with top 3 elements

I'm currently working on a mobile car racing game.
After a user finishes a track, a new document is added to "Plays" collections.
Also, if the user finishes the track 3rd/2nd/1st in time. the user id and time will be added to the "best" array of this track. (and the new 4th place user will be removed from the array).
Since 2+ users can finish a track on the same time, I'll probably need to make this atomic. so I've used findAndModify.
So far I've managed to do it well if I only maintain the 1st position in the array. this is what I did:
db.collection('tracks').findAndModify(
{ $or: [ {_id: track_id, 'best': {$exists: false}}, {_id: track_id,'best.0.time': {$gt: _time}} ] },
[],
{$set : {'best.0' : {'user_id': _userId, 'time': _time} }},
(err, data) => {
if (err) return app_res.send(err);
app_res.send (data.value != null);
}
);
But My goal is to maintain the 3 best.
I've looked in the MongoDB documentation for array operators but I can't understand how (and if) they can't help me achieve my goal.
Is there anyway I can do it?
EDIT: Just to make this more clear, the top 3 indicates the top 3 users and their top times. for example, if "best" array is:
1. user: a, time : 5.
2. user: b, time : 9.
3. user: c, time : 20.
and than user c finish the track in 7 seconds, than "best" changes to:
1. user: a, time : 5.
2. user: c, time : 7.
3. user: b, time : 9.
My Schema:
Users:
{
"_id": {
"$oid": "123"
},
"name": "A name"
}
Tracks:
{
"_id": {
"$oid": "765"
},
"name": "A track name",
"length": 34.65,
"best": [{"user_id": 467,"time": 24},{"user_id": 532,"time": 47},{"user_id": 953,"time": 89}]
}
Plays:
{
"_id": {
"$oid": "1"
},
"time": 300000,
"date": {
"$date": "2018-08-15T14:05:47.872Z"
},
"user_id": {
"$oid": "123"
},
"track_id": {
"$oid": "765"
}
}
Here is how you'd do that - using some special modifiers that can be used with $push:
db.tracks.update({}, {
$push: {
"best": {
$each: [ {"user_id": 123,"time": 1} ], // add a new item to the "best" array
$slice: 3, // keep only top three
$sort: { "time": 1 } // rank/sort based on "time" field
}
}
})

Get sum of Nested Array in Aggregate

Ok, I have an issue I cannot seem to solve.
I have a document like this:
{
"playerId": "43345jhiuy3498jh4358yu345j",
"leaderboardId": "5b165ca15399c020e3f17a75",
"data": {
"type": "EclecticData",
"holeScores": [
{
"type": "RoundHoleData",
"xtraStrokes": 0,
"strokes": 3,
},
{
"type": "RoundHoleData",
"xtraStrokes": 1,
"strokes": 5,
},
{
"type": "RoundHoleData",
"xtraStrokes": 0,
"strokes": 4
}
]
}
}
Now, what I am trying to accomplish is using aggregate sum the strokes and then order it afterwards. I am trying this:
var sortedBoard = db.collection.aggregate(
{$match: {"leaderboardId": boardId}},
{$group: {
_id: "$playerId",
played: { $sum: 1 },
strokes: {$sum: '$data.holeScores.strokes'}
}
},
{$project:{
type: "$SortBoard",
avgPoints: '$played',
sumPoints: "$strokes",
played : '$played'
}}
);
The issue here is that I do net get the strokes sum correct, since this is inside another array.
Hope someone can help me with this and thanks in advance :-)
You need to say $sum twice:
var sortedBoard = db.collection.aggregate([
{ "$match": { "leaderboardId": boardId}},
{ "$group": {
"_id": "$playerId",
"SortBoard": { "$first": "$SortBoard" },
"played": { "$sum": 1 },
"strokes": { "$sum": { "$sum": "$data.holeScores.strokes"} }
}},
{ "$project": {
"type": "$SortBoard",
"avgPoints": "$playeyed",
"sumPoints": "$strokes",
"played": "$played"
}}
])
The reason is because you are using it both as a way to "sum array values" and also as an "accumulator" for $group.
The other thing you appear to be missing is that $group only outputs the fields you tell it to, therefore if you want to access other fields in other stages or output, you need to keep them with something like $first or another accumulator. We also appear to be missing a pipeline stage in the question anyway, but it's worth noting just to be sure.
Also note you really should wrap aggregation pipelines as an official array [], because the legacy usage is deprecated and can cause problems in some language implementations.
Returns the correct details of course:
{
"_id" : "43345jhiuy3498jh4358yu345j",
"avgPoints" : 1,
"sumPoints" : 12,
"played" : 1
}

MongoDB Sum Array With Objects

Say I have an aggregation that returns the following:
[
{driverId: 21312asd12, cars: 2, totalMiles: 30000, family: 4},
{driverId: 55512a23a2, cars: 3, totalMiles: 55000, family: 2},
...
]
How would I go about running a summation of each data set on a groupId basis to return the following? Do I use an $unwind? Do another grouping?
For example I would like to return:
{
totalDrivers: 2,
totalCars: 5,
totalMiles: 85000,
totalFamily: 6
}
You seem to just be referring to the documents in the output as an "array", therefore just add another $group to the end of your pipeline:
{ "$group": {
"_id": null,
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$cars" },
"totalMiles": { "$sum": "$totalMiles" },
"totalFamily": { "$sum": "$family" }
}}
Where null is essentially just a blank grouping key that is not a field present in the document to group on. The result should be a single document (albeit in an array, depending on the API method call used or server version).
Or if you actually mean that each document has a field with an array like this, then $unwind and process the group either per document or with a null as above:
{ "$unwind": "$someArray" },
{ "$group": {
"_id": "$_id",
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$someArray.cars" },
"totalMiles": { "$sum": "$someArray.totalMiles" },
"totalFamily": { "$sum": "$someArray.family" }
}}
At any rate, you should really post the code you are using when asking questions like this. It is very likely that your pipeline may not be as efficient to get to your end goal as you think, and if you posted that it both gives a clear picture of what you are doing as well as leaves it open for suggested improvement.

Sub-query in MongoDB

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:
{_id: ObjectId("xxxxx"), country: "UK",...}
and actions like
{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}
I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country
db.events.aggregate({
$match: {
createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
user: { $exists: true }
}
},
{
$group: {
_id: {
year: { $year: "$createdAt" },
user_obj: "$user"
},
count: { $sum: 1 }
}
},
{
$group: {
_id: {
year: "$_id.year",
country: db.users.findOne({
_id: { $eq: "$_id.user_obj" },
country: { $exists: true }
}).country
},
total: { $sum: "$count" },
distinct: { $sum: 1 }
}
})
No Joins in here, just us bears
So MongoDB "does not do joins". You might have tried something like this in the shell for example:
db.events.find().forEach(function(event) {
event.user = db.user.findOne({ "_id": eventUser });
printjson(event)
})
But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.
For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:
db.events.aggregate([
{ "$match": {
"user": {
"$in": db.users.distinct("_id",{ "country": "UK" })
}
}}
])
Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.
Correcting
You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:
Embed your whole user object data within the event data.
At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.
So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:
{
"_id": ObjectId("yyyyy"),
"createdAt": ISODate(),
"user": {
"_id": ObjectId("xxxxx"),
"country": "UK"
}
}
Then the "aggregation" process becomes simple:
db.events.aggregate([
{ "$match": {
"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
"user": { "$exists": true }
}},
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"user_id": "$user._id"
"country": "$user.country"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.country",
"total": { "$sum": "$count" },
"distinct": { "$sum": 1 }
}}
])
We're not normal
Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:
var bulk = db.events.intitializeUnorderedBulkOp(),
count = 0;
db.users.find().forEach(function(user) {
// update multiple events for user
bulk.find({ "user": user._id }).update({
"$set": { "user": { "_id": user._id, "country": user.country } }
});
count++;
// Send batch every 1000
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.events.intitializeUnorderedBulkOp();
}
});
// Clear any queued
if ( count % 1000 != 0 )
bulk.execute();
So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.
If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".
But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.