mongodb/meteor: how to I get the value of one field corresponding to the $max value of another field? - mongodb

I have a collection of messsages with the following fields: _id, senderId, receiverId, dateSubmittedMs, message, and for a given user I want to return the latest message to him from all other users. So, for example, if there are users Alex, Barb, Chuck, Dora, I would like to return the most recent message between Alex and each of Barb, Chuck and Dora. What is the best way to do this? Can I do it in one step using aggregation?
The aggregation examples in the official online documentation (http://docs.mongodb.org/manual/reference/aggregation/min/) show how to find the lowest age over groups within a collection, but what I need is something analogous to finding the name of the youngest person over groups of people.
Here is my current approach:
Step 1: Find the highest value for dateSubmitted over all messages sent and received by Alex, grouping over the other users:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$group: {_id: "$receiverId", lastestSubmitted: {$max: "$submitted"} }}).fetch();
Step 2: Create an array of these highest values of dateSubmitted:
var MIds = _.pluck(M,'lastestSubmitted');
Step 3: Find these messages, by senderId, receiverId, and latestSubmitted:
return Messages.find(
{submitted: {$in: MIds}, $or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]},
{$sort: {submitted: 1}}
});
There are two problems with this:
Can it be done in one step instead of three? Perhaps through a mapReduce or Aggregate command?
Instead of grouping only over the receiverId: 'Alex', is there a way to group over something like: $or [{receiverId: 'Alex', senderId: 'Barb'}, {senderId: 'Alex', receiverId: 'Barb'}]? (but for EACH of the other users) This would allow me to get the latest message in a conversation between any two participants that Alex conversed with. So for example:
Any suggestions?

The only thing you have to change is the group._id in the grougping phase. While you can use a document for this propose not just a field, you can apply the sort in the same pipeline. However it is not change if you use only the receiverId for grouping.
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$group:
{_id: {
receiverId: "$receiverId",
senderId: "$senderId"},
lastestSubmitted: {$max: "$submitted"} }
},
{$sort: {submitted: -1}
},
{$limit: 1}
).fetch();
This example above only adds the possiblity to check for the second most recently pinged connection or the third one, if you only likely to get the most recent message you do not even need to group. Just run this:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$sort: {submitted: -1}
},
{$limit: 1}
).fetch();
THE ANSWER ABOVE THIS IS FOR GETTING THE MOST RECENT MESSAGE BETWEEN A SINGLE USER AND OTHERS, TO GET THE MOST RECENT MESSAGE BETWEEN ALL PAIRS INVOLVING A USER READ UNDER>>>
Based on the comments i misinterpreted a bit the question but the part above is useful anyway. The correct resolution for your problem is under. The difficulty in is to get a key for the pair to group on and identify the switched pair ( bob -> tom == tom -> bob) in this case. U can use condition and ordering to identify the swaps. (It is certainly a much more difficult question) The code look like this:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$project:
{'part1':{$cond:[{$gt:['$senderId','$receiverId']},'$senderId','$receiverId']},
'part2':{$cond:[{$gt:['$senderId','$receiverId']},'$receiverId','$senderId']},
'message':1,
'submitted':1
}
},
{$sort: {submitted: -1}},
{$group:
{_id: {
part1: "$part1",
part2: "$part2"},
lastestSubmitted: {$first: "$submitted"},
message: {$first: "$message"} }
}
).fetch();
If you are not familiar with some of the operators used above, like $cond or $first, check out this.

Related

Mongodb - Find coupled documents, where A is following B and B is following A

I'm trying to find all users in my database that follow the user back.
The followers collection has 3 fields: _id, _t, _f.
When a user follows another user it adds their user ID to _f and the id of the target user to _t
I need to query all documents where the inverse for _f, _t exist in the database and retrieve all users id's that follow back.
EXAMPLE:
{
_id: "62a0f3fb362c239460e8ee09",
_f: "611531d23039d93be3bf2e2a",
_t: "61bdd50570a6a12866e3297f",
createdAt: "2022-06-08T19:03:35.246Z"
},
{
_id: "62a0f3fb362c239460e8ee09",
_f: "61bdd50570a6a12866e3297f",
_t: "611531d23039d93be3bf2e2a",
createdAt: "2022-06-08T19:03:35.246Z"
}
If these two documents existed in a collection I would want the query to retrieve them both so that I could pull the _f value, thus retrieve all the users that I follow AND follow me back.
There is a difference between finding all couples, and finding all couples that a specific user is a part of them.
If you want to find all couples, you can use an aggregation pipeline with $lookup:
db.followers.aggregate([
{
$lookup: {
from: "followers",
let: {f: "$_f", t: "$_t"},
pipeline: [
{
$match: {
$expr: {$and: [{$eq: ["$_t", "$$f"]}, {$eq: ["$_f", "$$t"]}]}
}
},
{$project: {_id: 1}}
],
as: "hasCouple"
}
},
{$match: {$expr: {$gt: [{$size: "$hasCouple"}, 0]}}},
{$set: {hasCouple: {$first: "$hasCouple._id"}}}
])
See how it works on the playground example - all couples
If you want just all couples that a user x is a part of them, add a $match step at the start, to get focused results:
{$match: {_f: x}},
See how it works on the playground example - specific user

How to addfields in MongoDB retrospectively

I have a schema that has a like array. This array stores all the people who have liked my post.
I just added a likeCount field as well but the likeCount fields default value is 0.
How can I create an addfields in mongoDB so that I can update the likeCount with the length of the like array?
I am on a MERN stack.
I am assuming you have a data structure like this:
{
postId: "post1",
likes: [ "ID1", "ID2", "ID3" ]
}
There is almost no reason to add a likeCount field. You should take the length of the likes array itself. Some examples:
db.foo.insert([
{'post':"P1", likes: ["ID1","ID2","ID3"]},
{'post':"P2", likes: ["ID1","ID2","ID3"]},
{'post':"P3", likes: ["ID4","ID2","ID6","ID7"]}
]);
// Which post has the most likes?
db.foo.aggregate([
{$addFields: {N: {$size: "$likes"}}},
{$sort: {"N":-1}}
//, {$limit: 2} // optionally limit to whatever
]);
// Is ID6 in likes?
// $match of a scalar to an input field ('likes') acts like
// $in for convenience:
db.foo.aggregate([ {$match: {'likes':'ID6'}} ]);
// Is ID6 OR ID3 in likes?
db.foo.aggregate([ {$match: {'likes':{$in:['ID6','ID3']}}} ]);
// Is ID2 AND ID7 in likes?
// This is a fancier way of doing set-to-set compares instead
// of a bunch of expression passed to $and:
var targets = ['ID7','ID2'];
db.foo.aggregate([
{$project: {X: {$eq:[2, {$size:{$setIntersection: ['$likes', targets]}} ]} }}
]);
// Who likes the most across all posts?
db.foo.aggregate([
{$unwind: '$likes'},
{$group: {_id: '$likes', N:{$sum:1}} },
{$sort: {'N':-1}}
]);
This is how to update all the documents with the respective likeCount values the first time:
db.collection.update({},
[
{
$addFields: {
likeCount: {
$size: "$like"
}
}
}
],
{
multi: true
})
Every next time somebody or multiple people are added to the like array , you may set the likeCount with the $size or you may increase the count with $inc operation.
Afcourse as #Buzz pointed below it is best to leave the array count() in the read code since updating every time like count() it will be an expensive operation leading to performance implication under heavy load ...
playground

MongoDB - Safely sort inner array after group

I'm trying to look up all records that match a certain condition, in this case _id being certain values, and then return only the top 2 results, sorted by the name field.
This is what I have
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$sort: {fk: 1, name: -1}},
{$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
{$project: {items: {$slice: ["$items", 2]} }}
])
and it works, BUT, it's not guaranteed. According to this Mongo thread $group does not guarantee document order.
This would also mean that all of the suggested solutions here and elsewhere, which recommend using $unwind, followed by $sort, and then $group, would also not work, for the same reason.
What is the best way to accomplish this with Mongo (any version)? I've seen suggestions that this could be accomplished in the $project phase, but I'm not quite sure how.
You are correct in saying that the result of $group is never sorted.
$group does not order its output documents.
Hence doing a;
{$sort: {fk: 1}}
then grouping with
{$group: {_id: "$fk", ... }},
will be a wasted effort.
But there is a silver lining with sorting before $group stage with name: -1. Since you are using $push (not an $addToSet), inserted objects will retain the order they've had in the newly created items array in the $group result. You can see this behaviour here (copy of your pipeline)
The items array will always have;
"items": [
{
..
"name": "Michael"
},
{
..
"name": "George"
}
]
in same order, therefore your nested array sort is a non-issue! Though I am unable to find an exact quote in documentation to confirm this behaviour, you can check;
this,
or this where it is confirmed.
Also, accumulator operator list for $group, where $addToSet has "Order of the array elements is undefined." in its description, whereas the similar operator $push does not, which might be an indirect evidence? :)
Just a simple modification of your pipeline where you move the fk: 1 sort from pre-$group stage to post-$group stage;
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$sort: {name: -1}},
{$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
{$sort: {_id: 1}},
{$project: {items: {$slice: ["$items", 2]} }}
])
should be sufficient to have the main result array order fixed as well. Check it on mongoplayground
$group doesn't guarantee the document order but it would keep the grouped documents in the sorted order for each bucket. So in your case even though the documents after $group stage are not sorted by fk but each group (items) would be sorted by name descending. If you would like to keep the documents sorted by fk you could just add the {$sort:{fk:1}} after $group stage
You could also sort by order of values passed in your match query should you need by adding a extra field for each document. Something like
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$addField:{ifk:{$indexOfArray:[[1, 2],"$fk"]}}},
{$sort: {ifk: 1, name: -1}},
{$group: {_id: "$ifk", items: {$push: "$$ROOT"}}},
{$sort: {_id : 1}},
{$project: {items: {$slice: ["$items", 2]}}}
])
Update to allow array sort without group operator : I've found the jira which is going to allow sort array.
You could try below $project stage to sort the array.There maybe various way to do it. This should sort names descending.Working but a slower solution.
{"$project":{"items":{"$reduce":{
"input":"$items",
"initialValue":[],
"in":{"$let":{
"vars":{"othis":"$$this","ovalue":"$$value"},
"in":{"$let":{
"vars":{
//return index as 0 when comparing the first value with initial value (empty) or else return the index of value from the accumlator array which is closest and less than the current value.
"index":{"$cond":{
"if":{"$eq":["$$ovalue",[]]},
"then":0,
"else":{"$reduce":{
"input":"$$ovalue",
"initialValue":0,
"in":{"$cond":{
"if":{"$lt":["$$othis.name","$$this.name"]},
"then":{"$add":["$$value",1]},
"else":"$$value"}}}}
}}
},
//insert the current value at the found index
"in":{"$concatArrays":[
{"$slice":["$$ovalue","$$index"]},
["$$othis"],
{"$slice":["$$ovalue",{"$subtract":["$$index",{"$size":"$$ovalue"}]}]}]}
}}}}
}}}}
Simple example with demonstration how each iteration works
db.b.insert({"items":[2,5,4,7,6,3]});
othis ovalue index concat arrays (parts with counts) return value
2 [] 0 [],0 [2] [],0 [2]
5 [2] 0 [],0 [5] [2],-1 [5,2]
4 [5,2] 1 [5],1 [4] [2],-1 [5,4,2]
7 [5,4,2] 0 [],0 [7] [5,4,2],-3 [7,5,4,2]
6 [7,5,4,2] 1 [7],1 [6] [5,4,2],-3 [7,6,5,4,2]
3 [7,6,5,4,2] 4 [7,6,5,4],4 [3] [2],-1 [7,6,5,4,3,2]
Reference - Sorting Array with JavaScript reduce function
There is a bit of a red herring in the question as $group does guarantee that it will be processing incoming documents in order (and that's why you have to sort of them before $group to get an ordered arrays) but there is an issue with the way you propose doing it, since pushing all the documents into a single grouping is (a) inefficient and (b) could potentially exceed maximum document size.
Since you only want top two, for each of the unique fk values, the most efficient way to accomplish it is via a "subquery" using $lookup like this:
db.coll.aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$group:{_id:"$fk"}},
{$sort: {_id: 1}},
{$lookup:{
from:"coll",
as:"items",
let:{fk:"$_id"},
pipeline:[
{$match:{$expr:{$eq:["$fk","$$fk"]}}},
{$sort:{name:-1}},
{$limit:2},
{$project:{_id:0, fk:1, name:1}}
]
}}
])
Assuming you have an index on {fk:1, name:-1} as you must to get efficient sort in your proposed code, the first two stages here will use that index via DISTINCT_SCAN plan which is very efficient, and for each of them, $lookup will use that same index to filter by single value of fk and return results already sorted and limited to first two. This will be the most efficient way to do this at least until https://jira.mongodb.org/browse/SERVER-9377 is implemented by the server.

Mongo query: array of objects where a key's value is repeated

I am new to Mongo. Posting this question because i am not sure how to search this on google
i have a book documents like below
{
bookId: 1
title: 'some title',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 2
}]
}
and another one like this
{
bookId: 2
title: 'some title 2',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 1
}]
}
I want to write a query db.books.find({}) which would return only those books where editions.editionId has been duplicated for a book.
So in this example, for bookId: 2 there are two editions with the editionId:1.
Any suggestions?
You can use the aggregation framework; specifically, you can use the $group operator to group the records together by book and edition id, and count how many times they occur : if the count is greater than 1, then you've found a duplication.
Here is an example:
db.books.aggregate([
{$unwind: "$editions"},
{$group: {"_id": {"_id": "$_id", "editionId": "$editions.editionId"}, "count": {$sum: 1}}},
{$match: {"count" : {"$gt": 1}}}
])
Note that this does not return the entire book records, but it does return their identifiers; you can then use these in a subsequent query to fetch the entire records, or do some de-duplication for example.

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".