Matching for latest documents for a unique set of fields before aggregating - mongodb

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.

I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.

I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)

Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.

Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".

Related

How to build a MongoDB query that combines two field temporarily?

I have a schema which has one field named ownerId and a field which is an array named participantIds. In the frontend users can select participants. I'm using these ids to filter documents by querying the participantIds with the $all operator and the list of participantsIds from the frontend. This is perfect except that the participantsIds in the document don't include the ownerId. I thought about using aggregate to add a new field which consists of a list like this one: [participantIds, ownerId] and then querying against this new field with $all and after that delete the field again since it isn't need in the frontend.
How would such a query look like or is there any better way to achieve this behavior? I'm really lost right now since I'm trying to implement this with mongo_dart for the last 3 hours.
This is how the schema looks like:
{
_id: ObjectId(),
title: 'Title of the Event',
startDate: '2020-09-09T00:00:00.000',
endDate: '2020-09-09T00:00:00.000',
startHour: 1,
durationHours: 1,
ownerId: '5f57ff55202b0e00065fbd10',
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13'],
classesIds: [],
categoriesIds: [],
roomsIds: [],
creationTime: '2020-09-10T16:42:14.966',
description: 'Some Desc'
}
Tl;dr I want to query documents with the $all operator on the participantsIds field but the ownerId should be included in this query.
What I want is instead of querying against:
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13']
I want to query against:
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13', '5f57ff55202b0e00065fbd10']
Having fun here, by the way, it's better to use Joe answer if you are doing the query frequently, or even better a "All" field on insertion.
Additional Notes: Use projection at the start/end, to get what you need
https://mongoplayground.net/p/UP_-IUGenGp
db.collection.aggregate([
{
"$addFields": {
"all": {
$setUnion: [
"$participantsIds",
[
"$ownerId"
]
]
}
}
},
{
$match: {
all: {
$all: [
"5f57ff55202b0e00065fbd14",
"5f57ff55202b0e00065fbd15",
"5f57ff55202b0e00065fbd13",
"5f57ff55202b0e00065fbd10"
]
}
}
}
])
Didn't fully understand what you want to do but maybe this helps:
db.collection.find({
ownerId: "5f57ff55202b0e00065fbd10",
participantsIds: {
$all: ['5f57ff55202b0e00065fbd14',
'5f57ff55202b0e00065fbd15',
'5f57ff55202b0e00065fbd13']
})
You could use the pipeline form of update to either add the owner to the participant list or add a new consolidated field:
db.collection.update({},[{$set:{
allParticipantsIds: {$setUnion: [
"$participantsIds",
["$ownerId"]
]}
}}])

Best way to doing this in mongodb?

db.post
{user:'50001', content:...,},
{user:'50002', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
{user:'50002', content:...,},
db.catagory
{user:'50001', type:'public',...}
{user:'50002', type:'vip',...}
{user:'50003', type:'public',...}
What I want, is pickup the post of user what type:public.
I can do it like:
publicUsers = db.catagory.distinct(user, {type:'public'})
db.post.find({user: {$in: publicUsers }}).sort({_id:-1})
or use lookup in aggregate.
and output
{user:'50001', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
Is that some other better way to do this faster?
Consider of the large request situation, should I create a new db like post_public for finding?
The aggregation pipeline is your best bet. It's quite performant, as it's intended for just these types of situations.
Another Edit
Ok, this is the best solution I have, but it has some limitations set by Mongo.
Firstly, a sort operation on non-indexed fields cannot use more than 100MB of RAM. That's quite a bit of space for text based posts, but something to be aware of (there's a workaround that uses disk space if needed, but it will slow down the operation).
Second, and more relevant in this case, Mongo sets a hard cap on the size of a single document to 16MB. Given that you wanted the posts in a single array, a large dataset can potentially pass that if no limits on the output are imposed. If you decide you don't want all posts in a single array, by removing the last three stages of the pipeline ($group, $addFields, and $project), the result will be the sorted posts each within their own document. You will need to limit the number of documents manually to get the n number of posts you want if you go that route. IMO, you should do this. Those last three stages add additional processing burden but offer very little practical benefit on the output. The only difference is iterating over a nested array, vs iterating over the root array.
db.categories.aggregate([
{$project: {
user: 1,
type: 1
}},
{$match: {
type: "public"
}},
{$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}},
{$unwind: {
path: "$userPosts",
preserveNullAndEmptyArrays: false
}},
{$sort: {
"userPosts.posted": -1,
}},
{$group: {
_id: "public",
posts: {
$push: "$userPosts"
}
}},
{$addFields: {
"latestPosts": {
$slice: ["$posts", 5]
}
}},
{$project: {
posts: 0
}}
]);
You'll access the resulting posts like so:
.then(result => {
let latestPosts = result[0].latestPosts;
});
Edit
I built a mock database and came up with this solution:
db.categories.aggregate([{
$project: {
_id: 0,
user: 1,
type: 1,
}
}, {
$match: {
type: "public"
}
}, {
$group: {
_id: {
$concat: ["$user", "-", "$type"]
},
user: {
$first: "$user"
},
type: {
$first: "$type"
}
}
}, {
$sort: {
user: -1
}
}, {
$limit: 10
}, {
$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}
}]);
The result will be an array of objects representing each unique user-type combo with the type specified in the $match filter. The object will contain a child array with all the posts made by that user. All the users will be sorted ind descending order and limited to ten.
If all you want is a single operator to get distinct values from categories:
db.categories.aggregate([{$project: {
_id: 0,
user: 1,
type: 1
}}, {$group: {
_id: {$concat: ["$user","-","$type"]},
user: {$first: "$user"},
type: {$first: "$type"}
}}]);
The goal is to first filter out the users with non-public types so they don't appear in the result, join the posts belonging to that user to a new field called "user_posts", and then sort all the users in ascending order. I don't know the entire structure of your database so my specific example may not be the best fit, but it would be a good starting point.
Edit
Some other ways to increase performance for large amounts of data:
By querying in pieces. Say for example you're displaying paginated user posts. Then you would only need to query the posts for the page currently being viewed. You can reduce any noticeable loading times by pre-fetching the next few pages as well.
Using a "virtual" list. This will display every post in one long list, but only fetch a certain size chunk that is currently being viewed, or will soon be viewed. Similar to the first pagination example. If you're familiar with the RecyclerView in Android development, that's the idea.
Restructuring the way posts are stored in your database. It would be a much faster operation if Mongo didn't need to iterate through so much unrelated data when looking for posts of a specific user or type. The solution is to store posts nested below a common ancestor, such as their poster, or their view type. This does mean that you may have duplicate entries spread between documents. This is one of the inherent downsides of a no-SQL database. You will need to find a solution for handling synchronizing the documents spread throughout the database.
Creating a new index on document fields that you use to sort your array. Indices help Mongo organize the documents internally for more efficient searches.
Running a heavy query once, then $out-ing the result. In this way you can store the result of expensive operations in a more permanent fashion than a simple cache.
A variant of number 5 is to $out the result of a large sorting operation, and then continuing to insert new entries at the end of the now sorted collection.
Finally, I do it like this.
db.post.aggregate([
{$match:{
uid:{$in: db.catagory.distinct('id',{type:'public'})} # directly write here
_id:{$lt: ObjectId("5c6d5494b9f43f020d4122a0")} # for turn the page
}},
{$sort:{_id:-1}}, # sort by post _id
{$limit:20} # page size
])
It's faster than lookup or others.

Returning whole object in MongoDB aggregation

I have Item schema in which I have item details with respective restaurant. I have to find all items of particular restaurant and group by them with 'type' and 'category' (type and category are fields in Item schema), I am able to group items as I want but I wont be able to get complete item object.
My query:
db.items.aggregate([{
'$match': {
'restaurant': ObjectId("551111450712235c81620a57")
}
}, {
'$group': {
id: {
'$push': '$_id'
}
, _id: {
type: '$type'
, category: '$category'
}
}
}, {
$project: {
id: '$id'
}
}])
I have seen one method by adding each field value to group then project it. As I have many fields in my Item schema I don't feel this will good solution for me, Can I get complete object instead of Ids only.
Well you can always use $$ROOT providing that your server is MongoDB 2.6 or greater:
db.items.aggregate([
{ '$match': {'restaurant': ObjectId("551111450712235c81620a57")}},
{ '$group':{
_id : {
type : '$type',
category : '$category'
},
id: { '$push': '$$ROOT' },
}}
])
Which is going to place every whole object into the members of the array.
You need to be careful when doing this as with larger results you are certain to break BSON limits.
I would suggest that you are trying to contruct some kind of "search results", with "facet counts" or similar. For that you are better off running a separate query for the "aggregation" part and one for the actual document results.
That is a much safer and flexible approach than trying to group everything together.

MongoDB query to find property of first element of array

I have the following data in MongoDB (simplified for what is necessary to my question).
{
_id: 0,
actions: [
{
type: "insert",
data: "abc, quite possibly very very large"
}
]
}
{
_id: 1,
actions: [
{
type: "update",
data: "def"
},{
type: "delete",
data: "ghi"
}
]
}
What I would like is to find the first action type for each document, e.g.
{_id:0, first_action_type:"insert"}
{_id:1, first_action_type:"update"}
(It's fine if the data structured differently, but I need those values present, somehow.)
EDIT: I've tried db.collection.find({}, {'actions.action_type':1}), but obviously that returns all elements of the actions array.
NoSQL is quite new to me. Before, I would have stored all this in two tables in a relational database and done something like SELECT id, (SELECT type FROM action WHERE document_id = d.id ORDER BY seq LIMIT 1) action_type FROM document d.
You can use $slice operator in projection. (but for what you do i am not sure that the order of the array remain the same when you update it. Just to keep in mind))
db.collection.find({},{'actions':{$slice:1},'actions.type':1})
You can also use the Aggregation Pipeline introduced in version 2.2:
db.collection.aggregate([
{ $unwind: '$actions' },
{ $group: { _id: "$_id", first_action_type: { $first: "$actions.type" } } }
])
Using the $arrayElemAt operator is actually the most elegant way, although the syntax may be unintuitive:
db.collection.aggregate([
{ $project: {first_action_type: {$arrayElemAt: ["$actions.type", 0]}
])

Combine two fields from different documents in Mongodb

I have these documents in a collection :
{topic : "a",
messages : [ObjectId("21312321321323"),ObjectId("34535345353"),...]
},
{topic : "b,
messages : [ObjectId("1233232323232"),ObjectId("6556565656565"),...]
}
Is there a posibility to get a result with the combination of messages fields ? I like to get this for example :
{[
ObjectId(""),ObjectId(""),ObjectId(""),ObjectId("")
]}
I thought that this was possible with MapReduce but in my case the documents doesn't have anything in common. Right now I'm doing this in the backend using javascript and loops, but i think that this isn't the best option. Thanks.
You could use the $group operator in the Aggregation Framework. To use the Aggregation Framework you will want to be sure you're running on MongoDB 2.2 or newer, of course.
If used with $push you will get all the lists of messages concatenated together.
db.myCollection.aggregate({ $group: { messages: { $push: '$messages' } } });
If used with $addToSet you will get only the distinct values.
db.myCollection.aggregate({ $group: { messages: { $addToSet: '$messages' } } });
And if you want to filter down the candidate documents first, you can use $match.
db.myCollection.aggregate([
{ $match: { topic: { $in: [ 'a', 'b' ] } } },
{ $group: { matches: { $sum: 1 }, messages: { $push: '$messages' } } }
]);
One option is to use the aggregation framework.
However, if you're planning on having a large number of results (beyond just a "lightweight" result), a result document exceeding 16MB in size, or using excessive system memory, you'll need to just loop through the objects in the collection and concatenate the results manually (as you suggest you might be doing now) or risk mongodb throwing an exception.
Aggregation limits may be found at the bottom of this page:
http://docs.mongodb.org/manual/applications/aggregation/
Given the limitations, you may want to just use find with a projection to return just messages.
(And with anything like this, I'd strongly recommend you do some performance benchmarks to compare options with your data on your servers as the "Internet" would suggest right now that some people have found the Aggregation support to be slower than other techniques).