How to add a counter in a MongoDB aggregate stage? - mongodb

I have a problem:
I have a set of documents which represent "completions of a task".
Each such completion has a user assigned to it, and a time the completion took.
I need to group my documents by user, and then sort it by the accumulated time, and this works fine:
const chartsAggregation = [
{
$group: {
_id: '$user',
totalTime: { $sum: '$totalTime' },
},
},
{
$sort: {
totalTime: -1,
},
},
{
$addFields: {
placement: { $inc: 1 }, // This does not work
},
},
];
However, I need to "burn in" the placement after sorting, the "rank" so to speak.
The reason is, that I want to display like a "charts page" with the people who took the most time one top. This page needs to be searchable and paginated, so people find themselves and their placement.
As I need to do search queries and limits (for the pagination) later, the actual positions of my users in the resulting array is no use to me.
I want to add a field (i tried this in the $addFields - portion) that associates the placement in the list with the data set, so even if I later filter and limit the results, the original placement is intact.
All I need for this is to add an incrementing counter within the $addFields - statement, but I can't find a way to do this. There doesn't seem to be something like that in the documentation.
Can you help me?

Related

MongoDB Aggregation to get events in timespan, plus the previous event

I have timeseries data as events coming in at random times. They are not ongoing metrics, but rather events. "This device went online." "This device went offline."
I need to report on the number of actual transitions within a time range. Because there are occasionally same-state events, for example two "went online" events in a row, I need to "seed" the data with the state previous to the time range. If I have events in my time range, I need to compare them to the state before the time range in order to determine if something actually changed.
I already have aggregation stages that remove same-state events.
Is there a way to add "the latest, previous event" to the data in the pipeline without writing two queries? A $facet stage totally ruins performance.
For "previous", I'm currently trying something like this in a separate query, but it's very slow on the millions of records:
// Get the latest event before a given date
db.devicemetrics.aggregate([
{
$match: {
'device.someMetadata': '70b28808-da2b-4623-ad83-6cba3b20b774',
time: {
$lt: ISODate('2023-01-18T07:00:00.000Z'),
},
someValue: { $ne: null },
},
},
{
$group: {
_id: '$device._id',
lastEvent: { $last: '$$ROOT' },
},
},
{
$replaceRoot: { newRoot: '$lastEvent' },
}
]);
You are looking for something akin to LAG window function in SQL. Mongo has $setWindowFields for this, combined with $shift Order operator.
Not sure about fields in your collection, but this should give you an idea.
{
$setWindowFields: {
partitionBy: "$device._id", //1. partition the data based on $device._id
sortBy: { time: 1 }, //2. within each partition, sort based on $time
output: {
"shiftedEvent": { //3. add a new field shiftedEvent to each document
$shift: {
output: "$event", //4. whose value is previous $event
by: -1
}
}
}
}
}
Then, you can compare the event and shiftedEvent fields.

Mongodb selecting every nth of a given sorted aggregation

I want to be able to retrieve every nth item of a given collection which is quite large (millions of records)
Here is a sample of my collection
{
_id: ObjectId("614965487d5d1c55794ad324"),
hour: ISODate("2021-09-21T17:21:03.259Z"),
searches: [
ObjectId("614965487d5d1c55794ce670")
]
}
My start of aggregation is like so
[
{
$match: {
searches: {
$in: [ObjectId('614965487d5d1c55794ce670')],
},
},
},
{ $sort: { hour: -1 } },
{ $project: { hour: 1 } },
...
]
I have tried many things including
$sample which does not make the pick in the good order
Using $skip makes it very slow as the number given to skip grows
Using _id instead of $skip but my ids are unfortunately not created in an ordered manner
My goal is thus to retrieve the hour of a record, every 20000 record, so that I can then make a call to retrieve data by chunks of approximately 20000 records.
I imagine it would be possible to
sort, and number every records, then keep only the first, 20000, 40000, ..., and the last
Thanks for your help and let me know if you need more information

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

Best way to doing this in mongodb?

db.post
{user:'50001', content:...,},
{user:'50002', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
{user:'50002', content:...,},
db.catagory
{user:'50001', type:'public',...}
{user:'50002', type:'vip',...}
{user:'50003', type:'public',...}
What I want, is pickup the post of user what type:public.
I can do it like:
publicUsers = db.catagory.distinct(user, {type:'public'})
db.post.find({user: {$in: publicUsers }}).sort({_id:-1})
or use lookup in aggregate.
and output
{user:'50001', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
Is that some other better way to do this faster?
Consider of the large request situation, should I create a new db like post_public for finding?
The aggregation pipeline is your best bet. It's quite performant, as it's intended for just these types of situations.
Another Edit
Ok, this is the best solution I have, but it has some limitations set by Mongo.
Firstly, a sort operation on non-indexed fields cannot use more than 100MB of RAM. That's quite a bit of space for text based posts, but something to be aware of (there's a workaround that uses disk space if needed, but it will slow down the operation).
Second, and more relevant in this case, Mongo sets a hard cap on the size of a single document to 16MB. Given that you wanted the posts in a single array, a large dataset can potentially pass that if no limits on the output are imposed. If you decide you don't want all posts in a single array, by removing the last three stages of the pipeline ($group, $addFields, and $project), the result will be the sorted posts each within their own document. You will need to limit the number of documents manually to get the n number of posts you want if you go that route. IMO, you should do this. Those last three stages add additional processing burden but offer very little practical benefit on the output. The only difference is iterating over a nested array, vs iterating over the root array.
db.categories.aggregate([
{$project: {
user: 1,
type: 1
}},
{$match: {
type: "public"
}},
{$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}},
{$unwind: {
path: "$userPosts",
preserveNullAndEmptyArrays: false
}},
{$sort: {
"userPosts.posted": -1,
}},
{$group: {
_id: "public",
posts: {
$push: "$userPosts"
}
}},
{$addFields: {
"latestPosts": {
$slice: ["$posts", 5]
}
}},
{$project: {
posts: 0
}}
]);
You'll access the resulting posts like so:
.then(result => {
let latestPosts = result[0].latestPosts;
});
Edit
I built a mock database and came up with this solution:
db.categories.aggregate([{
$project: {
_id: 0,
user: 1,
type: 1,
}
}, {
$match: {
type: "public"
}
}, {
$group: {
_id: {
$concat: ["$user", "-", "$type"]
},
user: {
$first: "$user"
},
type: {
$first: "$type"
}
}
}, {
$sort: {
user: -1
}
}, {
$limit: 10
}, {
$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}
}]);
The result will be an array of objects representing each unique user-type combo with the type specified in the $match filter. The object will contain a child array with all the posts made by that user. All the users will be sorted ind descending order and limited to ten.
If all you want is a single operator to get distinct values from categories:
db.categories.aggregate([{$project: {
_id: 0,
user: 1,
type: 1
}}, {$group: {
_id: {$concat: ["$user","-","$type"]},
user: {$first: "$user"},
type: {$first: "$type"}
}}]);
The goal is to first filter out the users with non-public types so they don't appear in the result, join the posts belonging to that user to a new field called "user_posts", and then sort all the users in ascending order. I don't know the entire structure of your database so my specific example may not be the best fit, but it would be a good starting point.
Edit
Some other ways to increase performance for large amounts of data:
By querying in pieces. Say for example you're displaying paginated user posts. Then you would only need to query the posts for the page currently being viewed. You can reduce any noticeable loading times by pre-fetching the next few pages as well.
Using a "virtual" list. This will display every post in one long list, but only fetch a certain size chunk that is currently being viewed, or will soon be viewed. Similar to the first pagination example. If you're familiar with the RecyclerView in Android development, that's the idea.
Restructuring the way posts are stored in your database. It would be a much faster operation if Mongo didn't need to iterate through so much unrelated data when looking for posts of a specific user or type. The solution is to store posts nested below a common ancestor, such as their poster, or their view type. This does mean that you may have duplicate entries spread between documents. This is one of the inherent downsides of a no-SQL database. You will need to find a solution for handling synchronizing the documents spread throughout the database.
Creating a new index on document fields that you use to sort your array. Indices help Mongo organize the documents internally for more efficient searches.
Running a heavy query once, then $out-ing the result. In this way you can store the result of expensive operations in a more permanent fashion than a simple cache.
A variant of number 5 is to $out the result of a large sorting operation, and then continuing to insert new entries at the end of the now sorted collection.
Finally, I do it like this.
db.post.aggregate([
{$match:{
uid:{$in: db.catagory.distinct('id',{type:'public'})} # directly write here
_id:{$lt: ObjectId("5c6d5494b9f43f020d4122a0")} # for turn the page
}},
{$sort:{_id:-1}}, # sort by post _id
{$limit:20} # page size
])
It's faster than lookup or others.

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".