Sorting before doing GROUP BY in MongoDB - mongodb

I have a mongo database for bug tracking. It contains 2 collections:
Project
{
"_id": 1,
"name": "My Project"
}
Bug
{
"_id": 1,
"project": 1,
"title": "we have a bug",
"timestamp": 1400215183000
}
On the dashboard, I want to display the latest bug of each project - up to total of 10.
So basically, when doing GROUP BY "project" field, I need to make sure it will always select the LATEST bug (by doing a pre-sort by "timestamp").
I'm not sure how to combine sorting and grouping together, thanks.

In order to get the "lastest" bug per project while limiting to 10 results:
db.collection.aggregate({
{ "$sort": { "timestamp": -1, "project": 1 } },
{ "$group": {
"_id": "$project",
"bug": {
"$first": {
"_id": "$_id",
"title": "$title",
"timestamp": "$timestamp"
}
}
}},
{ "$limit": 10 }
})
So the sorting is done by timestamp and project (as an optimization) then you do the $group and the $limit. The grouping here is just looking on the "boundaries" using $first, and just returning all of the rest of the document, which you may or may not need.
Try to actually restrict your "timestamp" range using $match first in order to optimize this.

Related

mongodb group by multiple fields, return newest document with all fields

I have a table with data that looks like the following
project | environment | timestamp
----------------------------------------
project1 | dev | 1644515845
project1 | dev | 1644513211
project1 | qa | 1644515542
project2 | dev | 1644513692
project2 | qa | 1644514822
There are multiple projects and each project has multiple environments. There are multiple timestamps associated with each (project, environment) pair which correspond to the last time changes were made to the project.
Is there a query to group by (project, environment), get the the entry with the newest timestamp for each combination of (project, environment), and then return the entire document?
Something like
db.collection.aggregate([
{
"$group": {
"_id": {
"env": "$env",
"project": "$project"
},
"timestamp": {
"$max": "$timestamp"
}
}
}
])
However, it should return the entire document.
My attempts can be found here and here
The first attempt, does not return the entire document. The second Attempt returns the document with the wrong timestamp.
{
"_id": {
"env": "dev",
"project": "project1"
},
"doc": {
"_id": 1,
"env": "dev",
"project": "project1",
"timestamp": 1.644515845e+09
},
"timestamp": 1.644519211e+09
},
Possible working solution here here, although I'm wondering if there is a better way to do it.
{
"_id" : 1,
"project" : "project1",
"env" : "dev",
"timestamp" : 1644515845 }
Is there a query to group by (project, environment), get the the entry
with the newest timestampfor each combination of (project,
environment), and then return the entire document?
Here is the aggregation query to get the desired result. The query runs in the newer mongosh or mongo shell client.
// Define the aggregation pipeline with various stages.
var pipeline = [
// Sorting by project+env+timestamp gives the last document for each group (project+env)
// as the latest (highest) timestamp.
{
$sort: {
project: 1,
env: 1,
timestamp: 1
}
},
// Grouping on project+env, and get the last document for the group -
// this is the latest of the group - use the "$last" operator.
// The aggregation system variable "$$ROOT" references
// the current top level document (with all fields).
{
$group: {
_id: { project: "$project", env: "$env" },
// newest_timestamp: { "$last": "$timestamp" },
newest_document: { "$last": "$$ROOT" },
}
},
// Make the "newest_document" as the root (top leve) document.
{
$replaceWith: "$newest_document"
},
// Optionally, sort the documents by project+ env
{
$sort: {
project: 1,
env: 1
}
}
]
// Run the query using the pipeline
db.collection.aggregate(pipeline)
There are few corrections,
First, you need to sort by timestamp in descending order
Second use $first operator while group to select the latest document
db.collection.aggregate([
{ "$sort": { "timestamp": -1 } },
{
"$group": {
"_id": {
"env": "$env",
"project": "$project"
},
"doc": { "$first": "$$ROOT" }
}
}
])
Playground
Another optional stage, use $project, $unset stage to remove from result _id field, because it is not needed,
{ "$project": { "_id": 0 } }
Playground
or
{ "$unset": "_id" }
Playground
For better performance you can create an index in timestamp field in descending order!
Query
we can use max on the document also like bellow without the need to sort and create an index (documents can be compared also based on order of their fields).
Test code here
aggregate(
[{"$group":
{"_id":{"env":"$env", "project":"$project"},
"latestDoc":{"$max":{"timestamp":"$timestamp", "doc":"$$ROOT"}}}},
{"$set":{"latestDoc":"$latestDoc.doc"}}])
The fast way to do this with index is described here but in the example the "_id" has only 1 field, and the accumulator has the other field of the compound index (here you have 2 fields and in the accumulator we have $$ROOT), so group is not using the index.
I tried all answers,and all were slow, 9-10 sec on 1 million documents.Test them all your self to be sure, if you find a way to use the index in the group, send some feedback if you can.
Mongodb 5.3 will have also $top for getting the top or the topN.

Mongodb get document that has max value for each subdocument

I have some data looking like this:
{'Type':'A',
'Attributes':[
{'Date':'2021-10-02', 'Value':5},
{'Date':'2021-09-30', 'Value':1},
{'Date':'2021-09-25', 'Value':13}
]
},
{'Type':'B',
'Attributes':[
{'Date':'2021-10-01', 'Value':36},
{'Date':'2021-09-15', 'Value':14},
{'Date':'2021-09-10', 'Value':18}
]
}
I would like to query for each document the document with the newest date. With the data above the desired result would be:
{'Type':'A', 'Date':'2021-10-02', 'Value':5}
{'Type':'B', 'Date':'2021-10-01', 'Value':36}
I managed to find some queries to find over all sub document only the global max. But I did not find the max for each document.
Thanks a lot for your help
Storing date as string is generally considered as bad pratice. Suggest that you change your date field into date type. Fortunately for your case, you are using ISO date format so some effort could be saved.
You can do this in aggregation pipeline:
use $max to find out the max date
use $filter to filter the Attributes array to contains only the latest element
$unwind the array
$project to your expected output
Here is the Mongo playground for your reference.
This keeps 1 member from Attributes only, the one with the max date.
If you want to keep multiple ones use the #ray solution that keeps all members that have the max-date.
*mongoplayground can lose the order, of fields in a document,
if you see wrong result, test it on your driver, its bug of mongoplayground tool
Query1 (local-way)
Test code here
aggregate([
{
"$project": {
"maxDateValue": {
"$max": {
"$map": {
"input": "$Attributes",
"in": { "Date": "$$this.Date", "Value": "$$this.Value" },
}
}
},
"Type": 1
}
},
{
"$project": {
"Date": "$maxDateValue.Date",
"Value": "$maxDateValue.Value"
}
}
])
Query2 (unwind-way)
Test code here
aggregate([
{
"$unwind": { "path": "$Attributes" }
},
{
"$group": {
"_id": "$Type",
"maxDate": {
"$max": {
"Date": "$Attributes.Date",
"Value": "$Attributes.Value"
}
}
}
},
{
"$project": {
"_id": 0,
"Type": "$_id",
"Date": "$maxDate.Date",
"Value": "$maxDate.Value"
}
}
])

sum key and group by key with Mongodb and Laravel

Having this collection -
{
"_id": "5b508587de796c0006207fa7",
"id": "1",
"status": "pending",
"updated_at": "2018-07-19 13:02:40",
"created_at": "2018-07-19 12:35:19"
},
{
"_id": "5b508587de796c0006207fa5",
"id": "2",
"status": "completed",
"updated_at": "2018-07-19 13:02:40",
"created_at": "2018-07-19 12:35:19"
},
I want to have a query that will sum the status key by the id key.
For example -
{
"id":"1",
"pending":"1"
}
I am using Laravel 5.5 with MongoDB
Here is a working MongoPlayground. Check out Mongo's reference for Aggregations, as well as the $group operator.
db.collection.aggregate([
{
$group: {
_id: "$status",
sumOfStatus: {
$sum: 1
}
}
}
])
EDIT: After proof-reading, I'm not really sure that was what you were looking for. This example will return your a list of statuses for each id such as:
[
{
"_id": "5",
"completed": 3,
"pending": 3
}
]
To do so, I'm leveraging theĀ $cond operator in order to conditionally $sum documents depending on their status value. One drawback is that you have to repeat this for each value. Not sure of a way around that.
Regarding the Laravel implementation, I'm definitely not a Laravel expert, but check out this answer which shows an example on how to access the aggregate() method.

MongoDB Sum Array With Objects

Say I have an aggregation that returns the following:
[
{driverId: 21312asd12, cars: 2, totalMiles: 30000, family: 4},
{driverId: 55512a23a2, cars: 3, totalMiles: 55000, family: 2},
...
]
How would I go about running a summation of each data set on a groupId basis to return the following? Do I use an $unwind? Do another grouping?
For example I would like to return:
{
totalDrivers: 2,
totalCars: 5,
totalMiles: 85000,
totalFamily: 6
}
You seem to just be referring to the documents in the output as an "array", therefore just add another $group to the end of your pipeline:
{ "$group": {
"_id": null,
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$cars" },
"totalMiles": { "$sum": "$totalMiles" },
"totalFamily": { "$sum": "$family" }
}}
Where null is essentially just a blank grouping key that is not a field present in the document to group on. The result should be a single document (albeit in an array, depending on the API method call used or server version).
Or if you actually mean that each document has a field with an array like this, then $unwind and process the group either per document or with a null as above:
{ "$unwind": "$someArray" },
{ "$group": {
"_id": "$_id",
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$someArray.cars" },
"totalMiles": { "$sum": "$someArray.totalMiles" },
"totalFamily": { "$sum": "$someArray.family" }
}}
At any rate, you should really post the code you are using when asking questions like this. It is very likely that your pipeline may not be as efficient to get to your end goal as you think, and if you posted that it both gives a clear picture of what you are doing as well as leaves it open for suggested improvement.

Sub-query in MongoDB

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:
{_id: ObjectId("xxxxx"), country: "UK",...}
and actions like
{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}
I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country
db.events.aggregate({
$match: {
createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
user: { $exists: true }
}
},
{
$group: {
_id: {
year: { $year: "$createdAt" },
user_obj: "$user"
},
count: { $sum: 1 }
}
},
{
$group: {
_id: {
year: "$_id.year",
country: db.users.findOne({
_id: { $eq: "$_id.user_obj" },
country: { $exists: true }
}).country
},
total: { $sum: "$count" },
distinct: { $sum: 1 }
}
})
No Joins in here, just us bears
So MongoDB "does not do joins". You might have tried something like this in the shell for example:
db.events.find().forEach(function(event) {
event.user = db.user.findOne({ "_id": eventUser });
printjson(event)
})
But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.
For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:
db.events.aggregate([
{ "$match": {
"user": {
"$in": db.users.distinct("_id",{ "country": "UK" })
}
}}
])
Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.
Correcting
You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:
Embed your whole user object data within the event data.
At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.
So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:
{
"_id": ObjectId("yyyyy"),
"createdAt": ISODate(),
"user": {
"_id": ObjectId("xxxxx"),
"country": "UK"
}
}
Then the "aggregation" process becomes simple:
db.events.aggregate([
{ "$match": {
"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
"user": { "$exists": true }
}},
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"user_id": "$user._id"
"country": "$user.country"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.country",
"total": { "$sum": "$count" },
"distinct": { "$sum": 1 }
}}
])
We're not normal
Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:
var bulk = db.events.intitializeUnorderedBulkOp(),
count = 0;
db.users.find().forEach(function(user) {
// update multiple events for user
bulk.find({ "user": user._id }).update({
"$set": { "user": { "_id": user._id, "country": user.country } }
});
count++;
// Send batch every 1000
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.events.intitializeUnorderedBulkOp();
}
});
// Clear any queued
if ( count % 1000 != 0 )
bulk.execute();
So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.
If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".
But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.