MongoDB Aggregation Sum implements on Spring Data - mongodb

The HOUR_COUNTS collections contains {docId, hour, count}
It's very easy for me to get the sum of the count of docId by using the following mongodb query:
db.HOUR_COUNTS.aggregate(
[
{
$match: { hour: { $gte: 10 } }
},
{
$group: { _id: "$docId", total: { $sum: "$count" } }
},
{
$sort: { total: -1, _id: -1 }
},
{
$limit: 20
}
]
)
Then I can get the following result:
{ "_id" : 6831, "total" : 6 }
{ "_id" : 6830, "total" : 6 }
{ "_id" : 6849, "total" : 4 }
{ "_id" : 6848, "total" : 4 }
{ "_id" : 6847, "total" : 3 }
It's the time for me to do it using Spring Data
I tried to do this but it is not going to work:
Aggregation agg = newAggregation(
match(where("hour").gte(0)),
project("docId"),
group("docId").sum("count").as("total"),
project("total").and("docId").previousOperation(),
sort(Sort.Direction.DESC, "total", "docId"),
limit(20)
);
The error is:
java.lang.IllegalArgumentException: Invalid reference 'count'!
Therefore, I would like to know how to make the query work on Spring Data. Thank you.

Why would this be expected to work? Which is the question you should be asking yourself really.
In the aggregation pipeline operations, operators such as $project and $group only ever "return" the fields that you explicitly ask them to. As a "pipeline" concept, only the "output" of the previous piped stage is available to the next stage and those thereafter, until otherwise possibly modified again.
So what you wrote in your Java code is not equal to what you experimented in the shell. You try to refer to a "field" you excluded with a prior $project operation. So don't do that. You seem to have a false view on how things actually optimize in the aggregation pipeline:
Aggregation agg = newAggregation(
match(Criteria.where("hour").gte(10)),
group("docId").sum("count").as("total"),
sort(Sort.Direction.DESC, "total","docId"),
limit(20)
);
So that is actually the "same" as what you wrote before. You don't need the additional "project" operations, and they are detrimental to your intended result.

Related

How to select only field of embeded document in mongodb?

I have collection that contains a document like the following:
I just want to get id of quiz
But my result expected is
{"id":1}
How to do that?
This solution works with MongoDB version 4.4 or higher:
Input document:
{ "_id" : 1, "quiz" : { "id" : 1, "time_limit" : 10 } }
The query uses the new functionality in projection:
db.names.find( { }, { "id" : "$quiz.id", "_id": 0 } )
{ "id" : 1 } // desired output
For more information see Projection.
You'll have to use an aggregation $project stage as the query language does not allow restructuring of data.
db.collection.aggregate([
{
$project: {
_id: 0,
id: "$quiz.id"
}
}
])
Mongo Playground

Double aggregation with distinct count in MongoDB

We have a collection which stores log documents.
Is it possible to have multiple aggregations on different attributes?
A document looks like this in it's purest form:
{
_id : int,
agent : string,
username: string,
date : string,
type : int,
subType: int
}
With the following query I can easily count all documents and group them by subtype for a specific type during a specific time period:
db.logs.aggregate([
{
$match: {
$and : [
{"date" : { $gte : new ISODate("2020-11-27T00:00:00.000Z")}}
,{"date" : { $lte : new ISODate("2020-11-27T23:59:59.000Z")}}
,{"type" : 906}
]
}
},
{
$group: {
"_id" : '$subType',
count: { "$sum": 1 }
}
}
])
My output so far is perfect:
{
_id: 4,
count: 5
}
However, what I want to do is to add another counter, which will also add the distinct count as a third attribute.
Let's say I want to append the resultset above with a third attribute as a distinct count of each username, so my resultset would contain the subType as _id, a count for the total amount of documents and a second counter that represents the amount of usernames that has entries. In my case, the number of people that somehow have created documents.
A "pseudo resultset" would look like:
{
_id: 4,
countOfDocumentsOfSubstype4: 5
distinctCountOfUsernamesInDocumentsWithSubtype4: ?
}
Does this makes any sense?
Please help me improve the question as well, since it's difficult to google it when you're not a MongoDB expert.
You can first group at the finest level, then perform a second grouping to achieve what you need:
db.logs.aggregate([
{
$match: {
$and : [
{"date" : { $gte : new ISODate("2020-11-27T00:00:00.000Z")}}
,{"date" : { $lte : new ISODate("2020-11-27T23:59:59.000Z")}}
,{"type" : 906}
]
}
},
{
$group: {
"_id" : {
subType : "$subType",
username : "$username"
},
count: { "$sum": 1 }
}
},
{
$group: {
"_id" : "$_id.subType",
"countOfDocumentsOfSubstype4" : {$sum : "$count"},
"distinctCountOfUsernamesInDocumentsWithSubtype4" : {$sum : 1}
}
}
])
Here is the test cases I used:
And here is the aggregate result:

$facet of mongodb returning full sorted documents instead of count based on match

i have a documents as below
{
_id:1234,
userId:90oi,
tag:"self"
},
{
_id:5678,
userId:65yd,
tag:"other"
},
{
_id:9012,
userId:78hy,
tag:"something"
},
{
_id:3456,
userId:60oy,
tag:"self"
},
i needed response like below
[{
tag : "self",
count : 2
},
{
tag : "something",
count : 1
},
{
tag : "other",
count : 1
}
]
i was using $facet to query the documents. but it is returning entire documents not the count. My query is as follows
db.data.aggregate({
$facet: {
categorizedByGrade : [
{ $match: {userId:ObjectId(userId)}},
{$sortByCount: "$tag"}
]
}
})
Let me know what i am doing wrong. Thanks in advance for the help
So you don't need to use $facet for this one - facet is when you really need to process multiple aggregation pipelines in one aggregation query (mongoDB $facet), Please try this :
db.yourCollectionName.aggregate([{$project :{tag :1, _id :0}},{$group :{_id: '$tag',
count: { $sum: 1 }}}, {$project : {tag : '$_id', _id:0, count :1}}])
Explanation :
$project at first point is to retain only needed fields in all documents that way we've less data to process, $group will iterate through all documents to group similar data upon fields specified, While $sum will count the respective number of items getting added through group stage in each set, Finally $project again is used to make the result look like what we needed.
You can retrieve the correct records using facet, please have a look at below query
db.data.aggregate({
$facet: {
categorizedByGrade : [
{
$sortByCount:"$tag"
},
{
$project:{
_id:0,
tag:"$_id",
count:1,
}
}]
}
})

mongodb - aggregate failed with memory error

I'm trying to find duplicates in my sharded collection using the id field, which is of this pattern -
"id" : {
"idInner" : {
"k1" : "v1",
"k2" : "v2",
"k3" : "v3",
"k4" : "v4"
}
}
I used the below query, but received the "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." error, even though I used "allowDiskUse : true" in my query.
db.collection.aggregate([
{ $group: {
_id: { id: "$id" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
],
{
allowDiskUse : true
});
Is there another way to get what I want, or something else I should pass in the above query? Thanks.
Please use allowDiskTrue in run command.
db.runCommand(
{ aggregate: "collection",
pipeline: [
{ $group: {
_id: { id: "$id" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
],
allowDiskUse: true
}
)
Let me know if this works for you.
Run a $match first in the pipeline to keep only documents of let's say id.idiInner.k1 that are between a range, so that you will take results for that range only. Since you are interested in duplicates on the id key, all the duplicated documents will satisfy this criteria. See how much you should narrow down that range and run it next for the next range etc. until you cover all documents.
If it is something you must do frequently, automate, by declaring the ranges, feed them in a loop, keep the duplicates of every run and merge the results in the end.
Another fast hack/trick would be to bypass the mongos and run the aggregation directly in each shard. Doing so will limit your docs roughly (assuming well balanced shards) to docs/number_of_shards and you may overcome the memory limit. In this second approach I assume that your shard key is the id key, however if it is not then this approach will not work since the same duplicated documents will be scattered among the shards.

MongoDB sum() data

I am new to mongoDB and nosql, what is the syntax to get a sum?
In MySQL, I would do something like this:
SELECT SUM(amount) from my_table WHERE member_id = 61;
How would I convert that to MongoDB? Here is what I have tried:
db.bigdata.aggregate({
$group: {
_id: {
memberId: 61,
total: {$sum: "$amount"}
}
}
})
Using http://docs.mongodb.org/manual/tutorial/aggregation-zip-code-data-set/ for reference you want:
db.bigdata.aggregate(
{
$match: {
memberId: 61
}
},
{
$group: {
_id: "$memberId",
total : { $sum : "$amount" }
}
})
From the MongoDB docs:
The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated results.
It would be better to match first and then group, so that you system only perform group operation on filtered records. If you perform group operation first then system will perform group on all records and then selects the records with memberId=61.
db.bigdata.aggregate(
{ $match : {memberId : 61 } },
{ $group : { _id: "$memberId" , total : { $sum : "$amount" } } }
)
db.bigdata.aggregate(
{ $match : {memberId : 61 } },
{ $group : { _id: "$memberId" , total : { $sum : "$amount" } } }
)
would work if you are summing data which is not a part of array, if you want to sum the data present in some array in a document then use
db.collectionName.aggregate(
{$unwind:"$arrayName"}, //unwinds the array element
{
$group:{_id: "$arrayName.arrayField", //id which you want to see in the result
total: { $sum: "$arrayName.value"}} //the field of array over which you want to sum
})
and will get result like this
{
"result" : [
{
"_id" : "someFieldvalue",
"total" : someValue
},
{
"_id" : "someOtherFieldvalue",
"total" : someValue
}
],
"ok" : 1
}