Aggregation query in mongo and spring-data-mongo - mongodb

Hi everyone I have a big problem in querying my data. I have documents like this:
{
"_id" : NumberLong(999789748357864),
"text" : "#asd #weila #asd2 welcome in my house",
"date" : ISODate("2016-12-13T21:44:37.000Z"),
"dateString" : "2016-12-13",
"hashtags" : [
"asd",
"weila",
"asd2"
]
}
and I want to build two queries:
1) count for each day the number of hashtag and get out for example something like this:
{_id:"2016-12-13",
hashtags:[
{hashtag:"asd",count:20},
{hashtag:"weila",count:18},
{hashtag:"asd2",count:10},
....
]
}
{_id:"2016-12-14",
hashtags:[
{hashtag:"asd",count:18},
{hashtag:"asd2",count:14},
{hashtag:"weila",count:10},
....
]
}
2)another is the same but I want to set a period from 2016-12-13 to 2016-12-17.
For the first one I write this query and I get what I search but in Spring Data Mongo I don't know how to write.
db.comment.aggregate([
{$unwind:"$hashtags"},
{"$group":{
"_id":{
"date" : "$dateString",
"hashtag": "$hashtags"
},
"count":{"$sum":1}
}
},
{"$group":{
"_id": "$_id.date",
"hashtags": {
"$push": {
"hashtag": "$_id.hashtag",
"count": "$count"
}},
"count": { "$sum": "$count" }
}},
{"$sort": { count: -1}},
{"$unwind": "$hashtags"},
{"$sort": { "count": -1, "hashtags.count": -1}},
{"$group": {
"_id": "$_id",
"hashtags": { "$push": "$hashtags" },
"count": { "$first": "$count" }
}},
{$project:{name:1,hashtags: { $slice: ["$hashtags", 2 ]}}}
]);

You can still use a fraction of the same aggregation operation minus the pipeline steps after the second group stage but for the filtering aspect you'd have to introduce a date range query in an initial $match pipeline step.
The following mongo shell examples
show how you filter the aggregates for a particular date range:
1) Set a period from 2016-12-13 to 2016-12-14:
var startDate = new Date("2016-12-13");
startDate.setHours(0,0,0,0);
var endDate = new Date("2016-12-14");
endDate.setHours(23,59,59,999);
var pipeline = [
{
"$match": {
"date": { "$gte": startDate, "$lte": endDate }
}
}
{ "$unwind": "$hashtags" },
{
"$group": {
"_id": {
"date": "$dateString",
"hashtag": "$hashtags"
},
"count": { "$sum": 1 }
}
},
{
"$group": {
"_id": "$_id.date",
"hashtags": {
"$push": {
"hashtag": "$_id.hashtag",
"count": "$count"
}
}
}
}
]
db.comment.aggregate(pipeline)
2) Set a period from 2016-12-13 to 2016-12-17:
var startDate = new Date("2016-12-13");
startDate.setHours(0,0,0,0);
var endDate = new Date("2016-12-17");
endDate.setHours(23,59,59,999);
// run the same pipeline as above but with the date range query set as required
Spring Data Equivalent (untested):
import static org.springframework.data.mongodb.core.aggregation.Aggregation.*;
Aggregation agg = newAggregation(
match(Criteria.where("date").gte(startDate).lte(endDate)),
unwind("hashtags"),
group("dateString", "hashtags").count().as("count"),
group("_id.dateString")
.push(new BasicDBObject
("hashtag", "$_id.hashtags").append
("count", "$count")
).as("hashtags")
);
AggregationResults<Comment> results = mongoTemplate.aggregate(agg, Comment.class);
List<Comment> comments = results.getMappedResults();

Related

How can i alter this query return the average over all overs in the db. mongodb

db.temperature.aggregate([
{ "$match": {
"$and": [
{ "date": { "$gte": ISODate("2017-10-12T22:00:00Z") }},
{ "date": { "$lt": ISODate("2017-10-12T22:59:99Z") }}
]
}},
{ "$group": {
"_id": { "$hour": "$date" },
"temperature": {
"$avg": "$temperature"
}
}}
])
The data looks like
{
"_id" : ObjectId("5df25dd648bfdfee3906e0cd"),
"date" : ISODate("2017-10-12T22:00:00Z"),
"power" : 39
}
There is a record for every minute and i am trying to get the average over every hour in the database. This returns the average over a specific hour.
You can simply remove the $match part of your query:
db.temperature.aggregate([
{ "$group": {
"_id": { "$hour": "$date" },
"temperature": {
"$avg": "$temperature"
}
}}
])
You can see the output of this query, with some sample data, by clicking on run in this playground.

summation of two columns in Aggregate Method

I am using mongodb Aggregate query. My db is like this:
{
"_id" : ObjectId("5a81636f017e441d609283cc"),
"userid": "123",
page : 'A',
newpage: 'A',
},
{
"_id" : ObjectId("5a81636f017e441d609283cd"),
"userid": "123",
page : 'B',
newpage: 'A',
},
{
"_id" : ObjectId("5a81636f017e441d609283ce"),
"userid": "123",
page : 'D',
newpage: 'D',
}
I want to get the Sum of all page and new page value. I am able to get one column value which can give the very precise result.
But I am stuck with the two columns. What I did for getting the sum/repetition of values for one column is:
db.Collection.aggregate([
{$match:{ "userid":"123"}},
{$unwind:"$newpage"},
{$group:{"_id":"$newpage", "count":{"$sum":1}}},
{$project: {_id:0, pagename :"$_id", count:{ $multiply: [ "$count", 1 ] }}},
{$sort: {count: -1}},
//{$limit: 10}
], function(error, data){
if (error) {
console.log(error);
} else {
console.log(data);
}
});
Desired Result will be like:
{
"pagename": "A",
"count": 3
},
{
"pagename": "D",
"count": 2
},
{
"pagename": "B",
"count": 1
}
Is anyone has any approach to getting these things for Two Column? Any Help is appreciated
Use $facet pipeline stage to process multiple aggregation pipelines within a single stage on the same set of input documents. In your case you need to aggregate the counts separately then join the two results and calculate the final aggregates.
This can be demonstrated by running the following pipeline:
db.collection.aggregate([
{ "$match": { "userid": "123" } },
{
"$facet": {
"groupByPage": [
{ "$unwind": "$page" },
{
"$group": {
"_id": "$page",
"count": { "$sum": 1 }
}
}
],
"groupByNewPage": [
{ "$unwind": "$newpage" },
{
"$group": {
"_id": "$newpage",
"count": { "$sum": 1 }
}
}
]
}
},
{
"$project": {
"pages": {
"$concatArrays": ["$groupByPage", "$groupByNewPage"]
}
}
},
{ "$unwind": "$pages" },
{
"$group": {
"_id": "$pages._id",
"count": { "$sum": "$pages.count" }
}
},
{ "$sort": { "count": -1 } }
], function(error, data){
if (error) {
console.log(error);
} else {
console.log(data);
}
)
There you go:
db.Collection.aggregate([
{$match:{ "userid":"123"}}, // filter out what's not of interest
{$facet: { // process two stages in parallel --> this will give us a single result document with the following two fields
"newpage": [ // "newpage" holding the ids and sums per "newpage" field
{$group:{"_id":"$newpage", "count":{"$sum":1}}}
],
"page": [ // and "page" holding the ids and sums per "page" field
{$group:{"_id":"$page", "count":{"$sum":1}}}
]
}},
{$project: {x:{$concatArrays:["$newpage", "$page"]}}}, // merge the two arrays into one
{$unwind: "$x"}, // flatten the single result document into multiple ones so we do not need to $reduce but can nicely $group
{$group: {_id: "$x._id", "count": {$sum: "$x.count"}}} // perform the final grouping/counting,
{$sort: {count: -1}} // well, the sort according to your question
]);

Get Last Date within Range for Each Id Group

Let say I have a collection with the following item:
[{myId:0,date:01.01.17,data:1000},
{myId:1,date:01.02.17,data:2000},
{myId:0,date:01.03.17,data:3000},
{myId:1,date:01.04.17,data:4000},
{myId:0,date:01.05.17,data:5000}]
I want to create a query that get a date as a parameter and return an array with single object for evrey myId that have the maximum date bellow the requested one.
For example calling the query with 15.03.17 date return:
[{myId:1,date:01.02.17,data:2000},
{myId:0,date:01.03.17,data:3000}]
And calling query with 15.01.17 date return
[{myId:0,date:01.01.17,data:1000}]
I'm looking for an answer that doesn't use db.eval
Fixing your data to make it valid:
db.junk.insertMany([
{myId:0,date: new Date("2017-01-01"),data:1000},
{myId:1,date: new Date("2017-02-01"),data:2000},
{myId:0,date: new Date("2017-03-01"),data:3000},
{myId:1,date: new Date("2017-04-01"),data:4000},
{myId:0,date: new Date("2017-05-01"),data:5000}
])
You run an aggregate statement, filtering the entries via $match, then applying $sort to ensure the order and using $last for the "max" on each grouping boundary:
db.junk.aggregate([
{ "$match": { "date": { "$lte": new Date("2017-03-15") } } },
{ "$sort": { "date": 1 } },
{ "$group": {
"_id": "$myId",
"date": { "$last": "$date" },
"data": { "$last": "$data" }
}}
])
Returns:
/* 1 */
{
"_id" : 1.0,
"date" : ISODate("2017-02-01T00:00:00.000Z"),
"data" : 2000.0
}
/* 2 */
{
"_id" : 0.0,
"date" : ISODate("2017-03-01T00:00:00.000Z"),
"data" : 3000.0
}
And for the other date:
db.junk.aggregate([
{ "$match": { "date": { "$lte": new Date("2017-01-15") } } },
{ "$sort": { "date": 1 } },
{ "$group": {
"_id": "$myId",
"date": { "$last": "$date" },
"data": { "$last": "$data" }
}}
])
Returns:
/* 1 */
{
"_id" : 0.0,
"date" : ISODate("2017-01-01T00:00:00.000Z"),
"data" : 1000.0
}
If you really must you can add a $sort as the final pipeline stage in order to ensure the order of _id ( myId value ) returned:
db.junk.aggregate([
{ "$match": { "date": { "$lte": new Date("2017-03-15") } } },
{ "$sort": { "date": 1 } },
{ "$group": {
"_id": "$myId",
"date": { "$last": "$date" },
"data": { "$last": "$data" }
}},
{ "$sort": { "_id": 1 } }
])

Using the aggregation framework to compare array element overlap

I have a collections with documents structured like below:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
}
I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
},
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-03T00:00:00Z"),
ISODate("2015-01-04T00:00:00Z"),
ISODate("2015-01-05T00:00:00Z")
]
}
If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.
Typically I would do this by grouping and counting like below:
db.flights.aggregate([
{
$group: {
_id: { carrier: "$carrier", flightNumber: "$flightNumber" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}
},
{
$match: {
count: { $gt: 1 }
}
}
])
But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?
You $unwind the array if you want to look at the contents as "grouped" within them:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } }
])
That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.
Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.
Now you could add a $lookup to that if retrieving the actual documents was important to you:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } },
}},
{ "$lookup": {
"from": "flights",
"localField": "_id",
"foreignField": "_id",
"as": "_ids"
}},
{ "$unwind": "$_ids" },
{ "$replaceRoot": {
"newRoot": "$_ids"
}}
])
And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.
But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.
Alternately for a more "reporting" like format:
db.flights.aggregate([
{ "$addFields": { "copy": "$$ROOT" } },
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"carrier": "$carrier",
"flightNumber": "$flightNumber",
"dates": "$dates"
},
"count": { "$sum": 1 },
"_docs": { "$addToSet": "$copy" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$group": {
"_id": {
"carrier": "$_id.carrier",
"flightNumber": "$_id.flightNumber",
},
"overlaps": {
"$push": {
"date": "$_id.dates",
"_docs": "$_docs"
}
}
}}
])
Which would report the overlapped dates within each group and tell you which documents contained the overlap:
{
"_id" : {
"carrier" : "abc",
"flightNumber" : 123.0
},
"overlaps" : [
{
"date" : ISODate("2015-01-03T00:00:00.000Z"),
"_docs" : [
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-03T00:00:00.000Z"),
ISODate("2015-01-04T00:00:00.000Z"),
ISODate("2015-01-05T00:00:00.000Z")
]
},
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-01T00:00:00.000Z"),
ISODate("2015-01-02T00:00:00.000Z"),
ISODate("2015-01-03T00:00:00.000Z")
]
}
]
}
]
}

Correct query for group by user, per month

I have MongoDB collection that stores documents in this format:
"name" : "Username",
"timeOfError" : ISODate("...")
I'm using this collection to keep track of who got an error and when it occurred.
What I want to do now is create a query that retrieves errors per user, per month or something similar. Something like this:
{
"result": [
{
"_id": "$name",
"errorsPerMonth": [
{
"month": "0",
"errorsThisMonth": 10
},
{
"month": "1",
"errorsThisMonth": 20
}
]
}
]
}
I have tried several different queries, but none have given the desired result. The closest result came from this query:
db.collection.aggregate(
[
{
$group:
{
_id: { $month: "$timeOfError"},
name: { $push: "$name" },
totalErrorsThisMonth: { $sum: 1 }
}
}
]
);
The problem here is that the $push just adds the username for each error. So I get an array with duplicate names.
You need to compound the _id value in $group:
db.collection.aggregate([
{ "$group": {
"_id": {
"name": "$name",
"month": { "$month": "$timeOfError" }
},
"totalErrors": { "$sum": 1 }
}}
])
The _id is essentially the "grouping key", so whatever elements you want to group by need to be a part of that.
If you want a different order then you can change the grouping key precedence:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}}
])
Or if you even wanted to or had other conditions in your pipeline with different fields, just add a $sort pipeline stage at the end:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}},
{ "$sort": { "_id.name": 1, "_id.month": 1 } }
])
Where you can essentially $sort on whatever you want.