mongodb $dayOfYear equivalent Unix epoch time aggregation - mongodb

Is there a method of grouping a Unix epoch time by day, equiv to $dayOfYear
or a process of aggregating floats, ints (into quartiles, hundreds, thousands, %)
try to avoid map reduce but an example of it would be awesome.

You can almost but not quite use Unix time seconds in aggregation pipeline by utilizing the $mod and $divide operators.
The math is Unix time seconds / 86400 to convert seconds into days since Epoch. Then modula that result by 365.25 for the day of the year (leaps every 4).
So the full aggregation for $dayOfYear using seconds is almost as simple as
db.MyCollection.aggregate( {$project : {"day" : {$mod : [ {$divide : ["$unix_seconds", 86400] } , 365.25] } } }, { $group : { _id : "$day" , num : { $sum : 1 } } } , {$sort : {_id : 1}} )
The above adds sorting for sequential day of year.
The problem is that the $mod operator returns both the whole number and remainder. and there is no way of rounding or truncating the remainder. Therefore the results are grouped by whole and remainder.
{
"_id" : 235.1864887063916,
"num" : 1
},
{
"_id" : 235.24300889818738,
"num" : 1
},
{
"_id" : 235.60299520864623,
"num" : 3
},
{
"_id" : 235.66453935674085,
"num" : 1
},
{
"_id" : 235.79900382758004,
"num" : 1
},
{
"_id" : 235.80265845312474,
"num" : 1
},
.. when clearly we want only the whole number
{
"_id" : 235,
"num" : 8
},
What would be nice is a $trunc or modula returning only the whole ($modw), and mod returning only remainder ($modr) operators in mongo.

JavaScript has the Date object which would be available to any server side JavaScript processing for MapReduce functions.
You seem to be aware of the $dayOfYear operator in the aggregation pipeline. There are other operators there for processing dates.
Unless your needs are very specific you should be using the aggregation pipeline. It is very flexible and in most cases will be considerably faster than the equivalent actions run under mapReduce.

Related

MongoDB Aggregation - Buckets Boundaries to Referenced Array

To whom this may concern:
I would like to know if there is some workaround in MongoDB to set the "boundaries" field of a "$bucket" aggregation pipeline stage to an array that's already in the previous aggregation stage. (Or some other aggregation pipeline that will get me the same result). I am using this data to create a histogram of a bunch of values. Rather than retrieve 1 million-or-so values, I can receive 20 buckets with their respective counts.
The previous stages of yield the following result:
{
"_id" : ObjectId("5cfa6fad883d3a9b8c6ad50a"),
"boundaries" : [ 73.0, 87.25, 101.5, 115.75, 130.0 ],
"value" : 83.58970621935025
},
{
"_id" : ObjectId("5cfa6fe0883d3a9b8c6ad5a8"),
"boundaries" : [ 73.0, 87.25, 101.5, 115.75, 130.0 ],
"value" : 97.3261380262403
},
...
The "boundaries" field for every document is a result a facet/unwind/addfield with some statistical mathematics involving "value" fields in the pipeline. Therefore, every "boundaries" field value is an array of evenly spaced values in ascending order, all with the same length and values.
The following stage of the aggregation I am trying to perform is:
$bucket: {
groupBy: "$value",
boundaries : "$boundaries" ,
default: "no_group",
output: { count: { $sum: 1 } }
}
I get the following error from the explain when I try to run this aggregation:
{
"ok" : 0.0,
"errmsg" : "The $bucket 'boundaries' field must be an array, but found type: string.",
"code" : NumberInt(40200),
"codeName" : "Location40200"
}
The result I would like to get is something like this, which is the result of a basic "$bucket" pipeline operator:
{
"_id" : 73.0, // range of [73.0,87.25)
"count" : 2 // number of documents with "value" in this range.
}, {
"_id" : 87.25, // range of [87.25,101.5)
"count" : 7 // number of documents with "value" in this range.
}, {
"_id" : 101.5,
"count" : 3
}, ...
What I know:
The JIRA documentation says
'boundaries' must be constant values (can't use "$x", but can use {$add: [4, 5]}), and must be sorted.
What I've tried:
$bucketAuto does not have a linear "granularity" setting. By default, it tries to evenly distribute the values amongst the buckets, and the bucket ranges are therefore spaced differently.
Building the constant array by retrieving the pipeline results, and then adding the constant array into the pipeline again. This is effective but inefficient and not atomic, as it creates an O(2N) time complexity. I can live with this solution if needs be.
There HAS to be a solution to this. Any workaround or alternative solutions are greatly appreciated.
Thank you for your time!

MongoDB - Get aggregated difference between two date fields

I have one collection called lists with following fields:
{ "_id" : ObjectId("5a7c9f60c05d7370232a1b73"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:40:10Z") }
{ "_id" : ObjectId("5a7c9f85c05d7370232a1b74"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:41:10Z") }
{ "_id" : ObjectId("5a7c9f89c05d7370232a1b75"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:42:10Z") }
{ "_id" : ObjectId("5a7c9f8cc05d7370232a1b76"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:42:20Z") }
I need to find out aggregated result in the following format (the difference between processed_date and created_date):
[{
"30Sec":count_for_diffrence_1,
"<=60Sec":count_for_diffrence_2,
"<=90Sec":count_for_diffrence_3
}]
One more thing if we can find out how may item took 30 sec, 60 sec and so on, also make sure that the result for <=60 Sec should not come in <=90Sec.
Any help will be appreciated.
You can try below aggregation query in 3.6 version.
$match with $expr to limit the documents where the time difference is 90 or less seconds.
$group with $sum to count different time slices occurences.
db.collection.aggregate([
{"$match":{"$expr":{"$lte":[{"$subtract":["$processed_date","$created_date"]},90000]}}},
{"$group":{
"_id":null,
"30Sec":{"$sum":{"$cond":{"if":{"$eq":[{"$subtract":["$processed_date","$created_date"]},30000]},"then":1,"else":0}}},
"<=60Sec":{"$sum":{"$cond":{"if":{"$lte":[{"$subtract":["$processed_date","$created_date"]},60000]},"then":1,"else":0}}},
"<=90Sec":{"$sum":{"$cond":{"if":{"$lte":[{"$subtract":["$processed_date","$created_date"]},90000]},"then":1,"else":0}}}
}}
])
Note if the created date is greater than processed date you may want to add a condition to look only for values where difference is between 0 and your requested time slice.
Something like
{$and:[{"$gte":[{"$subtract":["$processed_date","$created_date"]},0]}, {"$lte":[{"$subtract":["$processed_date","$created_date"]},60000]}]}

Can sorting before grouping improve query performance in Mongo using the aggregate framework?

I'm trying to aggregate data for 100 accounts for a 14-15 month period, grouping by year and month.
However, the query performance is horrible as it takes 22-27 seconds. There are currently over 15 million records in the collection and I've got an index on the match criteria and can see using explain() that the optimizer uses it.
I tried adding another index on the sort criteria in the query below and after adding the index, the query now takes over 50 seconds! This happens even after I remove the sort from the query.
I'm extremely confused. I thought because grouping can't utilize an index, that if the collection was sorted beforehand, then the grouping could be much faster. Is this assumption correct? If not, what other options do I have? I can bear the query performance to be as much as 5 seconds but nothing more than that.
//Document Structure
{
Acc: 1,
UIC: true,
date: ISODate("2015-12-01T05:00:00Z"),
y: 2015
mm: 12
value: 22.3
}
//Query
db.MyCollection.aggregate([
{ "$match" : { "UIC" : true, "Acc" : { "$in" : [1, 2, 3, ..., 99, 100] }, "date" : { "$gte" : ISODate("2015-12-01T05:00:00Z"), "$lt" : ISODate("2017-02-01T05:00:00Z") } } },
//{ "$sort" : { "UIC" : 1, "Acc" : 1, "y" : -1, "mm" : 1 } },
{ "$group" : { "_id" : { "Num" : "$Num", "Year" : "$y", "Month" : "$mm" }, "Sum" : { "$sum" : "$value" } } }
])
What I would suggest you to do is to make a script (can be in nodejs) that aggregates the data in a different collection. When you have these long queries, what's advisable is to make a different collection containing the aggregation data and query from that.
My second advice would be to create a composed index in this aggregated collection and search by regular expression. In your case I would make an index containing accountId:period. For example, for account 1, and February of 2016, The index would be something like 1:201602.
Then you would be able to perform queries using regular expressions by account and timestamp. Like as if you wanted the registers for 2016 of account 1, you could do something like:
db.aggregatedCollection.find{_id : \1:2016\})
Hope my answer was helpful

mongo query select only first of month

is it possible to query only the first (or last or any single?) day of the month of a mongo date field.
i use the $date aggregation operators regularly but within a $group clause.
basically i have field that is already aggregated (averaged) for each day of the month. i want to select only one of these days (with the value as a representative of the entire month.)
following is a sample of a record set from jan 1, 2014 to feb 1, 2015 with price as the daily price and 28day_avg as the trailing monthly average for 28 days.
{ "date" : ISODate("2014-01-01T00:00:00Z"), "_id" : ObjectId("533b3697574e2fd08f431cff"), "price": 59.23, "28day_avg": 54.21}
{ "date" : ISODate("2014-01-02T00:00:00Z"), "_id" : ObjectId("533b3697574e2fd08f431cff"), "price": 58.75, "28day_avg": 54.15}
...
{ "date" : ISODate("2015-02-01T00:00:00Z"), "_id" : ObjectId("533b3697574e2fd08f431cff"), "price": 123.50, "28day_avg": 122.25}
method 1.
im currently running an aggregation using $month data (and summing the price) but one issue is im seeking to retrieve the underlying date value ISODate("2015-02-01T00:00:00Z") versus the 0,1,2 value that comes with several of the date aggregations (that loop at the first of the week, month, year). mod(28) on a date?
method 2
i'd like to simply pluck out a single record of the 28day_avg as representative of the period. the 1st of the month would be adequate
the desired output is...
_id: ISODate("2015-02-01T00:00:00Z"), value: 122.25,
_id: ISODate("2015-01-01T00:00:00Z"), value: 120.78,
_id: ISODate("2014-12-01T00:00:00Z"), value: 118.71,
...
_id: ISODate("2014-01-01T00:00:00Z"), value: 53.21,
of course, the value will vary from method 1 to method 2 but that is fine. one is 28 days trailing while the other will account for 28, 30, 31 day months...dont care about that so much.
A non-agg is ok but also doesnt work. aka {"date": { "$mod": [ 28, 0 ]} }
To pick the first of the month for each month (method 2), use the following aggregation:
db.test.aggregate([
{ "$project" : { "_id" : "$date", "day" : { "$dayOfMonth" : "$date" }, "28day_avg" : 1 } },
{ "$match" : { "day" : 1 } }
])
You can't use an index for the match, so this is not efficient. I'd suggest adding another field to each document that holds the $dayOfMonth value, so you can index it and do a simple find:
{
"date" : ISODate("2014-01-01T00:00:00Z"),
"price" : 59.23,
"28day_avg" : 54.21,
"dayOfMonth" : 1
}
db.test.ensureIndex({ "dayOfMonth" : 1 })
db.test.find({ "dayOfMonth" : 1 }, { "_id" : 0, "date" : 1, "28day_avg" : 1 })

Need some help completing this aggregation pipeline

I have an analytics collection where I store queries as individual documents. I want to count the number of queries taking place over the past day (24 hours). Here's the aggregation command as it is:
db.analytics.aggregate([{$group:{_id:{"day":{$dayOfMonth:"$datetime"},"hour":{$hour:"$datetime"}},"count":{$sum:1}}},{$sort:{"_id.day":1,"_id.hour":1}}])
The result looks like:
.
.
.
{
"_id" : {
"day" : 17,
"hour" : 19
},
"count" : 8
},
{
"_id" : {
"day" : 17,
"hour" : 22
},
"count" : 1
},
{
"_id" : {
"day" : 18,
"hour" : 0
},
"count" : 1
}
.
.
.
Originally, my plan was to add a $limit operation to simply take the last 24 results. That's a great plan until you realize that there are some hours without any queries at all. So the last 24 documents could go back more than a single day. I thought of using $match, but I'm just not sure how to go about constructing it. Any ideas?
First of all you need to get the day just as current date or as most recent document from the collection. Then use query for specified day like:
db.analytics.aggregate([
{$project:{datetime:"$datetime",day:{$dayOfMonth:"$datetime"}}},
{$match:{day:3}},
{$group:{_id:{"hour":{$hour:"$datetime"}},"count":{$sum:1}}},
{$sort:{"_id.hour":1}}
]);
where 3 is the day of the month here {$match:{day:3}}
The idea is to add a day field, so, we able to filter by it, then group documents of the day by hours and sort.