Aggregate Group by Date with Daylight Saving Offset - mongodb

I'm trying to use mongo aggregation to group documents by the week of a timestamp on each document. All timestamps are stored in UTC time and I need to calculating the week using the clients time not UTC time.
I can provide and add the clients UTC offset as shown below but this doesn't always work due to daylight savings time. The offset is different depending on the date and therefore adjusting all the timestamps with the offset of the current date won't do.
Does anyone know of a way to group by week that consistently accounts for daylight savings time?
db.collection.aggregate([
{ $group:
{ "_id":
{ "$week":
{ "$add": [ "$Timestamp", clientsUtcOffsetInMilliseconds ] }
}
},
{ "firstTimestamp":
{ "$min": "$Timestamp"}
}
}
]);

The basic concept here is to make your query "aware" of when "daylight savings" both "starts" and "ends" for the given query period and simply supply this test to $cond in order to determine which "offset" to use:
db.collection.aggregate([
{ "$group": {
"_id": {
"$week": {
"$add": [
"$Timestamp",
{ "$cond": [
{ "$and": [
{ "$gte": [
{ "$dayOfyear": "$Timestamp" },
daylightSavingsStartDay
]},
{ "$lt": [
{ "$dayOfYear": "$Timestamp" },
daylightSavingsEndDay
]}
]},
daylightSavingsOffset,
normalOffset
]}
]
}
},
"min": { "$min": "$Timestamp" }
}}
])
So you can make that a little more complex when covering several years, but it still is the basic principle. In in the southern hemisphere you are always spanning a year so each condition would be a "range" of "start" to "end of year" and "begining of year" to "end". Therefore an $or with an inner $and within the "range" operators demonstrated.
If you have different values to apply, then detect when you "should" choose either and then apply using $cond.

Related

MongoDB slow aggregate time

I'm facing an issue where the aggregate function is performing very slowly where it takes about 30 seconds to gather all my data. Assume 1 of the record in this structure:
{
"_id":{
"$oid":"5909a5cefece40f172895a6b"
},
"Record":1,
"Link":"https://www.google.com",
"Location":["loc1", "loc2", "loc3"],
"Organization":["org1", "org2", "org3"],
"Date":2017,
"PeoplePPL":["ppl1", "ppl2", "ppl3"]
}
And the aggregate query as follows:
db.testdata_4.aggregate([{
"$unwind": "$PeoplePPL"
},{
"$unwind": "$Location"
},{
"$match": {
Date: {
$gte: lowerBoundYear,
$lte: upperBoundYear
}
}
},{
"$group": {
"_id": {
"People": "$PeoplePPL",
"Date": "$Date"
},
Links: {
$addToSet: "$Link"
},
Locations: {
$addToSet: "$Location"
}
}
},{
"$group": {
"_id": "$_id.People",
Record: {
$push: {
"Country": "$Locations",
"Year": "$_id.Date",
"Links": "$Links"
}
}
}
}]).toArray()
There are a total of 154 records in the "testdata_4" collection, and upon aggregation, there will be 5571 records returned with the query time of 28 seconds. I have performed the ensureIndex() on "Locations" and "Date". Is this supposed to be normal as the number of records returned increases?
If it isn't normal, may I know if there's a workaround to decrease my query time to 5 seconds at max instead of having it at 28 seconds or more?
It's very likely that the index on Date isn't being used.
The $match and $sort operators can take advantage of indexes when they are being used at the beginning of the pipeline. In this case, the filters are applied after several $unwind stages, which mean it likely isn't be used.
Suggestions:
Move the $match stage to the beginning of the pipeline
The "Location", "Date" and "Link" fields aren't arrays, so it isn't immediately clear why there are $unwind aggregation stages on these fields. You may want to remove these.

how to get year wise and month wise documents which are having 'createdat' field

I need to generate monthly and yearly reports. So how to get year wise and month wise documents which are having 'createdat' field.
I suggest you to use the aggregation framework
You need to project all createdAt fields into year and month field as follow:
db.collection.aggregate([{
$project: {
year: {
$year: "$createdat"
},
month: {
$month: "$createdat"
},
log_field: 1,
other_field: 1,
}
}, {
$match: {
"year": 2016
}
}]);
More details about the $project
I realise you might not be able to change the structure of the data you have, but if you do, this might be useful to you...
I do quite a lot of this in our projects, and have always found that the best (as in, simplest and fastest) way is to actually store that kind of info up front, separately.
So as an example, I store the full datetime field (as you do), but then also store the YYYYMM value in its own field, as well as the Day. These values/formats might be different depending on the kind of data, but I've had very good success with it.
(For context, I'm storing financial calculation data, anywhere up to several million records per month, per customer..... not something the aggregation framework has managed nicely)
A little sample of the BSON
....
"cd" : ISODate("2016-02-29T22:59:59.999Z"),
"cdym" : "201602",
"cdd" : "29",
....
I got the answer and this is the answer.
db.mycollection.aggregate([
{ "$redact": {
"$cond": [
{ "$and": [
{ "$eq": [ { "$year": "$date" }, 2016 ] },
{ "$eq": [ { "$month": "$date" }, 5 ] }
] },
"$$KEEP",
"$$PRUNE"
] }
}
])

Need to aggregate by hour and $avg not recognized

From a MongoDB collection storing data with time stamps I need to return a single record for each hour.
So far I have selected the set of records between two dates successfully, but I cant figure how to build the hourly record I need in the $group clause.
var myName = "CollectionName"
//schema for mongoose
var mySchema = new Schema({
dt: Date,
value: Number
});
var myDB = mongoose.createConnection('mongodb://localhost:27017/MYDB');
myDBObj = myDB.model(myName, evalSchema, myName);
The match in this aggregate call works fine, and the $hour creates a record for each hour in the day.. but I don't know how to recreate the a full date and get an error "unknown group operator $avg" ...
myDBObj.aggregate([
{
$match: { "dt": { $gt: new Date("October 13, 2010 12:00:00"), $lt: new Date("November 13, 2010 12:00:00") } }
},{
$group: {
"_id": { "dt": { "$hour": "$dt" } , "price": { "$avg": "$price" }}
}], function (err, data) { if (err) { return next(err); } res.json(data); });
I think I need to use $dayOfYear so there is different records for each hour of each day, and include a new Date() somewhere ...
Can someone help me do this correctly? any help is appreciated.
The $group pipeline stage works by "grouping" all data by the "key" specified for _id. Other fields you are actually aggregating are separate from the _id value and are their own field properties.
So your $group becomes this instead:
{ "$group": {
"_id": { "$hour": "$dt" },
"price": { "$avg": "$price" }
}}
Or if you want that broken by day then make a compound key:
{ "$group": {
"_id": {
"day": { "$dayOfYear": "$dt" },
"hour": { "$hour": "$dt" }
},
"price": { "$avg": "$price" }
}}
Or just use date math to produce Date objects rounded by hour:
{ "$group": {
"_id": {
"$add": [
{ "$subtract": [
{ "$subtract": [ "$dt", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$dt", new Date(0) ] },
1000 * 60 *60
]}
]},
new Date(0)
]
},
"price": { "$avg": "$price" }
}}
Where subrtacting another date object (epoch date) from another prodces a numeric value you can round ( 1000 milliseconds, 60 seconds, 60 minutes = 1 hour ) with the applied math, and adding a number to a date object produces a date corresponding to that value.
So your problem was you had everything in the _id, where the $avg accumulator is not recognised. All accumulators need to be specified outside of the grouping key. That is the intent.
If you want to make an accumulator value part of a grouping key ( does not seem relevant here though ), you instead follow with another group stage, referencing the field that was produced from the former.

Getting unix timestamp in seconds out of MongoDB ISODate during aggregation

I was searching for this one but I couldn't find anything useful to solve my case. What I want is to get the unix timestamp in seconds out of MongoDB ISODate during aggregation. The problem is that I can get the timestamp out of ISODate but it's in milliseconds. So I would need to cut out those milliseconds. What I've tried is:
> db.data.aggregate([
{$match: {dt:2}},
{$project: {timestamp: {$concat: [{$substr: ["$md", 0, -1]}, '01', {$substr: ["$id", 0, -1]}]}}}
])
As you can see I'm trying to get the timestamp out of 'md' var and also concatenate this timestamp with '01' and the 'id' number. The above code gives:
{
"_id" : ObjectId("52f8fc693890fc270d8b456b"),
"timestamp" : "2014-02-10T16:20:56011141"
}
Then I improved the command with:
> db.data.aggregate([
{$match: {dt:2}},
{$project: {timestamp: {$concat: [{$substr: [{$subtract: ["$md", new Date('1970-01-01')]}, 0, -1]}, '01', {$substr: ["$id", 0, -1]}]}}}
])
Now I get:
{
"_id" : ObjectId("52f8fc693890fc270d8b456b"),
"timestamp" : "1392049256000011141"
}
What I really need is 1392049256011141 so without the 3 extra 000. I tried with $subtract:
> db.data.aggregate([
{$match: {dt:2}},
{$project: {timestamp: {$concat: [{$substr: [{$divide: [{$subtract: ["$md", new Date('1970-01-01')]}, 1000]}, 0, -1]}, '01', {$substr: ["$id", 0, -1]}]}}}
])
What I get is:
{
"_id" : ObjectId("52f8fc693890fc270d8b456b"),
"timestamp" : "1.39205e+009011141"
}
Not exactly what I would expect from the command. Unfortunately the $substr operator doesn't allow negative length. Does anyone have any other solution?
I'm not sure why you think you need the value in seconds rather than milliseconds as generally both forms are valid and within most language implementations the milliseconds is actually preferred. But generally speaking, trying to coerce this into a string is the wrong way to go around this, and generally you just do the math:
db.data.aggregate([
{ "$project": {
"timestamp": {
"$subtract": [
{ "$divide": [
{ "$subtract": [ "$md", new Date("1970-01-01") ] },
1000
]},
{ "$mod": [
{ "$divide": [
{ "$subtract": [ "$md", new Date("1970-01-01") ] },
1000
]},
1
]}
]
}
}}
])
Which returns you an epoch timestamp in seconds. Basically derived from when one BSON date object is subtracted from another one then the result is the time interval in milliseconds. Using the initial epoch date of "1970-01-01" results in essentially extracting the milliseconds value from the current date value. The $divide operator essentially takes off the milliseconds portion and the $mod does the modulo to implement rounding.
Really though you are better off doing the work in the native language for your application as all BSON dates will be returned there as a native "date/datetime" type where you can extract the timestamp value. Consider the JavaScript basics in the shell:
var date = new Date()
( date.valueOf() / 1000 ) - ( ( date.valueOf() / 1000 ) % 1 )
Typically with aggregation you want to do this sort of "math" to a timestamp value for use in something like aggregating values within a time period such as a day. There are date operators available to the aggregation framework, but you can also do it the date math way:
db.data.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$md", new Date("1970-01-01") ] },
{ "$mod": [
{ "$subtract": [ "$md", new Date("1970-01-01") ] },
1000 * 60 * 60 * 24
]}
]
},
"count": { "$sum": 1 }
}}
])
That form would be more typical to emit a timestamp rounded to a day, and aggregate the results within those intervals.
So your purposing of the aggregation framework just to extract a timestamp does not seem to be the best usage or indeed it should not be necessary to convert this to seconds rather than milliseconds. In your application code is where I think you should be doing that unless of course you actually want results for intervals of time where you can apply the date math as shown.
The methods are there, but unless you are actually aggregating then this would be the worst performance option for your application. Do the conversion in code instead.

MongoDB get every second result

In MongoDB all documents have a date field, it is a timestamp.
There is a lot of data, and I want to get only some part of it, for every interval:
e.g. 400ms
1402093316030<----
1402093316123
1402093316223
1402093316400<----
1402093316520
1402093316630
1402093316824<----
Is it possible to get every other, or every third result?
Or better first document every 400 ms?
You can do this with the aggregation framework and a little date math. Let's say you have a "timestamp" field and addtional fields "a", "b" and "c":
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 400 ] }
]
},
"timestamp": { "$first": "$timestamp" },
"a": { "$first": "$a" },
"b": { "$first": "$b" },
"c": { "$first": "$c" }
}}
])
So the date math there "groups" on the values of the "timestamp" field at 400ms intervals. The rest of the data is identified with the $first operator, which picks the "last" value from the field as found on those grouping boundaries.
If you otherwise wan the "last" item on those boundaries then you switch to use the $lastoperator instead.
The end result is the last document that occurred every 400 millisecond interval.
See the aggregate command and the Aggregation Framework operators for additional reference.