MongoDB multiple date range groupings - mongodb

I'm storing data in hour buckets. Simplified structure:
{
value: 20,
date: ISODate("2017-01-01T00:00:00Z")
},
{
value: 50,
date: ISODate("2017-01-01T01:00:00Z")
},
{
value: 90,
date: ISODate("2017-01-01T02:00:00Z")
},
...
Goal
Make a single aggregate query to get all the month totals for the year. I cannot use the $month operator because I'm querying UTC data but desire local timezone groupings.
Current Solution
I can currently achieve this by making 12 separate queries like the following:
db.data.aggregate([
{
$match: {
"date": {
"$gte": ISODate("2017-01-01T08:00:00Z"),
"$lte": ISODate("2017-01-31T08:00:00Z")
}
}
},
{
$group: {
"_id": "$date",
total: {
$sum: "$value"
}
}
}
])
Note: Notice the 8-hr offsets to get the groupings in PST
That would get me the total for January, but then I would need to make an additional query for every other month with different match conditions.

Related

mongodb find between dates (month, year)

I have a collection with two fields, similar to the one bellow:
{
year: 2017,
month: 04 }
How can i select documents between 2017/07 - 2018/04?
Solved with:
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$addFields: {
"date": {
"$dateFromParts": {
"year": "$year",
"month": {"$toInt": "$month"}
}
}
}
},
// Stage 2
{
$match: {
"date": {
"$gte": ISODate("2017-07-01T00:00:00.000Z"),
"$lte": ISODate("2018-04-30T00:00:00.000Z")
}
}
},
]
);
Firstly you have to make sure you db is storing date in ISO format ( a format that mongo supports )
You can use following command to find documents :-
model.find({
date:{
$gte:ISODate("2017-04-29T00:00:00.000Z"),
$lte:ISODate("2017-07-29T00:00:00.000Z"),
}
})
Where model is the name of the collection and date is a attribute of
document holding dates in ISO format.

Aggregation pipeline slow with large collection

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:
{
"dimensions": {
"account_id": ObjectId("590889944befcf34204dbef2"),
"url": "https://test.com",
"date": ISODate("2018-03-04T23:00:00.000+0000")
},
"metrics": {
"cost": 155,
"likes": 200
}
}
I have three indexes on this collection, as there are various aggregations being ran on this collection:
account_id
date
account_id and date
The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:
db.large_collection.aggregate(
[
{
$match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
},
{
$match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
},
{
$group: {
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" },
_id: {
year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}
}
},
{
$project: {
cost: 1,
likes: 1
}
}
],
{
cursor: {
batchSize: 50
},
allowDiskUse: true
}
);
This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.
Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?
The following could improve performances if and only if precomputing dimensions within each record is an option.
If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.
This hasn't been benchmarked.
One of the costly parts of this query probably comes from working with dates.
First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.
The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).
In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.
Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):
db.large_collection.aggregate([
{ $addFields: {
"prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}},
{ "$out": "large_collection_precomputed" }
])
which will store these documents:
{
"dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
"metrics" : { "cost" : 155, "likes" : 200 },
"prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}
And let's query:
db.large_collection_precomputed.aggregate([
// Initial gross filtering of dates (months) (on 200M documents):
{ $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
{ $match: {
"dimensions.account_id": { $in: [
ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
]}
}},
// Exact filtering of dates (costlier, but only on ~1M5 documents).
{ $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
{ $group: {
// The _id is now extremly fast to retrieve:
_id: { year: "$prec.year", "week": "$prec.week" },
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" }
}},
...
])
In this case we would use indexes on account_id and month.
Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted
As a side effect, this obviously will make the storage disk/ram of the collection heavier.

MongoDB Aggregate for a sum on a per week basis for all prior weeks

I've got a series of docs in MongoDB. An example doc would be
{
createdAt: Mon Oct 12 2015 09:45:20 GMT-0700 (PDT),
year: 2015,
week: 41
}
Imagine these span all weeks of the year and there can be many in the same week. I want to aggregate them in such a way that the resulting values are a sum of each week and all its prior weeks counting the total docs.
So if there were something like 10 in the first week of the year and 20 in the second, the result could be something like
[{ week: 1, total: 10, weekTotal: 10},
{ week: 2, total: 30, weekTotal: 20}]
Creating an aggregation to find the weekTotal is easy enough. Including a projection to show the first part
db.collection.aggregate([
{
$project: {
"createdAt": 1,
year: {$year: "$createdAt"},
week: {$week: "$createdAt"},
_id: 0
}
},
{
$group: {
_id: {year: "$year", week: "$week"},
weekTotal : { $sum : 1 }
}
},
]);
But getting past this to sum based on that week and those weeks preceding is proving tricky.
The aggregation framework is not able to do this as all operations can only effectively look at one document or grouping boundary at a time. In order to do this on the "server" you need something with access to a global variable to keep the "running total", and that means mapReduce instead:
db.collection.mapReduce(
function() {
Date.prototype.getWeekNumber = function(){
var d = new Date(+this);
d.setHours(0,0,0);
d.setDate(d.getDate()+4-(d.getDay()||7));
return Math.ceil((((d-new Date(d.getFullYear(),0,1))/8.64e7)+1)/7);
};
emit({ year: this.createdAt.getFullYear(), week: this.createdAt.getWeekNumber() }, 1);
},
function(values) {
return Array.sum(values);
},
{
out: { inline: 1 },
scope: { total: 0 },
finalize: function(value) {
total += value;
return { total: total, weekTotal: value }
}
}
)
If you can live with the operation occuring on the "client" then you need to loop through the aggregation result and similarly sum up the totals:
var total = 0;
db.collection.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"week": { "$week": "$createdAt" }
},
"weekTotal": { "$sum": 1 }
}},
{ "$sort": { "_id": 1 } }
]).map(function(doc) {
total += doc.weekTotal;
doc.total = total;
return doc;
});
It's all a matter of whether it makes the most sense to you of whether this needs to happen on the server or on the client. But since the aggregation pipline has no such "globals", then you probably should not be looking at this for any further processing without outputting to another collection anyway.

Find last record of each day

I store data about my power consumption, each minute there is a new record, here is an example:
{"date":1393156826114,"id":"5309d4cae4b0fbd904cc00e1","adco":"O","hchc":7267599,"hchp":10805900,"hhphc":"g","ptec":"c","iinst":13,"papp":3010,"imax":58,"optarif":"s","isousc":60,"motdetat":"Á"}
such that I have around 1440 records a day.
How can I get the last record of each day?
Note: I use mongodb in spring java, so I need a query like this:
Example to get all measures :
#Query("{ 'date' : { $gt : ?0 }}")
public List<Mesure> findByDateGreaterThan(Date date, Sort sort);
A bit more modern than the original answer:
db.collection.aggregate([
{ "$sort": { "date": 1 } },
{ "$group": {
"_id": {
"$subtract": ["$date",{"$mod": ["$date",86400000]}]
},
"doc": { "$last": "$$ROOT" }
}},
{ "$replaceRoot": { "newDocument": "$doc" } }
])
The same principle applies that you essentially $sort the collection and then $group on the required grouping key picking up the $last data from the grouping boundary.
Making things a bit clearer since the original writing is that you can use $$ROOT instead of specifying every document property, and of course the $replaceRoot stage allows you to restore that data fully as the original document form.
But the general solution is still $sort first, then $group on the common key that is required and keep the $last or $first depending on sort order occurrences from the grouping boundary for the properties that are required.
Also for BSON Dates as opposed to a timestamp value as in the question, see Group result by 15 minutes time interval in MongoDb for different approaches on how to accumulate for different time intervals actually using and returning BSON Date values.
Not quite sure what you are going for here but you could do this in aggregate if my understanding is right. So to get the last record for each day:
db.collection.aggregate([
// Sort in date order as ascending
{"$sort": { "date": 1 } },
// Date math converts to whole day
{"$project": {
"adco": 1,
"hchc": 1,
"hchp": 1,
"hhphc": 1,
"ptec": 1,
"iinst": 1,
"papp": 1,
"imax": 1,
"optarif": 1,
"isousc": 1,
"motdetat": 1,
"date": 1,
"wholeDay": {"$subtract": ["$date",{"$mod": ["$date",86400000]}]}
}},
// Group on wholeDay ( _id insertion is monotonic )
{"$group":
"_id": "$wholeDay",
"docId": {"$last": "$_id" },
"adco": {"$last": "$adco" },
"hchc": {"$last": "$hchc" },
"hchp": {"$last": "$hchp" },
"hhphc": {"$last": "$hhphc" },
"ptec": {"$last": "$ptec" },
"iinst": {"$last": "$iinst" },
"papp": {"$last": "$papp" },
"imax": {"$last": "$imax" },
"optarif": {"$last": "$optarif",
"isousc": {"$last": "$isouc" },
"motdetat": {"$last": "$motdetat" },
"date": {"$last": "$date" },
}}
])
So the principle here is that given the timestamp value, do the date math to project that as the midnight time at the beginning of each day. Then as the _id key on the document is already monotonic (always increasing), then simply group on the wholeDay value while pulling the $last document from the grouping boundary.
If you don't need all the fields then only project and group on the ones you want.
And yes you can do this in the spring data framework. I'm sure there is a wrapped command in there. But otherwise, the incantation to get to the native command goes something like this:
mongoOps.getCollection("yourCollection").aggregate( ... )
For the record, if you actually had BSON date types rather than a timestamp as a number, then you can skip the date math:
db.collection.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$date" },
"month": { "$month": "$date" },
"day": { "$dayOfMonth": "$date" }
},
"hchp": { "$last": "$hchp" }
}}
])
It's also possible to format timestamps in the group key as %Y-%m-%d (e.g. 2021-12-05) with dateToString:
// { timestamp: 1638697946000, value: "a" } <= 2021-12-05 9:52:26
// { timestamp: 1638686311000, value: "b" } <= 2021-12-05 6:38:31
// { timestamp: 1638859111000, value: "c" } <= 2021-12-07 6:38:31
db.collection.aggregate([
{ $sort: { timestamp: 1 } },
// { timestamp: 1638686311000, value: "b" }
// { timestamp: 1638697946000, value: "a" }
// { timestamp: 1638859111000, value: "c" }
{ $group: {
_id: { $dateToString: { date: { $toDate: "$timestamp" }, format: "%Y-%m-%d" } },
last: { $last: "$$ROOT" }
}},
// { _id: "2021-12-07", last: { timestamp: 1638859111000, value: "c" } }
// { _id: "2021-12-05", last: { timestamp: 1638697946000, value: "a" } }
{ $replaceWith: "$last" }
])
// { timestamp: 1638697946000, value: "a" } <= 2021-12-05 9:52:26
// { timestamp: 1638859111000, value: "c" } <= 2021-12-07 6:38:31
This:
first $sorts documents by chronological order of timestamps such that we can latter on pick newest documents based on their order.
then $groups documents by their %Y-%m-%d-formatted timestamps:
by first converting the timestamp into a datetime: { $toDate: "$timestamp" }
and then converting the associated datetime into a string only representing the year, month and day: { $dateToString: { date: ..., format: "%Y-%m-%d" } }
such that for each group (i.e. date), we can pick the $last (i.e. newest since chronologically sorted) matching document
and the pick is the whole document as represented by $$ROOT
finally cleans up the group result with a $replaceWith stage (alias for $replaceRoot).

How to aggregate by year-month-day on a different timezone

I have a MongoDB whom store the date objects in UTC. Well, I want to perform aggregation by year,month day in a different timezone (CET).
doing this, works fine for UTC:
BasicDBObject group_id = new BasicDBObject("_id", new BasicDBObject("year", new BasicDBObject("$year", "$tDate")).
append("month", new BasicDBObject("$month", "$tDate")).
append("day", new BasicDBObject("$dayOfMonth", "$tDate")).
append("customer", "$customer"));
BasicDBObject groupFields = group_id.
append("eventCnt", new BasicDBObject("$sum", "$eventCnt"));
BasicDBObject group = new BasicDBObject("$group", groupFields);
or, if you use the command line (not tested, I only tested the java version):
{
$group: {
_id: {
"year": {
"$year", "$tDate"
},
"month": {
"$month", "$tDate"
},
"day": {
"$dayOfMonth", "$tDate"
},
"customer": "$customer"
},
"eventCount": {
"$sum": "$eventCount"
}
}
}
How do I convert these dates into CET inside the aggregation framework?
For example '2013-09-16 23:45:00 UTC' is '2013-09-17 00:45:00 CET', this is a different day.
I'm not an expert on CET and its relation to UTC, but the following code (for the shell) should do a proper conversion (adding an hour) to a MongoDB date type:
db.dates.aggregate(
{$project: {"tDate":{$add: ["$tDate", 60*60*1000]}, "eventCount":1, "customer":1}}
)
If you run that project command before the rest of your pipeline, the results should be in CET.
You can provide the timezone to the date operators starting in 3.6.
Replace the timezone with your timezone.
{
"$group":{
"_id":{
"year":{"$year":{"date":"$tDate","timezone":"America/Chicago"}},
"month":{"$month":{"date":"$tDate","timezone":"America/Chicago"}},
"dayOfMonth":{"$dayOfMonth":{"date":"$tDate","timezone":"America/Chicago"}}
},
"count":{"$sum":1}
}
}
After searching for hours, this is the solution that worked for me. It is also very simple. Just convert the timezone by subtracting the timezone offset in milliseconds.
25200000 = 7 hour offset // 420 min * 60 sec * 1000 mili
$group: {
_id = {
year: { $year : [{ $subtract: [ "$timestamp", 25200000 ]}] },
month: { $month : [{ $subtract: [ "$timestamp", 25200000 ]}] },
day: { $dayOfMonth : [{ $subtract: [ "$timestamp", 25200000 ]}] }
},
count = {
$sum : 1
}
};
Use for example moment.js to dertmine the current timezone offset for CET but this way you get the summer&winter offsets
var offsetCETmillisec = moment.tz.zone('Europe/Berlin').offset(moment())* 60 * 1000;
$group: {
_id: {
'year': {'$year': [{ $subtract: [ '$createdAt', offsetCETmillisec ]}] },
'month': {'$month': [{ $subtract: [ '$createdAt', offsetCETmillisec ]}] },
'day': {'$dayOfMonth': [{ $subtract: [ '$createdAt', offsetCETmillisec ]}] }
},
count: {$sum: 1}
}
}
MongoDB's documentation suggests that you save the timezone offset alongside the timestamp:
var now = new Date();
db.data.save( { date: now,
offset: now.getTimezoneOffset() } );
This is of course not the ideal solution – but one that works, until we have in MongoDb's aggregation pipeline a proper $utcOffset function.
The solution with timezone is a good one, but in version 3.6 you can also format the output using timezone, so, you get the result ready for use:
{
"$project":{
"year_month_day": {"$dateToString": { "format": "%Y-%m-%d", "date": "$tDate", "timezone": "America/Chicago"}}
},
"$group":{
"_id": "$year_month_day",
"count":{"$sum":1}
}
}
Make sure that your "$match" also considers timezone, or else you will get wrong results.
Mongo stores the dates in UTC,
so this is the procedure to get them in other zone
check that mongo saves the dates in UTC, insert some records etc.
get timezone offset with moment-timezone.js eg moment().tz('Europe/Zagreb').utcOffset() functions, for your
specified timezone
Prepare $gte and $lte for $match stage (eg user input for dates 1.1.2019 - 13.1.2019.):
If offset is positive subtract() those seconds in $match stage; If offset is negative add() those seconds in $match stage
Then normalize the dates (because $match stage will return them in UTC) to your zone like this:
-if timezone offset is positive add() those seconds in $project stage;
-if timezone offset is negative subtract() those seconds in $project stage.
$group goes last, this is important (because we want to group normalized results, and not $match-ed)
Basically it is this: shift input(s) to $match(UTC), and then normalize to your timezone.
<?php
date_default_timezone_set('Asia/Karachi');
$date=getdate(date("U"));
$day = $date['mday'];
$month =$date['mon'];
$year = $date['year'];
$currentDate = $year.'-'.$month.'-'.$day;
?>