Time difference since previous record in time series aggregation - mongodb

I have a collection of events made by different actors. I now need to calculate the amount of events per actor that occurred x amount of time since the last time an event occurred.
A more concrete example: a collection of login events made by different actors. Every login event that happened less than 8 hours since the previous login should be ignored. So let's say I log in at 2pm and again at 11pm, the count we want would be 2. If I would log in at 2pm and again at 5pm, that would have to count for 1.
I don't really see a viable solution to this problem using the aggregation framework. A possible solution would be to calculate (and save) the time between te previous event for each record. But I feel there should be a better solution.
Can anyone point me in the right direction? Didn't really find any usecases that are similar to this one.
If my question isn't clear, let me know!
Edit:
An example:
Simplified events:
[{
_id: 1,
actor: X,
date: ISODate("2018-09-20T18:00:00.000Z")
},
{
_id: 2,
actor: X,
date: ISODate("2018-09-21T05:00:00.000Z") // 11 hours since previous
},
{
_id: 3,
actor: X,
date: ISODate("2018-09-21T07:00:00.000Z") // 2 hours since previous
},
{
_id: 4,
actor: Y,
date: ISODate("2018-09-21T06:00:00.000Z")
},
{
_id: 5,
actor: Y,
date: ISODate("2018-09-21T09:00:00.000Z") // 3 hours since previous
}]
Expected output:
[{
_id: X,
count: 2 // 3 events, but one is less than 8 hours since previous
},
{
_id: Y,
count: 1 // 2 events, but one is less than 8 hours since previous
}]

You can compare values from different documents by grouping them into array and iterating over it. In your case $reduce is probably the simplest way:
db.collection.aggregate([
// ensure order
{ $sort: { date: 1 } },
// get all dates per actor
{ $group: { _id: "$actor", dates: { $push: "$date" } } },
{ $addFields: {
// iterate dates
events: { $reduce: {
input: "$dates",
initialValue: { last: null, count: 1 },
// increment counter if difference is > than 8 * 60 * 60 * 1000 millis
in: { last: "$$this", count: { $add: [
"$$value.count",
{ $cond: [
{$gt: [ { $subtract: [ "$$this", "$$value.last" ] }, 28800000 ] },
1,
0
] }
] } }
} }
} },
{ $project: { count: "$events.count" } }
])
It's gonna be slow on large dataset. In this case you may need to pre-aggregate counters at insert time.

Related

Performance of mongo request for rain/sunshine/raindays on weekends

I want to know:
sum of rain (mm)
sum of sunshine (hours)
Probability (%) of a rainday with more than 0.5mm rain on weekends
On the weekends (sa+so) between week 20 to 40 for the last 17years.
I have 820k documents in 10min periods.
The request took sometimes 38sec but sometimes more than 1min.
Do you have an Idea how to improve performance?
data-Model:
[
'datum',
'regen',
'tempAussen',
'sonnenSchein',
and more...
]
schema:
[
{
$project: {
jahr: {
$year: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
woche: {
$week: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
day: {
$isoDayOfWeek: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
stunde: {
$hour: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tagjahr: {
$dayOfYear: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tempAussen: 1,
regen: 1,
sonnenSchein: 1,
},
},
{
$match: {
$and: [
{
woche: {
$gte: 20,
},
},
{
woche: {
$lte: 40,
},
},
{
day: {
$gte: 6,
},
},
],
},
},
{
$group: {
_id: ['$tagjahr', '$jahr'],
woche: {
$first: '$woche',
},
regen_sum: {
$sum: '$regen',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein',
},
},
},
{
$project: {
_id: '$_id',
regenTage: {
$sum: {
$cond: {
if: {
$gte: ['$regen_sum', 0.5],
},
then: 1,
else: 0,
},
},
},
woche: 1,
regen_sum: 1,
sonnenSchein_sum: 1,
},
},
{
$group: {
_id: '$woche',
regen_sum: {
$sum: '$regen_sum',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein_sum',
},
regenTage: {
$sum: '$regenTage',
},
},
},
{
$project: {
regenTage: 1,
regen_sum: {
$divide: ['$regen_sum', 34],
},
sonnenSchein_sum: {
$divide: ['$sonnenSchein_sum', 2040],
},
probability: {
$divide: ['$regenTage', 0.34],
},
},
},
{
$project: {
regen_sum: {
$round: ['$regen_sum', 1],
},
sonnenSchein_sum: {
$round: ['$sonnenSchein_sum', 1],
},
wahrscheinlich: {
$round: ['$probability', 0],
},
},
},
{
$sort: {
_id: 1,
},
},
]
this result is an example for week 20:
on the weekend of calender week 20 I have in average 2.3mm rain, 11.9h sunshine and a probility of 35% that it will rain atleast on one day of the weekend
_id:20
regen_sum:2.3
sonnenSchein_sum:11.9
probability:35
Without having the verbose explain output (.explain("allPlansExecution")), it is hard to say anything for sure. Here are some observations from just taking a look at the aggregation pipeline that was provided (underneath "schema:").
Before going into observations, I must ask what your specific goals are. Are operations like these something you will be running frequently? Is anything faster than 38 seconds acceptable, or is there a specific runtime that you are looking for? As outlined below, there probably isn't much opportunity for direct improvement. Therefore it might be beneficial to look into other approaches to the problem, and I'll outline one at the end.
The first observation is that this aggregation is going to perform a full collection scan. Even if an index existed on the datum field, it could not be used since the filtering in the $match is done on new fields that are calculated from datum. We could make some changes to allow an index to be used, but it probably wouldn't help. You are processing ~38% of your data (20 of the 52 weeks per year) so the overhead of doing the index scan and randomly fetching a significant portion of the data is probably more than just scanning the entire collection directly.
Secondly, you are currently $grouping twice. The only reason for this seems to be so that you can determine if a day is considered 'rainy' first (more than 0.5mm of rain). But the 'rainy day' indicator then effectively gets combined to become a 'rainy weekend' indicator in the second grouping. Although it could technically change the results a little due to the rounding done on the 24 hour basis, perhaps that small change would be worthwhile to eliminate one of the $group stages entirely?
If this were my system, I would consider pre-aggregating some of this data. Specifically having daily summaries as opposed to the 10 minute intervals of raw data would really go a long way here in reducing the amount of processing that is required to generate summaries like this. Details for each day (which won't change) would then be contained in a single document rather than in 144 individual ones. That would certainly allow an aggregation logically equivalent to the one above to process much faster than what you are currently observing.

Aggregation pipeline slow with large collection

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:
{
"dimensions": {
"account_id": ObjectId("590889944befcf34204dbef2"),
"url": "https://test.com",
"date": ISODate("2018-03-04T23:00:00.000+0000")
},
"metrics": {
"cost": 155,
"likes": 200
}
}
I have three indexes on this collection, as there are various aggregations being ran on this collection:
account_id
date
account_id and date
The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:
db.large_collection.aggregate(
[
{
$match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
},
{
$match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
},
{
$group: {
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" },
_id: {
year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}
}
},
{
$project: {
cost: 1,
likes: 1
}
}
],
{
cursor: {
batchSize: 50
},
allowDiskUse: true
}
);
This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.
Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?
The following could improve performances if and only if precomputing dimensions within each record is an option.
If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.
This hasn't been benchmarked.
One of the costly parts of this query probably comes from working with dates.
First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.
The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).
In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.
Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):
db.large_collection.aggregate([
{ $addFields: {
"prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}},
{ "$out": "large_collection_precomputed" }
])
which will store these documents:
{
"dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
"metrics" : { "cost" : 155, "likes" : 200 },
"prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}
And let's query:
db.large_collection_precomputed.aggregate([
// Initial gross filtering of dates (months) (on 200M documents):
{ $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
{ $match: {
"dimensions.account_id": { $in: [
ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
]}
}},
// Exact filtering of dates (costlier, but only on ~1M5 documents).
{ $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
{ $group: {
// The _id is now extremly fast to retrieve:
_id: { year: "$prec.year", "week": "$prec.week" },
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" }
}},
...
])
In this case we would use indexes on account_id and month.
Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted
As a side effect, this obviously will make the storage disk/ram of the collection heavier.

Aggregation is very slow

I have a collection with a structure similar to this.
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"dateJoined": ISODate("2014-12-28T16:37:17.984Z"),
"activatedMonth": 5,
"enrollments" : [
{ "month":-10, "enrolled":'00'},
{ "month":-9, "enrolled":'00'},
{ "month":-8, "enrolled":'01'},
//other months
{ "month":8, "enrolled":'11'},
{ "month":9, "enrolled":'11'},
{ "month":10, "enrolled":'00'}
]
}
month in enrollments sub document is a relative month from dateJoined.
activatedMonth is a month of activation relative to dateJoined. So, this will be different for each document.
I am using Mongodb aggregation framework to process queries like "Find all documents that are enrolled from 10 months before dateJoined activating to 25 months after dateJoined activating".
"enrolled" values 01, 10, 11 are considered enrolled and 00 is considered not enrolled. For a document to be considered to to be enrolled, it should be enrolled for every month in the range.
I am applying all the filters that I can apply in the match stage but this can be empty in most cases. In projection phase I am trying to find out the all the document with at least one not-enrolled month. if the size is zero, then the document is enrolled.
Below is the query that I am using. It takes 3 to 4 seconds to finish. It is more or less same time with or with out the group phase. My data is relatively smaller in size ( 0.9GB) and total number of documents are 41K and sub document count is approx. 13 million.
I need to reduce the processing time. I tried creating an index on enrollments.month and enrollment.enrolled and is of no use and I think it is because of the fact that project stage cant use indexes. Am I right?
Are there are any other things that I can do to the query or the collection structure to improve performance?
let startMonth = -10;
let endMonth = 25;
mongoose.connection.db.collection("collection").aggregate([
{
$match: filters
},
{
$project: {
_id: 0,
enrollments: {
$size: {
$filter: {
input: "$enrollment",
as: "enrollment",
cond: {
$and: [
{
$gte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$lte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$eq: [
'$$enrollment.enroll',
'00'
]
}
]
}
}
}
}
}
},
{
$match: {
enrollments: {
$eq: 0
}
}
},
{
$group: {
_id: null,
enrolled: {
$sum: 1
}
}
}
]).toArray(function(err,
result){
//some calculations
}
});
Also, I definitely need the group stage as I will group the counts based on different field. I have omitted this for simplicity.
Edit:
I have missed a key details in the initial post. Updated the question with the actual use case why I need projection with a calculation.
Edit 2:
I converted this to just a count query to see how it performs (based on comments to this question by Niel Lunn.
My query:
mongoose.connection.db.collection("collection")
.find({
"enrollment": {
"$not": {
"$elemMatch": { "month": { "$gte": startMonth, "$lte": endMonth }, "enrolled": "00" }
}
}
})
.count(function(e,count){
console.log(count);
});
This query is taking 1.6 seconds. I tried with following indexes separately:
1. { 'enrollment.month':1 }
2. { 'enrollment.month':1 }, { 'enrollment.enrolled':1 } -- two seperate indexes
3. { 'enrollment.month':1, 'enrollment.enrolled':1} - just one index on both fields.
Winning query plan is not using keys in any of these cases, it does a COLLSCAN always. What am I missing here?

MongoDB Aggregate for a sum on a per week basis for all prior weeks

I've got a series of docs in MongoDB. An example doc would be
{
createdAt: Mon Oct 12 2015 09:45:20 GMT-0700 (PDT),
year: 2015,
week: 41
}
Imagine these span all weeks of the year and there can be many in the same week. I want to aggregate them in such a way that the resulting values are a sum of each week and all its prior weeks counting the total docs.
So if there were something like 10 in the first week of the year and 20 in the second, the result could be something like
[{ week: 1, total: 10, weekTotal: 10},
{ week: 2, total: 30, weekTotal: 20}]
Creating an aggregation to find the weekTotal is easy enough. Including a projection to show the first part
db.collection.aggregate([
{
$project: {
"createdAt": 1,
year: {$year: "$createdAt"},
week: {$week: "$createdAt"},
_id: 0
}
},
{
$group: {
_id: {year: "$year", week: "$week"},
weekTotal : { $sum : 1 }
}
},
]);
But getting past this to sum based on that week and those weeks preceding is proving tricky.
The aggregation framework is not able to do this as all operations can only effectively look at one document or grouping boundary at a time. In order to do this on the "server" you need something with access to a global variable to keep the "running total", and that means mapReduce instead:
db.collection.mapReduce(
function() {
Date.prototype.getWeekNumber = function(){
var d = new Date(+this);
d.setHours(0,0,0);
d.setDate(d.getDate()+4-(d.getDay()||7));
return Math.ceil((((d-new Date(d.getFullYear(),0,1))/8.64e7)+1)/7);
};
emit({ year: this.createdAt.getFullYear(), week: this.createdAt.getWeekNumber() }, 1);
},
function(values) {
return Array.sum(values);
},
{
out: { inline: 1 },
scope: { total: 0 },
finalize: function(value) {
total += value;
return { total: total, weekTotal: value }
}
}
)
If you can live with the operation occuring on the "client" then you need to loop through the aggregation result and similarly sum up the totals:
var total = 0;
db.collection.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"week": { "$week": "$createdAt" }
},
"weekTotal": { "$sum": 1 }
}},
{ "$sort": { "_id": 1 } }
]).map(function(doc) {
total += doc.weekTotal;
doc.total = total;
return doc;
});
It's all a matter of whether it makes the most sense to you of whether this needs to happen on the server or on the client. But since the aggregation pipline has no such "globals", then you probably should not be looking at this for any further processing without outputting to another collection anyway.

How to handle partial week data grouping in mongodb

I have some docs (daily open price for a stock) like the followings:
/* 0 */
{
"_id" : ObjectId("54d65597daf0910dfa8169b0"),
"D" : ISODate("2014-12-29T00:00:00.000Z"),
"O" : 104.98
}
/* 1 */
{
"_id" : ObjectId("54d65597daf0910dfa8169af"),
"D" : ISODate("2014-12-30T00:00:00.000Z"),
"O" : 104.73
}
/* 2 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ae"),
"D" : ISODate("2014-12-31T00:00:00.000Z"),
"O" : 104.51
}
/* 3 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ad"),
"D" : ISODate("2015-01-02T00:00:00.000Z"),
"O" : 103.75
}
/* 4 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ac"),
"D" : ISODate("2015-01-05T00:00:00.000Z"),
"O" : 102.5
}
and I want to aggregate the records by week so I can get the weekly average open price. My first attempt is to use:
db.ohlc.aggregate({
$match: {
D: {
$gte: new ISODate('2014-12-28')
}
}
}, {
$project: {
year: {
$year: '$D'
},
week: {
$week: '$D'
},
O: 1
}
}, {
$group: {
_id: {
year: '$year',
week: '$week'
},
O: {
$avg: '$O'
}
}
}, {
$sort: {
_id: 1
}
})
Bu I soon realized the result is incorrect as both the last week of 2014 (week number 52) and the first week of 2015 (week number 0) are partial weeks. With this aggregation I would have an average price for 12/29-12/31/2014 and another one for 01/02/2015 (which is the only trading date in the first week of 2015) but in my application I would need to group the data from 12/29/2015 through 01/02/2015. Any advice?
To answer my own question, the trick is to calculate the number of weeks based on a reference date (1970-01-04) and group by that number. You can check out my new post at http://midnightcodr.github.io/2015/02/07/OHLC-data-grouping-with-mongodb/ for details.
I use this for candelization; with allowDiskUsage, out and some date filters it works great. Maybe you can adopt the grouping?
db.getCollection('market').aggregate(
[
{ $match: { date: { $exists: true } } },
{ $sort: { date: 1 } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, tm15: { $mod: [ "$date", 900 ] } } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, candleDate: { $subtract: [ "$date", "$tm15" ] } } },
{ $group: { _id: "$candleDate", open: { $first: '$rate' }, low: { $min: '$rate' }, high: { $max: '$rate' }, close: { $last: '$rate' }, volume: { $sum: '$amount' }, trades: { $sum: 1 } } }
])
From my experience, this is not a really good approach to tackle the problem. Why? This will definitely not scale, the amount of computation needed is quite exhausting, specially to do the grouping.
What I would do in your situation is to move part of the application logic to the documents in the DB.
My first approach would be to add a "week" field that will state the previous (or next) Sunday of the date the sample belongs to. This is quite easy to do at the moment of insertion. Then you can simply run the aggregation method grouping by that field. If you want more performance, add an index for { symbol : 1, week : 1 } and do a sort in the aggregate.
My second approach, which would be if you plan on making a lot this type of aggregations, is basically having documents that group the samples in a weekly manner. Like this:
{
week : <Day Representing Week>,
prices: [
{ Day Sample }, ...
]
}
Then you can simply work on those documents directly. This will help you reduce your indexes in a significant manner, thus speeding things up.