How can I aggregate MongoDB documents by hour? - mongodb

I have documents in a collection called "files" in MongoDB 5.0.2. This collection is for user uploaded content. I want to get a count of the number of files uploaded within the last 24 hours at a 1 hour interval.
For example, user uploaded 3 files at 13:00, 2 files at 14:00, 0 files at 15:00, and so on.
I already store a "timestamp" field in my Mongo documents which is an ISODate object from Javascript.
I have looked at countless stackoverflow questions but I cannot find anything that fits my needs or that I understand.

Would be this one:
{ $match: {timestamp: {$gt: moment().startOf('hour').subtract(24, 'hours').toDate() } } },
{ $group:
_id: {
y: {$year: "$timestamp"},
m: {$month: "$timestamp"},
d: {$dayOfMonth: "$timestamp"},
h: {$hour: "$timestamp"},
},
count: ...
}
or in Mongo version 5.0 you can do
{ $group:
_id: { $dateTrunc: { date: "$timestamp", unit: "hour" } },
count: ...
}
For any datetime related operations I recommend the moment.js library

Related

How to create time series of paying customers with MongoDB Aggregate?

I have a customers model:
const CustomerSchema = new Schema({
...
activeStartDate: Date,
activeEndDate: Date
...
}
Now I want to create an aggregate that creates a timeseries of active customers. So an output of:
[
{
_id: {year: 2022 month: 7}
activeCustomers: 500
},
...
]
The issue I cant figure out is how to get one customer document to count in multiple groups. A customer could be active for years, and therefore they should appear in multiple timeframes.
One option is:
Create a list of dates according to the months difference
$unwind to create a document per each month
$group by year and month and count the number of customers
db.collection.aggregate([
{$set: {
months: {$map: {
input: {
$range: [
0,
{$add: [
{$dateDiff: {
startDate: "$activeStartDate",
endDate: "$activeEndDate",
unit: "month"
}},
1]}
]
},
in: {$dateAdd: {
startDate: {$dateTrunc: {date: "$activeStartDate", unit: "month"}},
unit: "month",
amount: "$$this"
}}
}}
}},
{$unwind: "$months"},
{$group: {
_id: {year: {$year: "$months"}, month: {$month: "$months"}},
activeCustomers: {$sum: 1}
}}
])
See how it works on the playground example

MongoDB aggregation group by month in big collection - optimize pipeline

I'm aware that this question has been asked before at SO - but I can't seem to find how to handle aggregation grouping in bigger collections. I have a set of +10 million records, and I just can't get any speed to it.
Running MongoDB v 3.2.
Having a field __createDateUtc (ISODate) in the schema, I'm trying the following pipeline:
db.transactions.aggregate([
{
$project: {
__createDateUtc: 1
}
},
{
$group: {
'_id': { $year: '$__createDateUtc' },
'count': {$sum: 1},
}
},
{
$limit: 10
},
])
This runs at +20 seconds. Could it be made faster? This is a fairly simple pipeline - so really - is there any other strategy that might help in this situation?
I did some bench marking with four different ways of getting the results that I wanted. The results are a discouraging.
Again, with a schema looking like:
{
"_id" : ObjectId("5d665491fd5852755236a5dc"),
...
"__createDateUtc" : ISODate("2019-08-28T10:16:49Z"),
"__createDate" : {
"year" : 2019,
"month" : 8,
"day" : 28,
"yearMonth" : 201908,
"yearMonthDay" : 20190829
}
}
The results:
// Group by __createDate.yearMonth
db.transactions.aggregate([
{ $group: {
'_id': '$__createDate.yearMonth',
'count': {$sum: 1},
} },
{ $limit: 10 },
{ $sort: {'_id': -1 } }
])
// 20 169 ms
// Group by year and month
db.transactions.aggregate([
{$group: {
'_id': {year: '$__createDate.year', month: '$__createDate.month' },
'count': {$sum: 1},
}},
{ $limit: 10 },
{ $sort: {'_id': -1 } }
])
// 23 777 ms
// Group by calculating year and month from ISODate
db.transactions.aggregate([
{$group: {
'_id': {year: { $year: '$__createDateUtc' }, month: { $month: '$__createDateUtc' } },
'count': {$sum: 1},
}},
{ $limit: 10 },
{ $sort: {'_id': -1 } }
])
// 16 444 ms
// Last stupid method to just run many queries with count
var years = [2017, 2018, 2019];
var results = {}
years.forEach(year => {
results[year] = {};
for(var i = 1; i < 13; i++) {
var count = db.transactions.find({'__createDate.year': year, '__createDate.month': i}).count();
if(count > 0) results[year][i] = count;
}
})
// 10 701 ms
As you can see the last method of just running multiple counts is by far the fastest. Especially since I'm actually fetching a lot more data compared to the three other methods.
This just seems stupid to me. I know MongoDB is no search engine, but still. Aggregation is just not fast at all. Makes me wanna sync data to elastic search and try to aggregate within ES instead.

Aggregation pipeline slow with large collection

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:
{
"dimensions": {
"account_id": ObjectId("590889944befcf34204dbef2"),
"url": "https://test.com",
"date": ISODate("2018-03-04T23:00:00.000+0000")
},
"metrics": {
"cost": 155,
"likes": 200
}
}
I have three indexes on this collection, as there are various aggregations being ran on this collection:
account_id
date
account_id and date
The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:
db.large_collection.aggregate(
[
{
$match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
},
{
$match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
},
{
$group: {
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" },
_id: {
year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}
}
},
{
$project: {
cost: 1,
likes: 1
}
}
],
{
cursor: {
batchSize: 50
},
allowDiskUse: true
}
);
This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.
Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?
The following could improve performances if and only if precomputing dimensions within each record is an option.
If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.
This hasn't been benchmarked.
One of the costly parts of this query probably comes from working with dates.
First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.
The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).
In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.
Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):
db.large_collection.aggregate([
{ $addFields: {
"prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}},
{ "$out": "large_collection_precomputed" }
])
which will store these documents:
{
"dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
"metrics" : { "cost" : 155, "likes" : 200 },
"prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}
And let's query:
db.large_collection_precomputed.aggregate([
// Initial gross filtering of dates (months) (on 200M documents):
{ $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
{ $match: {
"dimensions.account_id": { $in: [
ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
]}
}},
// Exact filtering of dates (costlier, but only on ~1M5 documents).
{ $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
{ $group: {
// The _id is now extremly fast to retrieve:
_id: { year: "$prec.year", "week": "$prec.week" },
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" }
}},
...
])
In this case we would use indexes on account_id and month.
Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted
As a side effect, this obviously will make the storage disk/ram of the collection heavier.

how to get year wise and month wise documents which are having 'createdat' field

I need to generate monthly and yearly reports. So how to get year wise and month wise documents which are having 'createdat' field.
I suggest you to use the aggregation framework
You need to project all createdAt fields into year and month field as follow:
db.collection.aggregate([{
$project: {
year: {
$year: "$createdat"
},
month: {
$month: "$createdat"
},
log_field: 1,
other_field: 1,
}
}, {
$match: {
"year": 2016
}
}]);
More details about the $project
I realise you might not be able to change the structure of the data you have, but if you do, this might be useful to you...
I do quite a lot of this in our projects, and have always found that the best (as in, simplest and fastest) way is to actually store that kind of info up front, separately.
So as an example, I store the full datetime field (as you do), but then also store the YYYYMM value in its own field, as well as the Day. These values/formats might be different depending on the kind of data, but I've had very good success with it.
(For context, I'm storing financial calculation data, anywhere up to several million records per month, per customer..... not something the aggregation framework has managed nicely)
A little sample of the BSON
....
"cd" : ISODate("2016-02-29T22:59:59.999Z"),
"cdym" : "201602",
"cdd" : "29",
....
I got the answer and this is the answer.
db.mycollection.aggregate([
{ "$redact": {
"$cond": [
{ "$and": [
{ "$eq": [ { "$year": "$date" }, 2016 ] },
{ "$eq": [ { "$month": "$date" }, 5 ] }
] },
"$$KEEP",
"$$PRUNE"
] }
}
])

Mongo date aggregation operators with ObjectId

I'm trying to use the ObjectId as a creation date holder, and running into some problems when trying to do aggregation queries. In particular we want to use the date aggregation operators to group the documents by month, but that operator doesn't work with ObjectId apparently. Is there a way around it, or would I have to start using a separate CreationTime field per document?
Here is a query I was trying -
db.People.aggregate([
{
$match: {
$or: [{"Name":"Random1"}, {"Name":"Random2"}]
}
},
{
$project: {
Name: "$Name",
Year: {$year: "$_id"},
Month: {$month: "$_id"}
}
},
{
$group: {
_id: {Name: "$Name", Year: "$Year", Month: "$Month"},
TotalRequests: {$sum:1}
}
}
])
Right now, you need to keep a separate field as the Aggregation Framework can not deal with this yet. There is a server ticket at https://jira.mongodb.org/browse/SERVER-9406 to implement it so I suggest you vote for it (I just did).