MongoDB Aggregate Time Series - mongodb

I'm using MongoDB to store time series data using a similar structure to "The Document-Oriented Design" explained here: http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
The objective is to query for the top 10 busiest minutes of the day on the whole system. Each document stores 1 hour of data using 60 sub-documents (1 for each minute). Each minute stores various metrics embedded in the "vals" field. The metric I care about is "orders". A sample document looks like this:
{
"_id" : ObjectId("54d023802b1815b6ef7162a4"),
"user" : "testUser",
"hour" : ISODate("2015-01-09T13:00:00Z"),
"vals" : {
"0" : {
"orders" : 11,
"anotherMetric": 15
},
"1" : {
"orders" : 12,
"anotherMetric": 20
},
.
.
.
}
}
Note there are many users in the system.
I've managed to flatten the structure (somewhat) by doing an aggregate with the following group object:
group = {
$group: {
_id: {
hour: "$hour"
},
0: {$sum: "$vals.0.orders"},
1: {$sum: "$vals.1.orders"},
2: {$sum: "$vals.2.orders"},
.
.
.
}
}
But that just gives me 24 documents (1 for each hour) with the # of orders for each minute during that hour, like so:
{
"_id" : {
"hour" : ISODate("2015-01-20T14:00:00Z")
},
"0" : 282086,
"1" : 239358,
"2" : 289188,
.
.
.
}
Now I need to somehow get the top 10 minutes of the day from this but I'm not sure how. I suspect it can be done with $project, but I'm not sure how.

You could aggregate as:
$match the documents for the specific date.
Construct the $group and $project objects before querying.
$group by the $hour, accumulate all the documents per hour per
minute in an array.Keep the minute somewhere within the document.
$project a variable docs as $setUnion of all the documents per
hour.
$unwind the documents.
$sort by orders
$limit the top 10 documents which is what we require.
Code:
var inputDate = new ISODate("2015-01-09T13:00:00Z");
var group = {};
var set = [];
for(var i=0;i<=60;i++){
group[i] = {$push:{"doc":"$vals."+i,
"hour":"$_id.hour",
"min":{$literal:i}}};
set.push("$"+i);
}
group["_id"] = {$hour:"$hour"};
var project = {"docs":{$setUnion:set}}
db.t.aggregate([
{$match:{"hour":{$lte:inputDate,$gte:inputDate}}},
{$group:group},
{$project:project},
{$unwind:"$docs"},
{$sort:{"docs.doc.orders":-1}},
{$limit:2},
{$project:{"_id":0,
"hour":"$_id",
"doc":"$docs.doc",
"min":"$docs.min"}}
])

Related

MongoDB - $push accumulator is slowing query down

To develop a web application, I use MongooDB as Back-End and I need to retrieve data from it. On a particular page I need to recover prices' history from specific brands. In my MongoDB collection, here is how a document/product is saved :
{
brand : "exampleBrand"
prices : Array
0: Object
date "2022-03-08"
price: 1900
1: Object
date "2022-03-09"
price: 1910
}
My goal is then to retieve dates and prices from a specific brand in the following format :
{
{
date : "2022-03-08",
prices : [price_product1, priceproduct2,...]
}
{
date : "2022-03-09",
prices : [price_product1, priceproduct3,...]
}
}
In order to do that I have designed the following query :
db.Prices.aggregate([
{
$match: {
{brand: "exampleBrand"}],
},
},
{
$project: {
_id: 0,
prices: 1,
},
},
{
$unwind: '$prices',
},
{
$group: {
_id: "$prices.date",
prix: {
$push: "$prices.price",
},
},
},
]);
Once I have these results I can go on with different calculations etc... to display on my page. However, there are approcimatively 90000 documents, each of them having in average 30 prices and dates. Thus, the group stage of the aggregation pipeline is taking a long time.
I have try different indexes on "prices", "prices.date", "brand, prices" but none of them seem to speed up the query. I have also tried twisting and changing the query but couldn't find a more efficient way to get my results. Would anyone have an idea on how to achieve this ?
Thank you,
For faster querying here with these conditions I think the only way is to use mechanisms like Redis or Memcached because the query, especially in the array, would cost a lot of io and process for the aggregation process.
P.S. I doubt that but if you somehow are able to change your data structure in a way that it would be flat it would be faster but not like caching method faster.
example:
{
brand : "exampleBrand"
price1 : 1900
date1 : "2022-03-08"
price2 : 2100
date2 : "2022-03-29"
}

MongoDB: calculate 90th percentile among all documents

I need to calculate the 90th percentile of the duration where the duration for each document is defined as finish_time - start_time.
My plan was to:
Create $project to calculate that duration in seconds for each document.
Calculate the index (in sorted documents list) that correspond to the 90th percentile: 90th_percentile_index = 0.9 * amount_of_documents.
Sort the documents by the $duration variable the was created.
Use the 90th_percentile_index to $limit the documents.
Choose the first document out of the limited subset of document.
I'm new to MongoDB so I guess the query can be improved. So, the query looks like:
db.getCollection('scans').aggregate([
{
$project: {
duration: {
$divide: [{$subtract: ["$finish_time", "$start_time"]}, 1000] // duration is in seconds
},
Percentile90Index: {
$multiply: [0.9, "$total_number_of_documents"] // I don't know how to get the total number of documents..
}
}
},
{
$sort : {"$duration": 1},
},
{
$limit: "$Percentile90Index"
},
{
$group: {
_id: "_id",
percentiles90 : { $max: "$duration" } // selecting the max, i.e, first document after the limit , should give the result.
}
}
])
The problem I have is that I don't know how to get the total_number_of_documents and therefore I can't calculate the index.
Example:
let's say I have only 3 documents:
{
"_id" : ObjectId("1"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:01:00.000Z"),
}
{
"_id" : ObjectId("2"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:03:00.000Z"),
}
{
"_id" : ObjectId("3"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:08:00.000Z"),
}
So I would expect the result to be something like:
{
percentiles50 : 3 // in minutes, since percentiles50=3 is the minimum value that setisfies the request of atleast 50% of the documents have duration <= percentiles50
}
I used percentiles 50th in the example because I only gave 3 documents but it really doesn't matter, just show me please a query for the i-th percentiles and it will be fine :-)

MongoDB query for finding schemes with more than a year?

Given the following schema, I want to write a MongoDB scheme to find all schemes with duration of more than a year. ie (start - end) > 1 year.
I am not sure if I can specify such an expression in mongodb query (start - end) > 1 year.
{
"update" : ISODate("2017-09-26T15:22:13.172Z"),
"create" : ISODate("2017-09-26T15:22:13.172Z"),
"scheme" : {
"since" : ISODate("2017-09-26T15:22:13.172Z"),
"startDate": ISODate("2017-09-26T15:22:13.172Z"),
"endDate": ISODate("2018-09-26T15:22:13.172Z"),
},
}
You can use aggregation with subtract:
db.yourcollection.aggregate(
[{
$project : { dateDifference: { $subtract: [ "$scheme.endDate", "$scheme.endDate" ] }},
$match : { "$dateDifference" : { $gt : 31536000000 } }
}]);
(* 31536000000 = milliseconds per year)
You may use another $project to output any fields you need in the matching documents.

Mongodb Aggregation Framework and timestamp

I have a collection
{ "_id" : 1325376000, "value" : 13393}
{ "_id" : 1325462400, "value" : 13393}
ObjectIds are Unix Timestamp and are storing as Number manually.(at insert time).
now I'm searching for a solution that i could calculate sum of values for each month with Aggregation Framework.
Here is a way you can do it by generating the aggregation pipeline programmatically:
numberOfMonths=24; /* number of months you want to go back from today's */
now=new Date();
year=now.getFullYear();
mo=now.getMonth();
months=[];
for (i=0;i<numberOfMonths;i++) {
m1=mo-i+1; m2=m1-1;
d = new Date(year,m1,1);
d2=new Date(year,m2,1);
from= d2.getTime()/1000;
to= d.getTime()/1000;
dt={from:from, to:to, month:d2}; months.push(dt);
}
prev="$nothing";
cond={};
months.forEach(function(m) {
cond={$cond: [{$and :[ {$gte:["$_id",m.from]}, {$lt:["$_id",m.to]} ]}, m.month, prev]};
prev=cond;
} );
/* now you can use "cond" variable in your pipeline to generate month */
db.collection.aggregate( { $project: { month: cond , value:1 } },
{ $group: {_id:"$month", sum:{$sum:"$value"} } }
)

How to count the number of documents on date field in MongoDB

Scenario: Consider, I have the following collection in the MongoDB:
{
"_id" : "CustomeID_3723",
"IsActive" : "Y",
"CreatedDateTime" : "2013-06-06T14:35:00Z"
}
Now I want to know the count of the created document on the particular day (say on 2013-03-04)
So, I am trying to find the solution using aggregation framework.
Information:
So far I have the following query built:
collection.aggregate([
{ $group: {
_id: '$CreatedDateTime'
}
},
{ $group: {
count: { _id: null, $sum: 1 }
}
},
{ $project: {
_id: 0,
"count" :"$count"
}
}
])
Issue: Now considering above query, its giving me the count. But not based on only date! Its taking time as well into consideration for unique count.
Question: Considering the field has ISO date, Can any one tell me how to count the documents based on only date (i.e excluding time)?
Replace your two groups with
{$project:{day:{$dayOfMonth:'$createdDateTime'},month:{$month:'$createdDateTime'},year:{$year:'$createdDateTime'}}},
{$group:{_id:{day:'$day',month:'$month',year:'$year'}, count: {$sum:1}}}
You can read more about the date operators here: http://docs.mongodb.org/manual/reference/aggregation/#date-operators