MongoDB Aggregation for Time Series data - mongodb

I have following document structure (events collection) for a time series data. I have decided this structure based on the article at https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports-mmapv1/
{
"_id" : "06062017/cpu",
"metadata" : {
"host" : "localhost",
"createdDate" : ISODate("2017-06-09T11:18:56.120Z"),
"metric" : "cpu"
},
"hourly" : {
"0" : {
"total" : 2,
"used" : 1
},
"1" : {
"total" : 3,
"used" : 2
}
},
"minute" : {
"0" : {
"0" : {
"total" : 9789789,
"used" : 353
},
"1" : {
"total" : 0,
"used" : 0
}
},
"1" : {
"0" : {
"total" : 0,
"used" : 0
},
"1" : {
"total" : 234234,
"used" : 123
}
}
}
}
Now I am trying to get the average cpu used from minutes and store it for hourly. Aggregation and map functions work well on Lists and arrays, I could not get the average per hour from minutes as well as I would like to get average per day as well.
I am using java as programming language. Is is better to calculate with in the application or with the aggregation framework?
Any help is greatly appreciated.

Related

what't the meaning of mongo $minKey?

this page :
https://docs.mongodb.com/manual/reference/operator/query/type/
{ "date": new Date(1393804800000), "grade": MaxKey(), "score": 2 },
when i show Maxkey() in mongo shell:
MaxKey().help
The MaxKey BSON Class.
For more information on usage: https://mongodb.github.io/node-mongodb-native/3.6/api/MaxKey.html
how can I understand it ?
should I compare it with "$lt" or "$gt" like this ?
db.test.find({"grades.grade": {"$gt":"a"}})
MinKey and MaxKey are MongoDB internal types. Their purpose is to represent the theoretical extremes.
MinKey is less than any value, and MaxKey is greater than any value, regardless of type.
See Comparison/Sort Order
I think minKey() or maxKey() is just a special value which can only be queried by { $type : "maxKey" }
If data is below,
{
"_id" : 2,
"grades" : [
{
"date" : ISODate("2014-03-03T00:00:00.000Z"),
"grade" : { "$maxKey" : 1 },
"score" : 2
}, {
"date" : ISODate("2013-01-24T00:00:00.000Z"),
"grade" : { "$maxKey" : 1 },
"score" : 3
}
]
}
Use,
db.test.find({"grades.grade": {"$gt":"A"}})
Will return nothing.
But if use,
db.test.find({"grades.grade" : { $type : "maxKey" }})
Will return,
{
"_id" : 2,
"grades" : [
{
"date" : ISODate("2014-03-03T00:00:00.000Z"),
"grade" : { "$maxKey" : 1 },
"score" : 2
}, {
"date" : ISODate("2013-01-24T00:00:00.000Z"),
"grade" : { "$maxKey" : 1 },
"score" : 3
}
]
}

Query Time series data based on Date

If I have the below document, I would like to return the same document but with hourly.1 based on the minute inside the date field.
Does anyone know how to dynamically do this? If the date had 0 minute I want the hourly.0 returned. All the way up to the 59th minute.
{
"_id" : "08062017/cpu",
"date" : ISODate("2018-04-11T02:01:00.000Z"),
"metadata" : {
"host" : "localhost",
"metric" : "cpu"
},
"hourly" : {
"0" : {
"total" : 234,
"used" : 123
},
"1" : {
"total" : 234,
"used" : 123
}
}
}
RESULT:
{
"_id" : "08062017/cpu",
"date" : ISODate("2018-04-11T02:01:00.000Z"),
"metadata" : {
"host" : "localhost",
"metric" : "cpu"
},
"hourly" : {
"0" : {
"total" : 234,
"used" : 123
}
}
}

MongoDB Query for Time Series data

I am trying to write a find query to retrieve data only for first hour i.e. hourly : "1" in the following events document. Following is the output from db.events.find().pretty().
In real scenario, I would be finding based on id and hour.
{
"_id" : "08062017/cpu",
"metadata" : {
"host" : "localhost",
"metric" : "cpu"
},
"hourly" : {
"0" : {
"total" : 234,
"used" : 123
},
"1" : {
"total" : 234,
"used" : 123
}
}
}

Mongo Query - Return docs if data is 0 or null

I have data in the following format:
{
"__v" : 0,
"_id" : ObjectId("12367687"),
"xyzId" : "ADV_ID",
"date" : ISODate("2013-08-19T10:58:21.473Z"),
"gId" : "987654",
"type" : "checks"
}
For the above data i have to plot a graph for daily, weekly, monthly and yearly data using the "created_on" field and count for "type" field. I have the following query which works partially.
db.trackads.aggregate(
{$match : {"gId" : "987654",type : 'checks'}},
{ $group : { _id : { "day" : {"$week" : "$date"}},cnt : {"$sum" : 1}}});
Result:
{
"result" : [
{
"_id" : {
"day" : 34
},
"cnt" : 734
},
{
"_id" : {
"day" : 33
},
"cnt" : 349
}
],
"ok" : 1
}
But with the above query, i do not get results of dates(week no ) when count for "type" = impressions is 0. How should i modify the mongo query to get the results for count 0 ???
If I understand the question correctly, you can do it as follows.
db.trackads.aggregate({$match:{"groupId":"520a077c62d4b3b00e008905",$or:[{type:{$exists:false}},{type:'impressions'}]}},{$group:{_id:{"day":{"$week":"$created_on"}},cnt:{"$sum":1}}});
The easiest option would be to add the zeroes in the code that consumes the results of this query.
MongoDB can't invent data that is not there. For example, how far back in time and forward in time would you expect it to go creating zeroes?

How to normalize/reduce time data in mongoDB?

I'm storing minutely performance data in MongoDB, each collection is a type of performance report, and each document is the measurement at that point in time for the port on the array:
{
"DateTime" : ISODate("2012-09-28T15:51:03.671Z"),
"array_serial" : "12345",
"Port Name" : "CL1-A",
"metric" : 104.2
}
There can be up to 128 different "Port Name" entries per "array_serial".
As the data ages I'd like to be able to average it out over increasing time spans:
Up to 1 Week : minute
1 Week to 1 month : 5 minute
1 - 3 months: 15 minute
etc..
Here's how I'm averaging the times so that they can be reduced :
var resolution = 5; // How many minutes to average over
var map = function(){
var coeff = 1000 * 60 * resolution;
var roundTime = new Date(Math.round(this.DateTime.getTime() / coeff) * coeff);
emit(roundTime, { value : this.metric, count: 1 } );
};
I'll be summing the values and counts in the reduce function, and getting the average in the finalize funciton.
As you can see this would average the data for just the time leaving out the "Port Name" value, and I need to average the values over time for each "Port Name" on each "array_serial".
So how can I include the port name in the above map function? Should the key for the emit be a compound "array_serial,PortName,DateTime" value that I split later? Or should I use the query function to query for each distinct serial, port and time? Am I storing this data in the database correctly?
Also, as far as I know this data gets saved out to it's own collection, what's the standard practice for replacing the data in the collection with this averaged data?
Is this what you mean Asya? Because it's not grouping the documents rounded to the lower 5 minute (btw, I changed 'DateTime' to 'datetime'):
$project: {
"year" : { $year : "$datetime" },
"month" : { $month : "$datetime" },
"day" : { $dayOfMonth : "$datetime" },
"hour" : { $hour : "$datetime" },
"minute" : { $mod : [ {$minute : "$datetime"}, 5] },
array_serial: 1,
port_name: 1,
port_number: 2,
metric: 1
}
From what I can tell the "$mod" operator will return the remainder of the minute divided by five, correct?
This would really help me if I could get the aggregation framework to do this operation rather than mapreduce.
Here is how you could do it in aggregation framework. I'm using a small simplification - I'm only grouping on Year, Month and Date - in your case you will need to add hour and minute for the finer grained calculations. You also have a choice about whether to do weighted average if the point distribution is not uniform in the data sample you get.
project={"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
};
group={"$group" : {
"_id" : {
"a" : "$array_serial",
"P" : "$Port Name",
"y" : "$year",
"m" : "$month",
"d" : "$day"
},
"avgMetric" : {
"$avg" : "$metric"
}
}
};
db.metrics.aggregate([project, group]).result
I ran this with some random sample data and got something of this format:
[
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 100.8
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 7
},
"avgMetric" : 98
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-A",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 105
}
]
As you can see this is one result per array_serial, port name, year/month/date combination. You can use $sort to get them into the order you want to process them from there.
Here is how you would extend the project step to include hour and minute while rounding minutes to average over every five minutes:
{
"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"hour" : {
"$hour" : "$DateTime"
},
"fmin" : {
"$subtract" : [
{
"$minute" : "$DateTime"
},
{
"$mod" : [
{
"$minute" : "$DateTime"
},
5
]
}
]
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
}
Hope you will be able to extend that to your specific data and requirements.
"what's the standard practice for replacing the data in the collection with this averaged data?"
The standard practice is to keep the original data and to store all derived data separately.
In your case it means:
Don't delete the original data
Use another collection (in the same MongoDB database) to store average values