How to normalize/reduce time data in mongoDB? - mongodb

I'm storing minutely performance data in MongoDB, each collection is a type of performance report, and each document is the measurement at that point in time for the port on the array:
{
"DateTime" : ISODate("2012-09-28T15:51:03.671Z"),
"array_serial" : "12345",
"Port Name" : "CL1-A",
"metric" : 104.2
}
There can be up to 128 different "Port Name" entries per "array_serial".
As the data ages I'd like to be able to average it out over increasing time spans:
Up to 1 Week : minute
1 Week to 1 month : 5 minute
1 - 3 months: 15 minute
etc..
Here's how I'm averaging the times so that they can be reduced :
var resolution = 5; // How many minutes to average over
var map = function(){
var coeff = 1000 * 60 * resolution;
var roundTime = new Date(Math.round(this.DateTime.getTime() / coeff) * coeff);
emit(roundTime, { value : this.metric, count: 1 } );
};
I'll be summing the values and counts in the reduce function, and getting the average in the finalize funciton.
As you can see this would average the data for just the time leaving out the "Port Name" value, and I need to average the values over time for each "Port Name" on each "array_serial".
So how can I include the port name in the above map function? Should the key for the emit be a compound "array_serial,PortName,DateTime" value that I split later? Or should I use the query function to query for each distinct serial, port and time? Am I storing this data in the database correctly?
Also, as far as I know this data gets saved out to it's own collection, what's the standard practice for replacing the data in the collection with this averaged data?
Is this what you mean Asya? Because it's not grouping the documents rounded to the lower 5 minute (btw, I changed 'DateTime' to 'datetime'):
$project: {
"year" : { $year : "$datetime" },
"month" : { $month : "$datetime" },
"day" : { $dayOfMonth : "$datetime" },
"hour" : { $hour : "$datetime" },
"minute" : { $mod : [ {$minute : "$datetime"}, 5] },
array_serial: 1,
port_name: 1,
port_number: 2,
metric: 1
}
From what I can tell the "$mod" operator will return the remainder of the minute divided by five, correct?
This would really help me if I could get the aggregation framework to do this operation rather than mapreduce.

Here is how you could do it in aggregation framework. I'm using a small simplification - I'm only grouping on Year, Month and Date - in your case you will need to add hour and minute for the finer grained calculations. You also have a choice about whether to do weighted average if the point distribution is not uniform in the data sample you get.
project={"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
};
group={"$group" : {
"_id" : {
"a" : "$array_serial",
"P" : "$Port Name",
"y" : "$year",
"m" : "$month",
"d" : "$day"
},
"avgMetric" : {
"$avg" : "$metric"
}
}
};
db.metrics.aggregate([project, group]).result
I ran this with some random sample data and got something of this format:
[
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 100.8
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 7
},
"avgMetric" : 98
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-A",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 105
}
]
As you can see this is one result per array_serial, port name, year/month/date combination. You can use $sort to get them into the order you want to process them from there.
Here is how you would extend the project step to include hour and minute while rounding minutes to average over every five minutes:
{
"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"hour" : {
"$hour" : "$DateTime"
},
"fmin" : {
"$subtract" : [
{
"$minute" : "$DateTime"
},
{
"$mod" : [
{
"$minute" : "$DateTime"
},
5
]
}
]
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
}
Hope you will be able to extend that to your specific data and requirements.

"what's the standard practice for replacing the data in the collection with this averaged data?"
The standard practice is to keep the original data and to store all derived data separately.
In your case it means:
Don't delete the original data
Use another collection (in the same MongoDB database) to store average values

Related

mongodb math operations on two fields in collection

Hello i have exercise to filter all countries where gdp is greater than 0.05 on one person in country. I need to take the latest year of population. Also code of the country should have at least 3 characters. My collection looks like this:
mondial.countries
{
"_id" : ObjectId("581cb5a519ec2deb4ba71c03"),
"name" : "Germany",
"code" : "GER",
"capital" : "RN-Niamey-Niamey",
"area" : 1267000,
"gdp" : 7304,
"inflation" : 1.9,
"unemployment" : null,
"independence" : ISODate("1960-08-03T00:00:00Z"),
"government" : "republic",
"population" : [
{
"year" : 1950,
"value" : 2559703
},
{
"year" : 1960,
"value" : 3337141
},
{
"year" : 1970,
"value" : 4412638
},
{
"year" : 1977,
"value" : 5102990
},
{
"year" : 1988,
"value" : 7251626
},
{
"year" : 1997,
"value" : 9113001
},
{
"year" : 2001,
"value" : 11060291
},
{
"year" : 2012,
"value" : 17138707
}
]
}
For this example I have to take the population from year 2012 a divide it by gdp a then display it if its greater than 50000. I have been trying with function in js but idk how to show fields that are greater thatn 5000 of my operation. What is the easies way to do this?
var countries = db.mondial.countries.find({
"code": {$gte: 3},
});
while(countries.hasNext()) {
gdp = countries.next()
gdpresult = countries.population / gdp.gdp
print(gdpresult)
}
I don't know if I understood correctly. more see if it helps
db.mondial.aggregate([
{
$match:{
$expr: {
$gte:['$code',3 ]
}
}
},
{
$project: {
gdpresult: {
$map: {
input: '$population',
as: 'p',
in: {
value: {
$divide: ["$$p.value", '$gdp']
},
year: '$$p.year'
}
}
}
}
}])

How to group the mondoDB document data by timestamp as quarterly/half yearly [duplicate]

This question already has answers here:
Group result by 15 minutes time interval in MongoDb
(7 answers)
Closed 5 years ago.
I am using MongoDB 3.2 and I am having requirement of grouping the document by timstamp as quarterly and half-yearly. The document structure is like
{
"_id" : ObjectId("59312c59bf501118aea587b2"),
"timestamp" : ISODate("2012-01-01T01:00:00Z"),
"value" : 20,
"uniqueId" : ObjectId("59312c59bf501118aea58a6d")
},
{
"_id" : ObjectId("59312c59bf501118aea587b3"),
"timestamp" : ISODate("2012-02-01T01:00:00Z"),
"value" : 20,
"uniqueId" : ObjectId("59312c59bf501118aea58a6d")
},
{
"_id" : ObjectId("59312c59bf501118aea587b4"),
"timestamp" : ISODate("2012-05-01T01:00:00Z"),
"value" : 20,
"uniqueId" : ObjectId("59312c59bf501118aea58a6d")
},
{
"_id" : ObjectId("59312c59bf501118aea587b5"),
"timestamp" : ISODate("2012-06-01T01:00:00Z"),
"value" : 20,
"uniqueId" : ObjectId("59312c59bf501118aea58a6d")
}
I need to group the document by timestamp quarterly or half yearly and I need to sum the value . The result for quarterly should looks like below
{
"timestamp" : ISODate("2012-01-01T01:00:00Z"),
"value" : 40
},
{
"timestamp" : ISODate("2012-05-01T01:00:00Z"),
"value" : 90
}
Can any help how can I achieve this and also for the half-yearly.
You can aggregate documents by quarter, but the calculating first date of each quarter should be done on the client side:
db.yourCollection.aggregate([
{
$group: {
_id: {
year: {$year: "$timestamp"},
quarter: {$ceil: {$divide:[{$month:"$timestamp"}, 3]}}
},
value: {$sum:"$value"}
}
},
{ $project: { _id: 0, year: "$_id.year", quarter: "$_id.quarter", value: 1 } },
{ $sort: { year: 1, quarter: 1 } }
])
Output:
{
"year" : 2012,
"quarter" : 1,
"value" : 40
}
,
{
"year" : 2012,
"quarter" : 2,
"value" : 40
}
If you want half-year reports, then insted of division by 3 you should use division by 6.

MongoDB Aggregation for Time Series data

I have following document structure (events collection) for a time series data. I have decided this structure based on the article at https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports-mmapv1/
{
"_id" : "06062017/cpu",
"metadata" : {
"host" : "localhost",
"createdDate" : ISODate("2017-06-09T11:18:56.120Z"),
"metric" : "cpu"
},
"hourly" : {
"0" : {
"total" : 2,
"used" : 1
},
"1" : {
"total" : 3,
"used" : 2
}
},
"minute" : {
"0" : {
"0" : {
"total" : 9789789,
"used" : 353
},
"1" : {
"total" : 0,
"used" : 0
}
},
"1" : {
"0" : {
"total" : 0,
"used" : 0
},
"1" : {
"total" : 234234,
"used" : 123
}
}
}
}
Now I am trying to get the average cpu used from minutes and store it for hourly. Aggregation and map functions work well on Lists and arrays, I could not get the average per hour from minutes as well as I would like to get average per day as well.
I am using java as programming language. Is is better to calculate with in the application or with the aggregation framework?
Any help is greatly appreciated.

MongoDB MapReduce Standard Deviation over date range

Good Afternoon SO, I wonder if anybody could help me. I am currently investigating using MongoDB Aggregation Framework and MapReduce functions.
My dataset looks like this
[{
"Name" : "Person 1",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 2",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 5
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 3",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 25
}
]
}
]
I have done a lot of research and as I an see there is no out of the box support for doing SD calculations. I have reviewed a few links and SO posts and came up with this URL https://gist.github.com/RedBeard0531/1886960, which seems to be what I am looking for.
So enough about the background what I would like to do is generate a chart of SDs over each year.
The current function does not take inconsideration each year only the value as a whole. I have changed the map function to and have no idea where to put the group date function.
function map() {
emit(1, // Or put a GROUP BY key here
{sum: this.RunningSpeed.value, // the field you want stats for
min: this.RunningSpeed.value,
max: this.RunningSpeed.value,
count:1,
diff: 0, // M2,n: sum((val-mean)^2)
});
}
However I just get zero's. Could anybody help me adapt this function?
You need to use getFullYear() and a forEach loop through each RunningSpeed entry:
function map() {
this.RunningSpeed.forEach(function(data){
var year = data.Date.getFullYear();
emit(year, // Or put a GROUP BY key here, the group by key is Date.getFullYear()
{sum: data.Value, // the field you want stats for
min: data.Value,
max: data.Value,
count: 1,
diff: 0, // M2,n: sum((val-mean)^2)
});
});
}

How can I aggregate documents by time interval in MongoDB?

I need to aggregate my collection based on a certain time interval.
As you may think, I don´t need to count e.g. per hour our day.
I need to aggregate based on a 30 minutes interval (or any other). Lets say, the first document was created at 3:45PM. Then there are 5 more documents, created between 3:45PM and 4:15PM.
So in this time interval, I have 6 documents. So the first document of the MapReduce result is a document with the count of 6.
Let´s say, the next document is created ad 4:35PM and three more at 4:40PM.
So the next document of the MapReduce result is a document with the count of 4.
And so on...
Currently my map function looks like this:
var map = function() {
var key = {name: this.name, minute: this.timestamp.getMinutes()};
emit(key, {count: 1})
};
So nothing special. Currently I group by the minute, which is not what I want at the end. Here, instead of minute, I need to be able to check the time-interval described above.
And my reduce function:
var reduce = function(key, values)
{
var sum = 0;
values.forEach(function(value)
{
sum += value['count'];
});
return {count: sum};
};
The output of this is like that:
{
0: "{ "_id" : { "name" : "A" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
1: "{ "_id" : { "name" : "B" , "minute" : 41.0} , "value" : { "count" : 6.0}}",
2: "{ "_id" : { "name" : "B" , "minute" : 42.0} , "value" : { "count" : 3.0}}",
3: "{ "_id" : { "name" : "C" , "minute" : 41.0} , "value" : { "count" : 2.0}}",
4: "{ "_id" : { "name" : "C" , "minute" : 42.0} , "value" : { "count" : 2.0}}",
5: "{ "_id" : { "name" : "D" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
6: "{ "_id" : { "name" : "E" , "minute" : 16.0} , "value" : { "count" : 1.0}}"
}
So it counts / aggregates documents per minute, but NOT by my custom time interval.
Any ideas about this?
Edit: My example using map reduce didn't work, but I think this does roughly what you want to do.
I use project to define a variable time to contain the minutes from your timestamp rounded to 5 minute intervals. This would be easy with an integer divide, but I don't think the mongodb query language supports that at this time, so instead I subtract minutes mod 5 from the minutes to get a number that changes every 5 minutes. Then a group by the name and this time counter should do the trick.
query = [
{
"$project": {
"_id":"$_id",
"name":"$name",
"time": {
"$subtract": [
{"$minute":"$timestamp"},
{"$mod": [{"$minute":"$timestamp"}, 5]}
]
}
}
},
{
"$group": {"_id": {"name": "$name", "time": "$time"}, "count":{"$sum":1}}
}
]
db.foo.aggregate(query)