MongoDB MapReduce Standard Deviation over date range - mongodb

Good Afternoon SO, I wonder if anybody could help me. I am currently investigating using MongoDB Aggregation Framework and MapReduce functions.
My dataset looks like this
[{
"Name" : "Person 1",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 2",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 5
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 3",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 25
}
]
}
]
I have done a lot of research and as I an see there is no out of the box support for doing SD calculations. I have reviewed a few links and SO posts and came up with this URL https://gist.github.com/RedBeard0531/1886960, which seems to be what I am looking for.
So enough about the background what I would like to do is generate a chart of SDs over each year.
The current function does not take inconsideration each year only the value as a whole. I have changed the map function to and have no idea where to put the group date function.
function map() {
emit(1, // Or put a GROUP BY key here
{sum: this.RunningSpeed.value, // the field you want stats for
min: this.RunningSpeed.value,
max: this.RunningSpeed.value,
count:1,
diff: 0, // M2,n: sum((val-mean)^2)
});
}
However I just get zero's. Could anybody help me adapt this function?

You need to use getFullYear() and a forEach loop through each RunningSpeed entry:
function map() {
this.RunningSpeed.forEach(function(data){
var year = data.Date.getFullYear();
emit(year, // Or put a GROUP BY key here, the group by key is Date.getFullYear()
{sum: data.Value, // the field you want stats for
min: data.Value,
max: data.Value,
count: 1,
diff: 0, // M2,n: sum((val-mean)^2)
});
});
}

Related

mongodb math operations on two fields in collection

Hello i have exercise to filter all countries where gdp is greater than 0.05 on one person in country. I need to take the latest year of population. Also code of the country should have at least 3 characters. My collection looks like this:
mondial.countries
{
"_id" : ObjectId("581cb5a519ec2deb4ba71c03"),
"name" : "Germany",
"code" : "GER",
"capital" : "RN-Niamey-Niamey",
"area" : 1267000,
"gdp" : 7304,
"inflation" : 1.9,
"unemployment" : null,
"independence" : ISODate("1960-08-03T00:00:00Z"),
"government" : "republic",
"population" : [
{
"year" : 1950,
"value" : 2559703
},
{
"year" : 1960,
"value" : 3337141
},
{
"year" : 1970,
"value" : 4412638
},
{
"year" : 1977,
"value" : 5102990
},
{
"year" : 1988,
"value" : 7251626
},
{
"year" : 1997,
"value" : 9113001
},
{
"year" : 2001,
"value" : 11060291
},
{
"year" : 2012,
"value" : 17138707
}
]
}
For this example I have to take the population from year 2012 a divide it by gdp a then display it if its greater than 50000. I have been trying with function in js but idk how to show fields that are greater thatn 5000 of my operation. What is the easies way to do this?
var countries = db.mondial.countries.find({
"code": {$gte: 3},
});
while(countries.hasNext()) {
gdp = countries.next()
gdpresult = countries.population / gdp.gdp
print(gdpresult)
}
I don't know if I understood correctly. more see if it helps
db.mondial.aggregate([
{
$match:{
$expr: {
$gte:['$code',3 ]
}
}
},
{
$project: {
gdpresult: {
$map: {
input: '$population',
as: 'p',
in: {
value: {
$divide: ["$$p.value", '$gdp']
},
year: '$$p.year'
}
}
}
}
}])

MongoDB Geospatial query within an array returns always the full doc

Have a collection in MongoDB that looks like this :
{
"_id" : "7613035010550",
"purchases" : [
{
"date" : ISODate("2017-04-15T14:15:00.000Z"),
"coords" : {
"lon" : 43.729604,
"lat" : 1.416017
},
"metar" : {},
"quantity" : 1,
"price" : 2.31
},
{
"date" : ISODate("2017-05-02T16:23:00.000Z"),
"coords" : {
"lon" : 43.722862,
"lat" : 1.415837
},
"metar" : {},
"quantity" : 6,
"price" : 12
},
{
"date" : ISODate("2017-05-02T18:32:00.000Z"),
"coords" : {
"lon" : 46.307353,
"lat" : 3.28937
},
"metar" : {},
"quantity" : 2,
"price" : 5
}
],
"rates" : [
{
"value" : 5
},
{
"value" : 4
},
{
"value" : 5
},
{
"value" : 2
}
]
}
And would like make a query that is abble to return only purchases done within a define radius (i.e 5 km) around a point and only for an id... But i don't know how to handle this kind of query.
Try this query :
db.getCollection('stats').find({"purchases.coords":{$geoWithin:{$centerSphere: [[43.688935, 1.401541], 25 / 6378.1]}}})
But returns the whole document... I would like to be abble to return something like an array of purchases made around the defined radius, i.e only those two in my exemple :
{
"date" : ISODate("2017-04-15T14:15:00.000Z"),
"coords" : {
"lon" : 43.729604,
"lat" : 1.416017
},
"metar" : {},
"quantity" : 1,
"price" : 2.31
},
{
"date" : ISODate("2017-05-02T16:23:00.000Z"),
"coords" : {
"lon" : 43.722862,
"lat" : 1.415837
},
"metar" : {},
"quantity" : 6,
"price" : 12
}
How can i achieve this kind of query... or... how to define my collection to be abble to make this kind of query ?
Thx,
JL
For Your purpose I'd recommend to use aggregation with stages project and unwind.
Can't check it right now but it should looks like this:
db.getCollection('stats').aggregate([
{'$match': {
"purchases.coords": {$geoWithin:{$centerSphere: [[43.688935, 1.401541], 25 / 6378.1]}}
}},
{'$project': {
"_id": 0, // 0 - if you don't need document id
"purchases": 1,
}},
{'$unwind': "$purchases"},
{'$match': {
"purchases.coords": {$geoWithin:{$centerSphere: [[43.688935, 1.401541], 25 / 6378.1]}}
}},
])
I've used 2 identical matches to:
Match all documents matched specified conditions.
Match all unwinded 'purchase' matched specified conditions.
You can use this aggregation without first match but it may be a bit slower.
You can see how it works If you comment all the stages and then uncomment one by one.

Group based on discrete date ranges

I am new to MongoDB and I've been struggling to get a specific query to work without any luck.
I have a collection with millions of documents having a date and an amount, I want to get the aggregations for specific periods of time.
For example, I want to get the count, amount summations for the periods between 1/1/2015 - 15/1/2015 and between 1/2/2015 - 15/2/2015
A sample collection is
{ "_id" : "148404972864202083547392254", "account" : "3600", "amount" : 50, "date" : ISODate("2017-01-01T12:02:08.642Z")}
{ "_id" : "148404972864202085437392254", "account" : "3600", "amount" : 50, "date" : ISODate("2017-01-03T12:02:08.642Z")}
{ "_id" : "148404372864202083547392254", "account" : "3600", "amount" : 70, "date" : ISODate("2017-01-09T12:02:08.642Z")}
{ "_id" : "148404972864202083547342254", "account" : "3600", "amount" : 150, "date" : ISODate("2017-01-22T12:02:08.642Z")}
{ "_id" : "148404922864202083547392254", "account" : "3600", "amount" : 200, "date" : ISODate("2017-02-02T12:02:08.642Z")}
{ "_id" : "148404972155502083547392254", "account" : "3600", "amount" : 30, "date" : ISODate("2017-02-7T12:02:08.642Z")}
{ "_id" : "148404972864202122254732254", "account" : "3600", "amount" : 10, "date" : ISODate("2017-02-10T12:02:08.642Z")}
for date ranges between 1/1/2017 - 10/10/2017 and 1/2/2017 - 10/2/2017 the output would be like this:
1/1/2017 - 10/1/2017 - count =3, amount summation: 170
10/2/2017 - 15/2/2017 - count =2, amount summation: 40
Is it possible to work with such different date ranges? The code would be in Java, but as an example in mongo, can someone please help me?
There must be a more elegant solution than this. Anyways you can wrap it into a function and generalize date related arguments.
First, you need to make a projection at the same time deciding into which range an item goes (note the huge $switch expression). By default, an item goes into 'null' range.
Then, you filter out results that didn't match your criteria (i.e. range != null).
The very last step is to group items by the range and make all needed calculations.
db.items.aggregate([
{ $project : {
amount : true,
account : true,
date : true,
range : {
$switch : {
branches : [
{
case : {
$and : [
{ $gte : [ "$date", ISODate("2017-01-01T00:00:00.000Z") ] },
{ $lt : [ "$date", ISODate("2017-01-10T00:00:00.000Z") ] }
]
},
then : { $concat : [
{ $dateToString: { format: "%d/%m/%Y", date: ISODate("2017-01-01T00:00:00.000Z") } },
{ $literal : " - " },
{ $dateToString: { format: "%d/%m/%Y", date: ISODate("2017-01-10T00:00:00.000Z") } }
] }
},
{
case : {
$and : [
{ $gte : [ "$date", ISODate("2017-02-01T00:00:00.000Z") ] },
{ $lt : [ "$date", ISODate("2017-02-10T00:00:00.000Z") ] }
]
},
then : { $concat : [
{ $dateToString: { format: "%d/%m/%Y", date: ISODate("2017-02-01T00:00:00.000Z") } },
{ $literal : " - " },
{ $dateToString: { format: "%d/%m/%Y", date: ISODate("2017-02-10T00:00:00.000Z") } }
] }
}
],
default : null
}
}
} },
{ $match : { range : { $ne : null } } },
{ $group : {
_id : "$range",
count : { $sum : 1 },
"amount summation" : { $sum : "$amount" }
} }
])
Based on your data it will give the following results*:
{ "_id" : "01/02/2017 - 10/02/2017", "count" : 2, "amount summation" : 230 }
{ "_id" : "01/01/2017 - 10/01/2017", "count" : 3, "amount summation" : 170 }
*I believe you have few typos in your questions, that's why the data look different.

Order by value in timeseries mongodb

I have a timeseries collection like this ( mongodb documentation sample)
_id: "20101010/site-1/apache_pb.gif",
metadata: {
date: ISODate("2000-10-10T00:00:00Z"),
site: "site-1",
page: "/apache_pb.gif" },
daily: 5468426,
hourly: {
"0": 227850,
"1": 210231,
"2" : 12344,
"23": 20457 },
minute: {
"0": 3612,
"1": 3241,
...
"1439": 2819 }
what is the best solution, using aggregation framework, to sort for value of hourly ? so for example I want to order from the lower to higher in hourly in order to have something like this :
{
"2": 12344,
"23" : 20457,
"1" : 21031,
"0" : 227850
}
Thanks
Hi this same problem occurred for me then that time I changed my documents structure as below
{
"_id" : "20101010/site-1/apache_pb.gif",
"metadata" : {
"date" : ISODate("2000-10-10T00:00:00Z"),
"site" : "site-1",
"page" : "/apache_pb.gif"
},
"daily" : 5468426,
"hourly" : [
{
"hour" : 0,
"value" : 227850
},
{
"hour" : 1,
"value" : 210231
},
{
"hour" : 2,
"value" : 12344
},
{
"hour" : 23,
"value" : 20457
}
],
"minute" : [
{
"min" : 0,
"value" : 3612
},
{
"min" : 1,
"value" : 3241
},
{
"min" : 1439,
"value" : 2819
}
]
}
And in your case you want to sort hourly data according to values from lower to highest first so I write following aggregation query which may be solve your problem
db.collectionName.aggregate(
{"$unwind":"$hourly"},
{"$project":{"hour":"$hourly.hour","value":"$hourly.value"}},
{"$sort":{"hour":-1}},
{"$group":{"_id":0,"hourlyData":
{"$push": {"hour":"$hour","value":"$value"}}}}).pretty()

How to normalize/reduce time data in mongoDB?

I'm storing minutely performance data in MongoDB, each collection is a type of performance report, and each document is the measurement at that point in time for the port on the array:
{
"DateTime" : ISODate("2012-09-28T15:51:03.671Z"),
"array_serial" : "12345",
"Port Name" : "CL1-A",
"metric" : 104.2
}
There can be up to 128 different "Port Name" entries per "array_serial".
As the data ages I'd like to be able to average it out over increasing time spans:
Up to 1 Week : minute
1 Week to 1 month : 5 minute
1 - 3 months: 15 minute
etc..
Here's how I'm averaging the times so that they can be reduced :
var resolution = 5; // How many minutes to average over
var map = function(){
var coeff = 1000 * 60 * resolution;
var roundTime = new Date(Math.round(this.DateTime.getTime() / coeff) * coeff);
emit(roundTime, { value : this.metric, count: 1 } );
};
I'll be summing the values and counts in the reduce function, and getting the average in the finalize funciton.
As you can see this would average the data for just the time leaving out the "Port Name" value, and I need to average the values over time for each "Port Name" on each "array_serial".
So how can I include the port name in the above map function? Should the key for the emit be a compound "array_serial,PortName,DateTime" value that I split later? Or should I use the query function to query for each distinct serial, port and time? Am I storing this data in the database correctly?
Also, as far as I know this data gets saved out to it's own collection, what's the standard practice for replacing the data in the collection with this averaged data?
Is this what you mean Asya? Because it's not grouping the documents rounded to the lower 5 minute (btw, I changed 'DateTime' to 'datetime'):
$project: {
"year" : { $year : "$datetime" },
"month" : { $month : "$datetime" },
"day" : { $dayOfMonth : "$datetime" },
"hour" : { $hour : "$datetime" },
"minute" : { $mod : [ {$minute : "$datetime"}, 5] },
array_serial: 1,
port_name: 1,
port_number: 2,
metric: 1
}
From what I can tell the "$mod" operator will return the remainder of the minute divided by five, correct?
This would really help me if I could get the aggregation framework to do this operation rather than mapreduce.
Here is how you could do it in aggregation framework. I'm using a small simplification - I'm only grouping on Year, Month and Date - in your case you will need to add hour and minute for the finer grained calculations. You also have a choice about whether to do weighted average if the point distribution is not uniform in the data sample you get.
project={"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
};
group={"$group" : {
"_id" : {
"a" : "$array_serial",
"P" : "$Port Name",
"y" : "$year",
"m" : "$month",
"d" : "$day"
},
"avgMetric" : {
"$avg" : "$metric"
}
}
};
db.metrics.aggregate([project, group]).result
I ran this with some random sample data and got something of this format:
[
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 100.8
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 7
},
"avgMetric" : 98
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-A",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 105
}
]
As you can see this is one result per array_serial, port name, year/month/date combination. You can use $sort to get them into the order you want to process them from there.
Here is how you would extend the project step to include hour and minute while rounding minutes to average over every five minutes:
{
"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"hour" : {
"$hour" : "$DateTime"
},
"fmin" : {
"$subtract" : [
{
"$minute" : "$DateTime"
},
{
"$mod" : [
{
"$minute" : "$DateTime"
},
5
]
}
]
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
}
Hope you will be able to extend that to your specific data and requirements.
"what's the standard practice for replacing the data in the collection with this averaged data?"
The standard practice is to keep the original data and to store all derived data separately.
In your case it means:
Don't delete the original data
Use another collection (in the same MongoDB database) to store average values