I need to aggregate my collection based on a certain time interval.
As you may think, I don´t need to count e.g. per hour our day.
I need to aggregate based on a 30 minutes interval (or any other). Lets say, the first document was created at 3:45PM. Then there are 5 more documents, created between 3:45PM and 4:15PM.
So in this time interval, I have 6 documents. So the first document of the MapReduce result is a document with the count of 6.
Let´s say, the next document is created ad 4:35PM and three more at 4:40PM.
So the next document of the MapReduce result is a document with the count of 4.
And so on...
Currently my map function looks like this:
var map = function() {
var key = {name: this.name, minute: this.timestamp.getMinutes()};
emit(key, {count: 1})
};
So nothing special. Currently I group by the minute, which is not what I want at the end. Here, instead of minute, I need to be able to check the time-interval described above.
And my reduce function:
var reduce = function(key, values)
{
var sum = 0;
values.forEach(function(value)
{
sum += value['count'];
});
return {count: sum};
};
The output of this is like that:
{
0: "{ "_id" : { "name" : "A" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
1: "{ "_id" : { "name" : "B" , "minute" : 41.0} , "value" : { "count" : 6.0}}",
2: "{ "_id" : { "name" : "B" , "minute" : 42.0} , "value" : { "count" : 3.0}}",
3: "{ "_id" : { "name" : "C" , "minute" : 41.0} , "value" : { "count" : 2.0}}",
4: "{ "_id" : { "name" : "C" , "minute" : 42.0} , "value" : { "count" : 2.0}}",
5: "{ "_id" : { "name" : "D" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
6: "{ "_id" : { "name" : "E" , "minute" : 16.0} , "value" : { "count" : 1.0}}"
}
So it counts / aggregates documents per minute, but NOT by my custom time interval.
Any ideas about this?
Edit: My example using map reduce didn't work, but I think this does roughly what you want to do.
I use project to define a variable time to contain the minutes from your timestamp rounded to 5 minute intervals. This would be easy with an integer divide, but I don't think the mongodb query language supports that at this time, so instead I subtract minutes mod 5 from the minutes to get a number that changes every 5 minutes. Then a group by the name and this time counter should do the trick.
query = [
{
"$project": {
"_id":"$_id",
"name":"$name",
"time": {
"$subtract": [
{"$minute":"$timestamp"},
{"$mod": [{"$minute":"$timestamp"}, 5]}
]
}
}
},
{
"$group": {"_id": {"name": "$name", "time": "$time"}, "count":{"$sum":1}}
}
]
db.foo.aggregate(query)
Related
Good Afternoon SO, I wonder if anybody could help me. I am currently investigating using MongoDB Aggregation Framework and MapReduce functions.
My dataset looks like this
[{
"Name" : "Person 1",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 2",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 5
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 40
}
]
}, {
"Name" : "Person 3",
"RunningSpeed" : [{
"Date" : ISODate("2005-07-23T23:00:00.000Z"),
"Value" : 20
}, {
"Date" : ISODate("2006-07-23T23:00:00.000Z"),
"Value" : 10
}, {
"Date" : ISODate("2007-07-23T23:00:00.000Z"),
"Value" : 30
}, {
"Date" : ISODate("2008-07-23T23:00:00.000Z"),
"Value" : 25
}
]
}
]
I have done a lot of research and as I an see there is no out of the box support for doing SD calculations. I have reviewed a few links and SO posts and came up with this URL https://gist.github.com/RedBeard0531/1886960, which seems to be what I am looking for.
So enough about the background what I would like to do is generate a chart of SDs over each year.
The current function does not take inconsideration each year only the value as a whole. I have changed the map function to and have no idea where to put the group date function.
function map() {
emit(1, // Or put a GROUP BY key here
{sum: this.RunningSpeed.value, // the field you want stats for
min: this.RunningSpeed.value,
max: this.RunningSpeed.value,
count:1,
diff: 0, // M2,n: sum((val-mean)^2)
});
}
However I just get zero's. Could anybody help me adapt this function?
You need to use getFullYear() and a forEach loop through each RunningSpeed entry:
function map() {
this.RunningSpeed.forEach(function(data){
var year = data.Date.getFullYear();
emit(year, // Or put a GROUP BY key here, the group by key is Date.getFullYear()
{sum: data.Value, // the field you want stats for
min: data.Value,
max: data.Value,
count: 1,
diff: 0, // M2,n: sum((val-mean)^2)
});
});
}
I have data in the following format:
{
"__v" : 0,
"_id" : ObjectId("12367687"),
"xyzId" : "ADV_ID",
"date" : ISODate("2013-08-19T10:58:21.473Z"),
"gId" : "987654",
"type" : "checks"
}
For the above data i have to plot a graph for daily, weekly, monthly and yearly data using the "created_on" field and count for "type" field. I have the following query which works partially.
db.trackads.aggregate(
{$match : {"gId" : "987654",type : 'checks'}},
{ $group : { _id : { "day" : {"$week" : "$date"}},cnt : {"$sum" : 1}}});
Result:
{
"result" : [
{
"_id" : {
"day" : 34
},
"cnt" : 734
},
{
"_id" : {
"day" : 33
},
"cnt" : 349
}
],
"ok" : 1
}
But with the above query, i do not get results of dates(week no ) when count for "type" = impressions is 0. How should i modify the mongo query to get the results for count 0 ???
If I understand the question correctly, you can do it as follows.
db.trackads.aggregate({$match:{"groupId":"520a077c62d4b3b00e008905",$or:[{type:{$exists:false}},{type:'impressions'}]}},{$group:{_id:{"day":{"$week":"$created_on"}},cnt:{"$sum":1}}});
The easiest option would be to add the zeroes in the code that consumes the results of this query.
MongoDB can't invent data that is not there. For example, how far back in time and forward in time would you expect it to go creating zeroes?
I'm storing minutely performance data in MongoDB, each collection is a type of performance report, and each document is the measurement at that point in time for the port on the array:
{
"DateTime" : ISODate("2012-09-28T15:51:03.671Z"),
"array_serial" : "12345",
"Port Name" : "CL1-A",
"metric" : 104.2
}
There can be up to 128 different "Port Name" entries per "array_serial".
As the data ages I'd like to be able to average it out over increasing time spans:
Up to 1 Week : minute
1 Week to 1 month : 5 minute
1 - 3 months: 15 minute
etc..
Here's how I'm averaging the times so that they can be reduced :
var resolution = 5; // How many minutes to average over
var map = function(){
var coeff = 1000 * 60 * resolution;
var roundTime = new Date(Math.round(this.DateTime.getTime() / coeff) * coeff);
emit(roundTime, { value : this.metric, count: 1 } );
};
I'll be summing the values and counts in the reduce function, and getting the average in the finalize funciton.
As you can see this would average the data for just the time leaving out the "Port Name" value, and I need to average the values over time for each "Port Name" on each "array_serial".
So how can I include the port name in the above map function? Should the key for the emit be a compound "array_serial,PortName,DateTime" value that I split later? Or should I use the query function to query for each distinct serial, port and time? Am I storing this data in the database correctly?
Also, as far as I know this data gets saved out to it's own collection, what's the standard practice for replacing the data in the collection with this averaged data?
Is this what you mean Asya? Because it's not grouping the documents rounded to the lower 5 minute (btw, I changed 'DateTime' to 'datetime'):
$project: {
"year" : { $year : "$datetime" },
"month" : { $month : "$datetime" },
"day" : { $dayOfMonth : "$datetime" },
"hour" : { $hour : "$datetime" },
"minute" : { $mod : [ {$minute : "$datetime"}, 5] },
array_serial: 1,
port_name: 1,
port_number: 2,
metric: 1
}
From what I can tell the "$mod" operator will return the remainder of the minute divided by five, correct?
This would really help me if I could get the aggregation framework to do this operation rather than mapreduce.
Here is how you could do it in aggregation framework. I'm using a small simplification - I'm only grouping on Year, Month and Date - in your case you will need to add hour and minute for the finer grained calculations. You also have a choice about whether to do weighted average if the point distribution is not uniform in the data sample you get.
project={"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
};
group={"$group" : {
"_id" : {
"a" : "$array_serial",
"P" : "$Port Name",
"y" : "$year",
"m" : "$month",
"d" : "$day"
},
"avgMetric" : {
"$avg" : "$metric"
}
}
};
db.metrics.aggregate([project, group]).result
I ran this with some random sample data and got something of this format:
[
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 100.8
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-B",
"y" : 2012,
"m" : 9,
"d" : 7
},
"avgMetric" : 98
},
{
"_id" : {
"a" : "12345",
"P" : "CL1-A",
"y" : 2012,
"m" : 9,
"d" : 6
},
"avgMetric" : 105
}
]
As you can see this is one result per array_serial, port name, year/month/date combination. You can use $sort to get them into the order you want to process them from there.
Here is how you would extend the project step to include hour and minute while rounding minutes to average over every five minutes:
{
"$project" : {
"year" : {
"$year" : "$DateTime"
},
"month" : {
"$month" : "$DateTime"
},
"day" : {
"$dayOfWeek" : "$DateTime"
},
"hour" : {
"$hour" : "$DateTime"
},
"fmin" : {
"$subtract" : [
{
"$minute" : "$DateTime"
},
{
"$mod" : [
{
"$minute" : "$DateTime"
},
5
]
}
]
},
"array_serial" : 1,
"Port Name" : 1,
"metric" : 1
}
}
Hope you will be able to extend that to your specific data and requirements.
"what's the standard practice for replacing the data in the collection with this averaged data?"
The standard practice is to keep the original data and to store all derived data separately.
In your case it means:
Don't delete the original data
Use another collection (in the same MongoDB database) to store average values
I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form
{"name" : "Jeroen", "gender" : "m", "age" :27.53 }
Now I want to get the distributions for gender and age. Assume for gender, there are only values "m" and "f". What is the most efficient way of getting the total count of males and females in my collection?
And for age, is there a way that does some 'binning' and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ... etc?
I just tried out the new aggregation framework that will be available in MongoDB version 2.2 (2.2.0-rc0 has been released), which should have higher performance than map reduce since it doesn't rely on Javascript.
input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
aggregation command for gender:
db.collection.aggregate(
{$project: {gender:1}},
{$group: {
_id: "$gender",
count: {$sum: 1}
}})
result:
{"result" :
[
{"_id" : "m", "count" : 2},
{"_id" : "f", "count" : 3}
],
"ok" : 1
}
To get the ages in bins:
db.collection.aggregate(
{$project: {
ageLowerBound: {$subtract:["$age", {$mod:["$age",2]}]}}
},
{$group: {
_id:"$ageLowerBound",
count:{$sum:1}
}
})
result:
{"result" :
[
{"_id" : 26, "count" : 3},
{"_id" : 22, "count" : 2}
],
"ok" : 1
}
Konstantin's answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.
To count genders, the map function key is the this.gender attribute for every record. The reduce function then simply adds them up:
// count genders
db.persons.mapReduce(
function(){
emit(this["gender"], {count: 1})
}, function(key, values){
var result = {count: 0};
values.forEach(function(value) {
result.count += value.count;
});
return result;
}, {out: { inline : 1}}
);
To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key "10-12". And then again we simply add them up:
db.responses.mapReduce(
function(){
var x = Math.floor(this["age"]/2)*2;
var key = x + "-" + (x+2);
emit(key, {count: 1})
}, function(state, values){
var result = {count: 0};
values.forEach(function(value) {
result.count += value.count;
});
return result;
}, {out: { inline : 1}}
);
an easy way to get the total count of males would be db.x.find({"gender": "m"}).count()
If you want both male and female counts in just one query, then there is no easy way. Map/reduce would be one possibility. Or perhaps the new aggregation framework. The same is true for your binning requirement
Mongo is not great for aggregation, but it's fantastic for many small incremental updates.
So the best way to solve this problem with mongo would be to collect the aggregation data in a seperate collection.
So, if you keep a stats collection with one document like this:
stats: [
{
"male": 23,
"female": 17,
"ageDistribution": {
"0_2" : 3,
"2_4" : 5,
"4_6" : 7
}
}
]
... then everytime you add or remove a person from the other collection, you count the respective fields up or down in the stats collection.
db.stats.update({"$inc": {"male": 1, "ageDistribution.2_4": 1}})
Queries to stats will be lightning fast this way, and you will hardly notice any performance overhead from counting the stats up and down.
Based on the answer of #ColinE binning for histogram can be done by
db.persons.aggregate([
{
$bucket: {
groupBy: "$j.age",
boundaries: [0,2,4,6,8,10,12,14,16,18,20],
default: "Other",
output: {
"count": { $sum: 1 }
}
}
],
{allowDiskUse:true})
$bucketAuto did not work for me since buckets seem to be collected on a logarithmic scale.
allowDiskUse is only necessary if you have millions of documents
Depending on amount of data most effective way to find amount of males and females could be either
naive query or map reduce job. Binning is best done via map reduce:
In the map phase your key is a bin, and value is 1, and in the reduce phase you just sum up values
With Mongo 3.4 this just got even easier, thanks to the new $bucket and $bucketAuto aggregation functions. The following query auto-buckets into two groups:
db.bucket.aggregate( [
{
$bucketAuto: {
groupBy: "$gender",
buckets: 2
}
}
] )
With the following input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
It gives the following result:
{ "_id" : { "min" : "f", "max" : "m" }, "count" : 3 }
{ "_id" : { "min" : "m", "max" : "m" }, "count" : 2 }
Note, bucket and auto-bucket are typically used for continuous variables (numeric, date), but in this case auto-bucket works just fine.
I am using mongoDB in which I have collection of following format.
{"id" : 1 , name : x ttm : 23 , val : 5 }
{"id" : 1 , name : x ttm : 34 , val : 1 }
{"id" : 1 , name : x ttm : 24 , val : 2 }
{"id" : 2 , name : x ttm : 56 , val : 3 }
{"id" : 2 , name : x ttm : 76 , val : 3 }
{"id" : 3 , name : x ttm : 54 , val : 7 }
On that collection I have queried to get records in descending order like this:
db.foo.find({"id" : {"$in" : [1,2,3]}}).sort(ttm : -1).limit(3)
But it gives two records of same id = 1 and I want records such that it gives 1 record per id.
Is it possible in mongodb?
There is a distinct command in mongodb, that can be used in conjunction with a query. However, I believe this just returns a distinct list of values for a specific key you name (i.e. in your case, you'd only get the id values returned) so I'm not sure this will give you exactly what you want if you need the whole documents - you may require MapReduce instead.
Documentation on distinct:
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Distinct
You want to use aggregation. You could do that like this:
db.test.aggregate([
// each Object is an aggregation.
{
$group: {
originalId: {$first: '$_id'}, // Hold onto original ID.
_id: '$id', // Set the unique identifier
val: {$first: '$val'},
name: {$first: '$name'},
ttm: {$first: '$ttm'}
}
}, {
// this receives the output from the first aggregation.
// So the (originally) non-unique 'id' field is now
// present as the _id field. We want to rename it.
$project:{
_id : '$originalId', // Restore original ID.
id : '$_id', //
val : '$val',
name: '$name',
ttm : '$ttm'
}
}
])
This will be very fast... ~90ms for my test DB of 100,000 documents.
Example:
db.test.find()
// { "_id" : ObjectId("55fb595b241fee91ac4cd881"), "id" : 1, "name" : "x", "ttm" : 23, "val" : 5 }
// { "_id" : ObjectId("55fb596d241fee91ac4cd882"), "id" : 1, "name" : "x", "ttm" : 34, "val" : 1 }
// { "_id" : ObjectId("55fb59c8241fee91ac4cd883"), "id" : 1, "name" : "x", "ttm" : 24, "val" : 2 }
// { "_id" : ObjectId("55fb59d9241fee91ac4cd884"), "id" : 2, "name" : "x", "ttm" : 56, "val" : 3 }
// { "_id" : ObjectId("55fb59e7241fee91ac4cd885"), "id" : 2, "name" : "x", "ttm" : 76, "val" : 3 }
// { "_id" : ObjectId("55fb59f9241fee91ac4cd886"), "id" : 3, "name" : "x", "ttm" : 54, "val" : 7 }
db.test.aggregate(/* from first code snippet */)
// output
{
"result" : [
{
"_id" : ObjectId("55fb59f9241fee91ac4cd886"),
"val" : 7,
"name" : "x",
"ttm" : 54,
"id" : 3
},
{
"_id" : ObjectId("55fb59d9241fee91ac4cd884"),
"val" : 3,
"name" : "x",
"ttm" : 56,
"id" : 2
},
{
"_id" : ObjectId("55fb595b241fee91ac4cd881"),
"val" : 5,
"name" : "x",
"ttm" : 23,
"id" : 1
}
],
"ok" : 1
}
PROS: Almost certainly the fastest method.
CONS: Involves use of the complicated Aggregation API. Also, it is tightly coupled to the original schema of the document. Though, it may be possible to generalize this.
I believe you can use aggregate like this
collection.aggregate({
$group : {
"_id" : "$id",
"docs" : {
$first : {
"name" : "$name",
"ttm" : "$ttm",
"val" : "$val",
}
}
}
});
The issue is that you want to distill 3 matching records down to one without providing any logic in the query for how to choose between the matching results.
Your options are basically to specify aggregation logic of some kind (select the max or min value for each column, for example), or to run a select distinct query and only select the fields that you wish to be distinct.
querymongo.com does a good job of translating these distinct queries for you (from SQL to MongoDB).
For example, this SQL:
SELECT DISTINCT columnA FROM collection WHERE columnA > 5
Is returned as this MongoDB:
db.runCommand({
"distinct": "collection",
"query": {
"columnA": {
"$gt": 5
}
},
"key": "columnA"
});
If you want to write the distinct result in a file using javascript...this is how you do
cursor = db.myColl.find({'fieldName':'fieldValue'})
var Arr = new Array();
var count = 0;
cursor.forEach(
function(x) {
var temp = x.id;
var index = Arr.indexOf(temp);
if(index==-1)
{
printjson(x.id);
Arr[count] = temp;
count++;
}
})
Specify Query with distinct.
The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":
db.inventory.distinct( "item.sku", { dept: "A" } )
Reference: https://docs.mongodb.com/manual/reference/method/db.collection.distinct/