Binning and tabulate (unique/count) in Mongo - mongodb

I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form
{"name" : "Jeroen", "gender" : "m", "age" :27.53 }
Now I want to get the distributions for gender and age. Assume for gender, there are only values "m" and "f". What is the most efficient way of getting the total count of males and females in my collection?
And for age, is there a way that does some 'binning' and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ... etc?

I just tried out the new aggregation framework that will be available in MongoDB version 2.2 (2.2.0-rc0 has been released), which should have higher performance than map reduce since it doesn't rely on Javascript.
input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
aggregation command for gender:
db.collection.aggregate(
{$project: {gender:1}},
{$group: {
_id: "$gender",
count: {$sum: 1}
}})
result:
{"result" :
[
{"_id" : "m", "count" : 2},
{"_id" : "f", "count" : 3}
],
"ok" : 1
}
To get the ages in bins:
db.collection.aggregate(
{$project: {
ageLowerBound: {$subtract:["$age", {$mod:["$age",2]}]}}
},
{$group: {
_id:"$ageLowerBound",
count:{$sum:1}
}
})
result:
{"result" :
[
{"_id" : 26, "count" : 3},
{"_id" : 22, "count" : 2}
],
"ok" : 1
}

Konstantin's answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.
To count genders, the map function key is the this.gender attribute for every record. The reduce function then simply adds them up:
// count genders
db.persons.mapReduce(
function(){
emit(this["gender"], {count: 1})
}, function(key, values){
var result = {count: 0};
values.forEach(function(value) {
result.count += value.count;
});
return result;
}, {out: { inline : 1}}
);
To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key "10-12". And then again we simply add them up:
db.responses.mapReduce(
function(){
var x = Math.floor(this["age"]/2)*2;
var key = x + "-" + (x+2);
emit(key, {count: 1})
}, function(state, values){
var result = {count: 0};
values.forEach(function(value) {
result.count += value.count;
});
return result;
}, {out: { inline : 1}}
);

an easy way to get the total count of males would be db.x.find({"gender": "m"}).count()
If you want both male and female counts in just one query, then there is no easy way. Map/reduce would be one possibility. Or perhaps the new aggregation framework. The same is true for your binning requirement
Mongo is not great for aggregation, but it's fantastic for many small incremental updates.
So the best way to solve this problem with mongo would be to collect the aggregation data in a seperate collection.
So, if you keep a stats collection with one document like this:
stats: [
{
"male": 23,
"female": 17,
"ageDistribution": {
"0_2" : 3,
"2_4" : 5,
"4_6" : 7
}
}
]
... then everytime you add or remove a person from the other collection, you count the respective fields up or down in the stats collection.
db.stats.update({"$inc": {"male": 1, "ageDistribution.2_4": 1}})
Queries to stats will be lightning fast this way, and you will hardly notice any performance overhead from counting the stats up and down.

Based on the answer of #ColinE binning for histogram can be done by
db.persons.aggregate([
 {
$bucket: {
groupBy: "$j.age",
boundaries: [0,2,4,6,8,10,12,14,16,18,20],
default: "Other",
output: {
"count": { $sum: 1 }
}
}
],
{allowDiskUse:true})
$bucketAuto did not work for me since buckets seem to be collected on a logarithmic scale.
allowDiskUse is only necessary if you have millions of documents

Depending on amount of data most effective way to find amount of males and females could be either
naive query or map reduce job. Binning is best done via map reduce:
In the map phase your key is a bin, and value is 1, and in the reduce phase you just sum up values

With Mongo 3.4 this just got even easier, thanks to the new $bucket and $bucketAuto aggregation functions. The following query auto-buckets into two groups:
db.bucket.aggregate( [
{
$bucketAuto: {
groupBy: "$gender",
buckets: 2
}
}
] )
With the following input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
It gives the following result:
{ "_id" : { "min" : "f", "max" : "m" }, "count" : 3 }
{ "_id" : { "min" : "m", "max" : "m" }, "count" : 2 }
Note, bucket and auto-bucket are typically used for continuous variables (numeric, date), but in this case auto-bucket works just fine.

Related

Indexing MongoDB for sort consistency

The MongoDB documentation says that MongoDB doesn't store documents in a collection in a particular order. So if you have this collection:
db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
] );
and sorting like this:
db.restaurants.aggregate(
[
{ $sort : { borough : 1 } }
]
)
Then the sort order can be inconsistent since:
the borough field contains duplicate values for both Manhattan and Brooklyn. Documents are returned in alphabetical order by borough, but the order of those documents with duplicate values for borough might not to be the same across multiple executions of the same sort.
To return a consistent result it's recommended to modify the query to:
db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)
My question relates to the efficiency of such a query. Let's say you have millions of documents, should you create a compound index, something like { borough: 1, _id: -1 }, to make it efficient? Or is it enough to index { borough: 1 } due to the, potentially, special nature of the _id field?
I'm using MongoDB 4.4.
If you need stable sort, you will have to sort on both the fields and for performant query you will need to have a compound index on both the fields.
{ borough: 1, _id: -1 }

How can I aggregate documents by time interval in MongoDB?

I need to aggregate my collection based on a certain time interval.
As you may think, I don´t need to count e.g. per hour our day.
I need to aggregate based on a 30 minutes interval (or any other). Lets say, the first document was created at 3:45PM. Then there are 5 more documents, created between 3:45PM and 4:15PM.
So in this time interval, I have 6 documents. So the first document of the MapReduce result is a document with the count of 6.
Let´s say, the next document is created ad 4:35PM and three more at 4:40PM.
So the next document of the MapReduce result is a document with the count of 4.
And so on...
Currently my map function looks like this:
var map = function() {
var key = {name: this.name, minute: this.timestamp.getMinutes()};
emit(key, {count: 1})
};
So nothing special. Currently I group by the minute, which is not what I want at the end. Here, instead of minute, I need to be able to check the time-interval described above.
And my reduce function:
var reduce = function(key, values)
{
var sum = 0;
values.forEach(function(value)
{
sum += value['count'];
});
return {count: sum};
};
The output of this is like that:
{
0: "{ "_id" : { "name" : "A" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
1: "{ "_id" : { "name" : "B" , "minute" : 41.0} , "value" : { "count" : 6.0}}",
2: "{ "_id" : { "name" : "B" , "minute" : 42.0} , "value" : { "count" : 3.0}}",
3: "{ "_id" : { "name" : "C" , "minute" : 41.0} , "value" : { "count" : 2.0}}",
4: "{ "_id" : { "name" : "C" , "minute" : 42.0} , "value" : { "count" : 2.0}}",
5: "{ "_id" : { "name" : "D" , "minute" : 11.0} , "value" : { "count" : 1.0}}",
6: "{ "_id" : { "name" : "E" , "minute" : 16.0} , "value" : { "count" : 1.0}}"
}
So it counts / aggregates documents per minute, but NOT by my custom time interval.
Any ideas about this?
Edit: My example using map reduce didn't work, but I think this does roughly what you want to do.
I use project to define a variable time to contain the minutes from your timestamp rounded to 5 minute intervals. This would be easy with an integer divide, but I don't think the mongodb query language supports that at this time, so instead I subtract minutes mod 5 from the minutes to get a number that changes every 5 minutes. Then a group by the name and this time counter should do the trick.
query = [
{
"$project": {
"_id":"$_id",
"name":"$name",
"time": {
"$subtract": [
{"$minute":"$timestamp"},
{"$mod": [{"$minute":"$timestamp"}, 5]}
]
}
}
},
{
"$group": {"_id": {"name": "$name", "time": "$time"}, "count":{"$sum":1}}
}
]
db.foo.aggregate(query)

MongoDB incorrect results with elemMatch and date fields

I have the following document
{
"_id" : ObjectId("52da43cd6f0a61e8a5059aaf"),
"assignments" : [
{
"project" : "abc",
"start" : ISODate("2012-12-31T18:30:00Z"),
"end" : ISODate("2013-06-29T18:30:00Z")
},
{
"project" : "efg",
"start" : ISODate("2013-06-30T18:30:00Z"),
"end" : ISODate("2014-03-30T18:30:00Z")
}
],
"eid" : "123",
"name" : "n1",
"uid" : "u1"
}
i am trying to find all the assignments starting after a certain date with the following query. but the returned data is confusing and doesnt look correct. pl help me understand what am i doing wrong.
> db.test.find( {uid:'u1'},{ assignments: { $elemMatch: {end: {$gte: new Date(2011,5,1) } } }} ).pretty();
{
"_id" : ObjectId("52da43cd6f0a61e8a5059aaf"),
"assignments" : [
{
"project" : "abc",
"start" : ISODate("2012-12-31T18:30:00Z"),
"end" : ISODate("2013-06-29T18:30:00Z")
}
]
}
should it return both the projects?
The $elemMatch operator limits the results to the first matching array element per document.
If you want to return all matching array elements you can use the Aggregation Framework:
db.courses.aggregate(
// Find matching documents (would benefit from an index on `{uid: 1}`)
{ $match : {
uid:'u1'
}},
// Unpack the assignments array
{ $unwind : "$assignments" },
// Find the assignments ending after given date
{ $match : {
"assignments.end": { $gte: ISODate("2011-05-01") }
}}
)
$elemMatch limits projections in the following way.
Say you have a collection with these documents:
{ "_id" : 1, "prices" : [ 50, 60, 70 ] }
{ "_id" : 2, "prices" : [ 80, 90, 100 ] }
db.t.find({},{'prices':{$elemMatch:{$gt:50}}})
Returns:
{ "_id" : 1, "prices" : [ 60 ] }
{ "_id" : 2, "prices" : [ 80 ] }
The projection is curtailed to the first element that is matched, rather than all elements that are matched.
If you were expecting a result like this:
{ "_id" : 1, "prices" : [ 60,70] }
{ "_id" : 2, "prices" : [ 80, 90, 100 ] }
It is not possible with projections which only supports $elemMatch, $ and $slice.
For example, you can use, for example, $all in the projection or any other smart projection.
If you want to shape the document, match, filter, project with an unwind and wind (via grouping).
aggregate([ {$unwind:'$prices'},{$match:{'prices':{$gt:50}}},{$group:{_id:'$_id',prices:{$push:'$prices'}}}])
This will result in what we're looking for.
{ "_id" : 2, "prices" : [ 80, 90, 100 ] }
{ "_id" : 1, "prices" : [ 60, 70 ] }
You can take this pattern and use it for your situation.
With that query you are searching documents with uid = u1 and projecting the results to show only the first element that has a date greater than Date(2011,5,1). Check $elemMatch projection operator
I think you need something like this:
db.test.find(
{
uid:'u1'
"assignments.end": {$gte: new Date(2011,5,1)}
}).pretty();

Find maximum date from multiple embedded documents

One of many documents in my collection is like below:
{ "_id" :123,
"a" :[
{ "_id" : 1,
"dt" : ISODate("2013-06-10T19:38:42Z")
},
{ "_id" : 2,
"dt" : ISODate("2013-02-10T19:38:42Z")
}
],
"b" :[
{ "_id" : 1,
"dt" : ISODate("2013-02-10T19:38:42Z")
},
{ "_id" : 2,
"dt" : ISODate("2013-23-10T19:38:42Z")
}
],
"c" :[
{ "_id" : 1,
"dt" : ISODate("2013-03-10T19:38:42Z")
},
{ "_id" : 2,
"dt" : ISODate("2013-13-10T19:38:42Z")
}
]
}
I want to find the maximum date for the whole document (a,b,c).
The solution i have right now is, I loop through all root _id then do a $match in aggregation framework for each a, b, c for every root document. this sounds very inefficient, any better ideas?
Your question is very very similar to this question. Check it out.
As with that solution, your issue can be handled with MongoDB's aggregation framework, using the $project and $cond operators to repeatedly flatten your documents while preserving a max value at each step.

mongodb get distinct records

I am using mongoDB in which I have collection of following format.
{"id" : 1 , name : x ttm : 23 , val : 5 }
{"id" : 1 , name : x ttm : 34 , val : 1 }
{"id" : 1 , name : x ttm : 24 , val : 2 }
{"id" : 2 , name : x ttm : 56 , val : 3 }
{"id" : 2 , name : x ttm : 76 , val : 3 }
{"id" : 3 , name : x ttm : 54 , val : 7 }
On that collection I have queried to get records in descending order like this:
db.foo.find({"id" : {"$in" : [1,2,3]}}).sort(ttm : -1).limit(3)
But it gives two records of same id = 1 and I want records such that it gives 1 record per id.
Is it possible in mongodb?
There is a distinct command in mongodb, that can be used in conjunction with a query. However, I believe this just returns a distinct list of values for a specific key you name (i.e. in your case, you'd only get the id values returned) so I'm not sure this will give you exactly what you want if you need the whole documents - you may require MapReduce instead.
Documentation on distinct:
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Distinct
You want to use aggregation. You could do that like this:
db.test.aggregate([
// each Object is an aggregation.
{
$group: {
originalId: {$first: '$_id'}, // Hold onto original ID.
_id: '$id', // Set the unique identifier
val: {$first: '$val'},
name: {$first: '$name'},
ttm: {$first: '$ttm'}
}
}, {
// this receives the output from the first aggregation.
// So the (originally) non-unique 'id' field is now
// present as the _id field. We want to rename it.
$project:{
_id : '$originalId', // Restore original ID.
id : '$_id', //
val : '$val',
name: '$name',
ttm : '$ttm'
}
}
])
This will be very fast... ~90ms for my test DB of 100,000 documents.
Example:
db.test.find()
// { "_id" : ObjectId("55fb595b241fee91ac4cd881"), "id" : 1, "name" : "x", "ttm" : 23, "val" : 5 }
// { "_id" : ObjectId("55fb596d241fee91ac4cd882"), "id" : 1, "name" : "x", "ttm" : 34, "val" : 1 }
// { "_id" : ObjectId("55fb59c8241fee91ac4cd883"), "id" : 1, "name" : "x", "ttm" : 24, "val" : 2 }
// { "_id" : ObjectId("55fb59d9241fee91ac4cd884"), "id" : 2, "name" : "x", "ttm" : 56, "val" : 3 }
// { "_id" : ObjectId("55fb59e7241fee91ac4cd885"), "id" : 2, "name" : "x", "ttm" : 76, "val" : 3 }
// { "_id" : ObjectId("55fb59f9241fee91ac4cd886"), "id" : 3, "name" : "x", "ttm" : 54, "val" : 7 }
db.test.aggregate(/* from first code snippet */)
// output
{
"result" : [
{
"_id" : ObjectId("55fb59f9241fee91ac4cd886"),
"val" : 7,
"name" : "x",
"ttm" : 54,
"id" : 3
},
{
"_id" : ObjectId("55fb59d9241fee91ac4cd884"),
"val" : 3,
"name" : "x",
"ttm" : 56,
"id" : 2
},
{
"_id" : ObjectId("55fb595b241fee91ac4cd881"),
"val" : 5,
"name" : "x",
"ttm" : 23,
"id" : 1
}
],
"ok" : 1
}
PROS: Almost certainly the fastest method.
CONS: Involves use of the complicated Aggregation API. Also, it is tightly coupled to the original schema of the document. Though, it may be possible to generalize this.
I believe you can use aggregate like this
collection.aggregate({
$group : {
"_id" : "$id",
"docs" : {
$first : {
"name" : "$name",
"ttm" : "$ttm",
"val" : "$val",
}
}
}
});
The issue is that you want to distill 3 matching records down to one without providing any logic in the query for how to choose between the matching results.
Your options are basically to specify aggregation logic of some kind (select the max or min value for each column, for example), or to run a select distinct query and only select the fields that you wish to be distinct.
querymongo.com does a good job of translating these distinct queries for you (from SQL to MongoDB).
For example, this SQL:
SELECT DISTINCT columnA FROM collection WHERE columnA > 5
Is returned as this MongoDB:
db.runCommand({
"distinct": "collection",
"query": {
"columnA": {
"$gt": 5
}
},
"key": "columnA"
});
If you want to write the distinct result in a file using javascript...this is how you do
cursor = db.myColl.find({'fieldName':'fieldValue'})
var Arr = new Array();
var count = 0;
cursor.forEach(
function(x) {
var temp = x.id;
var index = Arr.indexOf(temp);
if(index==-1)
{
printjson(x.id);
Arr[count] = temp;
count++;
}
})
Specify Query with distinct.
The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":
db.inventory.distinct( "item.sku", { dept: "A" } )
Reference: https://docs.mongodb.com/manual/reference/method/db.collection.distinct/