Scope work strangely in mapReduce of MongoDB for the purpose of producing cumulative frequency - mongodb

I have a collection called user, and I want to get cumulative frequency of number of users by date based on the _id field. The desired result should be something like that:
{
{_id: 2013-12-02, value: 10}, //upto 2013-12-02 there are 10 users
{_id: 2014-01-05, value: 20}, //upto 2014-01-05 there are totally 20 users
….
}
I try to get the above using the following mapReduce call:
db.user.mapReduce(
function(){var date = this._id.getTimestamp();
emit(new Date(date.getFullYear()+"-"+date.getMonth()+"-"+date.getDate()), 1)},
function(key, values) {cum = cum + Array.sum(values); return cum},
{out: "newUserAnalysis",
sort: {_id: 1},
scope: {cum: 0}})
But it seems that the cum variable reset to zero after the first return statement encountered in the reduce function. Why? Is there any other method to get what I want?
Many thanks.

cum should not be reset as it's a global variable in map, reduce and finalize functions during the whole mapReduce processing.
But reduce function has 3 requirements to be observed to assure processing correctly, particularly for bulky data handling since reduce function will be called repeatedly even on the same key. Normally the length of values in map function would not exceed 100. In a word, your design can't assure cum is called on the right sequence as you expect, which will produce incorrect statistics.
Following code for your reference:
// map and reduce per day then save to a collection.
db.user.mapReduce(function() {
var date = this._id.getTimestamp();
emit(new Date(date.getFullYear() + "-" + (date.getMonth() + 1) + "-"
+ date.getDate()), 1);
}, function(key, values) {
return Array.sum(values);
}, {
out : "newUserAnalysis",
sort : {
_id : 1
}
});
// Do accumulation one by one.
var cursor = db.newUserAnalysis.find().sort({_id:1});
var newValue = 0, first = true;
while (cursor.hasNext()) {
var doc = cursor.next();
newValue += doc.value;
if (first) {
first = false;
} else {
db.newUserAnalysis.update({_id:doc._id}, {$set:{value:newValue}});
}
}

Related

mapreduce between consecutive documents

Setup:
I got a large collection with the following entries
Name - String
Begin - time stamp
End - time stamp
Problem:
I want to get the gaps between documents, Using the map-reduce paradigm.
Approach:
I'm trying to set a new collection of pairs mid, after that I can compute differences from it using $unwind and Pair[1].Begin - Pair[0].End
function map(){
emit(0, this)
}
function reduce(){
var i = 0;
var pairs = [];
while ( i < values.length -1){
pairs.push([values[i], values[i+1]]);
i = i + 1;
}
return {"pairs":pairs};
}
db.collection.mapReduce(map, reduce, sort:{begin:1}, out:{replace:"mid"})
This works with limited number of document because of the 16MB document cap. I'm not sure if I need to get the collection into memory and doing it there, How else can I approach this problem?
The mapReduce function of MongoDB has a different way of handling what you propose than the method you are using to solve it. The key factor here is "keeping" the "previous" document in order to make the comparison to the next.
The actual mechanism that supports this is the "scope" functionality, which allows a sort of "global" variable approach to use in the overall code. As you will see, what you are asking when that is considered takes no "reduction" at all as there is no "grouping", just emission of document "pair" data:
db.collection.mapReduce(
function() {
if ( last == null ) {
last = this;
} else {
emit(
{
"start_id": last._id,
"end_id": this._id
},
this.Begin - last.End
);
last = this;
}
},
function() {}, // no reduction required
{
"out": { "inline": 1 },
"scope": { "last": null }
}
)
Out with a collection as the output as required to your size.
But this way by using a "global" to keep the last document then the code is both simple and efficient.

MapReduce, MongoDB and node-mongodb-native

I'm using the node-mongodb-native library to run a MapReduce on MongoDB (from node.js).
Here's my code:
var map = function() {
emit(this._id, {'count': this.count});
};
var reduce = function(key, values) {
return {'testing':1};
};
collection.mapReduce(
map,
reduce,
{
query:{ '_id': /s.*/g },
sort: {'count': -1},
limit: 10,
jsMode: true,
verbose: false,
out: { inline: 1 }
},
function(err, results) {
logger.log(results);
}
);
Two questions:
1) Basically, my reduce function is ignored. No matter what I put in it, the output remains just the result of my map function (no 'testing', in this case). Any ideas?
2) I get an error unless an index is defined on the field used for the sort (in this case - the count field). I understand this is to be expected. It seems inefficient as surely the right index would be (_id, count) and not (count), as in theory the _id should be used first (for the query), and only then the sorting should be applied to the applicable results. Am I missing something here? Is MongoDB inefficient? Is this a bug?
Thanks! :)
The reason why the reduce function is never called is due to you emitting a single value for each key so there is no reason for the reduce function to actually execute. Here is an example of how you trigger the reduce function
collection.insert([{group: 1, price:41}, {group: 1, price:22}, {group: 2, price:12}], {w:1}, function(err, r) {
// String functions
var map = function() {
emit(this.group, this.price);
};
var reduce = function(key, values) {
return Array.sum(values);
};
collection.mapReduce(
map,
reduce,
{
query:{},
// sort: {'count': -1},
// limit: 10,
// jsMode: true,
// verbose: false,
out: { inline: 1 }
},
function(err, results) {
console.log("----------- 0")
console.dir(err)
console.dir(results)
// logger.log(results);
}
);
Notice that we are emitting by the "group" key meaning there is n >= 0 entries grouped by the "group" key. Since you are emitting _id each key is unique and thus the reduce function is not needed.
http://docs.mongodb.org/manual/reference/command/mapReduce/#requirements-for-the-reduce-function

MongoDB MapReduce: Not working as expected for more than 1000 records

I wrote a mapreduce function where the records are emitted in the following format
{userid:<xyz>, {event:adduser, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<abc>, {event:adduser, count:1}}
where userid is the key and the remaining are the value for that key.
After the MapReduce function, I want to get the result in following format
{userid:<xyz>,{events: [{adduser:1},{login:2}], allEventCount:3}}
To acheive this I wrote the following reduce function
I know this can be achieved by group by.. both in aggregation framework and mapreduce, but we require a similar functionality for a complex scenario. So, I am taking this approach.
var reducefn = function(key,values){
var result = {allEventCount:0, events:[]};
values.forEach(function(value){
var notfound=true;
for(var n = 0; n < result.events.length; n++){
eventObj = result.events[n];
for(ev in eventObj){
if(ev==value.event){
result.events[n][ev] += value.allEventCount;
notfound=false;
break;
}
}
}
if(notfound==true){
var newEvent={}
newEvent[value.event]=1;
result.events.push(newEvent);
}
result.allEventCount += value.allEventCount;
});
return result;
}
This runs perfectly, when I run for 1000 records, when there are 3k or 10k records, the result I get is something like this
{ "_id" : {...}, "value" :{"allEventCount" :30, "events" :[ { "undefined" : 1},
{"adduser" : 1 }, {"remove" : 3 }, {"training" : 1 }, {"adminlogin" : 1 },
{"downgrade" : 2 } ]} }
Not able to understand where this undefined came from and also the sum of the individual events is less than allEventCount. All the docs in the collection has non-empty field event so there is no chance of undefined.
Mongo DB version -- 2.2.1
Environment -- Local machine, no sharding.
In the reduce function, why should this operation fail result.events[n][ev] += value.allEventCount; when the similar operation result.allEventCount += value.allEventCount; passes?
The corrected answer as suggested by johnyHK
Reduce function:
var reducefn = function(key,values){
var result = {totEvents:0, event:[]};
values.forEach(function(value){
value.event.forEach(function(eventElem){
var notfound=true;
for(var n = 0; n < result.event.length; n++){
eventObj = result.event[n];
for(ev in eventObj){
for(evv in eventElem){
if(ev==evv){
result.event[n][ev] += eventElem[evv];
notfound=false;
break;
}
}}
}
if(notfound==true){
result.event.push(eventElem);
}
});
result.totEvents += value.totEvents;
});
return result;
}
The shape of the object you emit from your map function must be the same as the object returned from your reduce function, as the results of a reduce can get fed back into reduce when processing large numbers of docs (like in this case).
So you need to change your emit to emit docs like this:
{userid:<xyz>, {events:[{adduser: 1}], allEventCount:1}}
{userid:<xyz>, {events:[{login: 1}], allEventCount:1}}
and then update your reduce function accordingly.

mongodb query with group()?

this is my collection structure :
coll{
id:...,
fieldA:{
fieldA1:[
{
...
}
],
fieldA2:[
{
text: "ciao",
},
{
text: "hello",
},
]
}
}
i want to extract all fieldA2 in my collection but if the fieldA2 is in two or more times i want show only one.
i try this
Db.runCommand({distinct:’coll’,key:’fieldA.fieldA2.text’})
but nothing. this return all filedA1 in the collection.
so i try
db.coll.group( {
key: { 'fieldA.fieldA2.text': 1 },
cond: { } },
reduce: function ( curr, result ) { },
initial: { }
} )
but this return an empty array...
How i can do this and see the execution time?? thank u very match...
Since you are running 2.0.4 (I recommend upgrading), you must run this through MR (I think, maybe there is a better way). Something like:
map = function(){
for(i in this.fieldA.fieldA2){
emit(this.fieldA.fieldA2[i].text, 1);
// emit per text value so that this will group unique text values
}
}
reduce = function(values){
// Now lets just do a simple count of how many times that text value was seen
var count = 0;
for (index in values) {
count += values[index];
}
return count;
}
Will then give you a collection of documents whereby _id is the unique text value from fieldA2 and the value field is of the amount of times is appeared i the collection.
Again this is a draft and is not tested.
I think the answer is simpler than a Map/Reduce .. if you just want distinct values plus execution time, the following should work:
var startTime = new Date()
var values = db.coll.distinct('fieldA.fieldA2.text');
var endTime = new Date();
print("Took " + (endTime - startTime) + " ms");
That would result in a values array with a list of distinct fieldA.fieldA2.text values:
[ "ciao", "hello", "yo", "sayonara" ]
And a reported execution time:
Took 2 ms

MongoDB map reduce producing different result to db.collection.find()

I have a map reduce like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += 1;
});
return result;
}
I've been monitoring the amount of entries processed in the result map, as you can see. I've found I get much lower numbers of accessed records than I should.
For my particular data set, the output is like so:
[{u'_id': u'1743', u'value': {u'received': 1406545.0, u'sent': 26251138.0, u'entries': 316.0}}]
As I'm running the map reduce with a query option, specifying a username and a date range.
If I perform the same query using db.collection.find() as follows, the count is different:
> db.entire_database.find({username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') }}).count()
1915
The full map reduce query is this:
db.entire_database.mapReduce(m, r, {out: 'myoutput', query: { username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') } } })
So basically, I'm unsure why the count is so radically different? Why is the find() giving me 1915, but the map reduce is 316?
Your map function needs to emit an object with the same form as the reduce function (ie. it should have an entries field set to 1). You can read more about this here.
Basically, the values that are passed to the reduce function are not necessarily the raw outputs emitted from map. Rather than being called once, the reduce function is called many times on 'groups' of values produced by map, the results of which are then combined again by being passed into a further call of the reduce function. This is what makes MapReduce horizontally scalable, because any group of emitted values can be farmed out to any server in any order before being combined later.
So I would restructure your functions slightly like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received, entries : 1});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += value.entries;
});
return result;
}