MongoDB MapReduce: Not working as expected for more than 1000 records - mongodb

I wrote a mapreduce function where the records are emitted in the following format
{userid:<xyz>, {event:adduser, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<abc>, {event:adduser, count:1}}
where userid is the key and the remaining are the value for that key.
After the MapReduce function, I want to get the result in following format
{userid:<xyz>,{events: [{adduser:1},{login:2}], allEventCount:3}}
To acheive this I wrote the following reduce function
I know this can be achieved by group by.. both in aggregation framework and mapreduce, but we require a similar functionality for a complex scenario. So, I am taking this approach.
var reducefn = function(key,values){
var result = {allEventCount:0, events:[]};
values.forEach(function(value){
var notfound=true;
for(var n = 0; n < result.events.length; n++){
eventObj = result.events[n];
for(ev in eventObj){
if(ev==value.event){
result.events[n][ev] += value.allEventCount;
notfound=false;
break;
}
}
}
if(notfound==true){
var newEvent={}
newEvent[value.event]=1;
result.events.push(newEvent);
}
result.allEventCount += value.allEventCount;
});
return result;
}
This runs perfectly, when I run for 1000 records, when there are 3k or 10k records, the result I get is something like this
{ "_id" : {...}, "value" :{"allEventCount" :30, "events" :[ { "undefined" : 1},
{"adduser" : 1 }, {"remove" : 3 }, {"training" : 1 }, {"adminlogin" : 1 },
{"downgrade" : 2 } ]} }
Not able to understand where this undefined came from and also the sum of the individual events is less than allEventCount. All the docs in the collection has non-empty field event so there is no chance of undefined.
Mongo DB version -- 2.2.1
Environment -- Local machine, no sharding.
In the reduce function, why should this operation fail result.events[n][ev] += value.allEventCount; when the similar operation result.allEventCount += value.allEventCount; passes?
The corrected answer as suggested by johnyHK
Reduce function:
var reducefn = function(key,values){
var result = {totEvents:0, event:[]};
values.forEach(function(value){
value.event.forEach(function(eventElem){
var notfound=true;
for(var n = 0; n < result.event.length; n++){
eventObj = result.event[n];
for(ev in eventObj){
for(evv in eventElem){
if(ev==evv){
result.event[n][ev] += eventElem[evv];
notfound=false;
break;
}
}}
}
if(notfound==true){
result.event.push(eventElem);
}
});
result.totEvents += value.totEvents;
});
return result;
}

The shape of the object you emit from your map function must be the same as the object returned from your reduce function, as the results of a reduce can get fed back into reduce when processing large numbers of docs (like in this case).
So you need to change your emit to emit docs like this:
{userid:<xyz>, {events:[{adduser: 1}], allEventCount:1}}
{userid:<xyz>, {events:[{login: 1}], allEventCount:1}}
and then update your reduce function accordingly.

Related

Unordered bulk update records in MongoDB shell

I've got a collection consisting of millions of documents that resemble the following:
{
_id: ObjectId('...'),
value: "0.53"
combo: [
{
h: 0,
v: "0.42"
},
{
h: 1,
v: "1.32"
}
]
}
The problem is that the values are stored as strings and I need to convert them to float/double.
I'm trying this and it's working but this'll take days to complete, given the volume of data:
db.collection.find({}).forEach(function(obj) {
if (typeof(obj.value) === "string") {
obj.value = parseFloat(obj.value);
db.collection.save(obj);
}
obj.combo.forEach(function(hv){
if (typeof(hv.value) === "string") {
hv.value = parseFloat(hv.value);
db.collection.save(obj);
}
});
});
I came across bulk update reading the Mongo docs and I'm trying this:
var bulk = db.collection.initializeUnorderedBulkOp();
bulk.find({}).update(
{
$set: {
"value": parseFloat("value"),
}
});
bulk.execute();
This runs... but I get a NAN as a value, which is because it thinks I'm trying to convert "value" to a float. I've tried different variations like this.value and "$value" but to no avail. Plus this approach only attempts to correct the value in the other object, not the ones in the array.
I'd appreciate any help. Thanks in advance!
Figured it out the following way:
1) To convert at the document level, I came across this post and the reply by Markus paved the way to my solution:
var bulk = db.collection.initializeUnorderedBulkOp()
var myDocs = db.collection.find()
var ops = 0
myDocs.forEach(
function(myDoc) {
bulk.find({ _id: myDoc._id }).updateOne(
{
$set : {
"value": parseFloat(myDoc.value),
}
}
);
if ((++ops % 1000) === 0){
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
}
)
bulk.execute();
2) The second part involved updating the array object values and I discovered the syntax to do so in the accepted answer on this post. In my case, I knew that there were 24 values in I ran this separately from the first query and the result looked like:
var bulk = db.collection.initializeUnorderedBulkOp()
var myDocs = db.collection.find()
var ops = 0
myDocs.forEach(
function(myDoc) {
bulk.find({ _id: myDoc._id }).update(
{
$set : {
"combo.0.v": parseFloat(myDoc.combo[0].v),
"combo.1.v": parseFloat(myDoc.combo[1].v),
"combo.2.v": parseFloat(myDoc.combo[2].v),
"combo.3.v": parseFloat(myDoc.combo[3].v),
"combo.4.v": parseFloat(myDoc.combo[4].v),
"combo.5.v": parseFloat(myDoc.combo[5].v),
"combo.6.v": parseFloat(myDoc.combo[6].v),
"combo.7.v": parseFloat(myDoc.combo[7].v),
"combo.8.v": parseFloat(myDoc.combo[8].v),
"combo.9.v": parseFloat(myDoc.combo[9].v),
"combo.10.v": parseFloat(myDoc.combo[10].v),
"combo.11.v": parseFloat(myDoc.combo[11].v),
"combo.12.v": parseFloat(myDoc.combo[12].v),
"combo.13.v": parseFloat(myDoc.combo[13].v),
"combo.14.v": parseFloat(myDoc.combo[14].v),
"combo.15.v": parseFloat(myDoc.combo[15].v),
"combo.16.v": parseFloat(myDoc.combo[16].v),
"combo.17.v": parseFloat(myDoc.combo[17].v),
"combo.18.v": parseFloat(myDoc.combo[18].v),
"combo.19.v": parseFloat(myDoc.combo[19].v),
"combo.20.v": parseFloat(myDoc.combo[20].v),
"combo.21.v": parseFloat(myDoc.combo[21].v),
"combo.22.v": parseFloat(myDoc.combo[22].v),
"combo.23.v": parseFloat(myDoc.combo[23].v)
}
}
);
if ((++ops % 1000) === 0){
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
}
)
bulk.execute();
Just to give an idea regarding performance, the forEach was going through around 900 documents a minute, which for 15 million records would have taken days, literally! Not only that but this was only converting the types at the document level, not the array level. For that, I would have to loop through each document and loop through each array (15 million x 24 iterations)! With this approach (running both queries side by side), it completed both in under 6 hours.
I hope this helps someone else.

mapreduce between consecutive documents

Setup:
I got a large collection with the following entries
Name - String
Begin - time stamp
End - time stamp
Problem:
I want to get the gaps between documents, Using the map-reduce paradigm.
Approach:
I'm trying to set a new collection of pairs mid, after that I can compute differences from it using $unwind and Pair[1].Begin - Pair[0].End
function map(){
emit(0, this)
}
function reduce(){
var i = 0;
var pairs = [];
while ( i < values.length -1){
pairs.push([values[i], values[i+1]]);
i = i + 1;
}
return {"pairs":pairs};
}
db.collection.mapReduce(map, reduce, sort:{begin:1}, out:{replace:"mid"})
This works with limited number of document because of the 16MB document cap. I'm not sure if I need to get the collection into memory and doing it there, How else can I approach this problem?
The mapReduce function of MongoDB has a different way of handling what you propose than the method you are using to solve it. The key factor here is "keeping" the "previous" document in order to make the comparison to the next.
The actual mechanism that supports this is the "scope" functionality, which allows a sort of "global" variable approach to use in the overall code. As you will see, what you are asking when that is considered takes no "reduction" at all as there is no "grouping", just emission of document "pair" data:
db.collection.mapReduce(
function() {
if ( last == null ) {
last = this;
} else {
emit(
{
"start_id": last._id,
"end_id": this._id
},
this.Begin - last.End
);
last = this;
}
},
function() {}, // no reduction required
{
"out": { "inline": 1 },
"scope": { "last": null }
}
)
Out with a collection as the output as required to your size.
But this way by using a "global" to keep the last document then the code is both simple and efficient.

Scope work strangely in mapReduce of MongoDB for the purpose of producing cumulative frequency

I have a collection called user, and I want to get cumulative frequency of number of users by date based on the _id field. The desired result should be something like that:
{
{_id: 2013-12-02, value: 10}, //upto 2013-12-02 there are 10 users
{_id: 2014-01-05, value: 20}, //upto 2014-01-05 there are totally 20 users
….
}
I try to get the above using the following mapReduce call:
db.user.mapReduce(
function(){var date = this._id.getTimestamp();
emit(new Date(date.getFullYear()+"-"+date.getMonth()+"-"+date.getDate()), 1)},
function(key, values) {cum = cum + Array.sum(values); return cum},
{out: "newUserAnalysis",
sort: {_id: 1},
scope: {cum: 0}})
But it seems that the cum variable reset to zero after the first return statement encountered in the reduce function. Why? Is there any other method to get what I want?
Many thanks.
cum should not be reset as it's a global variable in map, reduce and finalize functions during the whole mapReduce processing.
But reduce function has 3 requirements to be observed to assure processing correctly, particularly for bulky data handling since reduce function will be called repeatedly even on the same key. Normally the length of values in map function would not exceed 100. In a word, your design can't assure cum is called on the right sequence as you expect, which will produce incorrect statistics.
Following code for your reference:
// map and reduce per day then save to a collection.
db.user.mapReduce(function() {
var date = this._id.getTimestamp();
emit(new Date(date.getFullYear() + "-" + (date.getMonth() + 1) + "-"
+ date.getDate()), 1);
}, function(key, values) {
return Array.sum(values);
}, {
out : "newUserAnalysis",
sort : {
_id : 1
}
});
// Do accumulation one by one.
var cursor = db.newUserAnalysis.find().sort({_id:1});
var newValue = 0, first = true;
while (cursor.hasNext()) {
var doc = cursor.next();
newValue += doc.value;
if (first) {
first = false;
} else {
db.newUserAnalysis.update({_id:doc._id}, {$set:{value:newValue}});
}
}

Group By (Aggregate Map Reduce Functions) in MongoDB using Scala (Casbah/Rogue)

Here's a specific query I'm having trouble with. I'm using Lift-mongo-
records so that i can use Rogue. I'm happy to use Rogue specific
syntax , or whatever works.
While there are good examples for using javascript strings via java noted below, I'd like to know what the best practices might be.
Imagine here that there is a table like
comments {
_id
topic
title
text
created
}
The desired output is a list of topics and their count, for example
cats (24)
dogs (12)
mice (5)
So a user can see an list, ordered by count, of a distinct/group by
Here's some psuedo SQL:
SELECT [DISTINCT] topic, count(topic) as topic_count
FROM comments
GROUP BY topic
ORDER BY topic_count DESC
LIMIT 10
OFFSET 10
One approach is using some DBObject DSL like
val cursor = coll.group( MongoDBObject(
"key" -> MongoDBObject( "topic" -> true ) ,
//
"initial" -> MongoDBObject( "count" -> 0 ) ,
"reduce" -> "function( obj , prev) { prev.count += obj.c; }"
"out" -> "topic_list_result"
))
[...].sort( MongoDBObject( "created" ->
-1 )).skip( offset ).limit( limit );
Variations of the above do not compile.
I could just ask "what am I doing wrong" but I thought I could make my
confusion more acute:
can I chain the results directly or do I need "out"?
what kind of output can I expect - I mean, do I iterate over a
cursor, or the "out" param
is "cond" required?
should I be using count() or distinct()
some examples contain a "map" param...
A recent post I found which covers the java driver implies I should
use strings instead of a DSL :
http://blog.evilmonkeylabs.com/2011/02/28/MongoDB-1_8-MR-Java/
Would this be the preferred method in either casbah or Rogue?
Update: 9/23
This fails in Scala/Casbah (compiles but produces error {MapReduceError 'None'} )
val map = "function (){ emit({ this.topic }, { count: 1 }); }"
val reduce = "function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; }"
val out = coll.mapReduce( map , reduce , MapReduceInlineOutput )
ConfiggyObject.log.debug( out.toString() )
I settled on the above after seeing
https://github.com/mongodb/casbah/blob/master/casbah-core/src/test/scala/MapReduceSpec.scala
Guesses:
I am misunderstanding the toString method and what the out.object is?
missing finalize?
missing output specification?
https://jira.mongodb.org/browse/SCALA-43 ?
This works as desired from command line:
map = function (){
emit({ this.topic }, { count: 1 });
}
reduce = function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; };
db.tweets.mapReduce( map, reduce, { out: "results" } ); //
db.results.ensureIndex( {count : 1});
db.results.find().sort( {count : 1});
Update
The issue has not been filed as a bug at Mongo.
https://jira.mongodb.org/browse/SCALA-55
The following worked for me:
val coll = MongoConnection()("comments")
val reduce = """function(obj,prev) { prev.csum += 1; }"""
val res = coll.group( MongoDBObject("topic"->true),
MongoDBObject(), MongoDBObject( "csum" -> 0 ), reduce)
res was an ArrayBuffer full of coll.T which can be handled in the usual ways.
Appears to be a bug - somewhere.
For now, I have a less-than-ideal workaround working now, using eval() (slower, less safe) ...
db.eval( "map = function (){ emit( { topic: this.topic } , { count: 1 }); } ; ");
db.eval( "reduce = function(key, values) { var count = 0; values.forEach(function(v) { count += v['count']; }); return {count: count}; }; ");
db.eval( " db.tweets.mapReduce( map, reduce, { out: \"tweetresults\" } ); ");
db.eval( " db.tweetresults.ensureIndex( {count : 1}); ");
Then I query the output table normally via casbah.

MongoDB map reduce producing different result to db.collection.find()

I have a map reduce like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += 1;
});
return result;
}
I've been monitoring the amount of entries processed in the result map, as you can see. I've found I get much lower numbers of accessed records than I should.
For my particular data set, the output is like so:
[{u'_id': u'1743', u'value': {u'received': 1406545.0, u'sent': 26251138.0, u'entries': 316.0}}]
As I'm running the map reduce with a query option, specifying a username and a date range.
If I perform the same query using db.collection.find() as follows, the count is different:
> db.entire_database.find({username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') }}).count()
1915
The full map reduce query is this:
db.entire_database.mapReduce(m, r, {out: 'myoutput', query: { username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') } } })
So basically, I'm unsure why the count is so radically different? Why is the find() giving me 1915, but the map reduce is 316?
Your map function needs to emit an object with the same form as the reduce function (ie. it should have an entries field set to 1). You can read more about this here.
Basically, the values that are passed to the reduce function are not necessarily the raw outputs emitted from map. Rather than being called once, the reduce function is called many times on 'groups' of values produced by map, the results of which are then combined again by being passed into a further call of the reduce function. This is what makes MapReduce horizontally scalable, because any group of emitted values can be farmed out to any server in any order before being combined later.
So I would restructure your functions slightly like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received, entries : 1});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += value.entries;
});
return result;
}