Why am I losing some values every 100 documents? - mongodb

I'm trying to understand a behavior with map/reduce.
Here's the map function:
function() {
var klass = this.error_class;
emit('klass', { model : klass, count : 1 });
}
And the reduce function:
function(key, values) {
var results = { count : 0, klass: { foo: 'bar' } };
values.forEach(function(value) {
results.count += value.count;
results.klass[value.model] = 0;
printjson(results);
});
return results;
}
Then I run it:
{
"count" : 85,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0
}
}
{
"count" : 86,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
At this point, everything is good, but here's come the yielding of the read lock every 100 documents:
{
"count" : 100,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
{ "count" : 100, "klass" : { "foo" : "bar", "undefined" : 0 } }
I kept my key foo and my count attribute kept being incremented. The problem is everything else became undefined.
So why am I losing the dynamic keys for my object while my count attribute is still good?

A thing to remember about your reduce function is that the values passed to it are either the output of your map function, or the return value of previous calls to reduce.
This is key - it means mapping / reducing of parts of the data can be farmed off to different machines (eg different shards of a mongo cluster) and then reduce used again to reassemble the data. It also means that mongo doesn't have to first map every value, keeping all the results in memory and then reduce them all: it can map and reduce in chunks, re-reducing where necessary.
In other words the following must be true:
reduce(k,[A,B,C]) == reduce(k, [A, reduce(k,[A,B]))
Your reduce function's output doesn't have a model property so if it gets used in a re-reduce those undefined values will crop up.
You either need to have your reduce function return something similar in format to what your map function emits so that you can process the two without distinction(usually the easiest) or else handle re-reduced values differently.

Related

MongoDB - reduce function does not work properly

My map function returns key-value pairs where key is the name of a field and the value is an object {type: <field type>, count : 1}.
For example suppose I have these documents:
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c8"),
"index" : NumberInt(0)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c9"),
"index" : NumberInt(7)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c7"),
"index" : NumberInt(9)
}
I have to retrieve the name of each field, its type and the number of occurrences of the field in my collection.
My map function works and I get:
"_id", [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
"index",[{type:"number", count:1},{type:"number", count:1},{type:"number", count:1}]
I want to delete duplicates from type.
I have the following reduce function:
function (key, stuff) {
reduceVal = {type:"", count:0};
var array = [];
for(var idx =0; idx < stuff.length; idx++) {
reduceVal.count += stuff[idx].count;
if(array.indexOf(stuff[idx].type) > -1) {
array.push(stuff[idx].type);
}
}
reduceVal.type = array.toString();
The if clause does not work. My target is to add an element to my array just if it is not a duplicate.
Expected output:
"_id", {type:"ObjectId", count:3}
"index", {type:"number", count:3}
How can I fix?
The reduce function works. The if statement was wrong: I have to add an element to my array when
if(array.indexOf(stuff[idx].type) === -1).
It looks like you just jumbled up your reduce function. As far as I can interpret this, you assume that the reducer is called once globally. This is not the case. Instead, it is called per key, i.e. the input to the reducer is somthing like:
First call:
key = "ObjectId", val = [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
Second call:
key = "number", val = [{type:"number", count:1},...]
Therefore, you need to sum up, knowing that the key is already set (this code is not tested and will have its shortcomings):
function(key, vals) {
var sum = 0;
for(var i = 0; i < vals.length; i++) {
sum += vals[i].count;
}
return { "type" : key, "count" : sum };
}

Efficient Median Calculation in MongoDB

We have a Mongo collection named analytics and it tracks user visits by a cookie id. We want to calculate medians for several variables as users visit different pages.
Mongo does not yet have an internal method for calculating the median. I have used the below method for determining it, but I'm afraid there is be a more efficient way as I'm pretty new to JS. Any comments would be appreciated.
// Saves the JS function for calculating the Median. Makes it accessible to the Reducer.
db.system.js.save({_id: "myMedianValue",
value: function (sortedArray) {
var m = 0.0;
if (sortedArray.length % 2 === 0) {
//Even numbered array, average the middle two values
idx2 = sortedArray.length / 2;
idx1 = idx2 - 1;
m = (sortedArray[idx1] + sortedArray[idx2]) / 2;
} else {
//Odd numbered array, take the middle value
idx = Math.floor(sortedArray.length/2);
m = sortedArray[idx];
}
return m
}
});
var mapFunction = function () {
key = this.cookieId;
value = {
// If there is only 1 view it will look like this
// If there are multiple it gets passed to the reduceFunction
medianVar1: this.Var1,
medianVar2: this.Var2,
viewCount: 1
};
emit(key, value);
};
var reduceFunction = function(keyCookieId, valueDicts) {
Var1Array = Array();
Var2Array = Array();
views = 0;
for (var idx = 0; idx < valueDicts.length; idx++) {
Var1Array.push(valueDicts[idx].medianVar1);
Var2Array.push(valueDicts[idx].medianVar2);
views += valueDicts[idx].viewCount;
}
reducedDict = {
medianVar1: myMedianValue(Var1Array.sort(function(a, b){return a-b})),
medianVar2: myMedianValue(Var2Array.sort(function(a, b){return a-b})),
viewCount: views
};
return reducedDict
};
db.analytics.mapReduce(mapFunction,
reduceFunction,
{ out: "analytics_medians",
query: {Var1: {$exists:true},
Var2: {$exists:true}
}}
)
The simple way to get the median value is to index on the field, then skip to the value halfway through the results.
> db.test.drop()
> db.test.insert([
{ "_id" : 0, "value" : 23 },
{ "_id" : 1, "value" : 45 },
{ "_id" : 2, "value" : 18 },
{ "_id" : 3, "value" : 94 },
{ "_id" : 4, "value" : 52 },
])
> db.test.ensureIndex({ "value" : 1 })
> var get_median = function() {
var T = db.test.count() // may want { "value" : { "$exists" : true } } if some fields may be missing the value field
return db.test.find({}, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).skip(Math.floor(T / 2)).limit(1).toArray()[0].value // may want to adjust skip this a bit depending on how you compute median e.g. in case of even T
}
> get_median()
45
It's not amazing because of the skip, but at least the query will be covered by the index. For updating the median, you could be fancier. When a new document comes in or the value of a document is updated, you compare its value to the median. If the new value is higher, you need to adjust the median up by finding the next highest value from the current median doc (or taking an average with it, or whatever to compute the new median correctly according to your rules)
> db.test.find({ "value" : { "$gt" : median } }, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).limit(1)
You'd do the analogous thing if the new value is smaller than the current median. This bottlenecks your writes on this updating process, and has various cases to think about (how would you allow yourself to update multiple docs at once? update the doc that has the median value? update a doc whose value is smaller than the median to one whose value is larger than the median?), so it might be better just to update occasionally based on the skip procedure.
We ended up updating the medians every page request, rather than in bulk with a cron job or something. We have a Node API that uses Mongo's aggregation framework to do the match/sort the user's results. The array of results then pass to a median function within Node. The results are then written back to Mongo for that user. Not super pleased with it, but it doesn't appear to have locking issues and is performing well.

MongoDb Summing multiple columns

I'm dabbling in mongoDb and trying to use map reduce queries. I need to sum up multiple values from different columns (num1, num2, num3, num4, num5). Going off this guide http://docs.mongodb.org/manual/tutorial/map-reduce-examples/ . There I'm trying to alter the first example there to sum up all the values.
This is what I am trying/tried. I'm not sure if it can take in multiple values like this, I just assumed.
var query1 = function(){ emit("sumValues", this.num1, this.num2, this.num3, this.num4, this.num5)};
var query1query = function(sumValues, totalSumOfValues){ return Array.sum(totalSumOfValues); }
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
This is the error I get.
Sun Dec 1 18:52:24.627 JavaScript execution failed: map reduce failed:{
"errmsg" : "exception: JavaScript execution failed: Error: fast_emit takes 2 args near 'ction (){ emit(\"sumValues\", this.c' ",
"code" : 16722,
"ok" : 0
Is there another way to sum up all these values? Or where is my error in what I have.
I also tried this and it seemed to work. but when I do a .find() on the file it creates it only seems to be retrieving the values of sum5 and adding them together.
var query1 = function(){ emit({sum1: this.sum1,sum2: this.sum2,sum3: this.sum3,sum4: this.sum4,sum5: this.sum5},{count:1});}
var query1query = function(key, val){ return Array.sum(val); };
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
{
"result" : "query_one_results",
"timeMillis" : 19503,
"counts" : {
"input" : 173657,
"emit" : 173657,
"reduce" : 1467,
"output" : 166859
},
"ok" : 1,
}
Ok I think I got it. This is what I ended up doing
> var map1= function(){
... emit("Total Sum", this.sum1);
... emit("Total Sum", this.sum2);
... emit("Total Sum", this.sum3);
... emit("Total Sum", this.sum4);
... emit("Total Sum", this.sum5);
... emit("Total Sum", this.sum6);
… };
> var reduce1 = function(key, val){
... return Array.sum(val)
... };
> db.delayData.mapReduce(map1,reduce1,{out:'query_one_result'});
{
"result" : "query_one_result",
"timeMillis" : 9264,
"counts" : {
"input" : 173657,
"emit" : 1041942,
"reduce" : 1737,
"output" : 1
},
"ok" : 1,
}
> db.query_one_result.find()
{ "_id" : "Total Sum", "value" : 250 }
You were getting the two values that you need to emit in map backwards. The first one represents a unique key value over which you are aggregating something. The second one is the thing you want to aggregate.
Since you are trying to get a single sum for the entire collection (it seems) you need to output a single key and then for value you need to output the sum of the four fields of each document.
map = function() { emit(1, this.num1+this.num2+this.num3+this.num4+this.num5); }
reduce = function(key, values) { return Array.sum(values); }
This will work as long as num1 through num5 are set in every document - you can make the function more robust by simply checking that each of those fields exists and if so, adding its value in.
In real life, you can do this faster and simpler with aggregation framework:
db.collection.aggregate({$group:{{$group:{_id:1, sum:{$sum:{$add:["$num1","$num2","$num3","$num4","$num5"]}}}})

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance
I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Counting documents in MapReduce depending on condition - MongoDB

I am trying to use a Map Reduce to count number documents according to one of the field values per date. First, here are the results from a couple of regular find() functions:
db.errors.find({ "cDate" : ISODate("2012-11-20T00:00:00Z") }).count();
returns 579 (ie. there are 579 documents for this date)
db.errors.find( { $and: [ { "cDate" : ISODate("2012-11-20T00:00:00Z") }, {"Type":"General"} ] } ).count()
returns 443 (ie. there are 443 documents for this date where Type="General")
Following is my MapReduce:
db.runCommand({ mapreduce: "errors",
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: 1,
Type: this.Type
}
);
},
reduce : function Reduce(key, values) {
var reduced = {count:0,countGeneral:0,Type:''};
values.forEach(function(val) {
reduced.count += val.count;
if (val.Type === 'General')
reduced.countGeneral += val.countGeneral;
});
return reduced;
},
finalize : function Finalize(key, reduced) {
return reduced;
},
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } },
out : { inline : 1 }
});
For the date 20-11-20 the map reduce returns:
count: 579
countGeneral: 60 (should be 443 according to the above find query)
Now, I understand that the Reduce is unpredictable in the way it loops so how should I do this?
Thanks
I suggest that you lose the rest of your values just because you don't return 'General' in your reduce part.
Reduce runs more than once for all the values emitted in the map part and returned from the reduce function.
For example, when the first iteration of reduce have run, you've got output object containing something like:
{count: 15, countGeneral: 3, Type: ''}
And other iterations of reduce collect this object and others like this one and don't see Type:'General' there and don't increase the countGeneral anymore.
Your map function is wrong.
You could do something like this:
function Map() {
var cG=0;
if (this.Type == 'General') { cG=1; }
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: cG
}
);
}
This emits countGeneral 1 if Type is 'General' and 0 otherwise.
Then you can remove the type check from your emit function entirely, since you're destroying it anyway in your reduce function. Currently your reduce clobbers Type information passed from emit during the reduce phase.