In mongodb, I have a map function as below:
var map = function() {
emit( this.username, {count: 1, otherdata:otherdata} );
}
and reduce function as below:
var reduce = function(key, values) {
values.forEach(function(value){
total += value.count; //note this line
}
return {count: total, otherdata: values[0].otherdata}; //please ignore otherdata
}
The problem is with the line noted:
total += value.count;
In my dataset, reduce function is called 9 times, and the supposed map reduced result count should be 8908.
With the line above, the returned result would be correctly returned as 8908.
But if I changed the line to:
total += 1;
The returned result would be only 909, about 1/9 of the supposed result.
Also that I tried print(value.count) and the printed result is 1.
What explains this behavior?
short answer : value.count is not always equal to one.
long answer : This is the expected behavior of map reduce : the reduce function is aggreagating the results of the map function. However, it does aggregate on the results of map function by small groups producing intermediate results (sub total in your case). Then reduce functions are runned again on these intermediate results as they were direct results of the map function. And so on until there is only one intermediate result left for each key, that's the final results.
It can be seen as a pyramid of intermediate results :
emit(...)-|
|- reduce -> |
emit(...)-| |
| |- reduce ->|
emit(...)-| | |
| | |
emit(...)-|- reduce -> | |
| |-> reduce = final result
emit(...)-| |
|
emit(...)--- reduce ------------ >|
|
emit(...)-----------------reduce ->|
The number of reduce and their inputs is unpredicatable and is meant to remain hidden.
That's why you have to give a reduce function which return data of the same type (same schema) as input.
The reduce function does not only get called on the original input data, but also on its own output, until there is a final result. So it needs to be able to handle these intermediate results, such as [{count: 5}, {count:3}, {count: 4}] coming out of an earlier stage.
Related
I use mongo query for calculating sum price for every item.
My query looks like so
$queryBuilder = new Query\Builder($this, $documentName);
$queryBuilder->field('created')->gte($startDate);
$queryBuilder->field('is_test_value')->notEqual(true);
..........
$queryBuilder->map('function() {emit(this.item, this.price)}');
$queryBuilder->reduce('function(item, valuesPrices) {
return {sum: Array.sum(valuesPrices)}
}');
And this works, no problem. But I found that in some cases (approximately 20 cases from 200 results) I have strange result in field sum - instead of sum value I see construction like
[objectObject]444444444444444
4 - is price for item.
I tried to replace reduce block to block like this:
var sum = 0;
for (var i = 0; i < valuesPrices.length; i++) {
sum += parseFloat(valuesPrices[i]);
}
return {sum: sum}
In that case I see NAN value.
I suspected that some data in field price was inserted incorrectly (not as float, but as string, object etc). I tried execute my query from mongo cli and I see that all price values are integer.
It's not "strange" at all. You "broke the rules" and now you are paying for it.
"MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key."
The primary rule of mapReduce (as cited ) is that you must return exactly the same structure from the "reducer" as you do from the "mapper". This is because the "reducer" can actually run several times for the same "key". This is how mapReduce processes large lists.
You fix this by just returning a singular value, just like you did in the emit:
return Array.sum(values);
And then there will not be a problem. Adding an object key to that makes the data inconsistent, and thus you get an error when the "reduced" result gets fed back into the "reducer" again.
I've translated the follow sql statment to map reduce:
select
p_brand, p_type, p_size,
count(ps_suppkey) as supplier_cnt
from
partsupp, part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED %'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand, p_type, p_size
order by
supplier_cnt desc, p_brand, p_type, p_size;
Map reduce function:
db.runCommand({
mapreduce: "partsupp",
query: {
"ps_partkey.p_size": { $in: [49, 14, 23, 45, 19, 3, 36, 9] },
"ps_partkey.p_brand": { $ne: "Brand#45" }
},
map: function() {
var pattern1 = /^MEDIUM POLISHED .*/;
var pattern2 = /.*Customer.*Complaints.*/;
var suppkey = this.ps_suppkey.s_suppkey;
if( this.ps_suppkey.s_comment.match(pattern1) == null ){
if(this.ps_suppkey.s_comment.match(pattern2) != null){
emit({p_brand: this.ps_partkey.p_brand, p_type: this.ps_partkey.p_type, p_size: this.ps_partkey.p_size}, suppkey);
}
}
},
reduce: function(key, values) {
return values.length;
},
out: 'query016'
});
The output result (seems to me) has no one reduce:
{
"result" : "query016",
"timeMillis" : 46862,
"counts" : {
"input" : 122272,
"emit" : 54,
"reduce" : 0,
"output" : 54
},
"ok" : 1
}
Whats wrong?
The map function outputs key and value pairs.
The reduce function's purpose is to combine multiple values for the same key. This means that if particular key value is only emitted once it has only one value and there is nothing to reduce.
This is one of the reasons that you must output the value in your emit statement in exact same format that reduce function will be returning.
Map outputs:
emit(key1, valueX);
emit(key1, valueY);
emit(key2, valueZ);
Reduce combines valueX and valueY to return new valueXY for key1 and the final result will be:
key1, valueXY
key, valueZ
Notice that reduce was never called on key2. Reduce function may be called zero, once or multiple times for each key value, so you have to be careful to construct both the map and reduce functions to allow for that possibility.
Your map function doesn't emit a correct value - you want to be counting so you have to output a count. Your reduce function must loop over the already accumulated counts and add them up and return the combined count. You may want to look at some examples provided in the MongoDB documentation.
You can probably do this much simpler using the Aggregation Framework - I don't see the need for MapReduce here unless you are expecting to output a huge amount of results.
I suspect that you called emit(value,key) instead of emit(key,value).
As others have already stated, the mapped value and the reduced value must have the same structure. If you just want to make a count, map a value=1 and in the reduce function just return Array.sum(values).
I'm playing around with Map Reduce in MongoDB and python and I've run into a strange limitation. I'm just trying to count the number of "book" records. It works when there are less than 100 records but when it goes over 100 records the count resets for some reason.
Here is my MR code and some sample outputs:
var M = function () {
book = this.book;
emit(book, {count : 1});
}
var R = function (key, values) {
var sum = 0;
values.forEach(function(x) {
sum += 1;
});
var result = {
count : sum
};
return result;
}
MR output when record count is 99:
{u'_id': u'superiors', u'value': {u'count': 99}}
MR output when record count is 101:
{u'_id': u'superiors', u'value': {u'count': 2.0}}
Any ideas?
Your reduce function should be summing up the count values, not just adding 1 for each value. Otherwise the output of a reduce can't properly be used as input back into another reduce. Try this instead:
var R = function (key, values) {
var sum = 0;
values.forEach(function(x) {
sum += x.count;
});
var result = {
count : sum
};
return result;
}
If emits numbers are equal or more than 100, 100 emits will be sent to reduce function first and process:
{count: 100}
Then only 1 emit remains, sent to reduce function and process:
{count: 1}
OK, the result now is:
[{count: 100}, {count: 1}]
And then this will call reduce function again (very important!). Because foreach sum+=1 in your code. There are two elements in the array, so the result is 2.
ref: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Amoretechnicalexplanation
I've collected about 10 mio documents spaning a few weeks in my mongodb database, and I want to be able to calculate some simple statistics and output them.
The statistics I'm trying to get is the average of the rating on each document within a timespan, in one hour intervals.
To give an idea of what I'm trying to do, follow this sudo code:
var dateTimeStart;
var dateTimeEnd;
var distinctHoursBetweenDateTimes = getHours(dateTimeStart, dateTimeEnd);
var totalResult=[];
foreach( distinctHour in distinctHoursBetweenDateTimes )
tmpResult = mapreduce_getAverageRating( distinctHour, distinctHour +1 )
totalResult[distinctHour] = tmpResult;
return totalResult;
My document structure is something like:
{_id, rating, topic, created_at}
Created_at is the date I'm gathering my statistics based on (time of insertion and time created are not always the same)
I've created an index on the created_at field.
The following is my mapreduce:
map = function (){
emit( this.Topic , { 'total' : this.Rating , num : 1 } );
};
reduce = function (key, values){
var n = {'total' : 0, num : 0};
for ( var i=0; i<values.length; i++ ){
n.total += values[i].total;
n.num += values[i].num;
}
return n;
};
finalize = function(key, res){
res.avg = res.total / res.num;
return res;
};
I'm pretty sure this can be done more effectively - possibly by letting mongo do more work, instead of running several map-reduce statements in a row.
At this point each map-reduce takes about 20-25 seconds so counting statistics for all the hours over a few days suddenly takes up a very long time.
My impression is that mongo should be suited for this kind of work - hence I must obviously be doing something wrong.
Thanks for your help!
And I assume the time is part of the documents you are MapReducing?
When you run the MapReduce over all documents, determine the hour in the map function and add it to the key you emit, you could do all this in a single MapReduce.
The output from MongoDB's map/reduce includes something like 'counts': {'input': I, 'emit': E, 'output': O}. I thought I clearly understand what those mean, until I hit a weird case which I can't explain.
According to my understanding, counts.input is the number of rows that match the condition (as specified in query). If so, how is it possible that the following two queries have different results?
db.mycollection.find({MY_CONDITION}).count()
db.mycollection.mapReduce(SOME_MAP, SOME_REDUCE, {'query': {MY_CONDITION}}).counts.input
I thought the two should always give the same result, independent of the map and reduce functions, as long as the same condition is used.
The map/reduce pattern is like a group function in SQL. So there are grouping some result in one row. So your can't have same number of result.
The count in mapReduce() method is the number of result after the map/reduce function.
By example. You have 2 rows :
{'id':3,'num':5}
{'id':4,'num':5}
And you apply the map function
function(){
emit(this.num, 1);
}
After this map function you get 2 rows:
{5, 1}
{5, 1}
And now you apply your reduce method :
function(k,vals) {
var sum=0;
for(var i in vals) sum += vals[i];
return sum;
}
You have now only 1 row return :
2
Is your server steady-state in between the two calls?