Couting rows in MapReduce in MongoDB - mongodb

I have created the following Map Reduce and came across something curious. I'm counting in 2 different ways the number of documents per date and coming up with different values. Here are my functions:
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
}
);
}
reduce : function Reduce(key, values) {
var reduced = {count:0,count1:0};
values.forEach(function(val) {
reduced.count += val.count;
reduced.count1++;
});
return reduced;
}
finalize : function Finalize(key, reduced) {
return reduced;
}
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } }
out : { inline : 1 }
So Basically what is strange is that at the end "count" and "count1" are returning different values. "count" has the correct value, that is, the number of documents for that date while "count1" has a much lower value. Can anyone explain (I'm new to MongoDB so use simple terms :-)
Thanks.

Two problems (which are really the same problem):
Your emit format must be the same as your result returned in the emit function.
Your reduce must be prepared to be called more than once for the same key (i.e. if you reduce five values for a key and then reduce three values for a key, the reduce function may be called again to reduce the result of two previous reduce operations.
Your example just demonstrates what happens if you assume that you will always be reducing the result "1" rather than the actual previously emitted or reduced result.
Reference: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-ReduceFunction

Related

Sort before querying

Is it possible to run a sort on a Mongo collection before running the filtering query? I have older code in which I was using a method of getting a random result from the database by having a field which was a random float between 0 and 1, then querying with findOne to get the first document with a value greater than a random float generated at that time. The sample set was small, so didn't notice a problem at the time, but recently noticed that with one query, I was almost always getting the same value. The "first" document had a random > .9, so nearly every query matched it first.
I realized, for this solution to work, I need to sort by random, then find the first value greater than my random. As I understand it, this isn't as necessary a solution as in the past, as $sample exists as of 3.2, but I figure learning how I could do this would be good? Plus, my understanding is that $sample can return the same document multiple times (where N > 1 obviously, so not directly applicable to my question).
So for example, the following data:
> db.links.find()
{ "_id" : ObjectId("553c072bc87652a80e00002a"), "random" : 0.9162904409691691 }
{ "_id" : ObjectId("553c3332c87652c80700002a"), "random" : 0.00427396921440959 }
{ "_id" : ObjectId("553c3c5cc87652a80e00002b"), "random" : 0.2409569111187011 }
{ "_id" : ObjectId("553c3c66c876521c10000029"), "random" : 0.35101076657883823 }
{ "_id" : ObjectId("553c3c6ec87652200700002e"), "random" : 0.3234482416883111 }
{ "_id" : ObjectId("553c68d5c87652a80e00002c"), "random" : 0.5221220930106938 }
Any attempt to run db.mycollection.findOne({'random': {'$gte': x}}) where x is any value up to .91 always return the first object (_id 553c072). Anything greater returns nothing. If I could sort by the random value in ascending order then filter, it would keep searching until it found the correct value.
I would strongly recommend you to drop your custom solution and simply switch to using the MongoDB built-in $sample stage which will return a random result from your collection.
EDIT based on your comment:
Here's how you can do what you originally asked for:
db.links.find({ "random": { $gte: /* put your value here */ } })
.sort({ "random": 1 /* sort by "random" field in ascending order */ })
.limit(1)
You can, but don't need to use the aggregation framework, too:
db.links.aggregate({
$match: {
"random": {
$gte: /* put your value here */ // filter the collection
}
}
}, {
$sort: {
"random": 1 // sort by "random" field in ascending order
}
}, {
$limit: 1 // return only the first element
})

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance
I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Counting documents in MapReduce depending on condition - MongoDB

I am trying to use a Map Reduce to count number documents according to one of the field values per date. First, here are the results from a couple of regular find() functions:
db.errors.find({ "cDate" : ISODate("2012-11-20T00:00:00Z") }).count();
returns 579 (ie. there are 579 documents for this date)
db.errors.find( { $and: [ { "cDate" : ISODate("2012-11-20T00:00:00Z") }, {"Type":"General"} ] } ).count()
returns 443 (ie. there are 443 documents for this date where Type="General")
Following is my MapReduce:
db.runCommand({ mapreduce: "errors",
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: 1,
Type: this.Type
}
);
},
reduce : function Reduce(key, values) {
var reduced = {count:0,countGeneral:0,Type:''};
values.forEach(function(val) {
reduced.count += val.count;
if (val.Type === 'General')
reduced.countGeneral += val.countGeneral;
});
return reduced;
},
finalize : function Finalize(key, reduced) {
return reduced;
},
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } },
out : { inline : 1 }
});
For the date 20-11-20 the map reduce returns:
count: 579
countGeneral: 60 (should be 443 according to the above find query)
Now, I understand that the Reduce is unpredictable in the way it loops so how should I do this?
Thanks
I suggest that you lose the rest of your values just because you don't return 'General' in your reduce part.
Reduce runs more than once for all the values emitted in the map part and returned from the reduce function.
For example, when the first iteration of reduce have run, you've got output object containing something like:
{count: 15, countGeneral: 3, Type: ''}
And other iterations of reduce collect this object and others like this one and don't see Type:'General' there and don't increase the countGeneral anymore.
Your map function is wrong.
You could do something like this:
function Map() {
var cG=0;
if (this.Type == 'General') { cG=1; }
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: cG
}
);
}
This emits countGeneral 1 if Type is 'General' and 0 otherwise.
Then you can remove the type check from your emit function entirely, since you're destroying it anyway in your reduce function. Currently your reduce clobbers Type information passed from emit during the reduce phase.

finding duplicates using map reduce from mongodb

I need to find the duplicates in a collection in mongo db which has around 20000 documents. The result should give me the key (on which I am grouping) and the count of times they are repeated only if the count is greater than 1. The below is not complete, however it is giving an error also when I run in mongo.exe shell :
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
}
reduce : function Reduce(key, vals) {
return Array.sum(vals);
}
finalize : function Finalize(key, reduced) {
return reduced
}
out : { inline : 1 }
});
SyntaxError: missing } after property list (shell):5
why is the above error coming?
how can only get the ones with count greater than 1?
I'm not sure if that is an exact copy of the code you've entered, but it looks like you're missing commas between the fields in the object being passed to runCommand. Try:
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
},
reduce : function Reduce(key, vals) {
return Array.sum(vals);
},
finalize : function Finalize(key, reduced) {
return reduced
},
out : { inline : 1 }
});
Also note that even when using finalize, you can't actually remove entries from the outputted document (or collection) in a single-pass with Map-Reduce. However, whether you're using out: {inline: 1}, or out: "some_collection", it is pretty trivial to filter out results where the count is 1.

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.