finding duplicates using map reduce from mongodb

finding duplicates using map reduce from mongodb - mongodb

I need to find the duplicates in a collection in mongo db which has around 20000 documents. The result should give me the key (on which I am grouping) and the count of times they are repeated only if the count is greater than 1. The below is not complete, however it is giving an error also when I run in mongo.exe shell :
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
}
reduce : function Reduce(key, vals) {
return Array.sum(vals);
}
finalize : function Finalize(key, reduced) {
return reduced
}
out : { inline : 1 }
});
SyntaxError: missing } after property list (shell):5
why is the above error coming?
how can only get the ones with count greater than 1?

I'm not sure if that is an exact copy of the code you've entered, but it looks like you're missing commas between the fields in the object being passed to runCommand. Try:
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
},
reduce : function Reduce(key, vals) {
return Array.sum(vals);
},
finalize : function Finalize(key, reduced) {
return reduced
},
out : { inline : 1 }
});
Also note that even when using finalize, you can't actually remove entries from the outputted document (or collection) in a single-pass with Map-Reduce. However, whether you're using out: {inline: 1}, or out: "some_collection", it is pretty trivial to filter out results where the count is 1.

Related

Mongodb find field with dot in collection with 1M documents

I am trying to find the query to find all the fields/keys in the collection with 1M records. The field is a nested field (field under a field under field). But I end up getting this error with output any suggestions on how to overcome this ??
mr = db.runCommand({
"mapreduce" : "MyCollectionName",
"map" : function() {
var f = function() {
for (var key in this) {
if (this.hasOwnProperty(key)) {
emit(key, null)
if (typeof this[key] == 'object') {
f.call(this[key])
}
}
}
}
f.call(this);
},
"reduce" : function(key, stuff) { return null; },
"out": "MyCollectionName" + "_keys"
});
print(db[mr.result].distinct("_id"));
lmdb> print(db[mr.result].distinct("_id"));
getting this error:
MongoServerError: distinct too big, 16mb cap
I am fairly new to mongodb, so forgive my ignorance…

Once you create new collection containing all the field names. You seem to be using distinct on "_id" field for this collection, "_id" are unique by definition, so can just run "db.{collectionName}.find({})" instead of distinct. Also, mongo docs explicitly state that distict results must not exceed max BSON size (16 Mb).
Results must not be larger than the maximum BSON size. If your results exceed the maximum BSON size, use the aggregation pipeline to retrieve distinct values using the $group operator, as described in Retrieve Distinct Values with the Aggregation Pipeline.
You can use use aggregate option for that. Like this:
db.{collectionName}.aggregate( [ { $group : { _id : "${fieldName}" } } ] )

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance

I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Counting documents in MapReduce depending on condition - MongoDB

I am trying to use a Map Reduce to count number documents according to one of the field values per date. First, here are the results from a couple of regular find() functions:
db.errors.find({ "cDate" : ISODate("2012-11-20T00:00:00Z") }).count();
returns 579 (ie. there are 579 documents for this date)
db.errors.find( { $and: [ { "cDate" : ISODate("2012-11-20T00:00:00Z") }, {"Type":"General"} ] } ).count()
returns 443 (ie. there are 443 documents for this date where Type="General")
Following is my MapReduce:
db.runCommand({ mapreduce: "errors",
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: 1,
Type: this.Type
}
);
},
reduce : function Reduce(key, values) {
var reduced = {count:0,countGeneral:0,Type:''};
values.forEach(function(val) {
reduced.count += val.count;
if (val.Type === 'General')
reduced.countGeneral += val.countGeneral;
});
return reduced;
},
finalize : function Finalize(key, reduced) {
return reduced;
},
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } },
out : { inline : 1 }
});
For the date 20-11-20 the map reduce returns:
count: 579
countGeneral: 60 (should be 443 according to the above find query)
Now, I understand that the Reduce is unpredictable in the way it loops so how should I do this?
Thanks

I suggest that you lose the rest of your values just because you don't return 'General' in your reduce part.
Reduce runs more than once for all the values emitted in the map part and returned from the reduce function.
For example, when the first iteration of reduce have run, you've got output object containing something like:
{count: 15, countGeneral: 3, Type: ''}
And other iterations of reduce collect this object and others like this one and don't see Type:'General' there and don't increase the countGeneral anymore.

Your map function is wrong.
You could do something like this:
function Map() {
var cG=0;
if (this.Type == 'General') { cG=1; }
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: cG
}
);
}
This emits countGeneral 1 if Type is 'General' and 0 otherwise.
Then you can remove the type check from your emit function entirely, since you're destroying it anyway in your reduce function. Currently your reduce clobbers Type information passed from emit during the reduce phase.

Couting rows in MapReduce in MongoDB

I have created the following Map Reduce and came across something curious. I'm counting in 2 different ways the number of documents per date and coming up with different values. Here are my functions:
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
}
);
}
reduce : function Reduce(key, values) {
var reduced = {count:0,count1:0};
values.forEach(function(val) {
reduced.count += val.count;
reduced.count1++;
});
return reduced;
}
finalize : function Finalize(key, reduced) {
return reduced;
}
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } }
out : { inline : 1 }
So Basically what is strange is that at the end "count" and "count1" are returning different values. "count" has the correct value, that is, the number of documents for that date while "count1" has a much lower value. Can anyone explain (I'm new to MongoDB so use simple terms :-)
Thanks.

Two problems (which are really the same problem):
Your emit format must be the same as your result returned in the emit function.
Your reduce must be prepared to be called more than once for the same key (i.e. if you reduce five values for a key and then reduce three values for a key, the reduce function may be called again to reduce the result of two previous reduce operations.
Your example just demonstrates what happens if you assume that you will always be reducing the result "1" rather than the actual previously emitted or reduced result.
Reference: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-ReduceFunction

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}

It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.

Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."

AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.

You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};

A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );

All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.

While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

finding duplicates using map reduce from mongodb - mongodb

Related

Mongodb find field with dot in collection with 1M documents

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

Counting documents in MapReduce depending on condition - MongoDB

Couting rows in MapReduce in MongoDB

In MongoDB mapreduce, how can I flatten the values object?

Categories

Resources