Why is the result of a reduce function fed back into reduce using mongodb mapreduce - mongodb

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance

I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Related

MongoDB: phantom records in $unwind results under heavy load

I have a simple collection of elements like this
{_id: n, xs: [...]}
I'm trying to count total number of elements in all arrays
db.testRace.aggregate([{ $unwind : "$xs" }, { $group : { _id : null, count : { $sum : 1 } } }])
And it works great unless I start to do massive updates of this collection. Under heavy load of update operations I get wrong total - slightly bigger than it should be.
It can be easily reproduced.
First generate some test data
for(var i = 1; i <= 1000000; i++) {
db.testRace.insert({_id: i, xs: [i]});
}
Then simulate a lot of updates
while(true) {
var id = Math.floor((Math.random() * 1000000) + 1);
var obj = db.testRace.find({_id: id}).next();
obj.some="change";
db.testRace.update({_id: id}, obj);
}
And while it is running do aggregate unwind query.
Without load I get right result - 1000000. But when there are a lot of updates I get bigger numbers, like 1001456.
And if I run query like this
db.testRace.aggregate([{ $unwind : "$xs" }, {$group: {_id:"$xs", count:{$sum: 1}}}, { $sort : { count : -1 } }, { $limit : 2 }]);
I get
"result" : [
{
"_id" : 996972,
"count" : 2
},
{
"_id" : 997789,
"count" : 2
}
],
So it seems aggregate count some records twice.
Is it expected behaviour or maybe I'm doing aggregation wrong?
I tested on local mongodb instance, version - 2.4.9
It's expected behavior due to the way MongoDB handles read isolation. When you have a long running query (and an aggregation that reads every single document is a long running query) with updates to that data during the query it may impact whether or no the updated data is returned in the query - depending on what happens when, you could miss a document, receive it or receive it twice.
From the source code:
Any data inserted, deleted, or modified during a yield that should be
returned by a query may or may not be returned by that query. The
query could return: nothing; the data before; the data after; or both
the data before and the data after.
In short, there is no isolation between a query and an
insert/delete/update. AKA, READ_UNCOMMITTED.
https://github.com/mongodb/mongo/blob/master/src/mongo/db/exec/plan_stage.h
Your aggregation query is yielding mid query, during which some of the data is updated. This impacts the results of the query.

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

Counting documents in MapReduce depending on condition - MongoDB

I am trying to use a Map Reduce to count number documents according to one of the field values per date. First, here are the results from a couple of regular find() functions:
db.errors.find({ "cDate" : ISODate("2012-11-20T00:00:00Z") }).count();
returns 579 (ie. there are 579 documents for this date)
db.errors.find( { $and: [ { "cDate" : ISODate("2012-11-20T00:00:00Z") }, {"Type":"General"} ] } ).count()
returns 443 (ie. there are 443 documents for this date where Type="General")
Following is my MapReduce:
db.runCommand({ mapreduce: "errors",
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: 1,
Type: this.Type
}
);
},
reduce : function Reduce(key, values) {
var reduced = {count:0,countGeneral:0,Type:''};
values.forEach(function(val) {
reduced.count += val.count;
if (val.Type === 'General')
reduced.countGeneral += val.countGeneral;
});
return reduced;
},
finalize : function Finalize(key, reduced) {
return reduced;
},
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } },
out : { inline : 1 }
});
For the date 20-11-20 the map reduce returns:
count: 579
countGeneral: 60 (should be 443 according to the above find query)
Now, I understand that the Reduce is unpredictable in the way it loops so how should I do this?
Thanks
I suggest that you lose the rest of your values just because you don't return 'General' in your reduce part.
Reduce runs more than once for all the values emitted in the map part and returned from the reduce function.
For example, when the first iteration of reduce have run, you've got output object containing something like:
{count: 15, countGeneral: 3, Type: ''}
And other iterations of reduce collect this object and others like this one and don't see Type:'General' there and don't increase the countGeneral anymore.
Your map function is wrong.
You could do something like this:
function Map() {
var cG=0;
if (this.Type == 'General') { cG=1; }
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: cG
}
);
}
This emits countGeneral 1 if Type is 'General' and 0 otherwise.
Then you can remove the type check from your emit function entirely, since you're destroying it anyway in your reduce function. Currently your reduce clobbers Type information passed from emit during the reduce phase.

finding duplicates using map reduce from mongodb

I need to find the duplicates in a collection in mongo db which has around 20000 documents. The result should give me the key (on which I am grouping) and the count of times they are repeated only if the count is greater than 1. The below is not complete, however it is giving an error also when I run in mongo.exe shell :
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
}
reduce : function Reduce(key, vals) {
return Array.sum(vals);
}
finalize : function Finalize(key, reduced) {
return reduced
}
out : { inline : 1 }
});
SyntaxError: missing } after property list (shell):5
why is the above error coming?
how can only get the ones with count greater than 1?
I'm not sure if that is an exact copy of the code you've entered, but it looks like you're missing commas between the fields in the object being passed to runCommand. Try:
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
},
reduce : function Reduce(key, vals) {
return Array.sum(vals);
},
finalize : function Finalize(key, reduced) {
return reduced
},
out : { inline : 1 }
});
Also note that even when using finalize, you can't actually remove entries from the outputted document (or collection) in a single-pass with Map-Reduce. However, whether you're using out: {inline: 1}, or out: "some_collection", it is pretty trivial to filter out results where the count is 1.

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.
Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."
AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.
You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};
A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );
All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.
While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.