Counting documents in MapReduce depending on condition - MongoDB - mongodb

I am trying to use a Map Reduce to count number documents according to one of the field values per date. First, here are the results from a couple of regular find() functions:
db.errors.find({ "cDate" : ISODate("2012-11-20T00:00:00Z") }).count();
returns 579 (ie. there are 579 documents for this date)
db.errors.find( { $and: [ { "cDate" : ISODate("2012-11-20T00:00:00Z") }, {"Type":"General"} ] } ).count()
returns 443 (ie. there are 443 documents for this date where Type="General")
Following is my MapReduce:
db.runCommand({ mapreduce: "errors",
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: 1,
Type: this.Type
}
);
},
reduce : function Reduce(key, values) {
var reduced = {count:0,countGeneral:0,Type:''};
values.forEach(function(val) {
reduced.count += val.count;
if (val.Type === 'General')
reduced.countGeneral += val.countGeneral;
});
return reduced;
},
finalize : function Finalize(key, reduced) {
return reduced;
},
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } },
out : { inline : 1 }
});
For the date 20-11-20 the map reduce returns:
count: 579
countGeneral: 60 (should be 443 according to the above find query)
Now, I understand that the Reduce is unpredictable in the way it loops so how should I do this?
Thanks

I suggest that you lose the rest of your values just because you don't return 'General' in your reduce part.
Reduce runs more than once for all the values emitted in the map part and returned from the reduce function.
For example, when the first iteration of reduce have run, you've got output object containing something like:
{count: 15, countGeneral: 3, Type: ''}
And other iterations of reduce collect this object and others like this one and don't see Type:'General' there and don't increase the countGeneral anymore.

Your map function is wrong.
You could do something like this:
function Map() {
var cG=0;
if (this.Type == 'General') { cG=1; }
emit(
this.cDate,//Holds a date value
{
count: 1,
countGeneral: cG
}
);
}
This emits countGeneral 1 if Type is 'General' and 0 otherwise.
Then you can remove the type check from your emit function entirely, since you're destroying it anyway in your reduce function. Currently your reduce clobbers Type information passed from emit during the reduce phase.

Related

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance
I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Mongoose limit/offset and count query

Bit of an odd one on query performance... I need to run a query which does a total count of documents, and can also return a result set that can be limited and offset.
So, I have 57 documents in total, and the user wants 10 documents offset by 20.
I can think of 2 ways of doing this, first is query for all 57 documents (returned as an array), then using array.slice return the documents they want. The second option is to run 2 queries, the first one using mongo's native 'count' method, then run a second query using mongo's native $limit and $skip aggregators.
Which do you think would scale better? Doing it all in one query, or running two separate ones?
Edit:
// 1 query
var limit = 10;
var offset = 20;
Animals.find({}, function (err, animals) {
if (err) {
return next(err);
}
res.send({count: animals.length, animals: animals.slice(offset, limit + offset)});
});
// 2 queries
Animals.find({}, {limit:10, skip:20} function (err, animals) {
if (err) {
return next(err);
}
Animals.count({}, function (err, count) {
if (err) {
return next(err);
}
res.send({count: count, animals: animals});
});
});
I suggest you to use 2 queries:
db.collection.count() will return total number of items. This value is stored somewhere in Mongo and it is not calculated.
db.collection.find().skip(20).limit(10) here I assume you could use a sort by some field, so do not forget to add an index on this field. This query will be fast too.
I think that you shouldn't query all items and than perform skip and take, cause later when you have big data you will have problems with data transferring and processing.
Instead of using 2 separate queries, you can use aggregate() in a single query:
Aggregate "$facet" can be fetch more quickly, the Total Count and the Data with skip & limit
db.collection.aggregate([
//{$sort: {...}}
//{$match:{...}}
{$facet:{
"stage1" : [ {"$group": {_id:null, count:{$sum:1}}} ],
"stage2" : [ { "$skip": 0}, {"$limit": 2} ]
}},
{$unwind: "$stage1"},
//output projection
{$project:{
count: "$stage1.count",
data: "$stage2"
}}
]);
output as follows:-
[{
count: 50,
data: [
{...},
{...}
]
}]
Also, have a look at https://docs.mongodb.com/manual/reference/operator/aggregation/facet/
db.collection_name.aggregate([
{ '$match' : { } },
{ '$sort' : { '_id' : -1 } },
{ '$facet' : {
metadata: [ { $count: "total" } ],
data: [ { $skip: 1 }, { $limit: 10 },{ '$project' : {"_id":0} } ] // add projection here wish you re-shape the docs
} }
] )
Instead of using two queries to find the total count and skip the matched record.
$facet is the best and optimized way.
Match the record
Find total_count
skip the record
And also can reshape data according to our needs in the query.
There is a library that will do all of this for you, check out mongoose-paginate-v2
After having to tackle this issue myself, I would like to build upon user854301's answer.
Mongoose ^4.13.8 I was able to use a function called toConstructor() which allowed me to avoid building the query multiple times when filters are applied. I know this function is available in older versions too but you'll have to check the Mongoose docs to confirm this.
The following uses Bluebird promises:
let schema = Query.find({ name: 'bloggs', age: { $gt: 30 } });
// save the query as a 'template'
let query = schema.toConstructor();
return Promise.join(
schema.count().exec(),
query().limit(limit).skip(skip).exec(),
function (total, data) {
return { data: data, total: total }
}
);
Now the count query will return the total records it matched and the data returned will be a subset of the total records.
Please note the () around query() which constructs the query.
You don't have to use two queries or one complicated query with aggregate and such.
You can use one query
example:
const getNames = async (queryParams) => {
const cursor = db.collection.find(queryParams).skip(20).limit(10);
return {
count: await cursor.count(),
data: await cursor.toArray()
}
}
mongo returns a cursor that has predefined functions such as count, which will return the full count of the queried results regardless of skip and limit
So in count property, you will get the full length of the collection and in data, you will get just the chunk with offset of 20 and limit of 10 documents
Thanks Igor Igeto Mitkovski, a best solution is using native connection
document is here: https://docs.mongodb.com/manual/reference/method/cursor.count/#mongodb-method-cursor.count
and mongoose dont support it ( https://github.com/Automattic/mongoose/issues/3283 )
we have to use native connection.
const query = StudentModel.collection.find(
{
age: 13
},
{
projection:{ _id:0 }
}
).sort({ time: -1 })
const count = await query.count()
const records = await query.skip(20)
.limit(10).toArray()

Couting rows in MapReduce in MongoDB

I have created the following Map Reduce and came across something curious. I'm counting in 2 different ways the number of documents per date and coming up with different values. Here are my functions:
map : function Map() {
emit(
this.cDate,//Holds a date value
{
count: 1,
}
);
}
reduce : function Reduce(key, values) {
var reduced = {count:0,count1:0};
values.forEach(function(val) {
reduced.count += val.count;
reduced.count1++;
});
return reduced;
}
finalize : function Finalize(key, reduced) {
return reduced;
}
query : { "cDate" : { "$gte" : ISODate("2012-11-20T00:00:00Z") } }
out : { inline : 1 }
So Basically what is strange is that at the end "count" and "count1" are returning different values. "count" has the correct value, that is, the number of documents for that date while "count1" has a much lower value. Can anyone explain (I'm new to MongoDB so use simple terms :-)
Thanks.
Two problems (which are really the same problem):
Your emit format must be the same as your result returned in the emit function.
Your reduce must be prepared to be called more than once for the same key (i.e. if you reduce five values for a key and then reduce three values for a key, the reduce function may be called again to reduce the result of two previous reduce operations.
Your example just demonstrates what happens if you assume that you will always be reducing the result "1" rather than the actual previously emitted or reduced result.
Reference: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-ReduceFunction

finding duplicates using map reduce from mongodb

I need to find the duplicates in a collection in mongo db which has around 20000 documents. The result should give me the key (on which I am grouping) and the count of times they are repeated only if the count is greater than 1. The below is not complete, however it is giving an error also when I run in mongo.exe shell :
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
}
reduce : function Reduce(key, vals) {
return Array.sum(vals);
}
finalize : function Finalize(key, reduced) {
return reduced
}
out : { inline : 1 }
});
SyntaxError: missing } after property list (shell):5
why is the above error coming?
how can only get the ones with count greater than 1?
I'm not sure if that is an exact copy of the code you've entered, but it looks like you're missing commas between the fields in the object being passed to runCommand. Try:
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
},
reduce : function Reduce(key, vals) {
return Array.sum(vals);
},
finalize : function Finalize(key, reduced) {
return reduced
},
out : { inline : 1 }
});
Also note that even when using finalize, you can't actually remove entries from the outputted document (or collection) in a single-pass with Map-Reduce. However, whether you're using out: {inline: 1}, or out: "some_collection", it is pretty trivial to filter out results where the count is 1.

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.
Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."
AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.
You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};
A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );
All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.
While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.