How to map/reduce two MongoDB collections - mongodb

I am new to map / reduce and trying to figure out a way to collect the following data using map / reduce instead doing it my my (slow) application logic:
I have a collection 'projects' with a 1:n relation to a collection 'tasks'. Now I'd like to receive an array of results that gives me the project names where the first is the project with the most tasks, and the last the project with the least tasks.
Or even better an array of hashes that also tells me how many tasks every project has (assuming the project name is unique:
[project_1: 23, project_2: 42, project_3: 82]
For map I tried something like:
map = function () {
emit(this.project_id, { count:1 });
}
and reduce:
reduce = function (key, values) {
var sum = 0;
values.forEach(function(doc){ sum += 1; });
return { count:sum };
}
I fired this against my tasks collection:
var mr = db.tasks.mapReduce(map, reduce, { out: "results" });
But I get crucial results when querying:
db[mr.result].find();
I am using Mongoid on Rails and am completely lost with it. Can someone point me into the right direction?
Thx in advance.
Felix

Looks generally right, but I spot at least one problem: The summation step in the reduce function should be
values.forEach(function(doc){ sum += doc.count ; });
because the function may be reducing values that are themselves the product of a prior reduction step, and that therefore have count values > 1.
That's a common oversight, mentioned here: http://www.mongodb.org/display/DOCS/Troubleshooting+MapReduce
Hope that helps!

Related

Using IF/ELSE in map reduce

I am trying to make a simple map/reduce function on one of my MongoDB database collections.
I get data but it looks wrong. I am unsure about the Map part. Can I use IF/ELSE in this way?
UPDATE
I want to get the amount of authors that ownes the files. In other words how many of the authors own the uploaded files and thus, how many authors has no files.
The objects in the collection looks like this:
{
"_id": {
"$id": "4fa8efe33a34a40e52800083d"
},
"file": {
"author": "john",
"type": "mobile",
"status": "ready"
}
}
The map / reduce looks like this:
$map = new MongoCode ("function() {
if (this.file.type != 'mobile' && this.file.status == 'ready') {
if (!this.file.author) {
return;
}
emit (this.file.author, 1);
}
}");
$reduce = new MongoCode ("function( key , values) {
var count = 0;
for (index in values) {
count += values[index];
}
return count;
}");
$this->cimongo->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "statistics.photographer_count"
)
);
The map part looks ok to me. I would slightly change the reduce part.
values.forEach(function(v) {
count += v;
}
You should not use for in loop to iterate an array, it was not meant to do this. It is for enumerating object's properties. Here is more detailed explanation.
Why do you think your data is wrong? What's your source data? What do you get? What do you expect to get?
I just tried your map and reduce in mongo shell and got correct (reasonable looking) results.
The other way you can do what you are doing is get rid of the inner "if" condition in the map but call your mapreduce function with appropriate query clause, for example:
db.files.mapreduce(map,reduce,{out:'outcollection', query:{"file.author":{$exists:true}}})
or if you happen to have indexes to make the query efficient, just get rid of all ifs and run mapreduce with query:{"file.author":{$exists:true},"file.type":"mobile","file.status":"ready"} clause. Change the conditions to match the actual cases you want to sum up over.
In 2.2 (upcoming version available today as rc0) you can use the aggregation framework for this type of query rather than writing map/reduce functions, hopefully that will simplify things somewhat.

MongoDB MapReduce - Emit one key/one value doesnt call reduce

So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk)
Say I have objects in my collection like so:
{'key':5, 'value':5}
{'key':5, 'value':4}
{'key':5, 'value':1}
{'key':4, 'value':6}
{'key':4, 'value':4}
{'key':3, 'value':0}
My map function simply emits the key and the value
My reduce function simply adds the values AND before returning them adds 1 (I did this to check to see if the reduce function is even called)
My results follow:
{'_id': 3, 'value': 0}
{'_id':4, 'value': 11.0}
{'_id':5, 'value': 11.0}
As you can see, for the keys 4 & 5 I get the expected answer of 11 BUT for the key 3 (with only one entry in the collection with that key) I get the unexpected 0!
Is this natural behavior of mapreduce in general? For MongoDB? For pymongo (which I am using)?
The reduce function combines documents with the same key into one document. If the map function emits a single document for a particular key (as is the case with key 3), the reduce function will not be called.
I realize this is an older question, but I came to it and felt like I still didn't understand why this behavior exists and how to build map/reduce functionality so it's a non-issue.
The reason MongoDB doesn't call the reduce function if there is a single instance of a key is because it isn't necessary (I hope this will make more sense in a moment). The following are requirements for reduce functions:
The reduce function must return an object whose type must be identical to the type of the value emitted by the map function.
The order of the elements in the valuesArray should not affect the output of the reduce function
The reduce function must be idempotent.
The first requirement is very important and it seems a number of people are overlooking it because I've seen a number of people mapping in the reduce function then dealing with the single-key case in the finalize function. This is the wrong way to address the issue, however.
Think about it like this: If there's only a single instance of a key, a simple optimization is to skip the reducer entirely (there's nothing to reduce). Single-key values are still included in the output, but the intent of the reducer is to build an aggregate result of the multi-key documents in your collection. If the mapper and reducer are outputting the same type, you should be blissfully unaware by looking at the object structure of the output from your map/reduce functions. You shouldn't have to use a finalize function to correct the structure of your objects that didn't run through the reducer.
In short, do your mapping in your map function and reduce multi-key values into a single, aggregate result in your reduce functions.
Solution:
added new field in map: single: 0
in reduce change this field to: single: 1
in finalize make checking for this field and make required actions
$map = new MongoCode("function() {
var value = {
time: this.time,
email_id: this.email_id,
single: 0
};
emit(this.email, value);
}");
$reduce = new MongoCode("function(k, vals) {
// make some need actions here
return {
time: vals[0].time,
email_id: vals[0].email_id,
single: 1
};
}");
$finalize = new MongoCode("function(key, reducedVal) {
if (reducedVal.single == 0) {
reducedVal.time = 11111;
}
return reducedVal;
};");
"MongoDB will not call the reduce function for a key that has only a single value. The values argument is an array whose elements are the value objects that are “mapped” to the key."
http://docs.mongodb.org/manual/reference/command/mapReduce/#mapreduce-reduce-cmd
Is this natural behavior of mapreduce in general?
Yes.

Ordering a result set randomly in mongo

I've recently discovered that Mongo has no SQL equivalent to "ORDER BY RAND()" in it's command syntax (https://jira.mongodb.org/browse/SERVER-533)
I've seen the recommendation at http://cookbook.mongodb.org/patterns/random-attribute/ and frankly, adding a random attribute to a document feels like a hack. This won't work because this places an implicit limit to any given query I want to randomize.
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
I have a couple ideas on how I could solve this via code, but I feel like I'm missing a more obvious and native solution. Does anyone have a thought or idea on how to solve this more elegantly?
I have to agree: the easiest thing to do is to install a random value into your documents. There need not be a tremendously large range of values, either -- the number you choose depends on the expected result size for your queries (1,000 - 1,000,000 distinct integers ought to be enough for most cases).
When you run your query, don't worry about the random field -- instead, index it and use it to sort. Since there is no correspondence between the random number and the document, you should get fairly random results. Note that collisions will likely result in documents being returned in natural order.
While this is certainly a hack, you have a very easy escape route: given MongoDB's schema-free nature, you can simply stop including the random field once there is support for random sort in the server. If size is an issue, you could run a batch job to remove the field from existing documents. There shouldn't be a significant change in your client code if you design it carefully.
An alternative option would be to think long and hard about the number of results that will be randomized and returned for a given query. It may not be overly expensive to simply do shuffling in client code (i.e., if you only consider the most recent 10,000 posts).
What you want cannot be done without picking either of the two solutions you mention. Picking a random offset is a horrible idea if your collection becomes larger than a few thousands documents. The reason for this is that the skip(n) operation takes O(n) time. In other words, the higher your random offset the longer the query will take.
Adding a randomized field to the document is, in my opinion, the least hacky solution there is given the current feature set of MongoDB. It provides stable query times and gives you some say over how the collection is randomized (and allows you to generate a new random value after each query through a findAndModify for example). I also do not understand how this would impose an implicit limit on your queries that make use of randomization.
You can give this a try - it's fast, works with multiple documents and doesn't require populating rand field at the beginning, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
Why? If you have 7.000 documents and you choose three random offsets from 0 to 6999, the chosen documents will be random, even if the collection itself is sorted alphabetically.
One could insert an id field (the $id field won't work because its not an actual number) use modulus math to get a random skip. If you have 10,000 records and you wanted 10 results you could pick a modulus between 1 and 1000 randomly sucH as 253 and then request where mod(id,253)=0 and this is reasonably fast if id is indexed. Then randomly sort client side those 10 results. Sure they are evenly spaced out instead of truly random, but it close to what is desired.
Both of the options seems like non-perfect hacks to me, random filed and will always have same value and skip will return same records for a same number.
Why don't you use some random field to sort then skip randomly, i admit it is also a hack but in my experience gives better sense of randomness.

selecting all the fields in a row using mapReduce

I am using mongoose with nodejs. I am using mapReduce to fetch data grouped by a field.So all it gives me as a collection is the key with the grouping field only from every row of database.
I need to fetch all the fields from the database grouped by a field and sorted on the basis of another field.e.g.: i have a database having details of places and fare for travelling to those places and a few other fields also.Now i need to fetch the data in such a way that i get the data grouped on the basis of places sorted by the fare for them. MapReduce helps me to get that, but i cannot get the other fields.
Is there a way to get all the fields using map reduce, rather than just getting the two fields as mentioned in the above example??
I must admit I'm not sure I understand completely what you're asking.
But maybe one of the following thoughts helps you:
either) when you iterate over your mapReduce results, you could fetch complete documents from mongodb for each result. That would give you access to all fields in each document for the cost of some network traffic.
or) The value that you send into emit(key, value) can be an object. So you could construct a value object that contains all your desired fields. Just be sure to use the exactly same object structure for your reduce method's return value.
I try to illustrate with an (untested) example.
map = function() {
emit(this.place,
{
'field1': this.field1,
'field2': this.field2,
'count' : 1
});
}
reduce = function(key, values) {
var result = {
'field1': values[0].field1,
'field2': values[0].field2,
'count' : 0 };
for (v in values) {
result.count += values[v].count;
}
return obj;
}

MongoDB map/reduce over multiple collections?

First, the background. I used to have a collection logs and used map/reduce to generate various reports. Most of these reports were based on data from within a single day, so I always had a condition d: SOME_DATE. When the logs collection grew extremely big, inserting became extremely slow (slower than the app we were monitoring was generating logs), even after dropping lots of indexes. So we decided to have each day's data in a separate collection - logs_YYYY-mm-dd - that way indexes are smaller, and we don't even need an index on date. This is cool since most reports (thus map/reduce) are on daily data. However, we have a report where we need to cover multiple days.
And now the question. Is there a way to run a map/reduce (or more precisely, the map) over multiple collections as if it were only one?
A reduce function may be called once, with a key and all corresponding values (but only if there are multiple values for the key - it won't be called at all if there's only 1 value for the key).
It may also be called multiple times, each time with a key and only a subset of the corresponding values, and the previous reduce results for that key. This scenario is called a re-reduce. In order to support re-reduces, your reduce function should be idempotent.
There are two key features in a idempotent reduce function:
The return value of the reduce function should be in the same format as the values it takes in. So, if your reduce function accepts an array of strings, the function should return a string. If it accepts objects with several properties, it should return an object containing those same properties. This ensures that the function doesn't break when it is called with the result of a previous reduce.
Don't make assumptions based on the number of values it takes in. It isn't guaranteed that the values parameter contains all the values for the given key. So using values.length in calculations is very risky and should be avoided.
Update: The two steps below aren't required (or even possible, I haven't checked) on the more recent MongoDB releases. It can now handle these steps for you, if you specify an output collection in the map-reduce options:
{ out: { reduce: "tempResult" } }
If your reduce function is idempotent, you shouldn't have any problems map-reducing multiple collections. Just re-reduce the results of each collection:
Step 1
Run the map-reduce on each required collection and save the results in a single, temporary collection. You can store the results using a finalize function:
finalize = function (key, value) {
db.tempResult.save({ _id: key, value: value });
}
db.someCollection.mapReduce(map, reduce, { finalize: finalize })
db.anotherCollection.mapReduce(map, reduce, { finalize: finalize })
Step 2
Run another map-reduce on the temporary collection, using the same reduce function. The map function is a simple function that selects the keys and values from the temporary collection:
map = function () {
emit(this._id, this.value);
}
db.tempResult.mapReduce(map, reduce)
This second map-reduce is basically a re-reduce and should give you the results you need.
I used map-reduce method. here is an example.
var mapemployee = function () {
emit(this.jobid,this.Name);};
var mapdesignation = function () {
emit(this.jobid, this.Designation);};
var reduceF = function(key, values) {
var outs = {Name:null,Designation: null};
values.forEach(function(v){
if(outs.Name ==null){
outs.Name = v.Name }
if(outs.Name ==null){
outs.Nesignation = v.Designation}
});
return outs;
};
result = db.employee.mapReduce(mapemployee, reduceF, {out: {reduce: 'output'}});
result = db.designation.mapReduce(mapdesignation,reduceF, {out: {reduce: 'output'}});
Refference : http://www.itgo.me/a/x3559868501286872152/mongodb-join-two-collections