MongoDB MapReduce strange results - mongodb

When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.
But when I run it with a collection with about 140.000 documents, I get some strange results:
Map function:
function() { emit(this.featureType, this._id); }
Reduce function:
function(key, values) { return { count: values.length, ids: values };
As a result, I would expect something like (for each mapping key):
{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
"ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}
But instead I get this strange document structure:
{
"_id": "FEATURE_TYPE_A",
"value": {
"count": 706,
"ids": [
{
"count": 101,
"ids": [
{
"count": 100,
"ids": [
"9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}
Could someone explain me if this is the expected behavior, or am I doing something wrong?
Thanks in advance!

The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.
Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.
So first the mapper revised:
function () { emit(this.type, { count: 1, ids: [this._id] }); }
Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:
function (key, values) {
var ids = [];
var count = 0;
values.forEach(function(value) {
count += value.count;
value.ids.forEach(function(id) {
ids.push( id );
});
});
return { count: count, ids: ids };
}
What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially
Reduce one chunk of results #chunk1
Reduce another chunk of results #chunk2
Comnine the reduce on the reduced chunks, #chunk1 and #chunk2
That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.
The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
"ids": { "$push": "$_id" }
}},
{ "$out": "ouputCollection" }
])
So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.
But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
}}
])
In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.

Related

Why is a MongoDB update() slower than find() when no documents are matched and can this be changed?

Consider the following two approaches of (maybe) updating a bunch of records:
Approach 1 ("find and maybe update"):
let ids = db.getCollection("users").find({
"status.lastActivity": {"$lte": timeoutDate}
}, {
"fields": {"_id": 1}
}).fetch().map(doc => {
doc = doc._id;
return doc
});
if (ids.length) {
db.getCollection("users").update({
"_id": {"$in": ids}
}, {
"$set": {
"status.idle": true
}
}, {
"multi": true
});
}
Approach 2 ("directly update"):
db.getCollection("users").update({
"status.lastActivity": {"$lte": timeoutDate}
}, {
"$set": {
"status.idle": true
}
}, {
"multi": true
});
And now to keep it simple let's assume that there are never users with a smaller status.lastActivity than timeoutDate (so ids is also always an empty array).
In that case I get a significantly better performance with Approach 1. Like Approach 1 takes 0.1 to 2 ms while Approach 2 takes 40 to 80 ms.
My question now is, why is that the case? I would have assumed MongoDB is 'clever' enough to do things similar to Approach 1 under the hood when I actually use Approach 2 and doesn't waste resource when there actually is no record matched by the selector...
And can I change it somehow so that it would work that way? Or have I maybe some kind of wrong configuration which is causing this and I could get rid of? Because obviously writing things like in Approach 2 would be leaner...
Is this in JS? db.getCollection("users").find( looks like it should return a Promise, and promises don't have length, so the update code that's gated by ids.length would never run.

MapReduce function to return two outputs. MongoDB

I am currently using doing some basic mapReduce using MongoDB.
I currently have data that looks like this:
db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...
I want to create a mapReduce function to group data by gender and show
Total number of each gender, Male & Female
Average weight of each gender
I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...
Work so far
var map1 = function()
{var key = this.gender;
emit(key, {count:1});}
var reduce1 = function(key, values)
{var sum=0;
values.forEach(function(value){sum+=value["count"];});
return{count: sum};};
db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});
Output
db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }
Thanks
The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.
db.football_team.mapReduce(
// mapper
function() {
emit(this.gender, { count: 1, weight: this.weight });
},
// reducer
function(key,values) {
var output = { count: 0, weight: 0 };
values.forEach(value => {
output.count += value.count;
output.weight += value.weight;
});
return output;
},
// options and finalize
{
"out": "gender_stats", // or { "inline": 1 } if you don't need another collection
"finalize": function(key,value) {
value.avg_weight = value.weight / value.count; // take an average
delete value.weight; // optionally remove the unwanted key
return value;
}
}
)
All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.
As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}}
])
And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}},
{ "$sort": { "avg_weight": -1 } }
])
The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.
As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.
Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.
The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.
The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.

Push Item to Array and Delete in the Same Request

I have a document that stores sensor data where the sensor readings are objects stored in an array. Example:
{
"readings": [
{
"timestamp": 1499475320,
"temperature": 121
},
{
"timestamp": 1499475326,
"temperature": 93
},
{
"timestamp": 1499475340,
"temperature": 142
}
]
}
I know how to push/add an item to the "readings" array. But what I need is when I add an item to the array, I also want to "clean" the array by removing items that have "timestamp" value older than a cutoff time.
Is this possible in mongodb?
The way I see this you basically have two options here that have varying approaches.
Restrict Arrays to Capped Size
The first option here is "not exactly" what you are asking for, but it is the option with the least implementation and execution overhead. The variance from your question is that instead of "removing past a certain age", we instead simply place a "limit/cap" on the total number of entries in the array.
This is actually done using the $slice modifier to $push:
Model.update(
{ "_id": docId },
{ "$push": {
"readings": {
"$each": [{ "timestamp": 1499478496679, "temperature": 100 }],
"$slice": -10
}
}
)
In this case the -10 argument restricts the array to only have the "last ten" entries from the end of the array since we are "appending" with $push. If you wanted instead the "latest" as the first entry then you would modify with $position and instead provide the "positive" value to $slice, which means "first ten" in contrast.
So it's not the same thing you asked for, but it is practical since the arrays do not have "unlimited growth" and you can simply "cap" them as each update is made and the "oldest" item will be removed once at the maximum length. This means the overall document never actually grows beyond a set size, and this is a very good thing for MongoDB.
Issue with Bulk Operations
The next case which actually does exactly what you ask uses "Bulk Operations" to issue "two" update operations in a "single" request to the server. The reason why it is "two" is because there is a rule that you cannot have different update operators "assigned to the same path" in a singe update operation.
Therefore what you want actually involves a $push AND a $pull operation, and on the "same array path" we need to issue those as "separate" operations. This is where the Bulk API can help:
Model.collection.bulkWrite([
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$pull": {
"readings": { "timestamp": { "$lt": cutOff } }
}
}
}},
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$push": { "timestamp": 1499478496679, "temperature": 100 }
}
}}
])
This uses the .bulkWrite() method from the underlying driver which you access from the model via .collection as shown. This will actually return a BulkWriteOpResult within the callback or Promise which contains information about the actual operations performed within the "batch". In this case it will be the "matched" and "modified" numbers which will be appropriate to the operations that were actually performed.
Hence if the $pull did not actually "remove" anything since the timestamp values were actually newer than the given constraint, then the modified count would only reflect the $push operation. But most of the time this need not concern you, where instead you would just accept that the operations completed without error and did something according to what you actually asked.
Conclude
So the general case of "both" is that it's really all done in one request and one response. The differences come in that "under the hood" the second approach which matches your request actually does do "two" operations per request and therefore takes microseconds longer.
There is actually no reason why you could not "combine" the logic of "both", and remove past your "cutoFF" as well as keeping a "cap" on the overall array size. But the general idea here is that the first implementation, though not exactly the same thing as asked will actually do a "good enough" job of "housekeeping" with little to no additional overhead on the request, or indeed the implementation of the actual code.
Also, whilst you can always "read the data" -> "modify" -> "save". That is not a really great pattern. And for best performance as well as "consistency" without conflict, you should be using the atomic operations to modify in just the same way as is outlined here.

Find largest document size in MongoDB

Is it possible to find the largest document size in MongoDB?
db.collection.stats() shows average size, which is not really representative because in my case sizes can differ considerably.
You can use a small shell script to get this value.
Note: this will perform a full table scan, which will be slow on large collections.
let max = 0, id = null;
db.test.find().forEach(doc => {
const size = Object.bsonsize(doc);
if(size > max) {
max = size;
id = doc._id;
}
});
print(id, max);
Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.
If you also want the _id, try this. Given a collection called "requests" :
// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();
// { "size" : 3333, "_id" : "someUniqueIdHere" }
Starting Mongo 4.4, the new aggregation operator $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to find the bson size of the document whose size is the biggest:
// { "_id" : ObjectId("5e6abb2893c609b43d95a985"), "a" : 1, "b" : "hello" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a986"), "c" : 1000, "a" : "world" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a987"), "d" : 2 }
db.collection.aggregate([
{ $group: {
_id: null,
max: { $max: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "max" : 46 }
This:
$groups all items together
$projects the $max of documents' $bsonSize
$$ROOT represents the current document for which we get the bsonsize
Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).
You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes:
arrays
strings
The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.
Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):
db.users.aggregate([
{ $match: {
'friends.ids': { $exists: true }
}},
{ $project: {
sizeLargestField: { $size: '$friends.ids' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:
db.users.aggregate([
{ $match: {
bio: { $exists: true }
}},
{ $project: {
sizeLargestField: { $strLenBytes: '$bio' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.
1 Technically, the rarely used binData type can also have variable size.
Well.. this is an old question.. but - I thought to share my cent about it
My approach - use Mongo mapReduce function
First - let's get the size for each document
db.myColection.mapReduce
(
function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document
function(key, val) { return val }, // val = document size value (single value for each document)
{
query: {}, // query all documents
out: { inline: 1 } // just return result (don't create a new collection for it)
}
)
This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the result field)
Second - let's get the max size of document by manipulating this query
db.metadata.mapReduce
(
function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value
function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size)
{ query: {}, out: { inline: 1 } } // same as first example
)
Which will provide you a single result with value equals to the max document size
In short:
you may want to use the first example and save its output as a collection (change out option to the name of collection you want) and applying further aggregations on it (max size, min size, etc.)
-OR-
you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)
If you're working with a huge collection, loading it all at once into memory will not work, since you'll need more RAM than the size of the entire collection for that to work.
Instead, you can process the entire collection in batches using the following package I created:
https://www.npmjs.com/package/mongodb-largest-documents
All you have to do is provide the MongoDB connection string and collection name. The script will output the top X largest documents when it finishes traversing the entire collection in batches.
Inspired by Elad Nana's package, but usable in a MongoDB console :
function biggest(collection, limit=100, sort_delta=100) {
var documents = [];
cursor = collection.find().readPref("nearest");
while (cursor.hasNext()) {
var doc = cursor.next();
var size = Object.bsonsize(doc);
if (documents.length < limit || size > documents[limit-1].size) {
documents.push({ id: doc._id.toString(), size: size });
}
if (documents.length > (limit + sort_delta) || !cursor.hasNext()) {
documents.sort(function (first, second) {
return second.size - first.size;
});
documents = documents.slice(0, limit);
}
}
return documents;
}; biggest(db.collection)
Uses cursor
Gives a list of the limit biggest documents, not just the biggest
Sort & cut output list to limit every sort_delta
Use nearest as read preference (you might also want to use rs.slaveOk() on the connection to be able to list collections if you're on a slave node)
As Xavier Guihot already mentioned, a new $bsonSize aggregation operator was introduced in Mongo 4.4, which can give you the size of the object in bytes. In addition to that just wanted to provide my own example and some stats.
Usage example:
// I had an `orders` collection in the following format
[
{
"uuid": "64178854-8c0f-4791-9e9f-8d6767849bda",
"status": "new",
...
},
{
"uuid": "5145d7f1-e54c-44d9-8c10-ca3ce6f472d6",
"status": "complete",
...
},
...
];
// and I've run the following query to get documents' size
db.getCollection("orders").aggregate(
[
{
$match: { status: "complete" } // pre-filtered only completed orders
},
{
$project: {
uuid: 1,
size: { $bsonSize: "$$ROOT" } // added object size
}
},
{
$sort: { size: -1 }
},
],
{ allowDiskUse: true } // required as I had huge amount of data
);
as a result, I received a list of documents by size in descending order.
Stats:
For the collection of ~3M records and ~70GB size in total, the query above took ~6.5 minutes.

MongoDB MapReduce : use positional operator $ in map function

I have a collection with entries that look like that :
{"userid": 1, "contents": [ { "tag": "whatever", "value": 100 }, {"tag": "whatever2", "value": 110 } ] }
I'm performing a MapReduce on this collection with queries such as {"contents.tag": "whatever"}.
What I'd like to do in my map function is emiting the field "value" corresponding to the entry in the array "contents" that matched the query without having to iterate through the whole array. Under normal circumstances, I could do that using the $ positional operator with something like contents.$.value. But in the MapReduce case, it's not working.
To summarize, here is the code I have right now :`
map=function(){
emit(this.userid, WHAT DO I WRITE HERE TO EMIT THE VALUE I WANT ?);
}
reduce=function(key,values){
return values[0]; //this reduce function does not make sense, just for the example
}
res=db.runCommand(
{
"mapreduce": "collection",
"query": {'contents.tag':'whatever'},
"map": map,
"reduce": reduce,
"out": "test_mr"
}
);`
Any idea ?
Thanks !
This will not work without iterating over the whole array. In MongoDB a query is intended to match an entire document.
When dealing with Map / Reduce, the query is simply trimming the number of documents that are passed into the map function. However, the map function has no knowledge of the query that was run. The two are disconnected.
The source code around the M/R is here.
There is an upcoming aggregation feature that will more closely match this desire. But there's no timeline on this feature.
No way. I've had the same problem. The iterate is necessary.
You could do this:
map=function() {
for(var i in this.contents) {
if(this.contents[i].tag == "whatever") {
emit(this.userid, this.contents[i].value);
}
}
}