I am currently using doing some basic mapReduce using MongoDB.
I currently have data that looks like this:
db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...
I want to create a mapReduce function to group data by gender and show
Total number of each gender, Male & Female
Average weight of each gender
I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...
Work so far
var map1 = function()
{var key = this.gender;
emit(key, {count:1});}
var reduce1 = function(key, values)
{var sum=0;
values.forEach(function(value){sum+=value["count"];});
return{count: sum};};
db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});
Output
db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }
Thanks
The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.
db.football_team.mapReduce(
// mapper
function() {
emit(this.gender, { count: 1, weight: this.weight });
},
// reducer
function(key,values) {
var output = { count: 0, weight: 0 };
values.forEach(value => {
output.count += value.count;
output.weight += value.weight;
});
return output;
},
// options and finalize
{
"out": "gender_stats", // or { "inline": 1 } if you don't need another collection
"finalize": function(key,value) {
value.avg_weight = value.weight / value.count; // take an average
delete value.weight; // optionally remove the unwanted key
return value;
}
}
)
All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.
As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}}
])
And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}},
{ "$sort": { "avg_weight": -1 } }
])
The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.
As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.
Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.
The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.
The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.
Related
Use Case:
I've got a mongodb collection with a couple million documents. Documents in this
collection must be updated sometimes. Therefore I've setup a monitorFrequency field which would define the that a specific document must be updated every 6, 12, 24 or 720 hours. Additionally I setup a field called lastRefreshAt which is a timestamp of the last actual update.
The problem:
How can I select all documents from my collection profiles which need to be refreshed again (because monitorFrequency is older than lastRefreshAt).
Should I run that on a single query which would only return those documents which need to be refreshed again or should I rather iterate on all documents with a cursor and check in my node application if the document needs to be refreshed or not?
I would know how to do approach #2, but I am not sure what approach to chose and how the query for #1 would look like.
There are a couple of approaches depending on available architecture and choices. Some are good choices and some are bad, but we might as well explain them all.
Use $where with multi-update
As a first option to examine, you could use $where to calculate the difference for selection and feed directly to .update() or .updateMany() for that matter:
db.profiles.update(
{
"$where": function() {
return (Date.now() - this.lastRefreshAt.valueOf())
> ( this.monitorFrequency * 1000 * 60 * 60 );
}
},
{ "$currentDate": { "lastRefreshAt": true } },
{ "multi": true }
)
Which pretty simply works out the milliseconds difference between the current "lastRefreshAt" value and the current Date value and compares that to the stored "monitorFrequency" converted into milliseconds itself.
The $currentDate is appplied because it is a "multi" update and applied to all matched documents, so this ensures the "server timestamp" at the actual time of document update is applied to the document.
It's not fantastic as it does require a full collection scan in order to select the documents via calculation and thus cannot use an index. Plus it's JavaScript evaluation, which not being native code does add some overhead.
Loop the matched selection
So JavaScript is not that great a selection option in general when other options apply. Instead try using the aggregation framework for the calculation and loop the cursor result:
var ops = [];
db.profiles.aggregate([
{ "$redact": {
"$cond": {
"if": {
"$gt": [
{ "$subtract": [new Date(), "$lastRefreshAt"] },
{ "$multiply": ["$monitorFrequency", 1000 * 60 * 60] }
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]).forEach(doc => {
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$currentDate": { "lastRefreshAt": true } }
}
});
if ( ops.length > 1000 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
})
if ( ops.length > 0 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
So again that's a collection scan due to the calculation but it is done with native operators, so that part at least should be a bit faster. Also from a technical standpoint it's a little different because the new Date() is actually established at the time of request and not per document iterated as it would be using $where. Lacking an operator to produce the "current date" internally, there is no way for the aggregation framework to do this per iteration.
And of course, instead of just applying our "update" expression as it matches documents, we are looping the result cursor and applying a function. So whilst there are "some" gains, there is also additional overhead. Mileage may vary as to performance and practicality.
Parallel Updates
Personally I would do neither of the above and simply run a query selecting each marked "monitorFrequency" and looking for the dates between the boundaries that exceed the allowed difference.
As a simple example using NodeJS to implement Promise.all() for parallel calls:
const MongoClient = require('mongodb').MongoClient;
const onHour = 1000 * 60 * 60;
(async function() {
let db;
try {
db = await MongoClient.connect('mongodb://localhost/test');
let collection = db.collection('profiles');
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
await Promise.all(
intervals.map( (monitorFrequency,i) =>
collection.updateMany(
{
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
},
{ "$currentDate": { "lastRefreshAt": true } },
)
)
);
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
This would allow you to index on the two fields and allow optimal selection, and since the "date ranges" are paired to their calculated difference from "monitorFrequency" then those documents that "require refresh" are the only ones that get selected for update.
Gievn the finite number of possible intervals this is what I would suspect to be the most optimal solution. But the construction along with the fact that the actual "update" portion remains consistent for each selection leads to one other option.
Use $or for each selection.
Much the same logic as above, but instead applied to build an $or condition for the "query" portion of a "single" update. It is an "array of criteria" afterall, which is essentially the same as an "array of queries" which is what we are doing above. So just turn it around a little:
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
db.profiles.updateMany(
{
"$or": intervals.map( (monitorFrequency,i) =>
({
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
})
)
},
{ "$currentDate": { "lastRefreshAt": true } }
)
This then becomes one simple statement and of course can actually use indexes where available. Generally this is what you should be doing, though as I have suggested my intuition tells me that 4 threads of execution constrained only by the slowest one gets the job done slightly faster. Again, mileage may vary on that but logic dictates that this is so.
So the basic lesson here is "whilst you may think" that the logical approach is to calculate the values and compare within the database itself, it's actually the worst possible thing you can do for query performance.
The simple approach taken are to work out the criteria that should select the documents you want "before" you issue the query statement to the server. This means you are looking at "concrete values" rather than "calculation results" in comparison. And "concrete values" can actually be indexed, which is generally what you want for database queries.
I have this particular scenario where I have to update certain value in MongoDB depending on different attributes present in same Document. So I am trying to use findAndUpdate with where operator which will be passed a JavaScript function and I will also be using one of the attribute as find criteria. But it has been mentioned in MongoDB documentation that, one should not use where operator until it can not be avoided because of performance issue.
Now lets say I have 3 attributes id, counter1, counter2 in my document and I am updating counter1 by 1 only when counter1 + counter2 = 2. So I will be writing something like
db.mydb.findAndUpdate({"_id" : id, $where : function() {
this.counter1 + this.counter2 == 2 ;}},
{$inc : {counter1 : 1}})
Now my question is:
Will this particular approach create any performance issue? as I am using id as another nonWhere operator criteria to search for a document.
Or I should be having another attribute in mydb collection something called say sumCounter which will store the values of counter1 and counter2.
So the main catch with $where evaluation is that the conditional logic cannot process an "index" in order to filter out matches. In addition, it is JavaScript logic afterall, and needs to be compiled as well as there needs to be "object translation" from the native forms into something that will work with the evaluation in the JavacScript engine.
So it's use should be "very sparingly" and only when "absolutely" required, as in there is no other practical way. In your case this is an "update" operation, therefore if you need that logic then fine. If it where just a "query", then I would say to use $redact in the aggregation framework instead:
db.mydb.aggregate([
{ "$match": { "_id": id } } },
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$add": [ "$counter1", "$counter2" ] },
2
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
As that is at least all in native operators and therefore going to work faster than JavaScript.
As for "performance", then it is all relative. But however in your case where _id is a "unique" lookup, then the actual performance "hit" should be negligible as the "exact match" was already done on the "index" for the primary key.
This is the general advice for $where conditions. In that you "use them" generally in conjuction with other native query operators that do the "bulk" of the filtering. Then if it takes a few more CPU cycles to apply the conditions in your JavaScript logic ( and it is absolutely needed since there is no other way ), then so be it.
But if however your JavaScript based condition needs to scan many documents without the assistance of other filtering, then that is bad indeed.
I'm working on a Meteor application and I have data for a weekly timetable of the format -
events:
event:
day: "Monday"
time: "9am"
location: "A"
event:
day: "Monday"
time: "10am"
location: "B"
There are numerous entries for each day. Can I run a query that will return an object of the format -
day: "Monday"
events:
event:
time: "9am"
location: "A"
event:
time: "10am"
location: "B"
I could store the object in the second format but prefer the first for ease of deleting and updating individual events.
I also want to return them ordered by day of week if there's a nice way to do that.
Several options:
You can use an aggregation command but be warned, you will loose reactivity: it means that except when you reload your template, you will not get external updates. You would also need to use a package to add the aggregation command to Meteor in order to achieve that.
My personal favorite: you don't need to aggregate (and loose reactivity) to achieve your data transformation. You can use a simple Collection.find() query and extend/reduce/modify it using a clever mix of cursor.Observe() and conditional modifications. Have a look at this answer, it did the trick for me (I needed a sum with black listing of some fields, but you can easily adapt it to your group/sorting case) : https://stackoverflow.com/a/30813050/3793161. Comment if you need more details on this
If you plan to have several servers, be warned that each server will have to observe so it may lead to an unnecessary load. So my third solution is either use collection hooks or methods to update an additional field containing every event for each day/user (whatever you need).See #David Weldon answer about that here: https://stackoverflow.com/a/31190896/3793161. In you case, it would probably mean to re-think your database structure to fit your needs (i.e. adding more fields so you ca update them on insert.
EDIT Here are some thoughts on your comment:
If you stick to what you described in the question, you would need seven documents, one per day, with an events field where you put all the events. My second solution is good if you need to rework a collection structure before sending it. However, in your case, you just need an object week with 7 sub-objects matching the days of the week.
I advise you to possible approaches:
use the aggregation in a method, as described by #chridam. Be warned that you will not be able to directly get a sorted array, as stated in mongo $group documentation
$group does not order its output documents
So you need to sort them (by day and by hour within each day) using, for example _.sortBy() before you return your method result. By the way, if you want to know what is going on in your method call, clientside, here is how you should write the call:
Meteor.call("getGroupedDailyEvents", userId, function(error, result){
if(error){
console.log("error", error);
}
if(result){
//do whatever you need
}
});
Make the data sorting client-side. You are looking for an overkill solution because, afaik, you don't need to filter any data to keep it from the user, and you are going to send the data anyway (just with another structure). This is much easier to make a simple helper in your template like this:
Template.displaySchedule.helpers({
"monday_events": function() {
return _.sortBy (events.find({day:"Monday"}).fetch(), "time")
},
//add other days
);
It assumes the format of your time field is sortable this way. If not, you just need to create a function to sort them accordingly to their formats or change the original format into something better suited.
The rest (HTML) would just be to iterate on Monday events using a {{#each monday_events}}
To achieve the desired result, use the aggregation framework where the $group pipeline operator groups all the input documents and apply the accumulator expression $push to the group to get the events array.
Your pipeline would look like this:
var Events = new Mongo.Collection('events');
var pipeline = [
{
"$group": {
"_id": "$day",
"events": {
"$push": {
"time": "$time"
"location": "$location"
}
}
}
},
{
"$project": {
"_id": 0, "day": "$_id", "events": 1
}
}
];
var result = Events.aggregate(pipeline);
console.log(result)
You can add the meteorhacks:aggregate package to implement the aggregation in Meteor:
Add to your app with
meteor add meteorhacks:aggregate
Since this package exposes .aggregate method on Mongo.Collection instances, you can define a method that gets the aggregated result array. For example
if (Meteor.isServer) {
var Events = new Mongo.Collection('events');
Meteor.methods({
getGroupedDailyEvents: function () {
var pipeline = [
{
"$group": {
"_id": "$day",
"events": {
"$push": {
"time": "$time"
"location": "$location"
}
}
}
},
{
"$project": {
"_id": 0, "day": "$_id", "events": 1
}
}
];
var result = Events.aggregate(pipeline);
return result;
}
});
}
if (Meteor.isClient) {
Meteor.call('getGroupedDailyEvents', logResult);
function logResult(err, res) {
console.log("Result: ", res)
}
}
When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.
But when I run it with a collection with about 140.000 documents, I get some strange results:
Map function:
function() { emit(this.featureType, this._id); }
Reduce function:
function(key, values) { return { count: values.length, ids: values };
As a result, I would expect something like (for each mapping key):
{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
"ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}
But instead I get this strange document structure:
{
"_id": "FEATURE_TYPE_A",
"value": {
"count": 706,
"ids": [
{
"count": 101,
"ids": [
{
"count": 100,
"ids": [
"9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}
Could someone explain me if this is the expected behavior, or am I doing something wrong?
Thanks in advance!
The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.
Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.
So first the mapper revised:
function () { emit(this.type, { count: 1, ids: [this._id] }); }
Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:
function (key, values) {
var ids = [];
var count = 0;
values.forEach(function(value) {
count += value.count;
value.ids.forEach(function(id) {
ids.push( id );
});
});
return { count: count, ids: ids };
}
What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially
Reduce one chunk of results #chunk1
Reduce another chunk of results #chunk2
Comnine the reduce on the reduced chunks, #chunk1 and #chunk2
That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.
The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
"ids": { "$push": "$_id" }
}},
{ "$out": "ouputCollection" }
])
So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.
But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
}}
])
In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.
I was wondering if anyone knows how to sort a mongodb find() result by string length.
I have tried something like db.foo.find().sort({item.lenght:-1}) but obviously doesn't work. Can somebody help me and also suggest me a way to do the same thing but in pymongo?
There are lot of things ( and basic API ) I would personally love to see in the aggregation framework such as:
Math functions
log (as in logarithm)
ceil
floor
Array
sum
String
length
Just to name a few.
And that is without resorting to obscure usages of the $mod operator or other means in such cases as "ceil" and "floor". But I digress.
Your "string length" falls into this category. Raise a JIRA issue about it. But for now you you can use mapReduce and the existing JavaScript functionality:
db.collection.mapReduce(
function() {
emit( this.item.length, this.item );
},
function(key,values) {
return values;
},
{ "out": { "inline": 1 } }
)
So while that does actually have the "mapReduce" funky style of returning a re-shaped document and with of course everything matching the same length in an array, what it does do is take advantage of the nature of "mapReduce" ( not just restricted to MongoDB ) and allows the emitted "key" value to be sorted in the response.
There is now a solution for this in MongoDB v3.4+ using the aggregation framework using $strLenBytes. Given the following document:
{_id: 0, name: "Bob"}
We can use
db.mycollection.aggregate([{
$project: {
byteLength: {$strLenBytes: "$name"}
}
}])
Which will return 3 for the number of bytes.
No, actually is not possible.
I was dealing with a similar problem, what I did was to store the string length of every object as a property of the object itself. This bypassed the problem.
If you think that shall be implemented (I do) I recomend you to upvote the issue in JIRA, which, for some reason have not so many votes:
https://jira.mongodb.org/browse/SERVER-5319