Run map reduce for all keys in collections - mongodb - mongodb

i am using map reduce in mongodb to find out the number of orders for a customer like this
db.order.mapReduce(
function() {
emit (this.customer,{count:1})
},
function(key,values){
var sum =0 ;
values.forEach(
function(value) {
sum+=value['count'];
}
);
return {count:sum};
},
{
query:{customer:ObjectId("552623e7e4b0cade517f9714")},
out:"order_total"
}).find()
which gives me an output like this
{ "_id" : ObjectId("552623e7e4b0cade517f9714"), "value" : { "count" : 13 } }
Currently it is working for a single customer which is a key. Now i want this to run this map reduce query for all customers in order collection, and output the result for all like this single output. Is there any way through which I can do it for all customers in order?

Using a map/reduce for that simple task is a bit like using a (comparatively slow) sledgehammer to crack a nut. The aggregation framework was basically invented for this kind of simple aggregation (and can do a lot more for you!):
db.order.aggregate([
{ "$group":{ "_id":"$customer", "orders":{ "$sum": 1 }}},
{ "$out": "order_total"}
])
Depending on your use case, you can even omit the $out stage and consume the results directly.
> db.orders.aggregate([{ "$group":{ "_id":"$customer", "orders":{ "$sum": 1 }}}])
{ "_id" : "b", "orders" : 2 }
{ "_id" : "a", "orders" : 3 }
Note that with very large collections, this most likely is not suitable, as it make take a while (but it should still be faster than a map/reduce operation).
For finding the number of orders of a single customer, you can use a simple query and use the cursor.count() method:
> db.orders.find({ "customer": "a" }).count()
3

Related

MongoDB return latest full document for each id (Full document Object containing all fields like sub document arrays etc) [duplicate]

I want to get the last document for each station with all other fields :
{
"_id" : ObjectId("535f5d074f075c37fff4cc74"),
"station" : "OR",
"t" : 86,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d114f075c37fff4cc75"),
"station" : "OR",
"t" : 82,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d364f075c37fff4cc76"),
"station" : "WA",
"t" : 79,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
I need to have t and station for the latest dt per station.
With the aggregation framework :
db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}])
returns
{
"result" : [
{
"_id" : "WA",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 79
},
{
"_id" : "OR",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 82
}
],
"ok" : 1
}
Is this the most efficient way to do that ?
Thanks
To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.
As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:
db.temperature.aggregate([
{ "$sort": { "station": 1, "dt": -1 } },
{ "$group": {
"_id": "$station",
"result": { "$first":"$dt"}, "t": {"$first":"$t"}
}}
])
And as stated you will of course want an index to reflect that in order to make the sort efficient:
However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.
So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:
db.temperature.aggregate([
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"}, "t": {"$last":"$t"}
}}
])
And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.
We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.
The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.
So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:
var stations = [
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
"NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
"OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
"VA", "WA", "WV", "WI", "WY"
];
for ( i=0; i<200000; i++ ) {
var station = stations[Math.floor(Math.random()*stations.length)];
var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
dt = new Date();
db.temperatures.insert({
station: station,
t: t,
dt: dt
});
}
On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.
Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.
Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.
So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.
But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:
db.temperature.aggregate([
// Get documents "greater than" the "highest" _id value found last time
{ "$match": {
"_id": { "$gt": ObjectId("536076603e70a99790b7845d") }
}},
// Do the grouping with addition of the returned field
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"},
"t": {"$last":"$t"},
"lastDoc": { "$last": "$_id" }
}}
])
And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.
Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.
An index is all you really need:
db.temperature.ensureIndex({ 'station': 1, 'dt': 1 })
for s in db.temperature.distinct('station'):
db.temperature.find({ station: s }).sort({ dt : -1 }).limit(1)
of course using whatever syntax is actually valid for your language.
Edit: You are correct that a loop like this incurs a round-trip per station, and it's great for a few stations, and not so good for 1000. You do still want the compound index on station+dt, though, and to take advantage of a descending sort:
db.temperature.aggregate([
{ $sort: { station: 1, dt: -1 } },
{ $group: { _id: "$station", result: {$first:"$dt"}, t: {$first:"$t"} } }
])
As far as the aggregation query you've posted, I'd make certain that you have an index on dt:
db.temperature.ensureIndex({'dt': 1 })
This will make certain that the $sort at the beginning of the aggregation pipeline is as efficient as possible.
As to whether or not this is the most efficient way to get this data, vs. a query in a loop, will likely be a function of how many data points you have. In the beginning, with "thousands of stations" and perhaps hundreds of thousands of data points I'd think the aggregation approach will be faster.
However, as you add more and more data an issue is that the aggregation query will continue to touch all the documents. This will get increasingly expensive as you scale up to millions or more documents. One approach for that case would be to add a $limit right after the $sort to limit the total number of documents being considered. That's a bit hacky and inexact but it would help to limit the total number of documents that need to be accessed.

Counting entries of subdocument in MongoDB documents

I have a document structure like so
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : {
"r152f47f1daf-0-2" : "c",
"r152f48413c1-0-2" : "c",
"r152f4851bf7-0-1" : "c"
}
}
My task is to find all documents with the following conditions:
The "_id" needs to start with "5:"
The number of revisions need to be exclusively greater then 3
The first part is easy, I have solved it with
db.nodes.find( {'_id': /^5:/} )
But I am struggling with the second part, am supposed to use $gt.
Since I am new to MongoDB, I was first looking at $size, but _revisions is not an array, it is a subdocument, right?.
Was also looking at $unwind and then counting the results, but that does not make sense either, since my result need to be the documents that match the above two conditions.
Any pointers highly appreciated.
Using the $where operator.
db.nodes.find(function() {
return (/^5:/.test(this._id) && Object.keys(this._revisions).length > 3 );
})
The problem with this as mentioned in the documentation is that:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
You should definitely consider to change the _revisions field to an array of sub-documents like this:
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : [
{
"rev": "r152f47f1daf-0-2",
"value": "c"
},
{
"rev": "r152f48413c1-0-2",
"value": "c"
},
{
"rev": "r152f4851bf7-0-1",
"value": "c"
}
]
}
And use the $exists operator.
db.nodes.find({ "_id": /^5:/, "_revisions.3": { "$exists": true } } )

Mongo find query for longest arrays inside object

I currently have objects in mongo set up like this for my application (simplified example, I removed some irrelevant fields for clarity here):
{
"_id" : ObjectId("529159af5b508dd71500000a"),
"c" : "somecontent",
"l" : [
{
"d" : "2013-11-24T01:43:11.367Z",
"u" : "User1"
},
{
"d" : "2013-11-24T01:43:51.206Z",
"u" : "User2"
}
]
}
What I would like to do is run a find query to return the objects which have the highest array length under "l" and sort highest->lowest, limit to 25 results. Some objects may have 1 object in the array, some may have 100. I'd like to find out which ones have the most under "l". I'm new to mongo and got everything else to work up until this point, but I just can't figure out the right parameters to get this specific query. Where I'm getting confused is how to handle counting the length of the array, sorting, etc. I could manually code this by parsing everything in the collection, but I'm sure there has to be a way for mongo to do this far more efficiently. I'm not against learning, if anyone knows any resources for more advanced queries or could help me out I'd really be thankful as this is the last piece! :-)
As a side note, node.js and mongo together is amazing and I wish I started using them in conjunction a long time ago.
Use the aggregation framework. Here's how:
db.collection.aggregate( [
{ $unwind : "$l" },
{ $group : { _id : "$_id", len : { $sum : 1 } } },
{ $sort : { len : -1 } },
{ $limit : 25 }
] )
There is no easy way to do this with your existing schema. The reason for this is that there is nothing in mongodb to find the size of your array length. Yes, you have $size operator, but the way it works is just to find all the arrays of a specific length.
So you can not sort your find query based on the length of the array. The only reasonable way out is to add additional field to your schema which will hold the length of the array (you will have something like "l_length : 3" in additional to your fields for every document). Good thing is that you can do it easily by looking at this relevant answer and after this you just need to make sure to increment or decrement this value when you are modifying the array.
When you will add this field, you can easily sort it by that field and moreover you can take advantage of indexes.
There is no straight approach to do this,
You can try adding size field in your document using $size,
$addFields to add new field total to get total elements in l array
$sort by total in descending order
$limit to select single document
$project to remove total field if you don't needed
db.collection.aggregate([
{ $addFields: { total: { $size: "$l" } } },
{ $sort: { total: -1 } },
{ $limit: 25 }
// { $project: { total: 0 } }
])
Playground

MongoDB, MapReduce and sorting

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.
Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.
I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.
My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.
db.runCommand({
'mapreduce': 'products',
'map': function() {
emit({
product_id: this.product_id,
popularity: this.popularity
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
'sort': {popularity: -1}
});
I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:
{
"v" : 1,
"key" : { "popularity" : -1 },
"ns" : "app.products",
"name" : "popularity_-1"
}
I just cannot figure out why it doesn't want to sort.
Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.
First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.
What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.
This is how I generated my dataset
function generate_dummy_data() {
for (i=2; i < 1000000; i++) {
db.foobar.save({
_id: i,
category_id: parseInt(Math.random() * 30),
popularity: parseInt(Math.random() * 50)
})
}
}
And this my map/reduce task:
var data = db.runCommand({
'mapreduce': 'foobar',
'map': function() {
emit({
sorting: this.popularity * -1,
product_id: this._id,
popularity: this.popularity,
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
});
And this is the end result (very long to paste it here):
http://cesarodas.com/results.txt
This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.
Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).
As noted in discussion on the original question:
Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see #crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().
OP's preference is for inline results rather than creating/deleting temporary collections.
The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.
Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:
db.products.aggregate(
{ $match: { category_id: 20 }},
{ $group : {
_id : "$product_id",
'popularity' : { $sum : "$popularity" },
}},
{ $sort: { 'popularity': -1 }}
)
.. and sample output:
{
"result" : [
{
"_id" : 50,
"popularity" : 139
},
{
"_id" : 150,
"popularity" : 99
},
{
"_id" : 123,
"popularity" : 55
}
],
"ok" : 1
}

Using stored JavaScript functions in the Aggregation pipeline, MapReduce or runCommand

Is there a way to use a user-defined function saved as db.system.js.save(...) in pipeline or mapreduce?
Any function you save to system.js is available for usage by "JavaScript" processing statements such as the $where operator and mapReduce and can be referenced by the _id value is was asssigned.
db.system.js.save({
"_id": "squareThis",
"value": function(a) { return a*a }
})
And some data inserted to "sample" collection:
{ "_id" : ObjectId("55aafd2bacbed38e06f9eccf"), "a" : 1 }
{ "_id" : ObjectId("55aafea6acbed38e06f9ecd0"), "a" : 2 }
{ "_id" : ObjectId("55aafeabacbed38e06f9ecd1"), "a" : 3 }
Then:
db.sample.mapReduce(
function() {
emit(null, squareThis(this.a));
},
function(key,values) {
return Array.sum(values);
},
{ "out": { "inline": 1 } }
);
Gives:
"results" : [
{
"_id" : null,
"value" : 14
}
],
Or with $where:
db.sample.find(function() { return squareThis(this.a) == 9 })
{ "_id" : ObjectId("55aafeabacbed38e06f9ecd1"), "a" : 3 }
But in "neither" case can you use globals such as the database db reference or other functions. Both $where and mapReduce documentation contain information of the limits of what you can do here. So if you thought you were going to do something like "look up data in another collection", then you can forget it because it is "Not Allowed".
Every MongoDB command action is actually a call to a "runCommand" action "under the hood" anyway. But unless what that command is actually doing is "calling a JavaScript processing engine" then the usage becomes irrelevant. There are only a few commands anyway that do this, being mapReduce, group or eval, and of course the find operations with $where.
The aggregation framework does not use JavaScript in any way at all. You might be mistaking just as others have done a statement like this, which does not do what you think it does:
db.sample.aggregate([
{ "$match": {
"a": { "$in": db.sample.distinct("a") }
}}
])
So that is "not running inside" the aggregation pipeline, but rather the "result" of that .distinct() call is "evaluated" before the pipeline is sent to the server. Much as with an external variable is done anyway:
var items = [1,2,3];
db.sample.aggregate([
{ "$match": {
"a": { "$in": items }
}}
])
Both essentially send to the server in the same way:
db.sample.aggregate([
{ "$match": {
"a": { "$in": [1,2,3] }
}}
])
So it is "not possible" to "call" any JavaScript function in the aggregation pipeline, nor is there really any point is "passing in" results in general from something saved in system.js. The "code" needs to be "loaded to the client" and only a JavaScript engine can actually do anything with it.
With the aggregation framework, all of the "operators" available are actually natively coded functions as opposed to the "free form" JavaScript interpretation provided for mapReduce. So instead of writing "JavaScript", you use the operators themselves:
db.sample.aggregate([
{ "$group": {
"_id": null,
"sqared": { "$sum": {
"$multiply": [ "$a", "$a" ]
}}
}}
])
{ "_id" : null, "sqared" : 14 }
So there are limitations on what you can do with functions saved in system.js, and the chances are that what you want to do is either:
Not allowed, such as accessing data from another collection
Not really required as the logic is generally self contained anyway
Or probably better implemented in client logic or other different form anyway
Just about the only practical use I can really think of is that you have a number of "mapReduce" operations that cannot be done any other way and you have various "shared" functions that you would rather just store on the server than maintain within every mapReduce function call.
But then again, the 90% reason for mapReduce over the aggregation framework is usually that the "document structure" of the collections has been poorly chosen and the JavaScript functionality is "required" to traverse the document for search and analysis.
So you can use it under the allowed constraints, but in most cases you probably should not be using this at all, but fixing the other issues that caused you to believe you needed this feature in the first place.