I have documents like this:
{
"_id" : "someid",
"name" : "somename",
"action" : "do something",
"date" : ISODate("2011-08-19T09:00:00Z")
}
I want to map reduce them into something like this:
{
"_id" : "someid",
"value" : {
"count" : 100,
"name" : "somename",
"action" : "do something",
"date" : ISODate("2011-08-19T09:00:00Z")
"firstEncounteredDate" : ISODate("2011-07-01T08:00:00Z")
}
}
I want to group the map reduced documents by "name", "action", and "date". But every document should has this "firstEncounteredDate" containing the earliest "date" (that is actually grouped by "name" and "action").
If I group by name, action and date, firstEncounteredDate would always be date, that's why I'd like to know if there's any way to get "the earliest date" (grouped by "name", and "action" from the entire document) while doing map-reduce.
How can I do this in map reduce?
Edit: more detail on firstEncounteredDate (courtesy to #beny23)
Seems like a two-pass map-reduce would fit the bill, somewhat akin to this example: http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
In pass #1, group the original "name"x"action"x"date" documents by just "name" and "action", collecting the various "date" values into a "dates" array during reduce. Use a 'finalize' function to find the minimum of the collected dates.
Untested code:
// phase i map function :
function () {
emit( { "name": this.name, "action": this.action } ,
{ "count": 1, "dates": [ this.date ] } );
}
// phase i reduce function :
function( key, values ) {
var result = { count: 0, dates: [ ] };
values.forEach( function( value ) {
result.count += value.count;
result.dates = result.dates.concat( value.dates );
}
return result;
}
// phase i finalize function :
function( key, reduced_value ) {
var earliest = new Date( Math.min.apply( Math, reduced_value.dates ) );
reduced_value.firstEncounteredDate = earliest ;
return reduced_value;
}
In pass #2, use the documents generated in pass #1 as input. For each "name"x"action" document, emit a new "name"x"action"x"date" document for each collected date, along with the now determined minimum date common to that "name"x"action" pair. Group by "name"x"action"x"date", summing up the count for each individual date during reduce.
Equally untested code:
// phase ii map function :
function() {
this.dates.forEach( function( d ) {
emit( { "name": this.name, "action": this.action, "date" : d } ,
{ "count": 1, "firstEncounteredDate" : this.firstEncounteredDate } );
}
}
// phase ii reduce function :
function( key, values ) {
// note: value[i].firstEncounteredDate should all be identical, so ...
var result = { "count": 0,
"firstEncounteredDate": values[0].firstEncounteredDate };
values.forEach( function( value ) {
result.count += value.count;
}
return result;
}
Pass #2 does not do a lot of heavy lifting, obviously -- it's mostly copying each document N times, one for each unique date. We could easily build a map of unique dates to their incidence counts during the reduce step of pass #1. (In fact, if we don't do this, there's no real point in having a "count" field in the values from pass #1.) But doing the second pass is a fairly effortless way of generating a full target collection containing the desired documents.
Related
I have a data structure like this:
We have some centers. A center has some switches. A switch has some ports.
{
"_id" : ObjectId("561ad881755a021904c00fb5"),
"Name" : "center1",
"Switches" : [
{
"Ports" : [
{
"PortNumber" : 2,
"Status" : "Empty"
},
{
"PortNumber" : 5,
"Status" : "Used"
},
{
"PortNumber" : 7,
"Status" : "Used"
}
]
}
]
}
All I want is to write an Update query to change the Status of the port that it's PortNumber is 5 to "Empty".
I can update it when I know the array index of the port (here array index is 1) with this query:
db.colection.update(
// query
{
_id: ObjectId("561ad881755a021904c00fb5")
},
// update
{
$set : { "Switches.0.Ports.1.Status" : "Empty" }
}
);
But I don't know the array index of that Port.
Thanks for help.
You would normally do this using the positional operator $, as described in the answer to this question:
Update field in exact element array in MongoDB
Unfortunately, right now the positional operator only supports one array level deep of matching.
There is a JIRA ticket for the sort of behavior that you want: https://jira.mongodb.org/browse/SERVER-831
In case you can make Switches into an object instead, you could do something like this:
db.colection.update(
{
_id: ObjectId("561ad881755a021904c00fb5"),
"Switch.Ports.PortNumber": 5
},
{
$set: {
"Switch.Ports.$.Status": "Empty"
}
}
)
Since you don't know the array index of the Port, I would suggest you dynamically create the $set conditions on the fly i.e. something which would help you get the indexes for the objects and then modify accordingly, then consider using MapReduce.
Currently this seems to be not possible using the aggregation framework. There is an unresolved open JIRA issue linked to it. However, a workaround is possible with MapReduce. The basic idea with MapReduce is that it uses JavaScript as its query language but this tends to be fairly slower than the aggregation framework and should not be used for real-time data analysis.
In your MapReduce operation, you need to define a couple of steps i.e. the mapping step (which maps an operation into every document in the collection, and the operation can either do nothing or emit some object with keys and projected values) and reducing step (which takes the list of emitted values and reduces it to a single element).
For the map step, you ideally would want to get for every document in the collection, the index for each Switches and Ports array fields and another key that contains the $set keys.
Your reduce step would be a function (which does nothing) simply defined as var reduce = function() {};
The final step in your MapReduce operation will then create a separate collection Switches that contains the emitted Switches array object along with a field with the $set conditions. This collection can be updated periodically when you run the MapReduce operation on the original collection.
Altogether, this MapReduce method would look like:
var map = function(){
for(var i = 0; i < this.Switches.length; i++){
for(var j = 0; j < this.Switches[i].Ports.length; j++){
emit(
{
"_id": this._id,
"switch_index": i,
"port_index": j
},
{
"index": j,
"Switches": this.Switches[i],
"Port": this.Switches[i].Ports[j],
"update": {
"PortNumber": "Switches." + i.toString() + ".Ports." + j.toString() + ".PortNumber",
"Status": "Switches." + i.toString() + ".Ports." + j.toString() + ".Status"
}
}
);
}
}
};
var reduce = function(){};
db.centers.mapReduce(
map,
reduce,
{
"out": {
"replace": "switches"
}
}
);
Querying the output collection Switches from the MapReduce operation will typically give you the result:
db.switches.findOne()
Sample Output:
{
"_id" : {
"_id" : ObjectId("561ad881755a021904c00fb5"),
"switch_index" : 0,
"port_index" : 1
},
"value" : {
"index" : 1,
"Switches" : {
"Ports" : [
{
"PortNumber" : 2,
"Status" : "Empty"
},
{
"PortNumber" : 5,
"Status" : "Used"
},
{
"PortNumber" : 7,
"Status" : "Used"
}
]
},
"Port" : {
"PortNumber" : 5,
"Status" : "Used"
},
"update" : {
"PortNumber" : "Switches.0.Ports.1.PortNumber",
"Status" : "Switches.0.Ports.1.Status"
}
}
}
You can then use the cursor from the db.switches.find() method to iterate over and update your collection accordingly:
var newStatus = "Empty";
var cur = db.switches.find({ "value.Port.PortNumber": 5 });
// Iterate through results and update using the update query object set dynamically by using the array-index syntax.
while (cur.hasNext()) {
var doc = cur.next();
var update = { "$set": {} };
// set the update query object
update["$set"][doc.value.update.Status] = newStatus;
db.centers.update(
{
"_id": doc._id._id,
"Switches.Ports.PortNumber": 5
},
update
);
};
Im trying to delete an item inside an object categorized inside multiple keys.
for example, deleting ObjectId("c") from every items section
This is the structure:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : {
"/" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("b")
]
},
"/folder1" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("d")
]
},
"/folder2" : {
"color" : "112233",
"items" : []
},
"/testing" : {
"color" : "112233",
"items" : [
ObjectId("c"),
ObjectId("f")
]
}
}
}
I tried with pull and unset like:
db.getCollection('col').update(
{},
{ $unset: { 'objects.$.items': ObjectId("c") } },
{ multi: true }
)
and
db.getCollection('col').update(
{},
{ "objects": {"items": { $pull: [ObjectId("c")] } } },
{ multi: true }
)
Any idea? thanks!
The problem here is largely with the current structure of your document. MongoDB cannot "traverse paths" in an efficient way, and your structure currently has an "Object" ( 'objects' ) which has named "keys". What this means is that accessing "items" within each "key" needs the explicit path to each key to be able to see that element. There are no wildcards here:
db.getCollection("coll").find({ "objects./.items": Object("c") })
And that is the basic principle to "match" something as you cannot do it "across all keys" without resulting to JavaScript code, which is really bad.
Change the structure. Rather than "object keys", use "arrays" instead, like this:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : [
{
"path": "/",
"color" : "#112233",
"items" : [
"c",
"b"
]
},
{
"path": "/folder1",
"color" : "#112233",
"items" : [
"c",
"d"
]
},
{
"path": "/folder2",
"color" : "112233",
"items" : []
},
{
"path": "/testing",
"color" : "112233",
"items" : [
"c",
"f"
]
}
]
}
It's much more flexible in the long run, and also allows you to "index" fields like "path" for use in query matching.
However, it's not going to help you much here, as even with a consistent query path, i.e:
db.getCollection("coll").find({ "objects.items": Object("c") })
Which is better, but the problem still persists that is it not possible to $pull from multiple sources ( whether object or array ) in the same singular operation. And that is augmented with "never" across multiple documents.
So the best you will ever get here is basically "trying" the "muti-update" concept until the options are exhausted and there is nothing left to "update". With the "modified" structure presented then you can do this:
var bulk = db.getCollection("coll").initializeOrderedBulkOp(),
count = 0,
modified = 1;
while ( modified != 0 ) {
bulk.find({ "objects.items": "c"}).update({
"$pull": { "objects.$.items": "c" }
});
count++;
var result = bulk.execute();
bulk = db.getCollection("coll").initializeOrderedBulkOp();
modified = result.nModified;
}
print("iterated: " + count);
That uses the "Bulk" operations API ( actually all shell methods now use it anyway ) to basically get a "better write response" that gives you useful information about what actually happened on the "update" attempt.
The point is that is basically "loops" and tries to match a document based on the "query" portion of the update and then tries to $pull from the matched array index an item from the "inner array" that matches the conditions given to $pull ( which acts as "query" in itself, just upon the array items ).
On each iteration you basically get the "nModified" value from the response, and when this is finally 0, then the operation is complete.
On the sample ( restructured ) given then this will take 4 iterations, being one for each "outer" array member. The updates are "multi" as implied by bulk .update() ( as opposed to .updateOne() ) already, and therefore the "maximum" iterations is determined by the "maximum" array elements present in the "outer" array across the whole collection. So if there is "one" document out of "one thousand" that has 20 entries then the iterations will be 20, and just because that document still has something that can be matched and modified.
The alternate case under your current structure does not bear mentioning. It is just plain "impossible" without:
Retrieving the document individually
Extracting the present keys
Running an individual $pull for the array under that key
Get next document, rinse and repeat
So "multi" is "right out" as an option and cannot be done, without some some possible "foreknowledge" of the possible "keys" under the "object" key in the document.
So please "change your structure" and be aware of the general limitations available.
You cannot possibly do this in "one" update, but at least if the maximum "array entries" your document has was "4", then it is better to do "four" updates over a "thousand" documents than the "four thousand" that would be required otherwise.
Also. Please do not "obfuscate" the ObjectId value in posts. People like to "copy/paste" code and data to test for themselves. Using something like ObjectId("c") which is not a valid ObjectId value would clearly cause errors, and therefore is not practical for people to use.
Do what "I did" in the listing, and if you want to abstract/obfuscate, then do it with "plain values" just as I have shown.
One approach that you could take is using JavaScript native methods like reduce to create the documents that will be used in the update.
You essentially need an operation like the following:
var itemId = ObjectId("55ba3a983857192828978fec");
db.col.find().forEach(function(doc) {
var update = {
"object./.items": itemId,
"object./folder1.items": itemId,
"object./folder2.items": itemId,
"object./testing.items": itemId
};
db.col.update(
{ "_id": doc._id },
{
"$pull": update
}
);
})
Thus to create the update object would require the reduce method that converts an array into an object:
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
Overall, you would need to use the Bulk operations to achieve the above update:
var bulk = db.col.initializeUnorderedBulkOp(),
itemId = ObjectId("55ba3a983857192828978fec"),
count = 0;
db.col.find().forEach(function(doc) {
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
bulk.find({ "_id": doc._id }).updateOne({
"$pull": update
})
count++;
if (count % 1000 == 0) {
bulk.execute();
bulk = db.col.initializeUnorderedBulkOp();
}
})
if (count % 1000 != 0) { bulk.execute(); }
I am trying to aggregate the total sum of packets in this document.
{
"_id" : ObjectId("51a6cd102769c63e65061bda"),
"capture" : "1369885967",
"packets" : {
"0" : "595",
"1" : "596",
"2" : "595",
"3" : "595",
...
}
}
The closest I can get is about
db.collection.aggregate({ $match: { capture : "1369885967" } }, {$group: { _id:null, sum: {$sum:"$packets"}}});
However it returns sum 0, which is obviously wrong.
{ "result" : [ { "_id" : null, "sum" : 0 } ], "ok" : 1 }
How do I get the sum of all the packets?
Since you have the values in an object instead of an array, you'll need to use mapReduce.
// Emit the values as integers
var mapFunction =
function() {
for (key in this.packets) {
emit(null, parseInt(this.packets[key]));
}
}
// Reduce to a simple sum
var reduceFunction =
function(key, values) {
return Array.sum(values);
}
> db.collection.mapReduce(mapFunction, reduceFunction, {out: {inline:1}})
{
"results" : [
{
"_id" : null,
"value" : 2381
}
],
"ok" : 1,
}
If at all possible, you should emit the values as an array of a numeric type instead since that gives you more options (ie aggregation) and (unless the data set is large) probably performance benefits.
If you don't know how many keys are in the packet subdocument and since you also seem to be storing counts as strings (why???) you will have to use mapReduce.
Something like:
m=function() {
for (f in "this.packets") {
emit(null, +this.packets[f]);
};
r=function(k, vals) {
int sum=0;
vals.forEach(function(v) { sum+=v; } );
return sum;
}
db.collection.mapreduce(m, r, {out:{inline:1}, query:{your query condition here}});
I've got a problem, I have data in mongodb which looks like this:
{"miejscowosci_str":"OneCity", "wojewodztwo":"FirstRegionName", "ZIP-Code" : "...", ...}
{"miejscowosci_str":"TwoCity", "wojewodztwo":"FirstRegionName", "ZIP-Code" : "...", ...}
{"miejscowosci_str":"ThreeCity", "wojewodztwo":"SecondRegionName", "ZIP-Code" : "...", ...}
{"miejscowosci_str":"FourCity", "wojewodztwo":"SecondRegionName", "ZIP-Code" : "...", ...}
and so on
What I want is to list all regions (wojewodztwo) and to count average number of zip codes per region, I know how to count all zip codes in region:
var map = function() {
emit(this.wojewodztwo,1);
};
var reduce = function(key, val) {
var count = 0;
for(i in val) {
count += val[i];
}
return count;
};
db.kodypocztowe.mapReduce(
map,
reduce,
{ out : "result" }
);
But I don't know how to count number of cities (miejscowosci_str) so I could divide number of ZIP-Codes in region through number of cities in the same region.
One city can have multiple number of zip-codes.
Have you got any ideas?
I'm making a couple of assumptions here :
cities can have multiple zip codes
zip codes are unique
you are not trying to get the answer to M101P week 5 questions !
Rather than just counting the cities in one go, why not build up a list of city/zip objects in the map phase and then reduce this to a list of zips and unique cities in the map phase. Then you can use the finalize phase to calculate the averages.
Note : if the data set is large you might want to consider using the aggregation framework instead, this is shown after the map/reduce example
db.kodypocztowe.drop();
db.result.drop();
db.kodypocztowe.insert([
{"miejscowosci_str":"OneCity", "wojewodztwo":"FirstRegionName", "ZIP-Code" : "1"},
{"miejscowosci_str":"TwoCity", "wojewodztwo":"FirstRegionName", "ZIP-Code" : "2"},
{"miejscowosci_str":"ThreeCity", "wojewodztwo":"SecondRegionName", "ZIP-Code" : "3"},
{"miejscowosci_str":"FourCity", "wojewodztwo":"SecondRegionName", "ZIP-Code" : "4"},
{"miejscowosci_str":"FourCity", "wojewodztwo":"SecondRegionName", "ZIP-Code" : "5"},
]);
// map the data to { region : [{citiy : name , zip : code }] }
// Note : a city can be in multiple zips but zips are assumed to be unique
var map = function() {
emit(this.wojewodztwo, {city:this.miejscowosci_str, zip:this['ZIP-Code']});
};
//
// convert the data to :
//
// {region : {cities: [], zips : []}}
//
// note : always add zips
// note : only add cities if they are not already there
//
var reduce = function(key, val) {
var res = {zips:[], cities:[]}
for(i in val) {
var city = val[i].city;
res.zips.push(val[i].zip);
if(res.cities.indexOf(city) == -1) {
res.cities.push(city);
}
}
return res;
};
//
// finalize the data to get the average number of zips / region
var finalize = function(key, res) {
res.average = res.zips.length / res.cities.length;
delete res.cities;
delete res.zips;
return res;
}
print("==============");
print(" map/reduce")
print("==============");
db.kodypocztowe.mapReduce(
map,
reduce,
{ out : "result" , finalize:finalize}
);
db.result.find().pretty()
print("==============");
print(" aggregation")
print("==============");
db.kodypocztowe.aggregate( [
// get the number of zips / [region,city]
{ "$group" :
{
_id : {"region" : "$wojewodztwo", city : "$miejscowosci_str"},
zips:{$sum:1}
}
},
// get the number of cities per region and sum the number of zips
{ "$group" :
{
_id : "$_id.region" ,
cities:{$sum:1},
zips:{$sum:"$zips"},
}
},
// project the data into the same format that map/reduce generated
{ "$project" :
{
"value.average":{$divide: ["$zips","$cities"]}
}
}
]);
I hope that helps.
I have a weird problem with MongoDB (2.0.2) map reduce.
So, the story goes like this:
I have Ad model (look for model source extract below) and I need to group up to n ads per category in order to have a nice ordered listing I can later use to do more interesting things.
# encoding: utf-8
class Ad
include Mongoid::Document
cache
include Mongoid::Timestamps
field :title
field :slug, :unique => true
def self.aggregate_latest_active_per_category
map = "function () {
emit( this.category, { id: this._id });
}"
reduce = "function ( key, value ) {
return { ads:v };
}"
self.collection.map_reduce(map, reduce, { :out => "categories"} )
end
All fun and games up until now.
What I expect is to get a result in a form which resembles (mongo shell for db.categories.findOne() ):
{
"_id" : "category_name",
"value" : {
"ads" : [
{
"id" : ObjectId("4f2970e9e815f825a30014ab")
},
{
"id" : ObjectId("4f2970e9e815f825a30014b0")
},
{
"id" : ObjectId("4f2970e9e815f825a30014b6")
},
{
"id" : ObjectId("4f2970e9e815f825a30014b8")
},
{
"id" : ObjectId("4f2970e9e815f825a30014bd")
},
{
"id" : ObjectId("4f2970e9e815f825a30014c1")
},
{
"id" : ObjectId("4f2970e9e815f825a30014ca")
},
// ... and it goes on and on
]
}
}
Actually, it would be even better if I could get value to contain only array but MongoDB complains about not supporting that yet, but, with later use of finalize function, that is not a big problem I want to ask about.
Now, back to problem. What actually happens when I do map reduce is that it spits out something like :
{
"_id" : "category_name",
"value" : {
"ads" : [
{
"ads" : [
{
"ads" : [
{
"ads" : [
{
"ads" : [
{
"id" : ObjectId("4f2970d8e815f825a3000011")
},
{
"id" : ObjectId("4f2970d8e815f825a3000017")
},
{
"id" : ObjectId("4f2970d8e815f825a3000019")
},
{
"id" : ObjectId("4f2970d8e815f825a3000022")
},
// ... on and on and on
... and while I could probably work out a way to use this it just doesn't look like something I should get.
So, my questions (finally) are:
Am I doing something wrong and what is it?
I there something wrong with MongoDB map reduce (I mean besides all the usual things when compared to hadoop)?
Yes, you're doing it wrong. Inputs and outputs of map and reduce should be uniform. Because they are meant to be executed in parallel, and reduce might be run over partially reduced results. Try these functions:
var map = function() {
emit(this.category, {ads: [this._id]});
};
var reduce = function(key, values) {
var result = {ads: []};
values.forEach(function(v) {
v.ads.forEach(function(a) {
result.ads.push(a)
});
});
return result;
}
This should produce documents like:
{_id: category, value: {ads: [ObjectId("4f2970d8e815f825a3000011"),
ObjectId("4f2970d8e815f825a3000019"),
...]}}