In MongoDB mapreduce, how can I flatten the values object? - mongodb

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}

It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.

Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."

AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.

You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};

A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );

All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.

While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.

Related

Why is mongo dot notation replacing an entire subdocument?

I've got the following doc in my db:
{
"_id": ObjectId("ABCDEFG12345"),
"options" : {
"foo": "bar",
"another": "something"
},
"date" : {
"created": 1234567890,
"updated": 0
}
}
And I want to update options.foo and date.updated at the same time using dot notation, like so:
var mongojs = require('mongojs');
var optionName = 'foo';
var optionValue = 'baz';
var updates = {};
updates['options.' + optionName] = optionValue;
updates['date.updated'] = new Date().getTime();
db.myCollection.findAndModify({
query : {
_id : ObjectId('ABCDEFG12345')
},
update : {
$set : updates
},
upsert : false,
new : true
}, function(error, doc, result) {
console.log(doc.options);
console.log(doc.date);
});
And this results in:
{
foo : 'baz',
another : 'something'
}
{
updated : 1234567890
}
Specifically, my pre-existing date.created field is getting clobbered even though I'm using dot notation.
Why is this only partially working? The options sub-document retains its pre-existing data (options.another), why doesn't the date sub-document retain its pre-existing data?
The behavior described typically happens when the object passed in the $set operator is of the form { "data" : { "updated" : 1234567890 } } rather than { "data.updated" : 1234567890 }, but I'm not familiar with dots in JavaScript enough to tell if that could be the cause on JS's side.
Also, it wouldn't explain why it happens with data and not options.
If you could print the object stored in the variable updates and that is sent to MongoDB in the update field, that would allow to tell on which side the issue is (JS or MongoDB).
i pass your code to a test environment and use the same library you are using. The mongojs library, for query by native ObjectId is like this mongojs.ObjectId("####") Can look the official documentation.
for the callback function in the findAndModify function, the docs parameter is an array so i navigate like an array
Note: [to concatenate the string i use template literals] (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals)
All work fine...

Searching with dynamic field name in MongoDB

I have a situation where records in Mongo DB are like :
{
"_id" : "xxxx",
"_class" : "xxxx",
"orgId" : xxx,
"targetKeyToOrgIdMap" : {
"46784_56139542ecaa34c13ba9e314" : 46784,
"47530_562f1bc5fc1c1831d38d1900" : 47530,
"700004280_56c18369fc1cde1e2a017afc" : 700004280
},
}
I have to find out the records where child nodes of targetKeyToOrgIdMap has a particular set of values. That means, I know what the value is going to be there in the record in "46784_56139542ecaa34c13ba9e314" : 46784 part. And the field name is variable, its combination of the value and some random string.
In above example, I have 46784, and I need to find all the records which have 46784 in that respective field.
Is there any way I can fire some regex or something like that or by using any other mean where I would get the records which has the value I need in the child nodes of the field targetKeyToOrgIdMap.
Thanks in advance
You could use MongoDB's $where like this:
db.myCollection.find( { $where: function() {
for (var key in obj.targetKeyToOrgIdMap) {
if (obj.targetKeyToOrgIdMap[key] == 46784){
return true;
}
}
}}).each { obj ->
println obj
}
But be aware that this will require a full table scan where the function is executed for each document. See documentation.

Use MongoDB aggregation to find set intersection of two sets within the same document

I'm trying to use the Mongo aggregation framework to find where there are records that have different unique sets within the same document. An example will best explain this:
Here is a document that is not my real data, but conceptually the same:
db.house.insert(
{
houseId : 123,
rooms: [{ name : 'bedroom',
owns : [
{name : 'bed'},
{name : 'cabinet'}
]},
{ name : 'kitchen',
owns : [
{name : 'sink'},
{name : 'cabinet'}
]}],
uses : [{name : 'sink'},
{name : 'cabinet'},
{name : 'bed'},
{name : 'sofa'}]
}
)
Notice that there are two hierarchies with similar items. It is also possible to use items that are not owned. I want to find documents like this one: where there is a house that uses something that it doesn't own.
So far I've built up the structure using the aggregate framework like below. This gets me to 2 sets of distinct items. However I haven't been able to find anything that could give me the result of a set intersection. Note that a simple count of set size will not work due to something like this: ['couch', 'cabinet'] compare to ['sofa', 'cabinet'].
{'$unwind':'$uses'}
{'$unwind':'$rooms'}
{'$unwind':'$rooms.owns'}
{'$group' : {_id:'$houseId',
use:{'$addToSet':'$uses.name'},
own:{'$addToSet':'$rooms.owns.name'}}}
produces:
{ _id : 123,
use : ['sink', 'cabinet', 'bed', 'sofa'],
own : ['bed', 'cabinet', 'sink']
}
How do I then find the set intersection of use and own in the next stage of the pipeline?
You were not very far from the full solution with aggregation framework - you needed one more thing before the $group step and that is something that would allow you to see if all the things that are being used match up with something that is owned.
Here is the full pipeline
> db.house.aggregate(
{'$unwind':'$uses'},
{'$unwind':'$rooms'},
{'$unwind':'$rooms.owns'},
{$project: { _id:0,
houseId:1,
uses:"$uses.name",
isOkay:{$cond:[{$eq:["$uses.name","$rooms.owns.name"]}, 1, 0]}
}
},
{$group: { _id:{house:"$houseId",item:"$uses"},
hasWhatHeUses:{$sum:"$isOkay"}
}
},
{$match:{hasWhatHeUses:0}})
and its output on your document
{
"result" : [
{
"_id" : {
"house" : 123,
"item" : "sofa"
},
"hasWhatHeUses" : 0
}
],
"ok" : 1
}
Explanation - once you unwrap both arrays you now want to flag the elements where used item is equal to owned item and give them a non-0 "score". Now when you regroup things back by houseId you can check if any used items didn't get a match. Using 1 and 0 for score allows you to do a sum and now a match for item which has sum 0 means it was used but didn't match anything in "owned". Hope you enjoyed this!
So here is a solution not using the aggregation framework. This uses the $where operator and javascript. This feels much more clunky to me, but it seems to work so I wanted to put it out there if anyone else comes across this question.
db.houses.find({'$where':
function() {
var ownSet = {};
var useSet = {};
for (var i=0;i<obj.uses.length;i++){
useSet[obj.uses[i].name] = true;
}
for (var i=0;i<obj.rooms.length;i++){
var room = obj.rooms[i];
for (var j=0;j<room.owns.length;j++){
ownSet[room.owns[j].name] = true;
}
}
for (var prop in ownSet) {
if (ownSet.hasOwnProperty(prop)) {
if (!useSet[prop]){
return true;
}
}
}
for (var prop in useSet) {
if (useSet.hasOwnProperty(prop)) {
if (!ownSet[prop]){
return true;
}
}
}
return false
}
})
For MongoDB 2.6+ Only
As of MongoDB 2.6, there are set operations available in the project pipeline stage. The way to answer this problem with the new operations is:
db.house.aggregate([
{'$unwind':'$uses'},
{'$unwind':'$rooms'},
{'$unwind':'$rooms.owns'},
{'$group' : {_id:'$houseId',
use:{'$addToSet':'$uses.name'},
own:{'$addToSet':'$rooms.owns.name'}}},
{'$project': {int:{$setIntersection:["$use","$own"]}}}
]);

Why is the result of a reduce function fed back into reduce using mongodb mapreduce

I'm seeing a perplexing behavior using mongo to perform progressive map reduce tasks. The input collection is large set of documents containing:
{_id: , url: 'some url from my swanky site'}
Here's my simple map function:
map: function() {
emit(this.url, {count: 1, id: this._id});
}
And the reduce (with lots of debugging print for logs shown below):
reduce: function (key, values) {
var count = 0;
var lastId = null;
var first = null;
if (typeof values[0].id == "undefined") {
print("bad id");
printjson(key);
printjson(values[0]);
return null;
} else {
print ("good id");
printjson(key);
printjson(values[0]);
}
first = ObjectId(values[0].id).getTimestamp();
values.forEach(function(v) {
count += v.count;
last = ObjectId(v.id).getTimestamp();
lastId = v.id;
});
return {
count: count,
first: first,
last: lastId,
lastCounted: lastId
};
}
Here's how I call mapreduce:
mrparams.out = {reduce: this.output};
mrparams.limit = 100;
mrparams.query = {'_id': {'$gt': mongoId(lastId.toHexString())}};
mrparams.finalize = null;
mrdb.mapReduce(this.map, this.reduce, mrparams, function(d) {
console.log("Finished mr", d);
callback();
});
This is done in a cron type manner so that every time interval, the job is run on limit number of records beginning with the record after the lastId it was run on the time before.
Very basic incremental map reduce stuff...
But, when I run it, I am seeing the return values of the reduce methond being passed back into the reduce method. Here's a snapshot of the logs:
XXXgood id
"http://www.nytimes.com/2013/04/23/technology/germany-fines-google-over-data-collection.html"
{ "count" : 1, "id" : ObjectId("5175a065b25f029a1d0927e6") }
good id
"http://www.nytimes.com/2013/04/23/world/middleeast/israel-hagel-iran.html"
{ "count" : 1, "id" : ObjectId("5175a065d7f115dd41097df6") }
good id
"http://www.nytimes.com/interactive/2013/04/22/sports/boston-moment.html"
{ "count" : 1, "id" : ObjectId("5175a0657c9c963654094d25") }
YYYThu Jun 20 11:42:11 [conn19938] query vox.system.indexes query: { ns: "vox.tmp.mr.pi_analytics_spark_trending_inventories_6667_inc" } nreturned:1 reslen:131 0ms
Thu Jun 20 11:42:11 [conn19938] query
vox.tmp.mr.pi_analytics_spark_trending_inventories_6667 nreturned:9 reslen:1716 0ms
ZZZbad id
"http://www.nytimes.com/2013/04/22/business/comedy-central-to-host-comedy-festival-on-twitter.html"
{
"count" : 2,
"first" : ISODate("2013-04-22T20:41:11Z"),
"last" : ObjectId("5175a067b25f029a1d092802"),
"lastCounted" : ObjectId("5175a067b25f029a1d092802")
}
bad id
"http://www.nytimes.com/2013/04/22/business/media/in-boston-cnn-stumbles-in-rush-to-break-news.html"
{
"count" : 7,
"first" : ISODate("2013-04-22T20:41:09Z"),
"last" : ObjectId("5175a067d7f115dd41097e3c"),
"lastCounted" : ObjectId("5175a067d7f115dd41097e3c")
}
XXX - a bunch of records emitted from my map function (containing a value with count and id)
YYY - some sort of mongo even that I'm not familiar with
ZZZ - after the event, reduce gets called with the output of former reduce jobs...
TLDR, when I run map reduce, the reducing is going fine until a mongo process runs then I start seeing the returned values of previous reduce functions passed into my reduce function.
Any idea why/how this is possible?
Running mongo 2.0.6
Thanks in advance
I figured out the situation. When putting the output of a map reduce job into a collection that already exists, mongo will pass both the newly reduced document and the document that was already in the output collection with the same key back through the reduce function.
This works seamlessly IF you have a consistent format for the value that you emit from map and the value that you return from reduce.
This is not well documented at all, but now that I have figured it out my frustration has transubstantiated into a feeling of smarts. Painful lesson learned. Good times ahead.

Ordering fields from find query with projection

I have a Mongo find query that works well to extract specific fields from a large document like...
db.profiles.find(
{ "profile.ModelID" : 'LZ241M4' },
{
_id : 0,
"profile.ModelID" : 1,
"profile.AVersion" : 2,
"profile.SVersion" : 3
}
);
...this produces the following output. Note how the SVersion comes before the AVersion in the document even though my projection asked for AVersion before SVersion.
{ "profile" : { "ModelID" : "LZ241M4", "SVersion" : "3.5", "AVersion" : "4.0.3" } }
{ "profile" : { "ModelID" : "LZ241M4", "SVersion" : "4.0", "AVersion" : "4.0.3" } }
...the problem is that I want the output to be...
{ "profile" : { "ModelID" : "LZ241M4", "AVersion" : "4.0.3", "SVersion" : "3.5" } }
{ "profile" : { "ModelID" : "LZ241M4", "AVersion" : "4.0.3", "SVersion" : "4.0" } }
What do I have to do get the Mongo JavaScript shell to present the results of my query in the field order that I specify?
I have achieved it by projecting the fields using aliases, instead of including and excluding by 0 and 1s.
Try this:
{
_id : 0,
"profile.ModelID" :"$profile.ModelID",
"profile.AVersion":"$profile.AVersion",
"profile.SVersion":"$profile.SVersion"
}
I get it now. You want to return results ordered by "fields" rather the value of a fields.
Simple answer is that you can't do this. Maybe its possible with the new aggregation framework. But this seems overkill just to order fields.
The second object in a find query is for including or excluding returned fields not for ordering them.
{
_id : 0, // 0 means exclude this field from results
"profile.ModelID" : 1, // 1 means include this field in the results
"profile.AVersion" :2, // 2 means nothing
"profile.SVersion" :3, // 3 means nothing
}
Last point, you shouldn't need to do this, who cares what order the fields come-back in.
You application should be able to make use of the fields it needs regardless of the order the fields are in.
Another solution I applied to achieve this is the following:
db.profiles
.find({ "profile.ModelID" : 'LZ241M4' })
.toArray()
.map(doc => ({
profile: {
ModelID: doc.profile.ModelID,
AVersion: doc.profile.AVersion,
SVersion: doc.profile.SVersion
}
}))
Since version 2.6 (that came out in 2014) MongoDB preserves the order of the document fields following the write operation (source).
P.S. If you are using Python you might find this interesting.