MongoDB MapReduce : use positional operator $ in map function - mongodb

I have a collection with entries that look like that :
{"userid": 1, "contents": [ { "tag": "whatever", "value": 100 }, {"tag": "whatever2", "value": 110 } ] }
I'm performing a MapReduce on this collection with queries such as {"contents.tag": "whatever"}.
What I'd like to do in my map function is emiting the field "value" corresponding to the entry in the array "contents" that matched the query without having to iterate through the whole array. Under normal circumstances, I could do that using the $ positional operator with something like contents.$.value. But in the MapReduce case, it's not working.
To summarize, here is the code I have right now :`
map=function(){
emit(this.userid, WHAT DO I WRITE HERE TO EMIT THE VALUE I WANT ?);
}
reduce=function(key,values){
return values[0]; //this reduce function does not make sense, just for the example
}
res=db.runCommand(
{
"mapreduce": "collection",
"query": {'contents.tag':'whatever'},
"map": map,
"reduce": reduce,
"out": "test_mr"
}
);`
Any idea ?
Thanks !

This will not work without iterating over the whole array. In MongoDB a query is intended to match an entire document.
When dealing with Map / Reduce, the query is simply trimming the number of documents that are passed into the map function. However, the map function has no knowledge of the query that was run. The two are disconnected.
The source code around the M/R is here.
There is an upcoming aggregation feature that will more closely match this desire. But there's no timeline on this feature.

No way. I've had the same problem. The iterate is necessary.
You could do this:
map=function() {
for(var i in this.contents) {
if(this.contents[i].tag == "whatever") {
emit(this.userid, this.contents[i].value);
}
}
}

Related

MongoDB $pop element which is in 3 time nested array

Here is the data structure for each document in the collection. The datastructure is fixed.
{
'_id': 'some-timestamp',
'RESULT': [
{
'NUMERATION': [ // numeration of divisions
{
// numeration of producttypes
'DIVISIONX': [{'PRODUCTTYPE': 'product xy', COUNT: 100}]
}
]
}
]
}
The query result should be in the same structure but only contain producttypes matching a regular expression.
I tried using an nested $elemMatchoperator but this doesn't get me any closer. I don't know how I can iterate each value in the producttypes array for each division.
How can I do that? Then I could apply $pop, $in and $each.
I looked at:
Querying an array of arrays in MongoDB
https://docs.mongodb.com/manual/reference/operator/update/each/
https://docs.mongodb.com/manual/reference/operator/update/pop/
... and more
The solution I want to avoid is writing something like this:
collection.find().forEach(function(x) { /* more for eaches */ })
Edit:
Here is an example document to copy:
{"_id":"5ab550d7e85d5930b0879cbe","RESULT":[{"NUMERATION":[{"DIVISION":[{"PRODUCTTYPE":"Book","COUNT":10},{"PRODUCTTYPE":"Giftcard","COUNT":"300"}]}]}]}
E.g. the query result should only return the entry with the giftcard:
{"_id":"5ab550d7e85d5930b0879cbe","RESULT":[{"NUMERATION":[{"DIVISION":[{"PRODUCTTYPE":"Giftcard","COUNT":"300"}]}]}]}
Using the forEach approach the result is in the correct format. I'm still looking for a better way which does not involve the use of that function - therefore I will not mark this as an answer.
But for now this works fine:
db.collection.find().forEach(
function(wholeDocument) {
wholeDocument['RESULT'].forEach(function (resultEntry) {
resultEntry['NUMERATION'].forEach(function (numerationEntry) {
numerationEntry['DIVISION'].forEach(function(divisionEntry, index) {
// example condition (will be replaced by regular expression evaluation)
if(divisionEntry['PRODUCTTYPE'] != 'Giftcard'){
numerationEntry['DIVISION'].splice(index, 1);
}
})
})
})
print(wholeDocument);
}
)
UPDATE
Thanks to Rahul Raj's comments I have read up the aggregation with the $redact operator. A prototype of the solution to the issue is this query:
db.getCollection('DeepStructure').aggregate( [
{ $redact: {
$cond: {
if: { $ne: [ "$PRODUCTTYPE", "Giftcard" ] },
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}
]
)
I hope you're trying to update nested array.
You need to use positional operators $[] or $ for that.
If you use $[], you will be able to remove all matching nested array elements.
And if you use $, only the first matching array element will get removed.
Use $regex operator to pass on your regular expression.
Also, you need to use $pull to remove array elements based on matching condition. In your case, its regular expression. Note that $elemMatch is not the correct one to use with $pull as arguments to $pull are direct queries to the array.
db.collection.update(
{/*additional matching conditions*/},
{$pull: {"RESULT.$[].NUMERATION.$[].DIVISIONX":{PRODUCTTYPE: {$regex: "xy"}}}},
{multi: true}
)
Just replace xy with your regular expression and add your own matching conditions as required. I'm not quite sure about your data set, but I came up with the above answer based on my assumptions from the given info. Feel free to change according to your requirements.

MongoDB MapReduce strange results

When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.
But when I run it with a collection with about 140.000 documents, I get some strange results:
Map function:
function() { emit(this.featureType, this._id); }
Reduce function:
function(key, values) { return { count: values.length, ids: values };
As a result, I would expect something like (for each mapping key):
{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
"ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}
But instead I get this strange document structure:
{
"_id": "FEATURE_TYPE_A",
"value": {
"count": 706,
"ids": [
{
"count": 101,
"ids": [
{
"count": 100,
"ids": [
"9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}
Could someone explain me if this is the expected behavior, or am I doing something wrong?
Thanks in advance!
The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.
Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.
So first the mapper revised:
function () { emit(this.type, { count: 1, ids: [this._id] }); }
Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:
function (key, values) {
var ids = [];
var count = 0;
values.forEach(function(value) {
count += value.count;
value.ids.forEach(function(id) {
ids.push( id );
});
});
return { count: count, ids: ids };
}
What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially
Reduce one chunk of results #chunk1
Reduce another chunk of results #chunk2
Comnine the reduce on the reduced chunks, #chunk1 and #chunk2
That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.
The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
"ids": { "$push": "$_id" }
}},
{ "$out": "ouputCollection" }
])
So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.
But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
}}
])
In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.

Mongodb $elemMatch query inside map

I need to search inside a map element with a certain value in mongodb.
I have this element in data base:
{
"_id": ObjectId("52950e93c4aad399cff0d9f9"),
"_class": "com.company.model.customer.DbCustomer",
"version": NumberLong(0),
"channels": {
"adea3d4e-2a73-4f3e-8a89-a336d6132909": {
"value": "dominik.czech.aal#gmail.com",
"alias": "email1",
"deliveryChannel": "EMAIL",
"status": "GOOD",
"_class": "com.company.model.customer.CustomerEmail"
}
}
}
Where "adea3d4e-2a73-4f3e-8a89-a336d6132909" is a key of a map of channels.
What I want to search is a channel with certain value.
If "channels" were an array the query would be this way:
{ "channels" :
{ "$elemMatch" : { "value" : "dominik.czech.aal#gmail.com" } }
}
But, as channels is a map, I can't use this approach.
Is it possible to search inside a map the same way you search inside an array?
Notice that I want to use a single query, for security reasons I cannot use the map reduce functionality in my database.
Thanks in advance.
AFAIK it's not possible with the current MongoDB operators anyway, without scripting or map/reduce or knowing the keys you want to query in advance.
As a side note, you should think your data structure against how you want to query it - i.e. you should probably consider transform the channels document into an array.
i am not seeing any array set in above sample code, array set must be look like myarray[1,2,3]
so with this sample code if you want to search sub-field value you can try like following.
>db.Collection.find({"channels.value" : "dominik.czech.aal#gmail.com"})
Hope this will help.....

How can I get all the doc ids in MongoDB?

How can I get an array of all the doc ids in MongoDB? I only need a set of ids but not the doc contents.
You can do this in the Mongo shell by calling map on the cursor like this:
var a = db.c.find({}, {_id:1}).map(function(item){ return item._id; })
The result is that a is an array of just the _id values.
The way it works in Node is similar.
(This is MongoDB Node driver v2.2, and Node v6.7.0)
db.collection('...')
.find(...)
.project( {_id: 1} )
.map(x => x._id)
.toArray();
Remember to put map before toArray as this map is NOT the JavaScript map function, but it is the one provided by MongoDB and it runs within the database before the cursor is returned.
One way is to simply use the runCommand API.
db.runCommand ( { distinct: "distinct", key: "_id" } )
which gives you something like this:
{
"values" : [
ObjectId("54cfcf93e2b8994c25077924"),
ObjectId("54d672d819f899c704b21ef4"),
ObjectId("54d6732319f899c704b21ef5"),
ObjectId("54d6732319f899c704b21ef6"),
ObjectId("54d6732319f899c704b21ef7"),
ObjectId("54d6732319f899c704b21ef8"),
ObjectId("54d6732319f899c704b21ef9")
],
"stats" : {
"n" : 7,
"nscanned" : 7,
"nscannedObjects" : 0,
"timems" : 2,
"cursor" : "DistinctCursor"
},
"ok" : 1
}
However, there's an even nicer way using the actual distinct API:
var ids = db.distinct.distinct('_id', {}, {});
which just gives you an array of ids:
[
ObjectId("54cfcf93e2b8994c25077924"),
ObjectId("54d672d819f899c704b21ef4"),
ObjectId("54d6732319f899c704b21ef5"),
ObjectId("54d6732319f899c704b21ef6"),
ObjectId("54d6732319f899c704b21ef7"),
ObjectId("54d6732319f899c704b21ef8"),
ObjectId("54d6732319f899c704b21ef9")
]
Not sure about the first version, but the latter is definitely supported in the Node.js driver (which I saw you mention you wanted to use). That would look something like this:
db.collection('c').distinct('_id', {}, {}, function (err, result) {
// result is your array of ids
})
I also was wondering how to do this with the MongoDB Node.JS driver, like #user2793120. Someone else said he should iterate through the results with .each which seemed highly inefficient to me. I used MongoDB's aggregation instead:
myCollection.aggregate([
{$match: {ANY SEARCHING CRITERIA FOLLOWING $match'S RULES} },
{$sort: {ANY SORTING CRITERIA, FOLLOWING $sort'S RULES}},
{$group: {_id:null, ids: {$addToSet: "$_id"}}}
]).exec()
The sorting phase is optional. The match one as well if you want all the collection's _ids. If you console.log the result, you'd see something like:
[ { _id: null, ids: [ '56e05a832f3caaf218b57a90', '56e05a832f3caaf218b57a91', '56e05a832f3caaf218b57a92' ] } ]
Then just use the contents of result[0].ids somewhere else.
The key part here is the $group section. You must define a value of null for _id (otherwise, the aggregation will crash), and create a new array field with all the _ids. If you don't mind having duplicated ids (according to your search criteria used in the $match phase, and assuming you are grouping a field other than _id which also has another document _id), you can use $push instead of $addToSet.
Another way to do this on mongo console could be:
var arr=[]
db.c.find({},{_id:1}).forEach(function(doc){arr.push(doc._id)})
printjson(arr)
Hope that helps!!!
Thanks!!!
I struggled with this for a long time, and I'm answering this because I've got an important hint. It seemed obvious that:
db.c.find({},{_id:1});
would be the answer.
It worked, sort of. It would find the first 101 documents and then the application would pause. I didn't let it keep going. This was both in Java using MongoOperations and also on the Mongo command line.
I looked at the mongo logs and saw it's doing a colscan, on a big collection of big documents. I thought, crazy, I'm projecting the _id which is always indexed so why would it attempt a colscan?
I have no idea why it would do that, but the solution is simple:
db.c.find({},{_id:1}).hint({_id:1});
or in Java:
query.withHint("{_id:1}");
Then it was able to proceed along as normal, using stream style:
createStreamFromIterator(mongoOperations.stream(query, MortgageDocument.class)).
map(MortgageDocument::getId).forEach(transformer);
Mongo can do some good things and it can also get stuck in really confusing ways. At least that's my experience so far.
Try with an agregation pipeline, like this:
db.collection.aggregate([
{ $match: { deletedAt: null }},
{ $group: { _id: "$_id"}}
])
this gona return a documents array with this structure
_id: ObjectId("5fc98977fda32e3458c97edd")
i had a similar requirement to get ids for a collection with 50+ million rows. I tried many ways. Fastest way to get the ids turned out to be to do mongoexport with just the ids.
One of the above examples worked for me, with a minor tweak. I left out the second object, as I tried using with my Mongoose schema.
const idArray = await Model.distinct('_id', {}, function (err, result) {
// result is your array of ids
return result;
});

Using IF/ELSE in map reduce

I am trying to make a simple map/reduce function on one of my MongoDB database collections.
I get data but it looks wrong. I am unsure about the Map part. Can I use IF/ELSE in this way?
UPDATE
I want to get the amount of authors that ownes the files. In other words how many of the authors own the uploaded files and thus, how many authors has no files.
The objects in the collection looks like this:
{
"_id": {
"$id": "4fa8efe33a34a40e52800083d"
},
"file": {
"author": "john",
"type": "mobile",
"status": "ready"
}
}
The map / reduce looks like this:
$map = new MongoCode ("function() {
if (this.file.type != 'mobile' && this.file.status == 'ready') {
if (!this.file.author) {
return;
}
emit (this.file.author, 1);
}
}");
$reduce = new MongoCode ("function( key , values) {
var count = 0;
for (index in values) {
count += values[index];
}
return count;
}");
$this->cimongo->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "statistics.photographer_count"
)
);
The map part looks ok to me. I would slightly change the reduce part.
values.forEach(function(v) {
count += v;
}
You should not use for in loop to iterate an array, it was not meant to do this. It is for enumerating object's properties. Here is more detailed explanation.
Why do you think your data is wrong? What's your source data? What do you get? What do you expect to get?
I just tried your map and reduce in mongo shell and got correct (reasonable looking) results.
The other way you can do what you are doing is get rid of the inner "if" condition in the map but call your mapreduce function with appropriate query clause, for example:
db.files.mapreduce(map,reduce,{out:'outcollection', query:{"file.author":{$exists:true}}})
or if you happen to have indexes to make the query efficient, just get rid of all ifs and run mapreduce with query:{"file.author":{$exists:true},"file.type":"mobile","file.status":"ready"} clause. Change the conditions to match the actual cases you want to sum up over.
In 2.2 (upcoming version available today as rc0) you can use the aggregation framework for this type of query rather than writing map/reduce functions, hopefully that will simplify things somewhat.