Select distinct more than one field using MongoDB's map reduce - mongodb

I want to execute this SQL statement on MongoDB:
SELECT DISTINCT book,author from library
So far MongoDB's DISTINCT only supports one field at a time. For more than one field, we have to use GROUP command or map-reduce.
I have googled a way to use GROUP command:
db.library.group({
key: {book:1, author:1},
reduce: function(obj, prev) { if (!obj.hasOwnProperty("key")) {
prev.book = obj.book;
prev.author = obj.author;
}},
initial: { }
});
However this approach only supports up to 10,000 keys. Anyone know how to use map reduce to solve this problem?

Take a look at this article which explains how to find unique articles using map-reduce in MongoDB.
Your emit statement is going to look something like:
emit({book: this.book, author: this.author}, {exists: 1});
and your reduce can be even simpler than the example since you don't care about how many there are for each grouping.
return {exists: 1};

In case someone faces the similar problem. This is the full solution:
Map step
map= "function(){
emit(
{book: this.book, author:this.author}, {exists: 1}
);
}"
Reduce step
reduce= "function(key, value){
return {exists: 1};
}"
Run the command
result= db.runCommand({
"mapreduce": "library",
"map": map,
"reduce": reduce,
"out: "result"
})
Get the result
db.result.find()

Related

sort by string length in Mongodb/pymongo

I was wondering if anyone knows how to sort a mongodb find() result by string length.
I have tried something like db.foo.find().sort({item.lenght:-1}) but obviously doesn't work. Can somebody help me and also suggest me a way to do the same thing but in pymongo?
There are lot of things ( and basic API ) I would personally love to see in the aggregation framework such as:
Math functions
log (as in logarithm)
ceil
floor
Array
sum
String
length
Just to name a few.
And that is without resorting to obscure usages of the $mod operator or other means in such cases as "ceil" and "floor". But I digress.
Your "string length" falls into this category. Raise a JIRA issue about it. But for now you you can use mapReduce and the existing JavaScript functionality:
db.collection.mapReduce(
function() {
emit( this.item.length, this.item );
},
function(key,values) {
return values;
},
{ "out": { "inline": 1 } }
)
So while that does actually have the "mapReduce" funky style of returning a re-shaped document and with of course everything matching the same length in an array, what it does do is take advantage of the nature of "mapReduce" ( not just restricted to MongoDB ) and allows the emitted "key" value to be sorted in the response.
There is now a solution for this in MongoDB v3.4+ using the aggregation framework using $strLenBytes. Given the following document:
{_id: 0, name: "Bob"}
We can use
db.mycollection.aggregate([{
$project: {
byteLength: {$strLenBytes: "$name"}
}
}])
Which will return 3 for the number of bytes.
No, actually is not possible.
I was dealing with a similar problem, what I did was to store the string length of every object as a property of the object itself. This bypassed the problem.
If you think that shall be implemented (I do) I recomend you to upvote the issue in JIRA, which, for some reason have not so many votes:
https://jira.mongodb.org/browse/SERVER-5319

How can I get all the doc ids in MongoDB?

How can I get an array of all the doc ids in MongoDB? I only need a set of ids but not the doc contents.
You can do this in the Mongo shell by calling map on the cursor like this:
var a = db.c.find({}, {_id:1}).map(function(item){ return item._id; })
The result is that a is an array of just the _id values.
The way it works in Node is similar.
(This is MongoDB Node driver v2.2, and Node v6.7.0)
db.collection('...')
.find(...)
.project( {_id: 1} )
.map(x => x._id)
.toArray();
Remember to put map before toArray as this map is NOT the JavaScript map function, but it is the one provided by MongoDB and it runs within the database before the cursor is returned.
One way is to simply use the runCommand API.
db.runCommand ( { distinct: "distinct", key: "_id" } )
which gives you something like this:
{
"values" : [
ObjectId("54cfcf93e2b8994c25077924"),
ObjectId("54d672d819f899c704b21ef4"),
ObjectId("54d6732319f899c704b21ef5"),
ObjectId("54d6732319f899c704b21ef6"),
ObjectId("54d6732319f899c704b21ef7"),
ObjectId("54d6732319f899c704b21ef8"),
ObjectId("54d6732319f899c704b21ef9")
],
"stats" : {
"n" : 7,
"nscanned" : 7,
"nscannedObjects" : 0,
"timems" : 2,
"cursor" : "DistinctCursor"
},
"ok" : 1
}
However, there's an even nicer way using the actual distinct API:
var ids = db.distinct.distinct('_id', {}, {});
which just gives you an array of ids:
[
ObjectId("54cfcf93e2b8994c25077924"),
ObjectId("54d672d819f899c704b21ef4"),
ObjectId("54d6732319f899c704b21ef5"),
ObjectId("54d6732319f899c704b21ef6"),
ObjectId("54d6732319f899c704b21ef7"),
ObjectId("54d6732319f899c704b21ef8"),
ObjectId("54d6732319f899c704b21ef9")
]
Not sure about the first version, but the latter is definitely supported in the Node.js driver (which I saw you mention you wanted to use). That would look something like this:
db.collection('c').distinct('_id', {}, {}, function (err, result) {
// result is your array of ids
})
I also was wondering how to do this with the MongoDB Node.JS driver, like #user2793120. Someone else said he should iterate through the results with .each which seemed highly inefficient to me. I used MongoDB's aggregation instead:
myCollection.aggregate([
{$match: {ANY SEARCHING CRITERIA FOLLOWING $match'S RULES} },
{$sort: {ANY SORTING CRITERIA, FOLLOWING $sort'S RULES}},
{$group: {_id:null, ids: {$addToSet: "$_id"}}}
]).exec()
The sorting phase is optional. The match one as well if you want all the collection's _ids. If you console.log the result, you'd see something like:
[ { _id: null, ids: [ '56e05a832f3caaf218b57a90', '56e05a832f3caaf218b57a91', '56e05a832f3caaf218b57a92' ] } ]
Then just use the contents of result[0].ids somewhere else.
The key part here is the $group section. You must define a value of null for _id (otherwise, the aggregation will crash), and create a new array field with all the _ids. If you don't mind having duplicated ids (according to your search criteria used in the $match phase, and assuming you are grouping a field other than _id which also has another document _id), you can use $push instead of $addToSet.
Another way to do this on mongo console could be:
var arr=[]
db.c.find({},{_id:1}).forEach(function(doc){arr.push(doc._id)})
printjson(arr)
Hope that helps!!!
Thanks!!!
I struggled with this for a long time, and I'm answering this because I've got an important hint. It seemed obvious that:
db.c.find({},{_id:1});
would be the answer.
It worked, sort of. It would find the first 101 documents and then the application would pause. I didn't let it keep going. This was both in Java using MongoOperations and also on the Mongo command line.
I looked at the mongo logs and saw it's doing a colscan, on a big collection of big documents. I thought, crazy, I'm projecting the _id which is always indexed so why would it attempt a colscan?
I have no idea why it would do that, but the solution is simple:
db.c.find({},{_id:1}).hint({_id:1});
or in Java:
query.withHint("{_id:1}");
Then it was able to proceed along as normal, using stream style:
createStreamFromIterator(mongoOperations.stream(query, MortgageDocument.class)).
map(MortgageDocument::getId).forEach(transformer);
Mongo can do some good things and it can also get stuck in really confusing ways. At least that's my experience so far.
Try with an agregation pipeline, like this:
db.collection.aggregate([
{ $match: { deletedAt: null }},
{ $group: { _id: "$_id"}}
])
this gona return a documents array with this structure
_id: ObjectId("5fc98977fda32e3458c97edd")
i had a similar requirement to get ids for a collection with 50+ million rows. I tried many ways. Fastest way to get the ids turned out to be to do mongoexport with just the ids.
One of the above examples worked for me, with a minor tweak. I left out the second object, as I tried using with my Mongoose schema.
const idArray = await Model.distinct('_id', {}, function (err, result) {
// result is your array of ids
return result;
});

Using IF/ELSE in map reduce

I am trying to make a simple map/reduce function on one of my MongoDB database collections.
I get data but it looks wrong. I am unsure about the Map part. Can I use IF/ELSE in this way?
UPDATE
I want to get the amount of authors that ownes the files. In other words how many of the authors own the uploaded files and thus, how many authors has no files.
The objects in the collection looks like this:
{
"_id": {
"$id": "4fa8efe33a34a40e52800083d"
},
"file": {
"author": "john",
"type": "mobile",
"status": "ready"
}
}
The map / reduce looks like this:
$map = new MongoCode ("function() {
if (this.file.type != 'mobile' && this.file.status == 'ready') {
if (!this.file.author) {
return;
}
emit (this.file.author, 1);
}
}");
$reduce = new MongoCode ("function( key , values) {
var count = 0;
for (index in values) {
count += values[index];
}
return count;
}");
$this->cimongo->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "statistics.photographer_count"
)
);
The map part looks ok to me. I would slightly change the reduce part.
values.forEach(function(v) {
count += v;
}
You should not use for in loop to iterate an array, it was not meant to do this. It is for enumerating object's properties. Here is more detailed explanation.
Why do you think your data is wrong? What's your source data? What do you get? What do you expect to get?
I just tried your map and reduce in mongo shell and got correct (reasonable looking) results.
The other way you can do what you are doing is get rid of the inner "if" condition in the map but call your mapreduce function with appropriate query clause, for example:
db.files.mapreduce(map,reduce,{out:'outcollection', query:{"file.author":{$exists:true}}})
or if you happen to have indexes to make the query efficient, just get rid of all ifs and run mapreduce with query:{"file.author":{$exists:true},"file.type":"mobile","file.status":"ready"} clause. Change the conditions to match the actual cases you want to sum up over.
In 2.2 (upcoming version available today as rc0) you can use the aggregation framework for this type of query rather than writing map/reduce functions, hopefully that will simplify things somewhat.

mongodb mapreduce function does not provide skip functionality, is their any solution to this?

Mongodb mapreduce function does not provide any way to skip record from database like find function. It has functionality of query, sort & limit options. But I want to skip some records from the database, and I am not getting any way to it. please provide solutions.
Thanks in advance.
Ideally a well-structured map-reduce query would allow you to skip particular documents in your collection.
Alternatively, as Sergio points out, you can simply not emit particular documents in map(). Using scope to define a global counter variable is one way to restrict emit to a specified range of documents. As an example, to skip the first 20 docs that are sorted by ObjectID (and thus, sorted by insertion time):
db.collection_name.mapReduce(map, reduce, {out: example_output, sort: {id:-1}, scope: "var counter=0")};
Map function:
function(){
counter ++;
if (counter > 20){
emit(key, value);
}
}
I'm not sure since which version this feature is available, but certainly in MongoDB 2.6 the mapReduce() function provides query parameter:
query : document
Optional. Specifies the selection criteria using query operators for determining the documents input to the map
function.
Example
Consider the following map-reduce operations on a collection orders that contains documents of the following prototype:
{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}
Perform the map-reduce operation on the orders collection using the mapFunction2, reduceFunction2, and finalizeFunction2 functions.
db.orders.mapReduce( mapFunction2,
reduceFunction2,
{
out: { merge: "map_reduce_example" },
query: { ord_date:
{ $gt: new Date('01/01/2012') }
},
finalize: finalizeFunction2
}
)
This operation uses the query field to select only those documents with ord_date greater than new Date(01/01/2012). Then it output the results to a collection map_reduce_example. If the map_reduce_example collection already exists, the operation will merge the existing contents with the results of this map-reduce operation.

mongodb find by comparing field values

Is it possible to express the following SQL query in mongodb:
SELECT * FROM table AS t WHERE t.field1 > t.filed2;
edit:
To summarize:.
using a third field storing "field1 - field2" is almost perfect, but requires a little extra maintenance.
$where will load and eval in JavaScript and won't use any indexes. No good for large data.
map/reduce has the same problem and will go trough all records even if we need only one
You can do this using $where:
db.coll.find( { $where: "this.field1 > this.field2" } );
But:
Javascript executes more slowly than
the native operators, but it is very flexible
If performance is an issue better to go with way suggested by #yi_H.
You could store in your document field1 - field2 as field3, then search for { field3: { $gt: 0 } }
It also possible to get matching documents with mapreduce.
You can use a $where. Just be aware it will be fairly slow (has to execute Javascript code on every record) so combine with indexed queries if you can.
db.T.find( { $where: function() { return this.Grade1 > this.Grade2 } } );
or more compact:
db.T.find( { $where : "this.Grade1 > this.Grade2" } );