How to convert this map reduce in aggregate framework? - mongodb

i've still done this map/reduce/finalize function using mongoDB.
This is how i need that mongoDB executes that aggregation:
db.house_results.mapReduce(function(){
emit(this.house_name.toLowerCase(),this);
},function(key,values){
var house = {name:key,address:"",description:"",photo:[],lat:0,lng:0,rooms:[]};
values.forEach(function(house_val) {
/*Address*/
if(house.address=="")
house.address = house_val.house_address;
/*Photo*/
if(!house_val.photo in house.photo)
house.photo.push(house_val.house_photo);
/*Description*/
if(house.description=="")
house.description = house_val.house_description;
/*LAT - LNG*/
if(house.lat==0 || house.lng==0){
var house_position = house_val.house_position;
if(house_position && house_position.lat && house_position.lng){
house.lat = house_position.lat;
house.lng = house_position.lng;
}
}
if(house.lat==0 || house.lng==0){
if(house_val.house_lat && house_val.house_lng){
house.lat = house_val.house_lat;
house.lng = house_val.house_lng;
}
}
if(house_val.rooms)
house.rooms.push(house_val.rooms);
});
return house;
},
{
out : "map_reduce_house_test",
finalize:function(key,house_val){
if(house_val.address==undefined){ // JUST ONE RESULT IN MAP FUNCTION -> REDUCE FUNCTION IS IGNORED -> FINALIZE IS SOLUTION
var house = {name:key,address:"",description:"",photo:[],lat:0,lng:0,rooms:[]};
/*Address*/
if(house.address=="")
house.address = house_val.house_address;
/*Photo*/
if(!house_val.photo in house.photo)
house.photo.push(house_val.house_photo);
/*Description*/
if(house.description=="")
house.description = house_val.house_description;
/*LAT - LNG*/
if(house.lat==0 || house.lng==0){
var house_position = house_val.house_position;
if(house_position && house_position.lat && house_position.lng){
house.lat = house_position.lat;
house.lng = house_position.lng;
}
}
if(house.lat==0 || house.lng==0){
if(house_val.house_lat && house_val.house_lng){
house.lat = house_val.house_lat;
house.lng = house_val.house_lng;
}
}
if(house_val.rooms)
house.rooms.push(house_val.rooms);
return house;
}else
return house_val;
}
}
);
Is there a way to simplify that functions and/or is better to do the same with aggregation mongodb's function?
Which could be the fastest and simplier method?
Thanks!

There isn't really much going on in this mapReduce other than taking the first values from various fields for the common grouping key and otherwise pushing some other values onto arrays.
Therefore everything is very much the same for aggregation:
db.house_results.aggregate([
{ "$group": {
"_id": { "$toLower": "$house_name" },
"name": { "$first": { "$toLower": "$house_name" } },
"photo": { "$push": "$house_photo" },
"address": { "$first": "$house_address" },
"description": { "$first": "$house_description" },
"lat": {
"$max": {
"$cond": [
{ "$gt": [ "$house_lat", "$house_position.lat" } },
"$house_lat",
"$house_position.lat"
}
},
"lng": {
"$max": {
"$cond": [
{ "$gt": [ "$house_lng", "$house_position.lng" } },
"$house_lng",
"$house_position.lng"
}
},
"rooms": { "$push": "$house_rooms" }
}}
])
The only real difference there is the conditional handling of the "lat" and "lng" output using primarily the $cond operator.
Noting that "_id" and "name" have the same thing in them, but that is what the map reduce is doing.
Take a good look at the aggregation operators for reference, but really your data should look like this rather than it's present form, which appears to be a de-normalized dump from somewhere.
Also for reference, It probably isn't affecting you in this case, but this is the wrong way to write a mapReduce. The output from the "map" function is different to that from the "reduce" function, notably the arrays.
Even though these will only have one element in them they "should" be emitted as an array element from the "map" function as well and treated as if they where already an array element by the "reduce" function.
This is because with larger "grouping", not all matching key values are sent into the reduce function at once, and the reducer can be called to combine other values emitted by "map" to the "reduce" function with previously reduced output. That is how large data is handling, and with arrays you run the risk of output like this, with the un-expected embedding of an array within an array:
[ [4,5,6], 7, 8, 9 ]
But this is covered in the documentation where you read carefully.
At any rate, the aggregation pipeline ( one stage) will perform much faster than the present operation. But really change your data as soon as possible.

Related

Using $sum on a existent field returns a value of 0 [duplicate]

I have a collection students with documents in the following format:-
{
_id:"53fe74a866455060e003c2db",
name:"sam",
subject:"maths",
marks:"77"
}
{
_id:"53fe79cbef038fee879263d2",
name:"ryan",
subject:"bio",
marks:"82"
}
{
_id:"53fe74a866456060e003c2de",
name:"tony",
subject:"maths",
marks:"86"
}
I want to get the count of total marks of all the students with subject = "maths". So I should get 163 as sum.
db.students.aggregate([{ $match : { subject : "maths" } },
{ "$group" : { _id : "$subject", totalMarks : { $sum : "$marks" } } }])
Now I should get the following result-
{"result":[{"_id":"53fe74a866455060e003c2db", "totalMarks":163}], "ok":1}
But I get-
{"result":[{"_id":"53fe74a866455060e003c2db", "totalMarks":0}], "ok":1}
Can someone point out what I might be doing wrong here?
Your current schema has the marks field data type as string and you need an integer data type for your aggregation framework to work out the sum. On the other hand, you can use MapReduce to calculate the sum since it allows the use of native JavaScript methods like parseInt() on your object properties in its map functions. So overall you have two choices.
Option 1: Update Schema (Change Data Type)
The first would be to change the schema or add another field in your document that has the actual numerical value not the string representation. If your collection document size is relatively small, you could use a combination of the mongodb's cursor find(), forEach() and update() methods to change your marks schema:
db.student.find({ "marks": { "$type": 2 } }).snapshot().forEach(function(doc) {
db.student.update(
{ "_id": doc._id, "marks": { "$type": 2 } },
{ "$set": { "marks": parseInt(doc.marks) } }
);
});
For relatively large collection sizes, your db performance will be slow and it's recommended to use mongo bulk updates for this:
MongoDB versions >= 2.6 and < 3.2:
var bulk = db.student.initializeUnorderedBulkOp(),
counter = 0;
db.student.find({"marks": {"$exists": true, "$type": 2 }}).forEach(function (doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "marks": parseInt(doc.marks) }
});
counter++;
if (counter % 1000 === 0) {
// Execute per 1000 operations
bulk.execute();
// re-initialize every 1000 update statements
bulk = db.student.initializeUnorderedBulkOp();
}
})
// Clean up remaining operations in queue
if (counter % 1000 !== 0) bulk.execute();
MongoDB version 3.2 and newer:
var ops = [],
cursor = db.student.find({"marks": {"$exists": true, "$type": 2 }});
cursor.forEach(function (doc) {
ops.push({
"updateOne": {
"filter": { "_id": doc._id } ,
"update": { "$set": { "marks": parseInt(doc.marks) } }
}
});
if (ops.length === 1000) {
db.student.bulkWrite(ops);
ops = [];
}
});
if (ops.length > 0) db.student.bulkWrite(ops);
Option 2: Run MapReduce
The second approach would be to rewrite your query with MapReduce where you can use the JavaScript function parseInt().
In your MapReduce operation, define the map function that process each input document. This function maps the converted marks string value to the subject for each document, and emits the subject and converted marks pair. This is where the JavaScript native function parseInt() can be applied. Note: in the function, this refers to the document that the map-reduce operation is processing:
var mapper = function () {
var x = parseInt(this.marks);
emit(this.subject, x);
};
Next, define the corresponding reduce function with two arguments keySubject and valuesMarks. valuesMarks is an array whose elements are the integer marks values emitted by the map function and grouped by keySubject.
The function reduces the valuesMarks array to the sum of its elements.
var reducer = function(keySubject, valuesMarks) {
return Array.sum(valuesMarks);
};
db.student.mapReduce(
mapper,
reducer,
{
out : "example_results",
query: { subject : "maths" }
}
);
With your collection, the above will put your MapReduce aggregation result in a new collection db.example_results. Thus, db.example_results.find() will output:
/* 0 */
{
"_id" : "maths",
"value" : 163
}
Possible causes your sum is being returned 0 are :
The field you are summing up is not an integer but a string.
Make sure the field contains numeric values.
You are using wrong syntax of $sum.
db.c1.aggregate([{
$group: {
_id: "$item",
price: {
$sum: "$price"
},
count: {
$sum: 1
}
}
}])
Make sure you use "$price" and not "price".
One of the most silly mistake due to which this error occurs is:
Use of space or tab inside the quotes while specifying field name.
Example - "$price " won't work !!! But, "$price" would work.

mongodb: document with the maximum number of matched targets

I need help to solve the following issue. My collection has a "targets" field.
Each user can have 0 or more targets.
When I run my query I'd like to retrieve the document with the maximum number of matched targets.
Ex:
documents=[{
targets:{
"cluster":"01",
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO"
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO",
"platform":"IG"
}
}];
userTarget={
"cluster":"01",
"env":"DC",
"core":"PO"
}
You seem to be asking to return the document where the most conditions were met, and possibly not all conditions. The basic process is an $or query to return the documents that can match either of the conditions. Then you basically need a statement to calculate "how many terms" were met in the document, and return the one that matched the most.
So the combination here is an .aggregate() statement using the intitial results from $or to calculate and then sort the results:
// initial targets object
var userTarget = {
"cluster":"01",
"env":"DC",
"core":"PO"
};
// Convert to $or condition
// and the calcuation condition to match
var orCondition = [],
scoreCondition = []
Object.keys(userTarget).forEach(function(key) {
var query = {},
cond = { "$cond": [{ "$eq": ["$target." + key, userTarget[key]] },1,0] };
query["target." + key] = userTarget[key];
orCondition.push(query);
scoreCondition.push(cond);
});
// Run aggregation
Model.aggregate(
[
// Match with condition
{ "$match": { "$or": orCondition } },
// Calculate a "score" based on matched fields
{ "$project": {
"target": 1,
"score": {
"$add": scoreCondition
}
}},
// Sort on the greatest "score" (descending)
{ "$sort": { "score": -1 } },
// Return the first document
{ "$limit": 1 }
],
function(err,result) {
// check errors
// Remember that result is an array, even if limitted to one document
console.log(result[0]);
}
)
So before processing the aggregate statement, we are going to generate the dynamic parts of the pipeline operations based on the input in the userTarget object. This would produce an orCondition like this:
{ "$match": {
"$or": [
{ "target.cluster" : "01" },
{ "target.env" : "DC" },
{ "target.core" : "PO" }
]
}}
And the scoreCondition would expand to a coding like this:
"score": {
"$add": [
{ "$cond": [{ "$eq": [ "$target.cluster", "01" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.env", "DC" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.core", "PO" ] },1,0] },
]
}
Those are going to be used in the selection of possible documents and then for counting the terms that could match. In particular the "score" is made by evaluating each condition within the $cond ternary operator, and then either attributing a score of 1 where there was a match, or 0 where there was not a match on that field.
If desired, it would be simple to alter the logic to assign a higher "weight" to each field with a different value going towards the score depending on the deemed importance of the match. At any rate, you simply $add these score results together for each field for the overall "score".
Then it is just a simple matter of applying the $sort to the returned "score", and then using $limit to just return the top document.
It's not super efficient, since even though there is a match for all three conditions the basic question you are asking of the data cannot presume that there is, hence it needs to look at all data where "at least one" condition was a match, and then just work out the "best match" from those possible results.
Ideally, I would personally run an additional query "first" to see if all three conditions were met, and if not then look for the other cases. That still is two separate queries, and would be different from simply just pushing the "and" conditions for all fields as the first statement in $or.
So the preferred implementation I think should be:
Look for a document that matches all given field values; if not then
Run the either/or on every field and count the condition matches.
That way, if all fields match then the first query is fastest and only needs to fall back to the slower but required implementaion shown in the listing if there was no actual result.

MongoDB :: Order Search result depend on search condition

I have a data
[{ "name":"BS",
"keyword":"key1",
"city":"xyz"
},
{ "name":"AGS",
"keyword":"Key2",
"city":"xyz1"
},
{ "name":"QQQ",
"keyword":"key3",
"city":"xyz"
},
{ "name":"BS",
"keyword":"Keyword",
"city":"city"
}]
and i need to search records which have name= "BS" OR keyword="key2" with the help of query
db.collection.find({"$OR" : [{"name":"BS"}, {"keyword":"Key2"}]});
These records i need in the sequence
[{ "name":"BS",
"keyword":"key1",
"city":"xyz"
},
{ "name":"BS",
"keyword":"Keyword",
"city":"city"
},
{ "name":"AGS",
"keyword":"Key2",
"city":"xyz1"
}]
but i am getting in following sequences:
[{ "name":"BS",
"keyword":"key1",
"city":"xyz"
},
{ "name":"AGS",
"keyword":"Key2",
"city":"xyz1"
},
{ "name":"BS",
"keyword":"Keyword",
"city":"city"
}]
Please provide some suggestion i am stuck with this problem since 2 days.
Thanks
The order of results returned by MongoDB is not guaranteed unless you explicitly sort your data using the sort function. For smaller datasets you maybe "lucky" in the sense that the results are always returned in the same order, however, for bigger datasets and in particular when you have sharded Mongo clusters this is very unlikely. As proposed by Yathish you need to explicitly order your results using the sort function. Based on the suggested output, it seems you want to sort by name in descending order so I have set the sorting flag to -1 for the field name.
db.collection.find({"$or" : [{"name":"BS"}, {"keyword":"Key2"}]}).sort({"name" : -1});
If you need a more complex sorting algorithm as specified in your comment, you can convert your results to a Javascript array and create a custom sort function. This sort function will first list documents with a name equal to "BS" and then documents containing the keyword "Key2"
db.data.find({
"$or": [{
"name": "BS"
}, {
"keyword": "Key2"
}]
}).toArray().sort(function(doc1, doc2) {
if (doc1.name == "BS" && doc2.keyword == "Key2") {
return -1
} else if (doc2.name == "BS" && doc1.keyword == "Key2") {
return 1
} else {
return doc1.name < doc2.name
}
});

mongodb aggregate query isn't returning proper sum on using $sum

I have a collection students with documents in the following format:-
{
_id:"53fe74a866455060e003c2db",
name:"sam",
subject:"maths",
marks:"77"
}
{
_id:"53fe79cbef038fee879263d2",
name:"ryan",
subject:"bio",
marks:"82"
}
{
_id:"53fe74a866456060e003c2de",
name:"tony",
subject:"maths",
marks:"86"
}
I want to get the count of total marks of all the students with subject = "maths". So I should get 163 as sum.
db.students.aggregate([{ $match : { subject : "maths" } },
{ "$group" : { _id : "$subject", totalMarks : { $sum : "$marks" } } }])
Now I should get the following result-
{"result":[{"_id":"53fe74a866455060e003c2db", "totalMarks":163}], "ok":1}
But I get-
{"result":[{"_id":"53fe74a866455060e003c2db", "totalMarks":0}], "ok":1}
Can someone point out what I might be doing wrong here?
Your current schema has the marks field data type as string and you need an integer data type for your aggregation framework to work out the sum. On the other hand, you can use MapReduce to calculate the sum since it allows the use of native JavaScript methods like parseInt() on your object properties in its map functions. So overall you have two choices.
Option 1: Update Schema (Change Data Type)
The first would be to change the schema or add another field in your document that has the actual numerical value not the string representation. If your collection document size is relatively small, you could use a combination of the mongodb's cursor find(), forEach() and update() methods to change your marks schema:
db.student.find({ "marks": { "$type": 2 } }).snapshot().forEach(function(doc) {
db.student.update(
{ "_id": doc._id, "marks": { "$type": 2 } },
{ "$set": { "marks": parseInt(doc.marks) } }
);
});
For relatively large collection sizes, your db performance will be slow and it's recommended to use mongo bulk updates for this:
MongoDB versions >= 2.6 and < 3.2:
var bulk = db.student.initializeUnorderedBulkOp(),
counter = 0;
db.student.find({"marks": {"$exists": true, "$type": 2 }}).forEach(function (doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "marks": parseInt(doc.marks) }
});
counter++;
if (counter % 1000 === 0) {
// Execute per 1000 operations
bulk.execute();
// re-initialize every 1000 update statements
bulk = db.student.initializeUnorderedBulkOp();
}
})
// Clean up remaining operations in queue
if (counter % 1000 !== 0) bulk.execute();
MongoDB version 3.2 and newer:
var ops = [],
cursor = db.student.find({"marks": {"$exists": true, "$type": 2 }});
cursor.forEach(function (doc) {
ops.push({
"updateOne": {
"filter": { "_id": doc._id } ,
"update": { "$set": { "marks": parseInt(doc.marks) } }
}
});
if (ops.length === 1000) {
db.student.bulkWrite(ops);
ops = [];
}
});
if (ops.length > 0) db.student.bulkWrite(ops);
Option 2: Run MapReduce
The second approach would be to rewrite your query with MapReduce where you can use the JavaScript function parseInt().
In your MapReduce operation, define the map function that process each input document. This function maps the converted marks string value to the subject for each document, and emits the subject and converted marks pair. This is where the JavaScript native function parseInt() can be applied. Note: in the function, this refers to the document that the map-reduce operation is processing:
var mapper = function () {
var x = parseInt(this.marks);
emit(this.subject, x);
};
Next, define the corresponding reduce function with two arguments keySubject and valuesMarks. valuesMarks is an array whose elements are the integer marks values emitted by the map function and grouped by keySubject.
The function reduces the valuesMarks array to the sum of its elements.
var reducer = function(keySubject, valuesMarks) {
return Array.sum(valuesMarks);
};
db.student.mapReduce(
mapper,
reducer,
{
out : "example_results",
query: { subject : "maths" }
}
);
With your collection, the above will put your MapReduce aggregation result in a new collection db.example_results. Thus, db.example_results.find() will output:
/* 0 */
{
"_id" : "maths",
"value" : 163
}
Possible causes your sum is being returned 0 are :
The field you are summing up is not an integer but a string.
Make sure the field contains numeric values.
You are using wrong syntax of $sum.
db.c1.aggregate([{
$group: {
_id: "$item",
price: {
$sum: "$price"
},
count: {
$sum: 1
}
}
}])
Make sure you use "$price" and not "price".
One of the most silly mistake due to which this error occurs is:
Use of space or tab inside the quotes while specifying field name.
Example - "$price " won't work !!! But, "$price" would work.

MongoDb MapReduce on child array

I've searched the internet long and hard but can't find a solution to this problem. Whilst there are lots of Map reduce examples, i'm getting confused because my document has a property which is an array of objects.
I'm pretty sure this should be easy for someone with experience but i'm a noob at the minute.
I have a document which looks roughly like this
{
_id:guid,
clientId:guid,
reference:'abc123'
items:
[
{ _id:guid, category:'A', length:100, active:true },
{ _id:guid, category:'B', length:150, active:true },
{ _id:guid, category:'A', length:10, active:false },
{ _id:guid, category:'A', length:111, active:true },
]
}
and I want to produce this output
dateFromIdGuid(day) category countOfItems countOfActive sumOfLength
I'd like to keep the data in this format to reduce the number of write operations (there are already over 1000 writes to this collection per second and rising)
This is driving me insane so any help would be very much appreciated.
Thanks.
If you are talking about extracting a timestamp and reducing that to a discrete day from a GUID, then MongoDB is not going to be of much help to you there. You would need an external language implementation that would support such a function and implement an external mapReduce process such as with Hadoop.
It makes me wonder though if we are in fact talking about a GUID or whether you actually mean an ObjectID which would be the default value for the _id field of your document unless this has been specifically overridden to have a GUID in there.
Even if that is not true, you would be helped by adding a "timestamp" field of some sort to your document and using the correct BSON Date object type as shown below:
{
_id:guid,
"timestamp": ISODate("2014-05-27T00:00:00Z")
"clientId":guid,
"reference":'abc123'
"items":
[
{ _id:guid, category:'A', length:100, active:true },
{ _id:guid, category:'B', length:150, active:true },
{ _id:guid, category:'A', length:10, active:false },
{ _id:guid, category:'A', length:111, active:true },
]
}
This allows you to use the MongoDB aggregation framework as it can operate on Date objects of this type in order to break down the results to discrete days:
db.collection.aggregate([
{ "$unwind": "$items" },
{ "$group": {
"_id": {
"day": { "$dayOfYear": "$timestamp" },
"category": "$items.category"
},
"countOfItems": { "$sum": 1 },
"countOfActive": {
"$sum": {
"$cond": [
"$items.active",
1,
0
]
}
},
"sumOfLength": { "$sum": "$items.length" }
}}
])
That not only gives you the results in the fastest way MongoDB can do it but that "timestamp" value is also useful for filtering queries within date ranges which is something you cannot easily do from other values.
Also there is a way in the JavaScript available to MongoDB mapReduce that allows you to get the date from an ObejctId. This runs slower than the aggregation framework though:
db.collection.mapReduce(
function() {
var date = this._id.getTimestamp();
items.forEach(function(item) {
var day =
"" + date.getFullyear() +
"" + ( date.getMonth() + 1 ) +
"" + date.getDate();
emit(
{
day: day,
category: item.category
},
{
countOfItems: 1,
countOfActive: ( item.active ) ? 1 : 0,
sumOfLength: item.length
}
);
});
},
function( key, values ) {
var reduced = {
countOfItems: 0,
countOfActive: 0,
sumOfLength: 0
};
values.forEach(function(value) {
for ( var k in value ) {
reduced[k] += value[k];
}
});
return reduced;
},
{
"out": { "inline": 1 }
}
)
That basically does the same thing where the mapper breaks apart the array and provides grouping keys while the reducer just sums up the values from the mapper. So even if you had to extract from GUID's that gives you a basic layout for a mapper and reducer in a language such as Java when using Hadoop.
Take a look at the aggregate and mapReduce manual pages for more information on options you can apply.