MongoDB - Is it possible to write several different functions in MapReduce? - mongodb

As part of my studies, we recently started learning MongoDB, and so I am very young in the field.
In one of the tasks we have been assigned to, MapReduce is used, but I can not do it properly.
Giving a collection of students of the following form:
{
_id: ObjectId("4ffmdjd8cfd99k03"),
tz: "11111",
FirstName: "Anton",
LastName: "Gill",
Department: "Computer Science"
Year: 1
Courses: [ { name: "Linear Algebra 1", grade: 70}, {name: "OS", grade: 88}]
}
The task is to write a mapReduce function, so that it returns a list of the names of the students in each department and each year, all whose grades are greater than 90. Example of normal output:
[{Department: "Computer Science", Year: 1, students: ["Anton", "John"]},
{Department: "Physics", Year: 2, students: ["Dean", "David"]}]
Can I write more than one map function? That is, the mapReduce structure will look like this:
db.students.mapReduce(
map1,
map2,
map3,
..
reduce,
..
I try unsuccessfully to create the desired structure.
This is the best I've been able to get to, and I'm still not sure how to write the reduce function.
var map = function(){
var value = 0, n = this.Courses.length;
for(var i = 0; i < n; i++){
value += this.Courses[i].grade;
}
var avgGrade = value/n;
if(avgGrade > 90){
var key = {
tz:this.tz,
Department:this.Department,
Year:this.Year,
};
value = {students:[this.FirstName]};
}
emit(key, value);
};
var reduce = function(keysObj, valuesObj){
return 1000; //Ignore this function, I've no idea how to deal with it.
};
db.students.mapReduce(
map,
reduce,
{
out: "output",
}
)
I would highly appreciate any assistance :)

Related

mongoDB collection.findOne() update the query with its own results

I have a query like this one :
.findOne( {
_id: myObjectId,
}, {
fields: {
smallSingleValue: 1,
HUDGE_OBJECT: 1,
}
} );
I do not need the entire HUDGE_OBJECT (it is roughly 1 Mo), I need HUDGE_OBJECT[ smallSingleValue ] ( less than a Ko ).
Right now, I can make that request and get the entire HUDGE_OBJECT, or make two requests ; one the guet the smallSingleValue and the other one to get the HUDGE_OBJECT[ smallSingleValue ].
Both solution are crapy.
Is there a way to do something like that :
fields: {
smallSingleValue: 1,
`HUDGE_OBJECT.${ $smallSingleValue }`: 1,
}
Obviously not with that syntax, but you get the idea.
I tried aggregation, but it’s probably not the solution.
Does the projection operator allows it ?
https://docs.mongodb.com/manual/reference/operator/projection/positional/
EDIT FOR THE COMMENTS :
Example of data :
{
_id: xxx,
firstName: xxx,
lastName: xxx,
currentLevel: bar, // <- that one is important
progress: {
foo: { big object },
bar: { big object },
baz: { big object },
...
}
}
What I need :
{
firstName: xxx,
lastName: xxx,
currentLevel: bar,
progress: {
bar: { big object },
}
}
But the question is : performance wise, is it better to just get the entire object (easy query) or to get a truncated object (more complex query but passing a bigger object) ? There are 50 «levels».
I don’ remember what I tried with the aggregation :x but it seemed to be a bad idea.

Embedding fields in all mongodb documents

I have a collection with documents that follows this structure:
child:
{
id: int
name: string
age: int
dob: date
school: string
class: string
}
I would like to embed certain fields, into something like this:
child:
{
id : int
personalInfo {
name: string
age: int
dob: date
}
educationInfo {
school: string
class: string
}
}
How would one go across in doing this in code? I am new to Mongodb, so I apologize if my syntax is incorrect. All of the fields have one-to-one relationships with the child (i.e. one child has one id, one name, one age, one school etc.), so I'm also wondering if embedding is even necessary.
Please try to use $set to set the new field personalInfo and educationInfo, with #unset to remove old fields age, name etc. Before do it, it would be better to check all those fields exists through $exists, here are sample codes as below,
> var personfields = [ "name", "age", "dob" ];
> var educationFields = [ "school", "class" ];
> var query = {};
> personFields.forEach(function(k){ query[k] = {$exists: 1}});
> educationFields.forEach(function(k){ query[k] = {$exists: 1}});
> db.collection.find(query).forEach(function(doc){
var personalInfo = {};
var educationInfo = {};
for (var k in doc) {
if (personFields.indexOf(k) !== -1){
personalInfo[k] = doc[k];
} else if (educationFields.indexOf(k) !== -1) {
educationInfo[k] = doc[k];
}
}
db.collection.update({_id: doc._id},
{$set: {
personalInfo: personalInfo,
educationInfo: educationInfo},
$unset: {'name': '',
'age': '',
'dob': '',
'school': '',
'class': ''}});
})
It's OK to embed them, that's what document dB's are for. So if you need a migration, you'll basically use mongodb's functions like update ,with $set and $unset.
See more here: https://docs.mongodb.org/manual/reference/method/db.collection.update/

Scope work strangely in mapReduce of MongoDB for the purpose of producing cumulative frequency

I have a collection called user, and I want to get cumulative frequency of number of users by date based on the _id field. The desired result should be something like that:
{
{_id: 2013-12-02, value: 10}, //upto 2013-12-02 there are 10 users
{_id: 2014-01-05, value: 20}, //upto 2014-01-05 there are totally 20 users
….
}
I try to get the above using the following mapReduce call:
db.user.mapReduce(
function(){var date = this._id.getTimestamp();
emit(new Date(date.getFullYear()+"-"+date.getMonth()+"-"+date.getDate()), 1)},
function(key, values) {cum = cum + Array.sum(values); return cum},
{out: "newUserAnalysis",
sort: {_id: 1},
scope: {cum: 0}})
But it seems that the cum variable reset to zero after the first return statement encountered in the reduce function. Why? Is there any other method to get what I want?
Many thanks.
cum should not be reset as it's a global variable in map, reduce and finalize functions during the whole mapReduce processing.
But reduce function has 3 requirements to be observed to assure processing correctly, particularly for bulky data handling since reduce function will be called repeatedly even on the same key. Normally the length of values in map function would not exceed 100. In a word, your design can't assure cum is called on the right sequence as you expect, which will produce incorrect statistics.
Following code for your reference:
// map and reduce per day then save to a collection.
db.user.mapReduce(function() {
var date = this._id.getTimestamp();
emit(new Date(date.getFullYear() + "-" + (date.getMonth() + 1) + "-"
+ date.getDate()), 1);
}, function(key, values) {
return Array.sum(values);
}, {
out : "newUserAnalysis",
sort : {
_id : 1
}
});
// Do accumulation one by one.
var cursor = db.newUserAnalysis.find().sort({_id:1});
var newValue = 0, first = true;
while (cursor.hasNext()) {
var doc = cursor.next();
newValue += doc.value;
if (first) {
first = false;
} else {
db.newUserAnalysis.update({_id:doc._id}, {$set:{value:newValue}});
}
}

sql server Row Number with partition over in MongoDB for returning a subset of rows

How to write below query using MongoDB-Csharp driver
SELECT SubSet.*
FROM ( SELECT T.ProductName ,
T.Price ,
ROW_NUMBER() OVER ( PARTITION BY T.ProductName ORDER BY T.ProductName ) AS ProductRepeat
FROM myTable T
) SubSet
WHERE SubSet.ProductRepeat = 1
What I am trying to achieve is
Collection
ProductName|Price|SKU
Cap|10|AB123
Bag|5|ED567
Cap|20|CD345
Cap|5|EC123
Expected results is
ProductName|Price|SKU
Cap|10|AB123
Bag|5|ED567
Here is the one attempt (please don't go with the object and fields)
public List<ProductOL> Search(ProductOL obj, bool topOneOnly)
{
List<ProdutOL> products = new List<ProductOL>();
var database = MyMongoClient.Instance.OpenToRead(dbName: ConfigurationManager.AppSettings["MongoDBDefaultDB"]);
var collection = database.GetCollection<RawBsonDocument>("Products");
List<IMongoQuery> build = new List<IMongoQuery>();
if (!string.IsNullOrEmpty(obj.ProductName))
{
var ProductNameQuery = Query.Matches("ProductName", new BsonRegularExpression(obj.ProductName, "i"));
build.Add(ProductNameQuery);
}
if (!string.IsNullOrEmpty(obj.BrandName))
{
var brandNameQuery = Query.Matches("BrandName", new BsonRegularExpression(obj.BrandName, "i"));
build.Add(brandNameQuery);
}
var fullQuery = Query.And(build.ToArray());
products = collection.FindAs<ProductOL>(fullQuery).SetSortOrder(SortBy.Ascending("ProductName")).ToList();
if (topOneOnly)
{
var tmpProducts = new List<ProductOL>();
foreach (var item in products)
{
if (tmpProducts.Any(x => x.ProductName== item.ProductName)) { }
else
tmpProducts.Add(item);
}
products = tmpProducts;
}
return products;
}
my mongo query works and gives me the right results. But that is not effeciant when I am dealing with huge data, so I was wondering if mongodb has any concepts like SQL Server for Row_Number() and Partitioning
If your query returns the expected results but isn't efficient, you should look into index usage with explain(). Given your query generation code includes conditional clauses, it seems likely you will need multiple indexes to efficiently cover common variations.
I'm not sure how the C# code you've provided relates to the original SQL query, as they seem to be entirely different. I'm also not clear how grouping is expected to help your query performance, aside from limiting the results returned.
Equivalent of the SQL query
There is no direct equivalent of ROW_NUMBER() .. PARTITION BY grouping in MongoDB, but you should be able to work out the desired result using either the Aggregation Framework (fastest) or Map/Reduce (slower but more functionality). The MongoDB manual includes an Aggregation Commands Comparison as well as usage examples.
As an exercise in translation, I'll focus on your SQL query which is pulling out the first product match by ProductName:
SELECT SubSet.*
FROM ( SELECT T.ProductName ,
T.Price ,
ROW_NUMBER() OVER ( PARTITION BY T.ProductName ORDER BY T.ProductName ) AS ProductRepeat
FROM myTable T
) SubSet
WHERE SubSet.ProductRepeat = 1
Setting up the test data you provided:
db.myTable.insert([
{ ProductName: 'Cap', Price: 10, SKU: 'AB123' },
{ ProductName: 'Bag', Price: 5, SKU: 'ED567' },
{ ProductName: 'Cap', Price: 20, SKU: 'CD345' },
{ ProductName: 'Cap', Price: 5, SKU: 'EC123' },
])
Here's an aggregation query in the mongo shell which will find the first match per group (ordered by ProductName). It should be straightforward to translate that aggregation query to the C# driver using the MongoCollection.Aggregate() method.
I've included comments with the rough equivalent SQL fragment in your original query.
db.myTable.aggregate(
// Apply a sort order so the $first product is somewhat predictable
// ( "ORDER BY T.ProductName")
{ $sort: {
ProductName: 1
// Should really have additional sort by Price or SKU (otherwise order may change)
}},
// Group by Product Name
// (" PARTITION BY T.ProductName")
{ $group: {
_id: "$ProductName",
// Find first matching product details per group (can use $$CURRENT in MongoDB 2.6 or list specific fields)
// "SELECT SubSet.* ... WHERE SubSet.ProductRepeat = 1"
Price: { $first: "$Price" },
SKU: { $first: "$SKU" },
}},
// Rename _id to match expected results
{ $project: {
_id: 0,
ProductName: "$_id",
Price: 1,
SKU: 1,
}}
)
Results given the test data appear to be what you were looking for:
{ "Price" : 10, "SKU" : "AB123", "ProductName" : "Cap" }
{ "Price" : 5, "SKU" : "ED567", "ProductName" : "Bag" }
Notes:
This aggregation query uses the $first operator, so if you want to find the second or third product per grouping you'd need a different approach (eg. $group and then take the subset of results needed in your application code)
If you want predictable results for finding the first item in a $group there should be more specific sort criteria than ProductName (for example, sorting by ProductName & Price or ProductName & SKU). Otherwise the order of results may change in future as documents are added or updated.
Thanks to #Stennie with the help of his answer I could come up with C# aggregation code
var match = new BsonDocument
{
{
"$match",
new BsonDocument{
{"ProductName", new BsonRegularExpression("cap", "i")}
}
}
};
var group = new BsonDocument
{
{"$group",
new BsonDocument
{
{"_id", "$ProductName"},
{"SKU", new BsonDocument{
{
"$first", "$SKU"
}}
}
}}
};
var project = new BsonDocument{
{
"$project",
new BsonDocument
{
{"_id", 0 },
{"ProductName","$_id" },
{"SKU", 1}
}}};
var sort = new BsonDocument{
{
"$sort",
new BsonDocument
{
{
"ProductName",1 }
}
}};
var pipeline = new[] { match, group, project, sort };
var aggResult = collection.Aggregate(pipeline);
var products= aggResult.ResultDocuments.Select(BsonSerializer.Deserialize<ProductOL>).ToList();
Using AggregateArgs
AggregateArgs args = new AggregateArgs();
List<BsonDocument> piple = new List<BsonDocument>();
piple.Add(match);
piple.Add(group);
piple.Add(project);
piple.Add(sort);
args.Pipeline = piple;
// var pipeline = new[] { match, group, project, sort };
var aggResult = collection.Aggregate(args);
products = aggResult.Select(BsonSerializer.Deserialize<ProductOL>).ToList();

Inserting a Document As a Field

Let' say I have a company collection:
[{
name: 'x',
code: 'a',
},
{
name: 'y',
code: 'b',
}]
I want to find the company with code 'a' and insert this to another collection called projects. I wrote something like this:
var collectionP = db.collection('projects');
var collectionC = db.collection('company');
var foundCompany = collectionC.find({code: 'a'});
db.collectionP.insert(name: 'project1', company: foundCompany);
This doesn't work. Any idea?
Calling find returns a cursor, not a result set.
Your alternatives are either iterate over the cursor, if expecting multiple results:
var foundCompany = collectionC.find({code: 'a'});
foundCompany.forEach(function(fc) {
db.collectionP.insert(name: 'project1', company: fc);
}
or convert it into an array if you want all results in one document:
var foundCompany = collectionC.find({code: 'a'}).toArray();
db.collectionP.insert(name: 'project1', company: foundCompany);
or if you only expect to match a single company, use findOne or its equivalent in your language:
var foundCompany = collectionC.findOne({code: 'a'});
db.collectionP.insert(name: 'project1', company: foundCompany);