Join Through Map reduce - mongodb

I have one collection in which student_id is the primary key:
test1:{student_id:"xxxxx"},
I have another collection in which student_id is inside array of collection:
class:{"class":"I",students:["student_id":"xxxx"]}
My problem is I want to join these two tables on the basis of student Id,
I am using map reduce and out as "merge", but it won't work.
My MR query is as follows.
db.runCommand({ mapreduce: "test1",
map : function Map() {
emit(this._id,this);
},
reduce : function Reduce(key, values) {
return values;
},
out : { merge: "testmerge" }
});
db.runCommand({ mapreduce: "class",
map : function Map() {
emit(this._id,this);
},
reduce : function Reduce(key, values) {
return values;
},
out : { merge: "testmerge" }
});
But it inserts two rows.
Can some one guide me regarding this,I am very new to MR
As in the example I want to get the details of all student from "test1" collection,studying in class "I".

Your requirement seems to be:
As in the example I want to get the details of all student from "test1" collection,studying in class "I".
In order to do that, store which classes a student is in with the student:
{
student_id: "xxxxx",
classes: ["I"],
},
Then you can just ask for all the students information with:
db.students.find( { classes: "I" } );
Without any need for slow and complex map reduce jobs. In general, you should avoid Map/Reduce as it can't make use of indexes and can not run concurrently. You also need to understand that in MongoDB operations are only done on one collection. There is no such thing as a join, and trying to emulate this with Map/Reduce is a bad idea. At least you can just do it with two queries:
// find all students in class "I":
ids = [];
db.classes.find( { class: "I" } ).forEach(function(e) { ids.push( e.student_id ) ; } );
// then with the result, find all of those students information:
db.students.find( { student_id: { $in: ids } } );
But I would strongly recommend you redesign your schema and store the classes with each student. As a general hint, in MongoDB you would store the relation between documents on the other side as compared to a relational database.

Related

Mongodb distinct on a key in a key value pair situation

I have the bellow document structure in mongodb, I want to bring distinct keys from the column customData
So if you look bellow, I want my result to be: key1,key2,key3,key4
doing
db.coll.distinct("customData")
will bring the values and not the keys.
{
"_id":ObjectId("56c4da4f681ec51d32a4053d"),
"accountUnique":7356464,
"customData":{
"key1":1,
"key2":2,
}
}
{
"_id":ObjectId("56c4da4f681ec51d32a4054d"),
"accountUnique":7356464,
"customData":{
"key3":1,
"key4":2,
}
}
Possible to do this with Map-Reduce since you have dynamic subdocument keys which the distinct method will not return a result for.
Running the following mapreduce operation will populate a separate collection with all the keys as the _id values:
var myMapReduce = db.runCommand({
"mapreduce": "coll",
"map" : function() {
for (var key in this.customData) { emit(key, null); }
},
"reduce" : function() {},
"out": "coll_keys"
})
To get a list of all the dynamic keys, run distinct on the resulting collection:
db[myMapReduce.result].distinct("_id")
will give you the sample output
["key1", "key2", "key3", "key4"]

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

Counting field names (not values) in mongo database

Is there a way to count field names in mongodb? I have a mongo database of documents with other embedded documents within them. Here is an example of what the data might look like.
{
"incident": "osint181",
"summary":"Something happened",
"actor": {
"internal": {
"motive": [
"Financial"
],
"notes": "",
"role": [
"Malicious"
],
"variety": [
"Cashier"
]
}
}
}
Another document might look like this:
{
"incident": "osint182",
"summary":"Something happened",
"actor": {
"external": {
"motive": [
"Financial"
],
"notes": "",
"role": [
"Malicious"
],
"variety": [
"Hacker"
]
}
}
}
As you can see, the actor has changed from internal to external in the second document. What I would like to be able to do is count the number of incidents for each type of actor. My first attempt looked like this:
db.public.aggregate( { $group : { _id : "$actor", count : { $sum : 1 }}} );
But that gave me the entire subdocument and the count reflected how many documents were exactly the same. Rather I was hoping to get a count for internal and a count for external, etc. Is there an elegant way to do that? If not elegant, can someone give me a dirty way of doing that?
Best option for this kind of problem is using map-reduce of mongoDB , it will allow you to iterate through all the keys of the mongoDB document and easily you can add your complex logic . Check out map reduce examples here : http://docs.mongodb.org/manual/applications/map-reduce/
This was the answer I came up with based on the hint from Devesh. I create a map function that looks at the value of actor and checks if the document is an empty JSON object using the isEmptyObject function that I defined. Then I used mapReduce to go through the collection and check if the action field is empty. If the object is not empty then rather than returning the value of the key, I return the key itself which will be named internal, or external, or whatever.
The magic here was the scope call in mapReduce which makes it so that my isEmptyObject is in scope for mapReduce. The results are written to a collection which I named temp. After gathering the information I want from the temp collection, I drop it.
var isEmptyObject = function(obj) {
for (var name in obj) {
return false;
}
return true;
};
var mapFunction = function() {
if (isEmptyObject(this.action)) {
emit("Unknown",1); }
else {
for (var key in this.actor) { emit(key,1); } } };
var reduceFunction = function(inKeys,counter) {
return Array.sum(counter); };
db.public.mapReduce(mapFunction, reduceFunction, {out:"temp", scope:{isEmptyObject:isEmptyObject}} );
foo = db.temp.aggregate(
{ $sort : { value : -1 }});
db.temp.drop();
printjson(foo)

$unwind an object in aggregation framework

In the MongoDB aggregation framework, I was hoping to use the $unwind operator on an object (ie. a JSON collection). Doesn't look like this is possible, is there a workaround? Are there plans to implement this?
For example, take the article collection from the aggregation documentation . Suppose there is an additional field "ratings" that is a map from user -> rating. Could you calculate the average rating for each user?
Other than this, I'm quite pleased with the aggregation framework.
Update: here's a simplified version of my JSON collection per request. I'm storing genomic data. I can't really make genotypes an array, because the most common lookup is to get the genotype for a random person.
variants: [
{
name: 'variant1',
genotypes: {
person1: 2,
person2: 5,
person3: 7,
}
},
{
name: 'variant2',
genotypes: {
person1: 3,
person2: 3,
person3: 2,
}
}
]
It is not possible to do the type of computation you are describing with the aggregation framework - and it's not because there is no $unwind method for non-arrays. Even if the person:value objects were documents in an array, $unwind would not help.
The "group by" functionality (whether in MongoDB or in any relational database) is done on the value of a field or column. We group by value of field and sum/average/etc based on the value of another field.
Simple example is a variant of what you suggest, ratings field added to the example article collection, but not as a map from user to rating but as an array like this:
{ title : title of article", ...
ratings: [
{ voter: "user1", score: 5 },
{ voter: "user2", score: 8 },
{ voter: "user3", score: 7 }
]
}
Now you can aggregate this with:
[ {$unwind: "$ratings"},
{$group : {_id : "$ratings.voter", averageScore: {$avg:"$ratings.score"} } }
]
But this example structured as you describe it would look like this:
{ title : title of article", ...
ratings: {
user1: 5,
user2: 8,
user3: 7
}
}
or even this:
{ title : title of article", ...
ratings: [
{ user1: 5 },
{ user2: 8 },
{ user3: 7 }
]
}
Even if you could $unwind this, there is nothing to aggregate on here. Unless you know the complete list of all possible keys (users) you cannot do much with this. [*]
An analogous relational DB schema to what you have would be:
CREATE TABLE T (
user1: integer,
user2: integer,
user3: integer
...
);
That's not what would be done, instead we would do this:
CREATE TABLE T (
username: varchar(32),
score: integer
);
and now we aggregate using SQL:
select username, avg(score) from T group by username;
There is an enhancement request for MongoDB that may allow you to do this in the aggregation framework in the future - the ability to project values to keys to vice versa. Meanwhile, there is always map/reduce.
[*] There is a complicated way to do this if you know all unique keys (you can find all unique keys with a method similar to this) but if you know all the keys you may as well just run a sequence of queries of the form db.articles.find({"ratings.user1":{$exists:true}},{_id:0,"ratings.user1":1}) for each userX which will return all their ratings and you can sum and average them simply enough rather than do a very complex projection the aggregation framework would require.
Since 3.4.4, you can transform object to array using $objectToArray
See:
https://docs.mongodb.com/manual/reference/operator/aggregation/objectToArray/
This is an old question, but I've run across a tidbit of information through trial and error that people may find useful.
It's actually possible to unwind on a dummy value by fooling the parser this way:
db.Opportunity.aggregate(
{ $project: {
Field1: 1, Field2: 1, Field3: 1,
DummyUnwindField: { $ifNull: [null, [1.0]] }
}
},
{ $unwind: "$DummyUnwindField" }
);
This will produce 1 row per document, regardless of whether or not the value exists. You may be able tinker with this to generate the results you want. I had hoped to combine this with multiple $unwinds to (sort of like emit() in map/reduce), but alas, the last $unwind wins or they combine as an intersection rather than union which makes it impossible to achieve the results I was looking for. I am sadly disappointed with the aggregate framework functionality as it doesn't fit the one use case I was hoping to use it for (and seems strangely like a lot of the questions on StackOverflow in this area are asking) - ordering results based on match rate. Improving the poor map reduce performance would have made this entire feature unnecessary.
This is what I found & extended.
Lets create experimental database in mongo
db.copyDatabase('livedb' , 'experimentdb')
Now Use experimentdb & convert Array to object in your experimentcollection
db.getCollection('experimentcollection').find({}).forEach(function(e){
if(e.store){
e.ratings = [e.ratings]; //Objects name to be converted to array eg:ratings
db.experimentcollection.save(e);
}
})
Some nerdy js code to convert json to flat object
var flatArray = [];
var data = db.experimentcollection.find().toArray();
for (var index = 0; index < data.length; index++) {
var flatObject = {};
for (var prop in data[index]) {
var value = data[index][prop];
if (Array.isArray(value) && prop === 'ratings') {
for (var i = 0; i < value.length; i++) {
for (var inProp in value[i]) {
flatObject[inProp] = value[i][inProp];
}
}
}else{
flatObject[prop] = value;
}
}
flatArray.push(flatObject);
}
printjson(flatArray);

In mongo, how do I use map reduce to get a group by ordered by most recent

the map reduce examples I see use aggregation functions like count, but what is the best way to get say the top 3 items in each category using map reduce.
I'm assuming I can also use the group function but was curious since they state sharded environments cannot use group(). However, I'm actually interested in seeing a group() example as well.
For the sake of simplification, I'll assume you have documents of the form:
{category: <int>, score: <int>}
I've created 1000 documents covering 100 categories with:
for (var i=0; i<1000; i++) {
db.foo.save({
category: parseInt(Math.random() * 100),
score: parseInt(Math.random() * 100)
});
}
Our mapper is pretty simple, just emit the category as key, and an object containing an array of scores as the value:
mapper = function () {
emit(this.category, {top:[this.score]});
}
MongoDB's reducer cannot return an array, and the reducer's output must be of the same type as the values we emit, so we must wrap it in an object. We need an array of scores, as this will let our reducer compute the top 3 scores:
reducer = function (key, values) {
var scores = [];
values.forEach(
function (obj) {
obj.top.forEach(
function (score) {
scores[scores.length] = score;
});
});
scores.sort();
scores.reverse();
return {top:scores.slice(0, 3)};
}
Finally, invoke the map-reduce:
db.foo.mapReduce(mapper, reducer, "top_foos");
Now we have a collection containing one document per category, and the top 3 scores across all documents from foo in that category:
{ "_id" : 0, "value" : { "top" : [ 93, 89, 86 ] } }
{ "_id" : 1, "value" : { "top" : [ 82, 65, 6 ] } }
(Your exact values may vary if you used the same Math.random() data generator as I have above)
You can now use this to query foo for the actual documents having those top scores:
function find_top_scores(categories) {
var query = [];
db.top_foos.find({_id:{$in:categories}}).forEach(
function (topscores) {
query[query.length] = {
category:topscores._id,
score:{$in:topscores.value.top}
};
});
return db.foo.find({$or:query});
}
This code won't handle ties, or rather, if ties exist, more than 3 documents might be returned in the final cursor produced by find_top_scores.
The solution using group would be somewhat similar, though the reducer will only have to consider two documents at a time, rather than an array of scores for the key.