Can MongoDB aggregate "top x" results in this document schema? - mongodb

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?

When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

Related

How do I update values in a nested array?

I would like to preface this with saying that english is not my mother tongue, if any of my explanations are vague or don't make sense, please let me know and I will attempt to make them clearer.
I have a document containing some nested data. Currently product and customer are arrays, I would prefer to have them as straight up ObjectIDs.
{
"_id" : ObjectId("5bab713622c97440f287f2bf"),
"created_at" : ISODate("2018-09-26T13:44:54.431Z"),
"prod_line" : ObjectId("5b878e4c22c9745f1090de66"),
"entries" : [
{
"order_number" : "123",
"product" : [
ObjectId("5ba8a0e822c974290b2ea18d")
],
"customer" : [
ObjectId("5b86a20922c9745f1a6408d4")
],
"quantity" : "14"
},
{
"order_number" : "456",
"product" : [
ObjectId("5b878ed322c9745f1090de6c")
],
"customer" : [
ObjectId("5b86a20922c9745f1a6408d5")
],
"quantity" : "12"
}
]
}
I tried using the following query to update it, however that proved unsuccessful as Mongo didn't behave quite as I had expected.
db.Document.find().forEach(function(doc){
doc.entries.forEach(function(entry){
var entry_id = entry.product[0]
db.Document.update({_id: doc._id}, {$set:{'product': entry_id}});
print(entry_id)
})
})
With this query it sets product in the root of the object, not quite what I had hoped for. What I was hoping to do was to iterate through entries and change each individual product and customer to be only their ObjectId and not an array. Is it possible to do this via the mongo shell or do I have to look for another way to accomplish this? Thanks!
In order to accomplish your specified behavior, you just need to modify your query structure a bit. Take a look here for the specific MongoDB documentation on how to accomplish this. I will also propose an update to your code below:
db.Document.find().forEach(function(doc) {
doc.entries.forEach(function(entry, index) {
var productElementKey = 'entries.' + index + '.product';
var productSetObject = {};
productSetObject[productElementKey] = entry.product[0];
db.Document.update({_id: doc._id}, {$set: productSetObject});
print(entry_id)
})
})
The problem that you were having is that you were not updating the specific element within the entries array, but rather adding a new key to the top-level of the document named product. Generally, in order to set the value of an inner document within an array, you need to specify the array key first (entries in this case) and the inner document key second (product in this case). Since you are trying to set specific elements within the entries array, you need to also specify the index in your query object, I have specified above.
In order to update the customer key in the inner documents, simply switch out the product for customer in my above code.
You're trying to add a property 'product' directly into your document with this line
db.Document.update({_id: doc._id}, {$set:{'product': entry_id}});
Try to modify all your entries first, then update your document with this new array of entries.
db.Document.find().forEach(function(doc){
let updatedEntries = [];
doc.entries.forEach(function(entry){
let newEntry = {};
newEntry["order_number"] = entry.order_number;
newEntry["product"] = entry.product[0];
newEntry["customer"] = entry.customer[0];
newEntry["quantity"] = entry.quantity;
updatedEntries.push(newEntry);
})
db.Document.update({_id: doc._id}, {$set:{'entries': updatedEntries}});
})
You'll need to enumerate all the documents and then update the documents one and a time with the value store in the first item of the array for product and customer from each entry:
db.documents.find().snapshot().forEach(function (elem) {
elem.entries.forEach(function(entry){
db.documents.update({
_id: elem._id,
"entries.order_number": entry.order_number
}, {
$set: {
"entries.$.product" : entry.product[0],
"entries.$.customer" : entry.customer[0]
}
}
);
});
});
Instead of doing 2 updates each time you could possibly use the filtered positional operator to do all updates to all arrays items within one update query.

How to check if a portion of an _id from one collection appears in another

I have a collection where the _id is of the form [message_code]-[language_code] and another where the _id is just [message_code]. What I'd like to do is find all documents from the first collection where the message_code portion of the _id does not appear in the second collection.
Example:
> db.colA.find({})
{ "_id" : "TRM1-EN" }
{ "_id" : "TRM1-ES" }
{ "_id" : "TRM2-EN" }
{ "_id" : "TRM2-ES" }
> db.colB.find({})
{ "_id" : "TRM1" }
I want a query that will return TRM2-EN and TRM-ES from colA. Of course in my live data, there are thousands of records in each collection.
According to this question which is trying to do something similar, we have to save the results from a query against colB and use it in an $in condition in a query against colA. In my case, I need to strip the -[language_code] portion before doing this comparison, but I can't find a way to do so.
If all else fails, I'll just create a new field in colA that contains only the message code, but is there a better way do it?
Edit:
Based on Michael's answer, I was able to come up with this solution:
var arr = db.colB.distinct("_id")
var regexs = arr.map(function(elm){
return new RegExp(elm);
})
var result = db.colA.find({_id : {$nin : regexs}}, {_id : true})
Edit:
Upon closer inspection, the above method doesn't work after all. In the end, I just had to add the new field.
Disclaimer: This is a little hack it may not end well.
Get distinct _id using collection.distinct method.
Build a regular expression array using Array.prototype.map()
var arr = db.colB.distinct('_id');
arr.map(function(elm, inx, tab) {
tab[inx] = new RegExp(elm);
});
db.colA.find({ '_id': { '$nin': arr }})
I'd add a new field to colA since you can index it and if you have hundreds of thousands of documents in each collection splitting the strings will be painfully slow.
But if you don't want to do that you could make use of the aggregation framework's $substr operator to extract the [message-code] then do a $match on the result.

In mongo, how do I use map reduce to get a group by ordered by most recent

the map reduce examples I see use aggregation functions like count, but what is the best way to get say the top 3 items in each category using map reduce.
I'm assuming I can also use the group function but was curious since they state sharded environments cannot use group(). However, I'm actually interested in seeing a group() example as well.
For the sake of simplification, I'll assume you have documents of the form:
{category: <int>, score: <int>}
I've created 1000 documents covering 100 categories with:
for (var i=0; i<1000; i++) {
db.foo.save({
category: parseInt(Math.random() * 100),
score: parseInt(Math.random() * 100)
});
}
Our mapper is pretty simple, just emit the category as key, and an object containing an array of scores as the value:
mapper = function () {
emit(this.category, {top:[this.score]});
}
MongoDB's reducer cannot return an array, and the reducer's output must be of the same type as the values we emit, so we must wrap it in an object. We need an array of scores, as this will let our reducer compute the top 3 scores:
reducer = function (key, values) {
var scores = [];
values.forEach(
function (obj) {
obj.top.forEach(
function (score) {
scores[scores.length] = score;
});
});
scores.sort();
scores.reverse();
return {top:scores.slice(0, 3)};
}
Finally, invoke the map-reduce:
db.foo.mapReduce(mapper, reducer, "top_foos");
Now we have a collection containing one document per category, and the top 3 scores across all documents from foo in that category:
{ "_id" : 0, "value" : { "top" : [ 93, 89, 86 ] } }
{ "_id" : 1, "value" : { "top" : [ 82, 65, 6 ] } }
(Your exact values may vary if you used the same Math.random() data generator as I have above)
You can now use this to query foo for the actual documents having those top scores:
function find_top_scores(categories) {
var query = [];
db.top_foos.find({_id:{$in:categories}}).forEach(
function (topscores) {
query[query.length] = {
category:topscores._id,
score:{$in:topscores.value.top}
};
});
return db.foo.find({$or:query});
}
This code won't handle ties, or rather, if ties exist, more than 3 documents might be returned in the final cursor produced by find_top_scores.
The solution using group would be somewhat similar, though the reducer will only have to consider two documents at a time, rather than an array of scores for the key.

Most efficient way to generate a list of Unigrams from a text field in MongoDB

I need to generate a vector of unigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.
I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.
Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...
Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...
What would be the most efficient way to generate the unigram vector and then keep it updated?
Thanks!
Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.
{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }
For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).
First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.
db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);
Second, we define both map() and reduce() function.
var map = function() {
uniqueWords = function (words){
var arrWords = words.split(" ");
var arrNewWords = [];
var seenWords = {};
for(var i=0;i<arrWords.length;i++) {
if (!seenWords[arrWords[i]]) {
seenWords[arrWords[i]]=true;
arrNewWords.push(arrWords[i]);
}
}
return arrNewWords;
}
var unigrams = uniqueWords(this.body) ;
emit(this.author, {unigrams:unigrams});
};
var reduce = function(key,values){
Array.prototype.uniqueMerge = function( a ) {
for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
if ( this.indexOf( a[i] ) === -1 ) {
nonDuplicates.push( a[i] );
}
}
return this.concat( nonDuplicates )
};
unigrams = [];
values.forEach(function(i){
unigrams = unigrams.uniqueMerge(i.unigrams);
});
return { unigrams:unigrams};
};
Third, we actually run the map-reduce function.
var result = db.stories.mapReduce( map,
reduce,
{query:{author:{$exists:true},mr_status:"inprocess"},
out: {reduce:"authors_unigrams"}
});
Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".
db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);
Optionally, you can see the result collection "authors_unigrams" by firing following command.
db.authors_unigrams.find();

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.