Mongo map-reduce output, how to read results back? - mongodb

I have a map-reduce query that "works" and does what I want however I have so far spectacularly failed to make use of my output data because I cannot workout how to read it back... let me explain... here is my emit:
emit( { jobid: this.job_id, type: this.type}, { count: 1 })
and the reduce function:
reduce: function (key, values) {
var total = 0;
for( i = 0; i < values.length; i++ ) {
total += values[i].count;
}
return { jobid: this.job_id, type:this.type, count: total};
},
It functions and the output I get in the results collection looks like this:
{ "_id" : { "jobid" : "5051ef142a120", "type" : 3 }, "value" : { "count" : 1 } }
{ "_id" : { "jobid" : "5051ef142a120", "type" : 5 }, "value" : { "count" : 43 } }
{ "_id" : { "jobid" : "5051f1a9d5442", "type" : 2 }, "value" : { "count" : 1 } }
{ "_id" : { "jobid" : "5051f1a9d5442", "type" : 3 }, "value" : { "count" : 1 } }
{ "_id" : { "jobid" : "5051f299340b1", "type" : 2 }, "value" : { "count" : 1 } }
{ "_id" : { "jobid" : "5051f299340b1", "type" : 3 }, "value" : { "count" : 1 } }
BUT HOW the hell do I issue a query that says find me all jobid entries by "jobid" whilst ignoring the type? I tried this intiailly, expecting two rows of output but got none!
db.mrtest.find( { "_id": { "jobid" : "5051f299340b1" }} );
I have also tried and failed with:
db.mrtest.find( { "_id": { "jobid" : "5051f299340b1" }} );
and whilst:
db.mrtest.find( { "_id" : { "jobid" : "5051f299340b1", "type" : 2 }} )
does produce one row of output as hoped for, changing it to this again fails to produce anything:
db.mrtest.find( { "_id" : { "jobid" : "5051f299340b1", "type" : { $in: [2] }}} )
I get the impression that you can't do such things with the _id field, or can you? I am thinking I need to re-organise my mr output instead but that feels like failing somehow ?!?!
Help!
PS: If anybody can explain why the count is contained in a field called "value", that would also be welcome!"5051f299340b1"

Have you tried:
db.mrtest.find( { "_id.jobid": "506ea3a85e126" })
That works for me!

db.mrtest.find( { "_id.jobid": "506ea3a85e126" })

Related

mongodb $unwind empty array

With this data:
{
"_id" : ObjectId("576948b4999274493425c08a"),
"virustotal" : {
"scan_id" : "4a6c3dfc6677a87aee84f4b629303c40bb9e1dda283a67236e49979f96864078-1465973544",
"sha1" : "fd177b8c50b457dbec7cba56aeb10e9e38ebf72f",
"resource" : "4a6c3dfc6677a87aee84f4b629303c40bb9e1dda283a67236e49979f96864078",
"response_code" : 1,
"scan_date" : "2016-06-15 06:52:24",
"results" : [
{
"sig" : "Gen:Variant.Mikey.29601",
"vendor" : "MicroWorld-eScan"
},
{
"sig" : null,
"vendor" : "nProtect"
},
{
"sig" : null,
"vendor" : "CAT-QuickHeal"
},
{
"sig" : "HEUR/QVM07.1.0000.Malware.Gen",
"vendor" : "Qihoo-360"
}
]
}
},
{
"_id" : ObjectId("5768f214999274362f714e8b"),
"virustotal" : {
"scan_id" : "3d283314da4f99f1a0b59af7dc1024df42c3139fd6d4d4fb4015524002b38391-1466529838",
"sha1" : "fb865b8f0227e9097321182324c959106fcd8c27",
"resource" : "3d283314da4f99f1a0b59af7dc1024df42c3139fd6d4d4fb4015524002b38391",
"response_code" : 1,
"scan_date" : "2016-06-21 17:23:58",
"results" : [
{
"sig" : null,
"vendor" : "Bkav"
},
{
"sig" : null,
"vendor" : "ahnlab"
},
{
"sig" : null,
"vendor" : "MicroWorld-eScan"
},
{
"sig" : "Mal/DrodZp-A",
"vendor" : "Qihoo-360"
}
]
}
}
I'm trying to group by and count the vendor when sig is not null in order to obtain something like:
{
"_id" : "Qihoo-360",
"count" : 2
},
{
"_id" : "MicroWorld-eScan",
"count" : 1
},
{
"_id" : "Bkav",
"count" : 0
},
{
"_id" : "CAT-QuickHeal",
"count" : 0
}
At the moment with this code:
db.analysis.aggregate([
{ $unwind: "$virustotal.results" },
{
$group : {
_id : "$virustotal.results.vendor",
count : { $sum : 1 }
}
},
{ $sort : { count : -1 } }
])
I'm getting everything:
{
"_id" : "Qihoo-360",
"count" : 2
},
{
"_id" : "MicroWorld-eScan",
"count" : 2
},
{
"_id" : "Bkav",
"count" : 1
},
{
"_id" : "CAT-QuickHeal",
"count" : 1
}
How can I count 0 if the sig is null?
You need a conditional expression in your $sum operator that will check if the "$virustotal.results.sig" key is null by using the comparison operator $gt (as specified in the documentation's BSON comparsion order)
You can restructure your pipeline by adding this expression as follows:
db.analysis.aggregate([
{ "$unwind": "$virustotal.results" },
{
"$group" : {
"_id": "$virustotal.results.vendor",
"count" : {
"$sum": {
"$cond": [
{ "$gt": [ "$virustotal.results.sig", null ] },
1, 0
]
}
}
}
},
{ "$sort" : { "count" : -1 } }
])
Sample Output
/* 1 */
{
"_id" : "Qihoo-360",
"count" : 2
}
/* 2 */
{
"_id" : "MicroWorld-eScan",
"count" : 1
}
/* 3 */
{
"_id" : "Bkav",
"count" : 0
}
/* 4 */
{
"_id" : "CAT-QuickHeal",
"count" : 0
}
/* 5 */
{
"_id" : "nProtect",
"count" : 0
}
/* 6 */
{
"_id" : "ahnlab",
"count" : 0
}
I changed the null with None and the numbers increased but seems not correct yet.
Basically doing the query in mongoshell I get like
{
"_id" : "Kaspersky",
"count" : 176.0
}
from python:
Kaspersky 64
one of these 2 is wrong :)
So I'm trying to investigate what part of the query in python is not correctly written compared to the mongo shell one.
I did a simple query:
In mongoshell:
rtmp = results_db.analysis.count( { "virustotal.results" : { "$elemMatch" : { "vendor": "Kaspersky", "sig": {"$ne": "null"} } }})
results: 176
db.analysis.count( { "virustotal.results" : { $elemMatch : { "vendor": "Kaspersky", "sig": {$gt: null} } }})
results: 0
Then I tried in python:
rtmp = results_db.analysis.count( { "virustotal.results" : { "$elemMatch" : { "vendor": "Kaspersky", "sig": {"$ne": "null"} } }})
results: 568
rtmp = results_db.analysis.count( { "virustotal.results" : { "$elemMatch" : { "vendor": "Kaspersky", "sig": {"$ne": "None"} } }})
results: 568
rtmp = results_db.analysis.count( { "virustotal.results" : { "$elemMatch" : { "vendor": "Kaspersky", "sig": {"$gt": "None"} } }})
results: 64
rtmp = results_db.analysis.count( { "virustotal.results" : { "$elemMatch" : { "vendor": "Kaspersky", "sig": {"$gt": "null"} } }})
results: 6
hard to says what is the correct value! I suppose 176 but not able to reproduce in python...

MongoDB Aggregate Slow Performance When Using Sort

I have collection (tvshow episodes) with more than 1,200,000 document,
here is my schema :
var episodeSchema = new Schema({
imdbId: { type : String },
showId: {type : String},
episodeId: { type : String },
episodeIdNumber:{ type : Number },
episodeTitle:{ type : String },
showTitle:{type : String},
seasonNumber:{type : Number},
episodeNumber:{type : Number},
airDate : {type : String},
summary:{type : String}
});
I created Index for episodeTitle episodeIdNumber seasonNumber episodeNumber episodeId and showId
Now i used mongodb aggregate group to get every tvshow episodes
here is the aggregate query i used :
episode.aggregate( [
{ $match : { showId : "scorpion" } },
{$sort:{"episodeNumber":-1}},
{ $group: {
_id: "$seasonNumber", count: { $sum: 1 } ,
episodes : { $push: { episodeId : "$episodeId" , episodeTitle: "$episodeTitle" , episodeNumber: "$episodeNumber" , seasonNumber: "$seasonNumber" , airDate: "$airDate" } }
} }
,
{ $sort : { _id : -1 } }
] )
Now when i am run this query its take more than 2605.907 ms , after some digging i found out why its slow , it was because of using {$sort:{"episodeNumber":-1}} , without using {$sort:{"episodeNumber":-1}} its take around 19.178 ms to run.
As i mentioned above i create an index for episodeNumber field and based on MongoDB Aggregation Pipeline Optimization
i used sort after match so basically everything was ok , and i didn't anything wrong.
So after this i thought something wrong with my indexes , so i removed episodeNumber index and reindexd , but i had same time nothing changed.
At end one time i tried run aggregate group query without episodeNumber indexed and surprisingly it was faster ! its take around 20.118 ms .
I wants know why this happened , isn't indexes to get faster query ?
Update
query explain output :
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
"showId" : "scorpion"
},
"sort" : {
"episodeNumber" : -1
},
"fields" : {
"airDate" : 1,
"episodeId" : 1,
"episodeNumber" : 1,
"episodeTitle" : 1,
"seasonNumber" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.episodes",
"indexFilterSet" : false,
"parsedQuery" : {
"showId" : {
"$eq" : "scorpion"
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$seasonNumber",
"count" : {
"$sum" : {
"$const" : 1
}
},
"episodes" : {
"$push" : {
"episodeId" : "$episodeId",
"episodeTitle" : "$episodeTitle",
"episodeNumber" : "$episodeNumber",
"seasonNumber" : "$seasonNumber",
"airDate" : "$airDate"
}
}
}
},
{
"$sort" : {
"sortKey" : {
"_id" : -1
}
}
}
],
"ok" : 1
}

MondoDB Aggregate - Absolute count for each day

I have the following (already aggregated)collection
{ "_id" : { "day" : "2015-02-01" }, "total" : 2 }
{ "_id" : { "day" : "2015-02-02" }, "total" : 3 }
{ "_id" : { "day" : "2015-02-03" }, "total" : 10 }
{ "_id" : { "day" : "2015-02-04" }, "total" : 10 }
{ "_id" : { "day" : "2015-02-05" }, "total" : 5 }
What i need is calculating an absolute value for each day, summing the previous days. So expected result would be in the case above
{ "_id" : { "day" : "2015-02-01" }, "absolutetotalforday" : 2 }
{ "_id" : { "day" : "2015-02-02" }, "absolutetotalforday" : 5 }
{ "_id" : { "day" : "2015-02-03" }, "absolutetotalforday" : 15 }
{ "_id" : { "day" : "2015-02-04" }, "absolutetotalforday" : 25 }
{ "_id" : { "day" : "2015-02-05" }, "absolutetotalforday" : 30 }
Currently no clue how to achieve this with 1 query. Of course i could do a sum for each day I'm interested in, but this might be a long time range
Any help appreciated
Because aggregation framework has no mechanism of knowing the value of a previous document, or the previous "grouped" value of a document, your best bet would be to use Map-Reduce in this case.
Map-Reduce will give you the "running total" for the current total at end of day you require although this won't be in the desired key, absolutetotalforday but in a key called value since the reduced values are always invalue key.
The following mapReduce() operation will give you the desired result, assuming the results from the previous aggregation operation were output to a separate collection named agg_results:
db.agg_results.mapReduce(
function() { emit( this._id, this.total ); },
function(key, values) { return Array.sum(values); },
{
"scope": { "total": 0 },
"finalize": function(key, value) {
total += value;
return total;
},
"out": { "inline": 1 }
}
);
Sample Output
{
"results" : [
{
"_id" : {
"day" : "2015-02-01"
},
"value" : 2
},
{
"_id" : {
"day" : "2015-02-02"
},
"value" : 5
},
{
"_id" : {
"day" : "2015-02-03"
},
"value" : 15
},
{
"_id" : {
"day" : "2015-02-04"
},
"value" : 25
},
{
"_id" : {
"day" : "2015-02-05"
},
"value" : 30
}
],
"timeMillis" : 0,
"counts" : {
"input" : 5,
"emit" : 5,
"reduce" : 0,
"output" : 5
},
"ok" : 1
}
Sorting the results will not work with inline results and with dates of String type. Instead, try converting the date strings to a JavaScript date object, write the results to a collection and then run a sort on that collection:
db.agg_results.mapReduce(
function() { emit( new Date(this._id.day), this.total ); },
function(key, values) { return Array.sum(values); },
{
"scope": { "total": 0 },
"finalize": function(key, value) {
total += value;
return total;
},
"out": "tmpResults"
}
);
Sample Output (with sort)
> db.tmpResults.find().sort({_id: 1})
{ "_id" : ISODate("2015-02-01T00:00:00Z"), "value" : 2 }
{ "_id" : ISODate("2015-02-02T00:00:00Z"), "value" : 5 }
{ "_id" : ISODate("2015-02-03T00:00:00Z"), "value" : 15 }
{ "_id" : ISODate("2015-02-04T00:00:00Z"), "value" : 25 }
{ "_id" : ISODate("2015-02-05T00:00:00Z"), "value" : 30 }
>

MongoDB Length of String or need help in mongoDB mapReduce query

I need some guidence in a query for mongodb. I have a collection of items which contain an array that contains multiple fields, but I only care about one field, which is the barcode field. I need to distiguish the length of the barcode string.
Items.find({this.array['barcode'].length > 6})
It would be great if the above query was possible, but I believe that it's not. I only need a list of barcodes. How do I go about solving this problem? Does mongodb have something to compare length of a string? Or do I have to use a mapReduce query? If I do could I have some guidence on that? I'm not sure how I would go about writing it.
Thank you
Try using a regular expression.
/^\w{6}/ this says match a word that has 6 word characters at the start of the string.
Example:
Setup:
user test;
var createProduct = function(name, barcode){
return {
name : name,
detail: {
barcode : barcode
}
};
};
db.products.drop();
for(var i = 0; i < 10; i++){
db.products.insert( createProduct( "product" + i, "1234567890".substring(0,i+1) ));
}
Document structure:
{
"_id" : ObjectId("540d3ba1242ff352caa6154b"),
"name" : "product0",
"detail" : {
"barcode" : "1"
}
}
Query:
db.products.find({ "detail.barcode" : /^\w{6}/ })
Output:
{ "_id" : ObjectId("540d3ba1242ff352caa61550"), "name" : "product5", "detail" : { "barcode" : "123456" } }
{ "_id" : ObjectId("540d3ba1242ff352caa61551"), "name" : "product6", "detail" : { "barcode" : "1234567" } }
{ "_id" : ObjectId("540d3ba1242ff352caa61552"), "name" : "product7", "detail" : { "barcode" : "12345678" } }
{ "_id" : ObjectId("540d3ba1242ff352caa61553"), "name" : "product8", "detail" : { "barcode" : "123456789" } }
{ "_id" : ObjectId("540d3ba1242ff352caa61554"), "name" : "product9", "detail" : { "barcode" : "1234567890" } }
However, if barcode is a key within an object inside an array AND you only want the matched barcode values. Then you should use an aggregate function to extract the values.
Setup:
user test;
var createProduct = function(name){
var o = {
name : name,
subProducts: []
};
for(var i = 0; i < 10; i++){
o.subProducts.push({
barcode : "1234567890".substring(0,i+1)
});
}
return o;
};
db.products.drop();
db.products.insert( createProduct( "newBrand") );
Document Structure:
{
"_id" : ObjectId("540d4125242ff352caa61555"),
"name" : "newBrand",
"subProducts" : [
{
"barcode" : "1"
},
...
{
"barcode" : "123456789"
},
{
"barcode" : "1234567890"
}
]
}
Aggregate Query:
db.products.aggregate([
{ $unwind : "$subProducts" },
{ $match : { "subProducts.barcode" : /^\w{6}/ } }
]);
Output:
{ "_id" : ObjectId("540d4125242ff352caa61555"), "name" : "newBrand", "subProducts" : { "barcode" : "123456" } }
{ "_id" : ObjectId("540d4125242ff352caa61555"), "name" : "newBrand", "subProducts" : { "barcode" : "1234567" } }
{ "_id" : ObjectId("540d4125242ff352caa61555"), "name" : "newBrand", "subProducts" : { "barcode" : "12345678" } }
{ "_id" : ObjectId("540d4125242ff352caa61555"), "name" : "newBrand", "subProducts" : { "barcode" : "123456789" } }
{ "_id" : ObjectId("540d4125242ff352caa61555"), "name" : "newBrand", "subProducts" : { "barcode" : "1234567890" } }
More info:
http://docs.mongodb.org/manual/reference/operator/query/regex/
Retrieve only the queried element in an object array in MongoDB collection

MongoDB aggregation - ignore key names

I have a query:
db.events.aggregate(
{ $match: { "camera._id": "1NJE48", "start_timestamp": { $lte: 1407803834.07 } } },
{ $sort: { "start_timestamp": -1 } },
{ $limit: 2 },
{ $project: { "_id": 0, "snapshots": 1 } }
)
It returns data like so:
/* 0 */
{
"result" : [
{
"snapshots" : {
"1401330834010" : {
"uploaded_timestamp" : 1401330895,
"filename_timestamp" : 1401330834.01,
"timestamp" : 1401330834.01
},
"1401330835010" : {
"uploaded_timestamp" : 1401330896,
"filename_timestamp" : 1401330835.01,
"timestamp" : 1401330835.01
},
"1401330837010" : {
"uploaded_timestamp" : 1401330899,
"filename_timestamp" : 1401330837.01,
"timestamp" : 1401330837.01
}
}
},
{
"snapshots" : {
"1401319837010" : {
"uploaded_timestamp" : 1401319848,
"filename_timestamp" : 1401319837.01,
"timestamp" : 1401319837.01
},
"1401319838010" : {
"uploaded_timestamp" : 1401319849,
"filename_timestamp" : 1401319838.01,
"timestamp" : 1401319838.01
},
"1401319839010" : {
"uploaded_timestamp" : 1401319850,
"filename_timestamp" : 1401319839.01,
"timestamp" : 1401319839.01
}
}
}
],
"ok" : 1
}
I would like an array of snapshots:
/* 0 */
{
"result" : [
{
"uploaded_timestamp" : 1401330895,
"filename_timestamp" : 1401330834.01,
"timestamp" : 1401330834.01
},
{
"uploaded_timestamp" : 1401330896,
"filename_timestamp" : 1401330835.01,
"timestamp" : 1401330835.01
},
{
"uploaded_timestamp" : 1401330899,
"filename_timestamp" : 1401330837.01,
"timestamp" : 1401330837.01
},
{
"uploaded_timestamp" : 1401319848,
"filename_timestamp" : 1401319837.01,
"timestamp" : 1401319837.01
},
{
"uploaded_timestamp" : 1401319849,
"filename_timestamp" : 1401319838.01,
"timestamp" : 1401319838.01
},
{
"uploaded_timestamp" : 1401319850,
"filename_timestamp" : 1401319839.01,
"timestamp" : 1401319839.01
}
],
"ok" : 1
}
I.e. no key names. I'm struggling to understand how to deal with the aggregation framework when the key names are unique like they are here.
The problem is that the only way you know the key names is by looking at the document itself. MongoDB does not handle this type of situation well, in general. You are expected to know the structure of your own documents, i.e. to know what the keys are and what their types should be.
I don't know your use case and there's no sample document so I can't evaluate your data model, but having keys-as-values is generally a bad idea as you will run into a host of limitations whenever you can't say what the keys on a document should be a priori. Consider using an array instead of an embedded object for snapshots, or using an array of key-value pairs pattern like
{
...
"result" : [
{
"snapshots" : [
{
"key" : "1401330834010",
"value" : {
"uploaded_timestamp" : 1401330895,
"filename_timestamp" : 1401330834.01,
"timestamp" : 1401330834.01
},
}
]
},
...
}
If you provide a sample document and some detail about what you're trying to accomplish I'd be happy to provide more complete advice.
Came up with a stop gap solution. We will store an array of the snapshot keys, in an array on an event. It essentially acts as an index. We can then perform 2 queries - one to fetch the keys, and do a filter, and another to correctly fetch the single snapshot we need.
It's no pretty, nor backwards compatible, but it will hopefully speed it up.