I've translated the follow sql statment to map reduce:
select
p_brand, p_type, p_size,
count(ps_suppkey) as supplier_cnt
from
partsupp, part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED %'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand, p_type, p_size
order by
supplier_cnt desc, p_brand, p_type, p_size;
Map reduce function:
db.runCommand({
mapreduce: "partsupp",
query: {
"ps_partkey.p_size": { $in: [49, 14, 23, 45, 19, 3, 36, 9] },
"ps_partkey.p_brand": { $ne: "Brand#45" }
},
map: function() {
var pattern1 = /^MEDIUM POLISHED .*/;
var pattern2 = /.*Customer.*Complaints.*/;
var suppkey = this.ps_suppkey.s_suppkey;
if( this.ps_suppkey.s_comment.match(pattern1) == null ){
if(this.ps_suppkey.s_comment.match(pattern2) != null){
emit({p_brand: this.ps_partkey.p_brand, p_type: this.ps_partkey.p_type, p_size: this.ps_partkey.p_size}, suppkey);
}
}
},
reduce: function(key, values) {
return values.length;
},
out: 'query016'
});
The output result (seems to me) has no one reduce:
{
"result" : "query016",
"timeMillis" : 46862,
"counts" : {
"input" : 122272,
"emit" : 54,
"reduce" : 0,
"output" : 54
},
"ok" : 1
}
Whats wrong?
The map function outputs key and value pairs.
The reduce function's purpose is to combine multiple values for the same key. This means that if particular key value is only emitted once it has only one value and there is nothing to reduce.
This is one of the reasons that you must output the value in your emit statement in exact same format that reduce function will be returning.
Map outputs:
emit(key1, valueX);
emit(key1, valueY);
emit(key2, valueZ);
Reduce combines valueX and valueY to return new valueXY for key1 and the final result will be:
key1, valueXY
key, valueZ
Notice that reduce was never called on key2. Reduce function may be called zero, once or multiple times for each key value, so you have to be careful to construct both the map and reduce functions to allow for that possibility.
Your map function doesn't emit a correct value - you want to be counting so you have to output a count. Your reduce function must loop over the already accumulated counts and add them up and return the combined count. You may want to look at some examples provided in the MongoDB documentation.
You can probably do this much simpler using the Aggregation Framework - I don't see the need for MapReduce here unless you are expecting to output a huge amount of results.
I suspect that you called emit(value,key) instead of emit(key,value).
As others have already stated, the mapped value and the reduced value must have the same structure. If you just want to make a count, map a value=1 and in the reduce function just return Array.sum(values).
Related
I am new to Mongodb trying to count the number of Delta values in DELTA_GROUP. In this example there are three "Delta" values under "DELTA_GROUP". In this case, the count value is 3 for this object. I need to satisfy two conditions here though.
First of all, I need to only count data collected within the specific time range I set (ex. between start point and end point using ISODate(gte,lte, etc)).
The Second, in the data with the specific time range, I want to count the number of delta values for every object and of course, there are a handful of objects within the specified time range. Thus, if I assume that there are only three delta values for each object (from example), and 10 objects total, the count result should be 30. How can I create a query for it with conditions above?
{
"_id" : ObjectId("5f68a088135c701658c24d62"),
"DELTA_GROUP" : [
{
"Delta" : 105,
},
{
"Delta" : 108,
},
{
"Delta" : 103,
}
],
"YEAR" : 2020,
"MONTH" : 9,
"DAY" : 21,
"RECEIVE_TIME" : ISODate("2020-09-21T21:46:00.323Z")
}
What I have tried so far is shown below. In this way, I was able to list out counted value for each object, but still need to work to get totalized counted value for specified range of dates.
db.DELTA_DATA.aggregate([
{$match:
{'RECEIVE_TIME':
{
$gte:ISODate("2020-09-10T00:00:00"),
$lte:ISODate("2020-10-15T23:59:59")
}
}},
{$project: {"total": {count : {"$size":"$DELTA_GROUP"}}}}])
I have an array of id's of LEGO parts in a LEGO building.
// building collection
{
"name": "Gingerbird House",
"buildingTime": 45,
"rating": 4.5,
"elements": [
{
"_id": 23,
"requiredElementAmt": 14
},
{
"_id": 13,
"requiredElementAmt": 42
}
]
}
and then
//elements collection
{
"_id": 23,
"name": "blue 6 dots brick",
"availableAmt":20
}
{
"_id": 13,
"name": "red 8 dots brick",
"availableAmt":50
}
{"_id":254,
"name": "green 4 dots brick",
"availableAmt":12
}
How can I find it's possible to build a building? I.e. database will return the building only if the "elements" array in a building document consists of those elements that I have in a warehouse(elements collection) require less(or equal) amount of certain element.
In SQL(which from I came recently) I would write something likeSELECT * FROM building WHERE id NOT IN(SELECT fk_building FROM building_elemnt_amt WHERE fk_element NOT IN (1, 3))
Thank you in advance!
I wont pretend I get how it works in SQL without any comparison, but in mongodb you can do something like that:
db.buildings.find({/* building filter, if any */}).map(function(b){
var ok = true;
b.elements.forEach(function(e){
ok = ok && 1 == db.elements.find({_id:e._id, availableAmt:{$gt:e.requiredElementAmt}}).count();
})
return ok ? b : false;
}).filter(function(b){return b});
or
db.buildings.find({/* building filter, if any */}).map( function(b){
var condition = [];
b.elements.forEach(function(e){
condition.push({_id:e._id, availableAmt:{$gt:e.requiredElementAmt}});
})
return db.elements.find({$or:condition}).count() == b.elements.length ? b : false;
}).filter(function(b){return b});
The last one should be a bit quicker, but I did not test. If performance is a key, it must be better to mapReduce it to run subqueries in parallel.
Note: The examples above work with assumption that buildings.elements have no elements with the same id. Otherwise the array of elements needs to be pre-processed before b.elements.forEach to calculate total requiredElementAmt for non-unique ids.
EDIT: How it works:
Select all/some documents from buildings collection with find:
db.buildings.find({/* building filter, if any */})
returns a cursor, which we iterate with map applying the function to each document:
map(function(b){...})
The function itself iterates over elements array for each buildings document b:
b.elements.forEach(function(e){...})
and find number of documents in elements collection for each element e
db.elements.find({_id:e._id, availableAmt:{$gte:e.requiredElementAmt}}).count();
which match a condition:
elements._id == e._id
and
elements.availableAmt >= e.requiredElementAmt
until first request that return 0.
Since elements._id is unique, this subquery returns either 0 or 1.
First 0 in expression ok = ok && 1 == 0 turns ok to false, so rest of the elements array will be iterated without touching the db.
The function returns either current buildings document, or false:
return ok ? b : false
So result of the map function is an array, containing full buildings documents which can be built, or false for ones that lacks at least 1 resource.
Then we filter this array to get rid of false elements, since they hold no useful information:
filter(function(b){return b})
It returns a new array with all elements for which function(b){return b} doesn't return false, i.e. only full buildings documents.
We have a problem wherein certain strings appear as 123, 00123, 000123. We need to group by this field and we would like all the above to be considered as one group. I know the length of these values cannot be greater than 6.
The approach I was thinking was to left pad all of these fields in projection with 0s to a length of 6. One way would be to concat 6 0s first and then do a substr - but there is no length available for me to calculate the indexes for the substr method. -JIRA
Is there something more direct? Couldn't find anything here : https://docs.mongodb.org/manual/meta/aggregation-quick-reference/#aggregation-expressions or has anyone solved this some way?
I would convert then to int. E.g.:
For collection:
db.leftpad.insert([
{key:"123"},
{key:"0123"},
{key:"234"},
{key:"000123"}
])
counting:
db.leftpad.mapReduce(function(){
emit(this.key * 1, 1);
}, function(key, count) {
return Array.sum(count);
}, {out: { inline: 1 }}
).results
returns an array:
[
{_id : 123, value : 3},
{_id : 234, value : 1}
]
If you can, it may worth to reduce it once:
db.leftpad.find({key:{$exists:true}, intKey:{$exists:false}}).forEach(function(d){
db.leftpad.update({_id:d._id}, {$set:{intKey: d.key * 1}});
})
And then group by intKey.
I have a collection of documents in mongodb and I want to compute the CDF for some of the attributes and return or store it in the db. Obviously adding a new attribute to each document isn't a good approach, and I'm fine with an approximation I can later use. This is more of a theoretical question.
So I went with computing a sampling of the CDF on discrete intervals with a mapreduce job, like this (just the algorithm):
Get the count, min and max of attribute someAttr
Suppose min = 5, max=70, count = 200.
In map(): for (i=this.someAttr; i < max+1; i++) { emit(i, 1) }
In reduce() just return the sum for each key.
In finalize(), divide the reduced output by the record count: return val / count.
This does output a collection with samples from the CDF, however..
As you see the interval step here is 1, but the huge inefficiency in this approach is that there can be a monstrous amount of emitting even from a single document, even with just a handful of documents in the colletion, hence this is obviously not scalable and will not work.
The output looks like this:
{ _id: 5, val: 0}
{ _id: 6, val: 0.04}
{ _id: 7, val: 0.04}
...
{ _id: 71, val: 1.0}
From here I can easily get an approximated value of CDF for any of the values or even interpolate between them if that's reasonable.
Could someone give me an insight into how would you compute a (sample of) CDF with MapReduce (or perhaps without MapReduce)?
By definition, the cumulative distribution function F_a for an attribute a is defined by
F_a(x) = # documents with attribute value <= x / # of documents
So you can compute the CDF with
F_a(x) = db.collection.count({ "a" : { "lte" : x }) / db.collection.count({ "a" : { "$exists" : true } })
The count in the denominator assumes you don't want to count documents missing the a field. An index on a will make this fast.
You can use this to compute samples of the cdf or just compute the cdf on demand. There's no need for map-reduce.
I'm playing around with Map Reduce in MongoDB and python and I've run into a strange limitation. I'm just trying to count the number of "book" records. It works when there are less than 100 records but when it goes over 100 records the count resets for some reason.
Here is my MR code and some sample outputs:
var M = function () {
book = this.book;
emit(book, {count : 1});
}
var R = function (key, values) {
var sum = 0;
values.forEach(function(x) {
sum += 1;
});
var result = {
count : sum
};
return result;
}
MR output when record count is 99:
{u'_id': u'superiors', u'value': {u'count': 99}}
MR output when record count is 101:
{u'_id': u'superiors', u'value': {u'count': 2.0}}
Any ideas?
Your reduce function should be summing up the count values, not just adding 1 for each value. Otherwise the output of a reduce can't properly be used as input back into another reduce. Try this instead:
var R = function (key, values) {
var sum = 0;
values.forEach(function(x) {
sum += x.count;
});
var result = {
count : sum
};
return result;
}
If emits numbers are equal or more than 100, 100 emits will be sent to reduce function first and process:
{count: 100}
Then only 1 emit remains, sent to reduce function and process:
{count: 1}
OK, the result now is:
[{count: 100}, {count: 1}]
And then this will call reduce function again (very important!). Because foreach sum+=1 in your code. There are two elements in the array, so the result is 2.
ref: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Amoretechnicalexplanation