Cumulative distribution in MongoDB using MapReduce - mongodb

I have a collection of documents in mongodb and I want to compute the CDF for some of the attributes and return or store it in the db. Obviously adding a new attribute to each document isn't a good approach, and I'm fine with an approximation I can later use. This is more of a theoretical question.
So I went with computing a sampling of the CDF on discrete intervals with a mapreduce job, like this (just the algorithm):
Get the count, min and max of attribute someAttr
Suppose min = 5, max=70, count = 200.
In map(): for (i=this.someAttr; i < max+1; i++) { emit(i, 1) }
In reduce() just return the sum for each key.
In finalize(), divide the reduced output by the record count: return val / count.
This does output a collection with samples from the CDF, however..
As you see the interval step here is 1, but the huge inefficiency in this approach is that there can be a monstrous amount of emitting even from a single document, even with just a handful of documents in the colletion, hence this is obviously not scalable and will not work.
The output looks like this:
{ _id: 5, val: 0}
{ _id: 6, val: 0.04}
{ _id: 7, val: 0.04}
...
{ _id: 71, val: 1.0}
From here I can easily get an approximated value of CDF for any of the values or even interpolate between them if that's reasonable.
Could someone give me an insight into how would you compute a (sample of) CDF with MapReduce (or perhaps without MapReduce)?

By definition, the cumulative distribution function F_a for an attribute a is defined by
F_a(x) = # documents with attribute value <= x / # of documents
So you can compute the CDF with
F_a(x) = db.collection.count({ "a" : { "lte" : x }) / db.collection.count({ "a" : { "$exists" : true } })
The count in the denominator assumes you don't want to count documents missing the a field. An index on a will make this fast.
You can use this to compute samples of the cdf or just compute the cdf on demand. There's no need for map-reduce.

Related

How to calculate closest total price from item prices collection with multiple items

Given a set of prices from a provider I want to work out the closest matching item set that will match a total value.
IE. The total value to match is $5, I have a price list from MongoDB as follows
val items = List(0.05, 0.06, 1.0, 2.0)
How do I work out the set of items that will give $5? The items can be duplicated in order to match the price.
My idea is to use a Stream to evaluate the price list and create permutations and match with an accuracy of $0.01 but I do not know how to go about doing that?
The other idea is to write an aggregation from MongoDB to provide the permutations and then sort by total price and pick the closest one
UPDATE:
So I am now sampling products from the database using the following:
db.getCollection("shop-items").aggregate([
{
$bucketAuto: {
groupBy: "$prices.avg",
buckets: 50,
output: {
"items" : {
$push: "$$ROOT"
},
"count": { $sum: 1 },
}
}
},
{$addFields: {"startField": {$floor :{$multiply: [ { $rand: {} }, "$count"]}}}},
{$project : { _id: 0, items: {$slice: ["$items", "$startField", 20]}}},
{$unwind: "$items"}
])
This results in about 650 items which will still result in 650! permutations which I would need to calculate
This is an algorithm question. Using a DB to generate permutations doesn't sound efficient. What you are looking for is a Knapsack solution with memoization. Basically, you need to solve ar(n) = minimum number items to get total sum n (with buffer of $0.01, if needed). The algorithm sketch will be:
Memoized array ar[500], where ar[y] = minimum number of items to match $(y/100).
All elements are initalized to infinite/very large value.
val items = [5, 6, 100, 200]
Initialize / populate: ar[0] = 0; items.forEach(ar[_] = 1);
while ($ar[500] is not populated) {
for (y where ar[y] is populated) {
items.forEach( item => ar[y + item] = min(ar[y + item], ar[y] + 1);
}
}
This will result in upper bound complexity of O(n * m) where n is the total number of possible outcomes (here 500), and m is total number of distinct items-prices, probably a constant.

MongoDB: What is the fastest / is there a way to get the 200 documents with a closest timestamp to a specified list of 200 timestamps, say using a $in [duplicate]

Let's assume I have a collection with documents with a ratio attribute that is a floating point number.
{'ratio':1.437}
How do I write a query to find the single document with the closest value to a given integer without loading them all into memory using a driver and finding one with the smallest value of abs(x-ratio)?
Interesting problem. I don't know if you can do it in a single query, but you can do it in two:
var x = 1; // given integer
closestBelow = db.test.find({ratio: {$lte: x}}).sort({ratio: -1}).limit(1);
closestAbove = db.test.find({ratio: {$gt: x}}).sort({ratio: 1}).limit(1);
Then you just check which of the two docs has the ratio closest to the target integer.
MongoDB 3.2 Update
The 3.2 release adds support for the $abs absolute value aggregation operator which now allows this to be done in a single aggregate query:
var x = 1;
db.test.aggregate([
// Project a diff field that's the absolute difference along with the original doc.
{$project: {diff: {$abs: {$subtract: [x, '$ratio']}}, doc: '$$ROOT'}},
// Order the docs by diff
{$sort: {diff: 1}},
// Take the first one
{$limit: 1}
])
I have another idea, but very tricky and need to change your data structure.
You can use geolocation index which supported by mongodb
First, change your data to this structure and keep the second value with 0
{'ratio':[1.437, 0]}
Then you can use $near operator to find the the closest ratio value, and because the operator return a list sorted by distance with the integer you give, you have to use limit to get only the closest value.
db.places.find( { ratio : { $near : [50,0] } } ).limit(1)
If you don't want to do this, I think you can just use #JohnnyHK's answer :)

search in limited number of record MongoDB

I want to search in the first 1000 records of my document whose name is CityDB. I used the following code:
db.CityDB.find({'index.2':"London"}).limit(1000)
but it does not work, it return the first 1000 of finding, but I want to search just in the first 1000 records not all records. Could you please help me.
Thanks,
Amir
Note that there is no guarantee that your documents are returned in any particular order by a query as long as you don't sort explicitely. Documents in a new collection are usually returned in insertion order, but various things can cause that order to change unexpectedly, so don't rely on it. By the way: Auto-generated _id's start with a timestamp, so when you sort by _id, the objects are returned by creation-date.
Now about your actual question. When you first want to limit the documents and then perform a filter-operation on this limited set, you can use the aggregation pipeline. It allows you to use $limit-operator first and then use the $match-operator on the remaining documents.
db.CityDB.aggregate(
// { $sort: { _id: 1 } }, // <- uncomment when you want the first 1000 by creation-time
{ $limit: 1000 },
{ $match: { 'index.2':"London" } }
)
I can think of two ways to achieve this:
1) You have a global counter and every time you input data into your collection you add a field count = currentCounter and increase currentCounter by 1. When you need to select your first k elements, you find it this way
db.CityDB.find({
'index.2':"London",
count : {
'$gte' : currentCounter - k
}
})
This is not atomic and might give you sometimes more then k elements on a heavy loaded system (but it can support indexes).
Here is another approach which works nice in the shell:
2) Create your dummy data:
var k = 100;
for(var i = 1; i<k; i++){
db.a.insert({
_id : i,
z: Math.floor(1 + Math.random() * 10)
})
}
output = [];
And now find in the first k records where z == 3
k = 10;
db.a.find().sort({$natural : -1}).limit(k).forEach(function(el){
if (el.z == 3){
output.push(el)
}
})
as you see your output has correct elements:
output
I think it is pretty straight forward to modify my example for your needs.
P.S. also take a look in aggregation framework, there might be a way to achieve what you need with it.

mongodb - Find document with closest integer value

Let's assume I have a collection with documents with a ratio attribute that is a floating point number.
{'ratio':1.437}
How do I write a query to find the single document with the closest value to a given integer without loading them all into memory using a driver and finding one with the smallest value of abs(x-ratio)?
Interesting problem. I don't know if you can do it in a single query, but you can do it in two:
var x = 1; // given integer
closestBelow = db.test.find({ratio: {$lte: x}}).sort({ratio: -1}).limit(1);
closestAbove = db.test.find({ratio: {$gt: x}}).sort({ratio: 1}).limit(1);
Then you just check which of the two docs has the ratio closest to the target integer.
MongoDB 3.2 Update
The 3.2 release adds support for the $abs absolute value aggregation operator which now allows this to be done in a single aggregate query:
var x = 1;
db.test.aggregate([
// Project a diff field that's the absolute difference along with the original doc.
{$project: {diff: {$abs: {$subtract: [x, '$ratio']}}, doc: '$$ROOT'}},
// Order the docs by diff
{$sort: {diff: 1}},
// Take the first one
{$limit: 1}
])
I have another idea, but very tricky and need to change your data structure.
You can use geolocation index which supported by mongodb
First, change your data to this structure and keep the second value with 0
{'ratio':[1.437, 0]}
Then you can use $near operator to find the the closest ratio value, and because the operator return a list sorted by distance with the integer you give, you have to use limit to get only the closest value.
db.places.find( { ratio : { $near : [50,0] } } ).limit(1)
If you don't want to do this, I think you can just use #JohnnyHK's answer :)

MongoDB map/reduce counts

The output from MongoDB's map/reduce includes something like 'counts': {'input': I, 'emit': E, 'output': O}. I thought I clearly understand what those mean, until I hit a weird case which I can't explain.
According to my understanding, counts.input is the number of rows that match the condition (as specified in query). If so, how is it possible that the following two queries have different results?
db.mycollection.find({MY_CONDITION}).count()
db.mycollection.mapReduce(SOME_MAP, SOME_REDUCE, {'query': {MY_CONDITION}}).counts.input
I thought the two should always give the same result, independent of the map and reduce functions, as long as the same condition is used.
The map/reduce pattern is like a group function in SQL. So there are grouping some result in one row. So your can't have same number of result.
The count in mapReduce() method is the number of result after the map/reduce function.
By example. You have 2 rows :
{'id':3,'num':5}
{'id':4,'num':5}
And you apply the map function
function(){
emit(this.num, 1);
}
After this map function you get 2 rows:
{5, 1}
{5, 1}
And now you apply your reduce method :
function(k,vals) {
var sum=0;
for(var i in vals) sum += vals[i];
return sum;
}
You have now only 1 row return :
2
Is your server steady-state in between the two calls?