how to perform statistics for every n elements in MongoDB - mongodb

How to perform basic statistics for every n elements in Mongodb. For example, if I have total of 100 records like below
Name
Count
Sample
a
10
x
a
20
y
a
10
z
b
10
x
b
10
y
b
5
z
how do I perform mean, median, std dev for every 10 records so I get 10 results. So I want to calculate mean/median/std dev for A for every 10 sample till all the elements of database. Similarly for b, c and so on
excuse me if it is a naive question

you need to have some sort of counter to keep track of count.... for example I have added here rownumber then applied bucket of 3 (here n=3) and then returning the sum and average of the group(3). this example can be modified to do some sorting and grouping before we create the bucket to get the desired result.
Pls refer to https://mongoplayground.net/p/CL7vQGUWD_S
db.collection.aggregate([
{
$set: {
"rownum": {
"$function": {
"body": "function() {try {row_number+= 1;} catch (e) {row_number= 0;}return row_number;}",
"args": [],
"lang": "js"
}
}
}
},
{
$bucket: {
groupBy: "$rownum",
// Field to group by
boundaries: [
1,
4,
7,
11,
14,
17,
21,
25
],
// Boundaries for the buckets
default: "Other",
// Bucket id for documents which do not fall into a bucket
output: {
// Output for each bucket
"countSUM": {
$sum: "$count"
},
"averagePrice": {
$avg: "$count"
}
}
}
}
])

Related

How to group by uniform intervals of data between a maximum and minimum using the MongoDB aggregator?

Let's say I have a whole mess of data that yields a range of integer values for a particular field... I'd like to see those ranked by a grouping of intervals of occurrence, perhaps because I am clustering...like so:
[{
_id: {
response_time: "3-4"
},
count: 234,
countries: ['US', 'Canada', 'UK']
}, {
_id: {
response_time: "4-5"
},
count: 452,
countries: ['US', 'Canada', 'UK', 'Poland']
}, ...
}]
How can I write a quick and dirty way to A) group the collection data by equally spaced intervals over B) a minimum and maximum range using a MongoDB aggregator?
Well, in order to quickly formulate a conditional grouping syntax for MongoDB aggregators, we first adopt the pattern, per MongoDB syntax:
$cond: [
{ <conditional> }, // test the conditional
<truthy_value>, // assign if true
$cond: [ // evaluate if false
{ <conditional> },
<truthy_value>,
... // and so forth
]
]
In order to do that muy rapidamente, without having to write every last interval out in a deeply nested conditional, we can use this handy recursive algorithm (that you import in your shell script or node.js script of course):
$condIntervalBuilder = function (field, interval, min, max) {
if (min < max - 1) {
var cond = [
{ '$and': [{ $gt:[field, min] }, { $lte: [field, min + interval] }] },
[min, '-', (min + interval)].join('')
];
if ((min + interval) > max) {
cond.push(ag.$condIntervalBuilder(field, (max - min), min, max));
} else {
min += interval;
cond.push(ag.$condIntervalBuilder(field, interval, min, max));
}
} else if (min >= max - 1 ) {
var cond = [
{ $gt: [field, max] },
[ max, '<' ].join(''), // Accounts for all outside the range
[ min, '<' ].join('') // Lesser upper bound
];
}
return { $cond: cond };
};
Then, we can invoke it in-line or assign it to a variable that we use elsewhere in our analysis.

Ranges in MongoDB documents

I want to define ranges in a MongoDB document so I can query by values belonging to a particular range.
For example, denoting a range by [min, max], if we consider a collection of three documents using this notation:
{
temperature: [-100, 10],
sensation: "Cold"
},
{
temperature: [10, 30],
sensation: "Mild"
},
{
temperature: [30, 50],
sensation: "Hot"
}
I would like to make queries of values within these ranges and see which documents fit in:
temperature = 25.6 -> sensation = "Mild"
I know I can store min and max values as separate values, but maybe someone can think in a more elegant and efficient way of defining indexable ranges in MongoDB.
You need to use the dot notation with the index of min and max value inn your array.
db.collection.find(
{
"temperature.0": { "$lt": 25.6 },
"temperature.1": { "$gt": 25.6 }
},
{ "sensation": 1, "_id": 0 }
)

Map Reduce Mongo DB: Sum of ODD and EVEN numbers with elements

I am trying to process a number series ( collection ) get sum of odd / even numbers separately along with elements considered for calculations of each.
The numberseries document structure is as follows:
{
_id: <Autogenerated>,
number: <any number, it can repeat. Even if it repeats, it should be added each time. >
}
The output is something like below( not exact but in general )
{
..
{
"odd":<result>, elements:{n1,n3,n5}
},
{
"even":<result>, elements:{n2,n4,n6}
}
..
}
Map Function:
mapf = function(){
var value = { sum : 0, elements :[] };
value.sum = this.number;
value.elements.push(this.number);
print(tojson(value));
if( this.number % 2 != 0 ){
emit( "odd", value );
}
if( this.number % 2 == 0 ){
emit( "even", value );
}
}
Reduce Values argument:
Values is an array of JSON emitted from map:
[{
"sum": 1,
"elements": [1]
}, {
"sum": 3,
"elements": [3]
} ... ]
Reduce Function:
reducef = function(key, values){
var result = { sum : 0 , elements:[] };
print("K " + key +"Values array " + tojson(values) );
for(var i = 0; i<values.length;i++ ){
v = values[i];
print("Key "+key+"V.JSON"+tojson(v)+" V.SUM -> "+v.sum);
result.sum += v.sum;
result.elements.push(v.elements[0]);
print(tojson(result));
}
return result;
}
I am getting sum correctly, but the elements array is not properly getting populated. It is containing only some of the elements considered for calculations.
UPDATE
As per the answer given by Neil, I further verified my code. I found that my code, without any modification, works for small dataset, but does not work for large data-set.
Below are points which I have verified as pointed out, I found my code to be correct.
print("K " + key +"Values array " + tojson(values) );
Above line in reduce function results in following values object printed.
[{
"sum": 1,
"elements": [1]
}, {
"sum": 3,
"elements": [3]
}, {
"sum": 5,
"elements": [5]
}, {
"sum": 7,
"elements": [7]
}, {
"sum": 9,
"elements": [9]
}, {
"sum": 11,
"elements": [11]
}, {
"sum": 13,
"elements": [13]
}, {
"sum": 15,
"elements": [15]
}, {
"sum": 17,
"elements": [17]
}, {
"sum": 19,
"elements": [19]
}]
Hence the line to push elements to array in final results result.elements.push(v.elements[0]); should be correct.
In map function, before emitting, I am modifying value.sum as follows
value.sum = this.number;
This ensures that sum is not zero and numbers are properly getting added due to this.
When I test this code with 20 records, 40 records, 100 records, it works perfectly.
When I test this code with 20000 records, the sum value is correct but the element array
does not contain 10000 elements each( Odd and even numbers are equally distributed in collection ) .
In later case, I get below message:
query not recording (too large)
Okay, there is a clear reason and you do appear to have read some of the documentation and at least applied this rule:
"the type of the return object must be identical to the type of the value emitted by the map function ..."
And by that this means that both the map function and the reduce function essentially have the same output, which you did:
{ sum : 0, elements :[] };
But there was a piece of documentation that has not been understood:
"MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key."
So where the whole thing goes wrong is that you have assumed that since your "map" function only emits one element, that then there will be only one element in the "elements" array. A careful re-read of the above says that this is not true. And in fact the output from "reduce" will very likely be fed back into the "reduce" function again. This is indeed how mapReduce deals with a large number of values for the "values" array.
To fix it, change this in the "reduce" function:
result.elements.push(v.elements[0]);
To this:
v.elements.forEach(function(element) {
result.elements.push(element);
}
And in that way, when the "reduce" function returns a result that has summed up a few "elements" already and pushed them to the list, then that "input" will be processed correctly and merged with any other "values" that come in with it.
BTW. I Think you actually meant this in your mapper:
var value = { sum : 1, elements :[] };
Otherwise this code down here would just be summing 0's:
result.sum += v.sum;
But aggregate does this better
All of that said the following aggregation framework statement does the same thing but better and faster with an implementation in native code:
db.collection.aggregate([
{ "$project": {
"type": { "$cond": [
{ "$eq": [ { "$mod": [ "$number", 2 ] }, 0 ] },
"even",
"odd"
]},
"number": 1
}},
{ "$group": {
"_id": "$type",
"sum": { "$sum": 1 },
"elements": { "$push": "$number" }
}}
])
And also note that in both cases you are not really "summing the elements", but rather "counting" them. So if your want the sum then the mapReduce part becomes:
//result.sum += v.sum;
v.elements.forEach(function(element) {
result.sum += element;
result.elements.push(element);
}
And the aggregate part becomes:
{ "$group": {
"_id": "$type",
"sum": { "$sum": "$number" },
"elements": { "$push": "$number" }
}}
Which truly sums the "odd" or "even" numbers as found in your collection.

Moving averages with MongoDB's aggregation framework?

If you have 50 years of temperature weather data (daily) (for example) how would you calculate moving averages, using 3-month intervals, for that time period? Can you do that with one query or would you have to have multiple queries?
Example Data
01/01/2014 = 40 degrees
12/31/2013 = 38 degrees
12/30/2013 = 29 degrees
12/29/2013 = 31 degrees
12/28/2013 = 34 degrees
12/27/2013 = 36 degrees
12/26/2013 = 38 degrees
.....
The agg framework now has $map and $reduce and $range built in so array processing is much more straightfoward. Below is an example of calculating moving average on a set of data where you wish to filter by some predicate. The basic setup is each doc contains filterable criteria and a value, e.g.
{sym: "A", d: ISODate("2018-01-01"), val: 10}
{sym: "A", d: ISODate("2018-01-02"), val: 30}
Here it is:
// This controls the number of observations in the moving average:
days = 4;
c=db.foo.aggregate([
// Filter down to what you want. This can be anything or nothing at all.
{$match: {"sym": "S1"}}
// Ensure dates are going earliest to latest:
,{$sort: {d:1}}
// Turn docs into a single doc with a big vector of observations, e.g.
// {sym: "A", d: d1, val: 10}
// {sym: "A", d: d2, val: 11}
// {sym: "A", d: d3, val: 13}
// becomes
// {_id: "A", prx: [ {v:10,d:d1}, {v:11,d:d2}, {v:13,d:d3} ] }
//
// This will set us up to take advantage of array processing functions!
,{$group: {_id: "$sym", prx: {$push: {v:"$val",d:"$date"}} }}
// Nice additional info. Note use of dot notation on array to get
// just scalar date at elem 0, not the object {v:val,d:date}:
,{$addFields: {numDays: days, startDate: {$arrayElemAt: [ "$prx.d", 0 ]}} }
// The Juice! Assume we have a variable "days" which is the desired number
// of days of moving average.
// The complex expression below does this in python pseudocode:
//
// for z in range(0, size of value vector - # of days in moving avg):
// seg = vector[n:n+days]
// values = seg.v
// dates = seg.d
// for v in seg:
// tot += v
// avg = tot/len(seg)
//
// Note that it is possible to overrun the segment at the end of the "walk"
// along the vector, i.e. not enough date-values. So we only run the
// vector to (len(vector) - (days-1).
// Also, for extra info, we also add the number of days *actually* used in the
// calculation AND the as-of date which is the tail date of the segment!
//
// Again we take advantage of dot notation to turn the vector of
// object {v:val, d:date} into two vectors of simple scalars [v1,v2,...]
// and [d1,d2,...] with $prx.v and $prx.d
//
,{$addFields: {"prx": {$map: {
input: {$range:[0,{$subtract:[{$size:"$prx"}, (days-1)]}]} ,
as: "z",
in: {
avg: {$avg: {$slice: [ "$prx.v", "$$z", days ] } },
d: {$arrayElemAt: [ "$prx.d", {$add: ["$$z", (days-1)] } ]}
}
}}
}}
]);
This might produce the following output:
{
"_id" : "S1",
"prx" : [
{
"avg" : 11.738793632512115,
"d" : ISODate("2018-09-05T16:10:30.259Z")
},
{
"avg" : 12.420766702631376,
"d" : ISODate("2018-09-06T16:10:30.259Z")
},
...
],
"numDays" : 4,
"startDate" : ISODate("2018-09-02T16:10:30.259Z")
}
The way I would tend to do this in MongoDB is maintain a running sum of the past 90 days in the document for each day's value, e.g.
{"day": 1, "tempMax": 40, "tempMaxSum90": 2232}
{"day": 2, "tempMax": 38, "tempMaxSum90": 2230}
{"day": 3, "tempMax": 36, "tempMaxSum90": 2231}
{"day": 4, "tempMax": 37, "tempMaxSum90": 2233}
Whenever a new data point needs to be added to the collection, instead of reading and summing 90 values you can efficiently calculate the next sum with two simple queries, one addition and one subtraction like this (psuedo-code):
tempMaxSum90(day) = tempMaxSum90(day-1) + tempMax(day) - tempMax(day-90)
The 90-day moving average for at each day is then just the 90-day sum divided by 90.
If you wanted to also offer moving averages over different time-scales, (e.g. 1 week, 30 day, 90 day, 1 year) you could simply maintain an array of sums with each document instead of a single sum, one sum for each time-scale required.
This approach costs additional storage space and additional processing to insert new data, however is appropriate in most time-series charting scenarios where new data is collected relatively slowly and fast retrieval is desirable.
The accepted answer helped me, but it took a while for me to understand how it worked and so I thought i'd explain my method to help others out. Particularly in your context I think my answer will help
This works on smaller datasets ideally
First group the data by day, then append all days in an array to each day:
{
"$sort": {
"Date": -1
}
},
{
"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"Previous Values": {
"$push": {
"Date": "$Date",
"Temperature": "$Temperature"
}
}
}
This will leave you with a record that looks like this (it'll be ordered correctly):
{"_id.Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-03-01", "Temperature": 20},
{"Day": "2017-02-11", "Temperature": 22},
{"Day": "2017-01-18", "Temperature": 03},
...
]},
Now that each day has all days appended to it, we need to remove the items from the Previous Values array that are more recent than the this _id.Day field, as the moving average is backward looking:
{
"$project": {
"_id": 0,
"Date": "$_id.Date",
"Temperature": "$_id.Temperature",
"Previous Values": 1
}
},
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$filter": {
"input": "$Previous Values",
"as": "pv",
"cond": {
"$lte": ["$$pv.Date", "$Date"]
}
}
}
}
},
Each item in the Previous Values array will only contain the dates that are less than or equal to the date for each record:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-01-31", "Temperature": 33},
{"Day": "2017-01-30", "Temperature": 36},
{"Day": "2017-01-29", "Temperature": 33},
{"Day": "2017-01-28", "Temperature": 32},
...
]}
Now we can pick our average window size, since the data is by day, for week we'd take the first 7 records of the array; for monthly, 30; or 3-monthly, 90 days:
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$slice": ["$Previous Values", 0, 90]
}
}
},
To average the previous temperatures we unwind the Previous Values array then group by the date field. The unwind operation does this:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-31",
"Temperature": 33}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-30",
"Temperature": 36}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-29",
"Temperature": 33}
},
...
See that the Day field is the same, but we now have a document for each of the previous dates from the Previous Values array. Now we can group back on day, then average Previous Values.Temperature to get the moving average:
{"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"3 Month Moving Average": {
"$avg": "$Previous Values.Temperature"
}
}
}
That's it! I know that joining every record to every record isn't ideal, but this works fine on smaller datasets
Starting in Mongo 5, it's a perfect use case for the new $setWindowFields aggregation operator:
Note that I'm consider the rolling average to have a 3-days window for simplicity (today and the 2 previous days):
// { date: ISODate("2013-12-26"), temp: 38 }
// { date: ISODate("2013-12-27"), temp: 36 }
// { date: ISODate("2013-12-28"), temp: 34 }
// { date: ISODate("2013-12-29"), temp: 31 }
// { date: ISODate("2013-12-30"), temp: 29 }
// { date: ISODate("2013-12-31"), temp: 38 }
// { date: ISODate("2014-01-01"), temp: 40 }
db.collection.aggregate([
{ $setWindowFields: {
sortBy: { date: 1 },
output: {
movingAverage: {
$avg: "$temp",
window: { range: [-2, "current"], unit: "day" }
}
}
}}
])
// { date: ISODate("2013-12-26"), temp: 38, movingAverage: 38 }
// { date: ISODate("2013-12-27"), temp: 36, movingAverage: 37 }
// { date: ISODate("2013-12-28"), temp: 34, movingAverage: 36 }
// { date: ISODate("2013-12-29"), temp: 31, movingAverage: 33.67 }
// { date: ISODate("2013-12-30"), temp: 29, movingAverage: 31.33 }
// { date: ISODate("2013-12-31"), temp: 38, movingAverage: 32.67 }
// { date: ISODate("2014-01-01"), temp: 40, movingAverage: 35.67 }
This:
sorts chronologically sorts documents: sortBy: { date: 1 }
creates for each document a span of documents (the window) that:
includes the "current" document and all previous documents within a "2"-"day" window
and within that window, averages temperatures: $avg: "$temp"
I think I may have an answer for my own question. Map Reduce would do it. First use emit to map each document to it's neighbors that it should be averaged with, then use reduce to avg each array... and that new array of averages should be the moving averages plot overtime since it's id would be the new date interval that you care about
I guess I needed to understand map-reduce better ...
:)
For instance... if we wanted to do it in memory (later we can create collections)
GIST https://gist.github.com/mrgcohen/3f67c597a397132c46f7
Does that look right?
I don't believe the aggregation framework can do this for multiple dates in the current version (2.6), or, at least, can't do this without some serious gymnastics. The reason is that the aggregation pipeline processes one document at a time and one document only, so it would be necessary to somehow create a document for each day that contains the previous 3 months worth of relevant information. This would be as a $group stage that would calculate the average, meaning that the prior stage would have produced about 90 copies of each day's record with some distinguishing key that can be used for the $group.
So I don't see a way to do this for more than one date at a time in a single aggregation. I'd be happy to be wrong and have to edit/remove this answer if somebody finds a way to do it, even if it's so complicated it's not practical. A PostgreSQL PARTITION type function would do the job here; maybe that function will be added someday.

Select MongoDB documents with custom random probability distribution

I have a collection that looks something like this:
[
{
"id": 1,
"tier": 0
},
{
"id": 2,
"tier": 1
},
{
"id": 3
"tier": 2
},
{
"id": 4,
"tier": 0
}
]
Is there a standard way to select n elements where the probabilty of choosing an element of the lowest tier is p, the next lowest tier is (1-p)*p, and so on, with standard random selection of element?
So for example, if the most likely thing happens and I run the query against the above example with n = 2 and any p > .5 (which I think will always be true), then I'd get back [{"id": 1, ...}, {"id": 4}]; with n = 3, then [{"id": 4}, {"id": 1}, {"id": 2}], etc.
E.g. here's some pseudo-Python code given a dictionary like that as objs:
def f(objs, p, n):
# get eligible tiers
tiers_set = set()
for o in objs:
eligible_tiers.add(o["tier"])
tiers_list = sorted(list(tiers_set))
# get the tier for each index of results
tiers = []
while len(tiers) < min(n, len(obis)):
tiers.append(select_random_with_initial_p(eligible_tiers, p))
# get res
res = []
for tier in tiers:
res.append(select_standard_random_in_tier(objs, tier)
return res
First, enable geospatial indexing on a collection:
db.docs.ensureIndex( { random_point: '2d' } )
To create a bunch of documents with random points on the X-axis:
for ( i = 0; i < 10; ++i ) {
db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}
Then you can get a random document from the collection like this:
db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )
Or you can retrieve several document nearest to a random point:
db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
To make your custom random selection, you can change that part [Math.random(), 0], so it best suits your random distribution
Source: Random record from MongoDB