Ranges in MongoDB documents - mongodb

I want to define ranges in a MongoDB document so I can query by values belonging to a particular range.
For example, denoting a range by [min, max], if we consider a collection of three documents using this notation:
{
temperature: [-100, 10],
sensation: "Cold"
},
{
temperature: [10, 30],
sensation: "Mild"
},
{
temperature: [30, 50],
sensation: "Hot"
}
I would like to make queries of values within these ranges and see which documents fit in:
temperature = 25.6 -> sensation = "Mild"
I know I can store min and max values as separate values, but maybe someone can think in a more elegant and efficient way of defining indexable ranges in MongoDB.

You need to use the dot notation with the index of min and max value inn your array.
db.collection.find(
{
"temperature.0": { "$lt": 25.6 },
"temperature.1": { "$gt": 25.6 }
},
{ "sensation": 1, "_id": 0 }
)

Related

how to perform statistics for every n elements in MongoDB

How to perform basic statistics for every n elements in Mongodb. For example, if I have total of 100 records like below
Name
Count
Sample
a
10
x
a
20
y
a
10
z
b
10
x
b
10
y
b
5
z
how do I perform mean, median, std dev for every 10 records so I get 10 results. So I want to calculate mean/median/std dev for A for every 10 sample till all the elements of database. Similarly for b, c and so on
excuse me if it is a naive question
you need to have some sort of counter to keep track of count.... for example I have added here rownumber then applied bucket of 3 (here n=3) and then returning the sum and average of the group(3). this example can be modified to do some sorting and grouping before we create the bucket to get the desired result.
Pls refer to https://mongoplayground.net/p/CL7vQGUWD_S
db.collection.aggregate([
{
$set: {
"rownum": {
"$function": {
"body": "function() {try {row_number+= 1;} catch (e) {row_number= 0;}return row_number;}",
"args": [],
"lang": "js"
}
}
}
},
{
$bucket: {
groupBy: "$rownum",
// Field to group by
boundaries: [
1,
4,
7,
11,
14,
17,
21,
25
],
// Boundaries for the buckets
default: "Other",
// Bucket id for documents which do not fall into a bucket
output: {
// Output for each bucket
"countSUM": {
$sum: "$count"
},
"averagePrice": {
$avg: "$count"
}
}
}
}
])

Is there an equivalent NTILE function in MongoDB?

I want to divide my collection into n groups with equal number of rows. Is there an equivalent to SQL's NTILE function in MongoDB?
I'd like to create a MongoDB view for my collections that will add an additional column (partition_no)
Yes there is. The $bucketAuto aggregation stage:
db.collection.aggregate(
[
{
'$bucketAuto': {
"groupBy": "$_id",
"buckets": n
}
}
]
)
This returns the structure of:
{ _id: { min: minIdOfBucket, max: maxIdOfBucket }, count: number }[]

How to group by uniform intervals of data between a maximum and minimum using the MongoDB aggregator?

Let's say I have a whole mess of data that yields a range of integer values for a particular field... I'd like to see those ranked by a grouping of intervals of occurrence, perhaps because I am clustering...like so:
[{
_id: {
response_time: "3-4"
},
count: 234,
countries: ['US', 'Canada', 'UK']
}, {
_id: {
response_time: "4-5"
},
count: 452,
countries: ['US', 'Canada', 'UK', 'Poland']
}, ...
}]
How can I write a quick and dirty way to A) group the collection data by equally spaced intervals over B) a minimum and maximum range using a MongoDB aggregator?
Well, in order to quickly formulate a conditional grouping syntax for MongoDB aggregators, we first adopt the pattern, per MongoDB syntax:
$cond: [
{ <conditional> }, // test the conditional
<truthy_value>, // assign if true
$cond: [ // evaluate if false
{ <conditional> },
<truthy_value>,
... // and so forth
]
]
In order to do that muy rapidamente, without having to write every last interval out in a deeply nested conditional, we can use this handy recursive algorithm (that you import in your shell script or node.js script of course):
$condIntervalBuilder = function (field, interval, min, max) {
if (min < max - 1) {
var cond = [
{ '$and': [{ $gt:[field, min] }, { $lte: [field, min + interval] }] },
[min, '-', (min + interval)].join('')
];
if ((min + interval) > max) {
cond.push(ag.$condIntervalBuilder(field, (max - min), min, max));
} else {
min += interval;
cond.push(ag.$condIntervalBuilder(field, interval, min, max));
}
} else if (min >= max - 1 ) {
var cond = [
{ $gt: [field, max] },
[ max, '<' ].join(''), // Accounts for all outside the range
[ min, '<' ].join('') // Lesser upper bound
];
}
return { $cond: cond };
};
Then, we can invoke it in-line or assign it to a variable that we use elsewhere in our analysis.

Ranking weighted averages in MongoDB

Let's say I have 1,000,000,000 entities in a MongoDB, and each entity has 3 numerical properties, A, B, and C.
for example:
entity1 : { A: 35, B: 60, C: 5 }
entity2 : { A: 15, B: 10, C: 55 }
entity2 : { A: 10, B: 10, C: 10 }
...
Now I need to query the database. The input of the query would be 3 numbers: (a, b, c). The result would be a list of entities in descending order as defined by the weighted average, or A * a + B * b + C * c.
so q(1, 100, 1) would return (entity1, entity2, entity3)
and q(1, 1, 100) would return (entity2, entity1, entity3)
Can something like this be achieved with MongoDB, without calculating the weighted average of every entity on every query? I am not bound to MongoDB, but am learning the MEAN stack. If I have to use something else, that is fine too.
NOTE: I chose 1,000,000,000 entities as an extreme example. My actual use case will only have ~5000 entities, so iterating over everything might be OK, I am just interested in a more clever solution.
Well of course you have to calculate it if you are providing input and cannot use a pre-calculated field, but the only difference here would be returning all items and sorting them in the client or letting the server do the work:
var a = 1,
b = 1,
c = 100;
db.collection.aggregate(
[
{ "$project": {
"A": 1,
"B": 1,
"C": 1,
"weight": {
"$add": [
{ "$multiply": [ "$A", a ] },
{ "$multiply": [ "$B", b ] },
{ "$multiply": [ "$C", c ] }
]
}
}},
{ "$sort": { "weight": -1 } }
],
{ "allowDiskUse": true }
)
So the key here is the .aggregate() method allows for document manipulation which is required to generate the value on which to apply the $sort.
The calculated value is provided in a $project pipeline stage before this using $multiply against each field value to each external variable fed into the pipeline, with the final math operation performing an $add on each argument in result to produce "weight" as a field to sort on.
You cannot directly feed algorithms to any "sort" methods in MongoDB, as they need to act on a field present in the document. The aggregation framework provides the means to "project" this value, so a later pipeline stage can then perform the sort required.
The other case here is that due to the sizes of documents you are generally proposing, it is better to supply "allowDiskUse" as an option to force the aggregation process to store processed documents temporily on disk and not in memory, as there is a restriction on the amount of memory that can be used in an aggregation process without this option.

Moving averages with MongoDB's aggregation framework?

If you have 50 years of temperature weather data (daily) (for example) how would you calculate moving averages, using 3-month intervals, for that time period? Can you do that with one query or would you have to have multiple queries?
Example Data
01/01/2014 = 40 degrees
12/31/2013 = 38 degrees
12/30/2013 = 29 degrees
12/29/2013 = 31 degrees
12/28/2013 = 34 degrees
12/27/2013 = 36 degrees
12/26/2013 = 38 degrees
.....
The agg framework now has $map and $reduce and $range built in so array processing is much more straightfoward. Below is an example of calculating moving average on a set of data where you wish to filter by some predicate. The basic setup is each doc contains filterable criteria and a value, e.g.
{sym: "A", d: ISODate("2018-01-01"), val: 10}
{sym: "A", d: ISODate("2018-01-02"), val: 30}
Here it is:
// This controls the number of observations in the moving average:
days = 4;
c=db.foo.aggregate([
// Filter down to what you want. This can be anything or nothing at all.
{$match: {"sym": "S1"}}
// Ensure dates are going earliest to latest:
,{$sort: {d:1}}
// Turn docs into a single doc with a big vector of observations, e.g.
// {sym: "A", d: d1, val: 10}
// {sym: "A", d: d2, val: 11}
// {sym: "A", d: d3, val: 13}
// becomes
// {_id: "A", prx: [ {v:10,d:d1}, {v:11,d:d2}, {v:13,d:d3} ] }
//
// This will set us up to take advantage of array processing functions!
,{$group: {_id: "$sym", prx: {$push: {v:"$val",d:"$date"}} }}
// Nice additional info. Note use of dot notation on array to get
// just scalar date at elem 0, not the object {v:val,d:date}:
,{$addFields: {numDays: days, startDate: {$arrayElemAt: [ "$prx.d", 0 ]}} }
// The Juice! Assume we have a variable "days" which is the desired number
// of days of moving average.
// The complex expression below does this in python pseudocode:
//
// for z in range(0, size of value vector - # of days in moving avg):
// seg = vector[n:n+days]
// values = seg.v
// dates = seg.d
// for v in seg:
// tot += v
// avg = tot/len(seg)
//
// Note that it is possible to overrun the segment at the end of the "walk"
// along the vector, i.e. not enough date-values. So we only run the
// vector to (len(vector) - (days-1).
// Also, for extra info, we also add the number of days *actually* used in the
// calculation AND the as-of date which is the tail date of the segment!
//
// Again we take advantage of dot notation to turn the vector of
// object {v:val, d:date} into two vectors of simple scalars [v1,v2,...]
// and [d1,d2,...] with $prx.v and $prx.d
//
,{$addFields: {"prx": {$map: {
input: {$range:[0,{$subtract:[{$size:"$prx"}, (days-1)]}]} ,
as: "z",
in: {
avg: {$avg: {$slice: [ "$prx.v", "$$z", days ] } },
d: {$arrayElemAt: [ "$prx.d", {$add: ["$$z", (days-1)] } ]}
}
}}
}}
]);
This might produce the following output:
{
"_id" : "S1",
"prx" : [
{
"avg" : 11.738793632512115,
"d" : ISODate("2018-09-05T16:10:30.259Z")
},
{
"avg" : 12.420766702631376,
"d" : ISODate("2018-09-06T16:10:30.259Z")
},
...
],
"numDays" : 4,
"startDate" : ISODate("2018-09-02T16:10:30.259Z")
}
The way I would tend to do this in MongoDB is maintain a running sum of the past 90 days in the document for each day's value, e.g.
{"day": 1, "tempMax": 40, "tempMaxSum90": 2232}
{"day": 2, "tempMax": 38, "tempMaxSum90": 2230}
{"day": 3, "tempMax": 36, "tempMaxSum90": 2231}
{"day": 4, "tempMax": 37, "tempMaxSum90": 2233}
Whenever a new data point needs to be added to the collection, instead of reading and summing 90 values you can efficiently calculate the next sum with two simple queries, one addition and one subtraction like this (psuedo-code):
tempMaxSum90(day) = tempMaxSum90(day-1) + tempMax(day) - tempMax(day-90)
The 90-day moving average for at each day is then just the 90-day sum divided by 90.
If you wanted to also offer moving averages over different time-scales, (e.g. 1 week, 30 day, 90 day, 1 year) you could simply maintain an array of sums with each document instead of a single sum, one sum for each time-scale required.
This approach costs additional storage space and additional processing to insert new data, however is appropriate in most time-series charting scenarios where new data is collected relatively slowly and fast retrieval is desirable.
The accepted answer helped me, but it took a while for me to understand how it worked and so I thought i'd explain my method to help others out. Particularly in your context I think my answer will help
This works on smaller datasets ideally
First group the data by day, then append all days in an array to each day:
{
"$sort": {
"Date": -1
}
},
{
"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"Previous Values": {
"$push": {
"Date": "$Date",
"Temperature": "$Temperature"
}
}
}
This will leave you with a record that looks like this (it'll be ordered correctly):
{"_id.Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-03-01", "Temperature": 20},
{"Day": "2017-02-11", "Temperature": 22},
{"Day": "2017-01-18", "Temperature": 03},
...
]},
Now that each day has all days appended to it, we need to remove the items from the Previous Values array that are more recent than the this _id.Day field, as the moving average is backward looking:
{
"$project": {
"_id": 0,
"Date": "$_id.Date",
"Temperature": "$_id.Temperature",
"Previous Values": 1
}
},
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$filter": {
"input": "$Previous Values",
"as": "pv",
"cond": {
"$lte": ["$$pv.Date", "$Date"]
}
}
}
}
},
Each item in the Previous Values array will only contain the dates that are less than or equal to the date for each record:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-01-31", "Temperature": 33},
{"Day": "2017-01-30", "Temperature": 36},
{"Day": "2017-01-29", "Temperature": 33},
{"Day": "2017-01-28", "Temperature": 32},
...
]}
Now we can pick our average window size, since the data is by day, for week we'd take the first 7 records of the array; for monthly, 30; or 3-monthly, 90 days:
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$slice": ["$Previous Values", 0, 90]
}
}
},
To average the previous temperatures we unwind the Previous Values array then group by the date field. The unwind operation does this:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-31",
"Temperature": 33}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-30",
"Temperature": 36}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-29",
"Temperature": 33}
},
...
See that the Day field is the same, but we now have a document for each of the previous dates from the Previous Values array. Now we can group back on day, then average Previous Values.Temperature to get the moving average:
{"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"3 Month Moving Average": {
"$avg": "$Previous Values.Temperature"
}
}
}
That's it! I know that joining every record to every record isn't ideal, but this works fine on smaller datasets
Starting in Mongo 5, it's a perfect use case for the new $setWindowFields aggregation operator:
Note that I'm consider the rolling average to have a 3-days window for simplicity (today and the 2 previous days):
// { date: ISODate("2013-12-26"), temp: 38 }
// { date: ISODate("2013-12-27"), temp: 36 }
// { date: ISODate("2013-12-28"), temp: 34 }
// { date: ISODate("2013-12-29"), temp: 31 }
// { date: ISODate("2013-12-30"), temp: 29 }
// { date: ISODate("2013-12-31"), temp: 38 }
// { date: ISODate("2014-01-01"), temp: 40 }
db.collection.aggregate([
{ $setWindowFields: {
sortBy: { date: 1 },
output: {
movingAverage: {
$avg: "$temp",
window: { range: [-2, "current"], unit: "day" }
}
}
}}
])
// { date: ISODate("2013-12-26"), temp: 38, movingAverage: 38 }
// { date: ISODate("2013-12-27"), temp: 36, movingAverage: 37 }
// { date: ISODate("2013-12-28"), temp: 34, movingAverage: 36 }
// { date: ISODate("2013-12-29"), temp: 31, movingAverage: 33.67 }
// { date: ISODate("2013-12-30"), temp: 29, movingAverage: 31.33 }
// { date: ISODate("2013-12-31"), temp: 38, movingAverage: 32.67 }
// { date: ISODate("2014-01-01"), temp: 40, movingAverage: 35.67 }
This:
sorts chronologically sorts documents: sortBy: { date: 1 }
creates for each document a span of documents (the window) that:
includes the "current" document and all previous documents within a "2"-"day" window
and within that window, averages temperatures: $avg: "$temp"
I think I may have an answer for my own question. Map Reduce would do it. First use emit to map each document to it's neighbors that it should be averaged with, then use reduce to avg each array... and that new array of averages should be the moving averages plot overtime since it's id would be the new date interval that you care about
I guess I needed to understand map-reduce better ...
:)
For instance... if we wanted to do it in memory (later we can create collections)
GIST https://gist.github.com/mrgcohen/3f67c597a397132c46f7
Does that look right?
I don't believe the aggregation framework can do this for multiple dates in the current version (2.6), or, at least, can't do this without some serious gymnastics. The reason is that the aggregation pipeline processes one document at a time and one document only, so it would be necessary to somehow create a document for each day that contains the previous 3 months worth of relevant information. This would be as a $group stage that would calculate the average, meaning that the prior stage would have produced about 90 copies of each day's record with some distinguishing key that can be used for the $group.
So I don't see a way to do this for more than one date at a time in a single aggregation. I'd be happy to be wrong and have to edit/remove this answer if somebody finds a way to do it, even if it's so complicated it's not practical. A PostgreSQL PARTITION type function would do the job here; maybe that function will be added someday.