Calculate average error via aggregation pymongo

Calculate average error via aggregation pymongo - mongodb

I am trying to find the average error for predictions made n-1, n-2, ... n-5 hours before an event happens. One "row" of my data looks like this:
note: _id is the time_collected.
{'_id': 5pm,
'value_at_time_collected': 72,
'predictions': [
{'prediction': 77, 'time': 6pm},
...
{'prediction': 79, 'time': 10pm},
]
}
With one aggregate I can come up with a time difference and isolate the predictions:
pipeline1 = [
{ "$unwind" : "$predictions" },
{ "$project" :
{
"timeDiff" : { "$subtract": ["$predictions.time", "$t_id"]},
"madeFor" : "$predictions.time",
"_id":0,
"estimate": "$predictions.prediction"
}}]
list(col.aggregate(pipeline3))
Resulting in a dataset like:
[{u'estimate': 77.60785714285714,
u'madeFor': datetime.datetime(2015, 10, 28, 0, 0),
u'timeDiff': 3600000L}, ... ]
But I still need to match these with the corresponding "time_collected" to calculate the errors, i.e.
`$subtract[prediction,value_at_time_collected]`
How can I do this?

Related

PyMongo(3.6.1) - db.collection.aggregate(pipeline) NOT WORKING

I'm having trouble with PyMongo, I have been googling for hours, but found no solution.
I'm working with some Python Scripts just to practice with MongoDb, which is running on my local machine. I have populated my mongoDb instance with one database "moviesDB", which contains 3 different collections:
1.Movies collection, here is an example of a document from this coll:
{'_id': 1,
'title': 'Toy Story (1995)',
'genres': ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'],
'averageRating': 3.87,
'numberOfRatings': 247,
'tags': [
{'_id': ObjectId('5b04970b87977f1fec0fb6e9'),
'userId': 501,
'tag': 'Pixar',
'timestamp': 1292956344}
]
}
2.Ratings collection, which looks like this:
{ '_id':ObjectId('5b04970c87977f1fec0fb923'),
'userId': 1,
'movieId': 31,
'rating': 2.5,
'timestamp': 1260759144}
3.Tags collection, that I won't use here, so it's not important.
Now, what I'm trying to do is: given a user (in this example, user 1), find all the genres of movies that he rated and per each genre list all the movieIds regarding that genre.
Here's my code:
"""
This query basically retrieves movieIds,
so from the result list of several documents like this:
{
ObjectId('5b04970c87977f1fec0fb923'),
'userId': 1,
'movieId': 31,
'rating': 2.5,
'timestamp': 1260759144},
retrieves only an array of integers, where each number represent a movie
that the user 1 rated."""
movies_rated_by_user = list(db.ratings.distinct(movieId, {userId: 1}))
pipeline = [
{"$match": {"_id ": {"$in": movies_rated_by_user}}},
{"$unwind": "$genres"},
{"$group": {"_id": "$genres", "movies": {"$addToSet": "$_id"}}}]
try:
"""HERE IS THE PROBLEM, SINCE db.movies.aggregate() RETURNS NOTHING!
so the cursor is empty."""
cursor = db.movies.aggregate(pipeline, cursor={})
except OperationFailure:
print("Something went Wrong", file=open("operations_log.txt", "a"))
print(OperationFailure.details, file=open("operations_log.txt", "a"))
sys.exit(1)
aggregate_genre = []
for c in cursor:
aggregate_genre.append(c)
print(aggregate_genre)
The point is that the aggregate function on the movies collection retrieves NOTHING, whereas it really should, since I tried this query on the MongoShell and it worked just fine. Here's how the mongoDB shell-query looks like:
db.movies.aggregate(
[
{$match:{_id : {$in: ids}}},
{$unwind : "$genres"},
{$group :
{
_id : "$genres",
movies: { $addToSet : "$_id" }}}
]
);
Where the 'ids' variables is defined like this, just like the movies_rated_by_user variable in the code:
ids= db.ratings.distinct("movieId", {userId : 1});
The result from the aggregate method looks like this (this is what the aggregate_genre variable in the code, should contain):
{ "_id" : "Western", "movies" : [ 3671 ] }
{ "_id" : "Crime", "movies" : [ 1953, 1405 ] }
{ "_id" : "Fantasy", "movies" : [ 2968, 2294, 2193, 1339 ] }
{ "_id" : "Comedy", "movies" : [ 3671, 2294, 2968, 2150, 1405 ] }
{ "_id" : "Sci-Fi", "movies" : [ 2455, 2968, 1129, 1371, 2105 ] }
{ "_id" : "Adventure", "movies" : [ 2193, 2150, 1405, 1287, 2105, 2294,
2968, 1371, 1129 ] }
Now the problem is the aggregate method, is there any error with the pipeline string??
PLEASE HELP!!
Thank you

I think you need to cast the cursor to a list before iterating through it. Hope this helps.
cursor = list(db.movies.aggregate(pipeline, cursor={}))

Moving averages with MongoDB's aggregation framework?

If you have 50 years of temperature weather data (daily) (for example) how would you calculate moving averages, using 3-month intervals, for that time period? Can you do that with one query or would you have to have multiple queries?
Example Data
01/01/2014 = 40 degrees
12/31/2013 = 38 degrees
12/30/2013 = 29 degrees
12/29/2013 = 31 degrees
12/28/2013 = 34 degrees
12/27/2013 = 36 degrees
12/26/2013 = 38 degrees
.....

The agg framework now has $map and $reduce and $range built in so array processing is much more straightfoward. Below is an example of calculating moving average on a set of data where you wish to filter by some predicate. The basic setup is each doc contains filterable criteria and a value, e.g.
{sym: "A", d: ISODate("2018-01-01"), val: 10}
{sym: "A", d: ISODate("2018-01-02"), val: 30}
Here it is:
// This controls the number of observations in the moving average:
days = 4;
c=db.foo.aggregate([
// Filter down to what you want. This can be anything or nothing at all.
{$match: {"sym": "S1"}}
// Ensure dates are going earliest to latest:
,{$sort: {d:1}}
// Turn docs into a single doc with a big vector of observations, e.g.
// {sym: "A", d: d1, val: 10}
// {sym: "A", d: d2, val: 11}
// {sym: "A", d: d3, val: 13}
// becomes
// {_id: "A", prx: [ {v:10,d:d1}, {v:11,d:d2}, {v:13,d:d3} ] }
//
// This will set us up to take advantage of array processing functions!
,{$group: {_id: "$sym", prx: {$push: {v:"$val",d:"$date"}} }}
// Nice additional info. Note use of dot notation on array to get
// just scalar date at elem 0, not the object {v:val,d:date}:
,{$addFields: {numDays: days, startDate: {$arrayElemAt: [ "$prx.d", 0 ]}} }
// The Juice! Assume we have a variable "days" which is the desired number
// of days of moving average.
// The complex expression below does this in python pseudocode:
//
// for z in range(0, size of value vector - # of days in moving avg):
// seg = vector[n:n+days]
// values = seg.v
// dates = seg.d
// for v in seg:
// tot += v
// avg = tot/len(seg)
//
// Note that it is possible to overrun the segment at the end of the "walk"
// along the vector, i.e. not enough date-values. So we only run the
// vector to (len(vector) - (days-1).
// Also, for extra info, we also add the number of days *actually* used in the
// calculation AND the as-of date which is the tail date of the segment!
//
// Again we take advantage of dot notation to turn the vector of
// object {v:val, d:date} into two vectors of simple scalars [v1,v2,...]
// and [d1,d2,...] with $prx.v and $prx.d
//
,{$addFields: {"prx": {$map: {
input: {$range:[0,{$subtract:[{$size:"$prx"}, (days-1)]}]} ,
as: "z",
in: {
avg: {$avg: {$slice: [ "$prx.v", "$$z", days ] } },
d: {$arrayElemAt: [ "$prx.d", {$add: ["$$z", (days-1)] } ]}
}
}}
}}
]);
This might produce the following output:
{
"_id" : "S1",
"prx" : [
{
"avg" : 11.738793632512115,
"d" : ISODate("2018-09-05T16:10:30.259Z")
},
{
"avg" : 12.420766702631376,
"d" : ISODate("2018-09-06T16:10:30.259Z")
},
...
],
"numDays" : 4,
"startDate" : ISODate("2018-09-02T16:10:30.259Z")
}

The way I would tend to do this in MongoDB is maintain a running sum of the past 90 days in the document for each day's value, e.g.
{"day": 1, "tempMax": 40, "tempMaxSum90": 2232}
{"day": 2, "tempMax": 38, "tempMaxSum90": 2230}
{"day": 3, "tempMax": 36, "tempMaxSum90": 2231}
{"day": 4, "tempMax": 37, "tempMaxSum90": 2233}
Whenever a new data point needs to be added to the collection, instead of reading and summing 90 values you can efficiently calculate the next sum with two simple queries, one addition and one subtraction like this (psuedo-code):
tempMaxSum90(day) = tempMaxSum90(day-1) + tempMax(day) - tempMax(day-90)
The 90-day moving average for at each day is then just the 90-day sum divided by 90.
If you wanted to also offer moving averages over different time-scales, (e.g. 1 week, 30 day, 90 day, 1 year) you could simply maintain an array of sums with each document instead of a single sum, one sum for each time-scale required.
This approach costs additional storage space and additional processing to insert new data, however is appropriate in most time-series charting scenarios where new data is collected relatively slowly and fast retrieval is desirable.

The accepted answer helped me, but it took a while for me to understand how it worked and so I thought i'd explain my method to help others out. Particularly in your context I think my answer will help
This works on smaller datasets ideally
First group the data by day, then append all days in an array to each day:
{
"$sort": {
"Date": -1
}
},
{
"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"Previous Values": {
"$push": {
"Date": "$Date",
"Temperature": "$Temperature"
}
}
}
This will leave you with a record that looks like this (it'll be ordered correctly):
{"_id.Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-03-01", "Temperature": 20},
{"Day": "2017-02-11", "Temperature": 22},
{"Day": "2017-01-18", "Temperature": 03},
...
]},
Now that each day has all days appended to it, we need to remove the items from the Previous Values array that are more recent than the this _id.Day field, as the moving average is backward looking:
{
"$project": {
"_id": 0,
"Date": "$_id.Date",
"Temperature": "$_id.Temperature",
"Previous Values": 1
}
},
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$filter": {
"input": "$Previous Values",
"as": "pv",
"cond": {
"$lte": ["$$pv.Date", "$Date"]
}
}
}
}
},
Each item in the Previous Values array will only contain the dates that are less than or equal to the date for each record:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": [
{"Day": "2017-01-31", "Temperature": 33},
{"Day": "2017-01-30", "Temperature": 36},
{"Day": "2017-01-29", "Temperature": 33},
{"Day": "2017-01-28", "Temperature": 32},
...
]}
Now we can pick our average window size, since the data is by day, for week we'd take the first 7 records of the array; for monthly, 30; or 3-monthly, 90 days:
{
"$project": {
"_id": 0,
"Date": 1,
"Temperature": 1,
"Previous Values": {
"$slice": ["$Previous Values", 0, 90]
}
}
},
To average the previous temperatures we unwind the Previous Values array then group by the date field. The unwind operation does this:
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-31",
"Temperature": 33}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-30",
"Temperature": 36}
},
{"Day": "2017-02-01",
"Temperature": 40,
"Previous Values": {
"Day": "2017-01-29",
"Temperature": 33}
},
...
See that the Day field is the same, but we now have a document for each of the previous dates from the Previous Values array. Now we can group back on day, then average Previous Values.Temperature to get the moving average:
{"$group": {
"_id": {
"Day": "$Date",
"Temperature": "$Temperature"
},
"3 Month Moving Average": {
"$avg": "$Previous Values.Temperature"
}
}
}
That's it! I know that joining every record to every record isn't ideal, but this works fine on smaller datasets

Starting in Mongo 5, it's a perfect use case for the new $setWindowFields aggregation operator:
Note that I'm consider the rolling average to have a 3-days window for simplicity (today and the 2 previous days):
// { date: ISODate("2013-12-26"), temp: 38 }
// { date: ISODate("2013-12-27"), temp: 36 }
// { date: ISODate("2013-12-28"), temp: 34 }
// { date: ISODate("2013-12-29"), temp: 31 }
// { date: ISODate("2013-12-30"), temp: 29 }
// { date: ISODate("2013-12-31"), temp: 38 }
// { date: ISODate("2014-01-01"), temp: 40 }
db.collection.aggregate([
{ $setWindowFields: {
sortBy: { date: 1 },
output: {
movingAverage: {
$avg: "$temp",
window: { range: [-2, "current"], unit: "day" }
}
}
}}
])
// { date: ISODate("2013-12-26"), temp: 38, movingAverage: 38 }
// { date: ISODate("2013-12-27"), temp: 36, movingAverage: 37 }
// { date: ISODate("2013-12-28"), temp: 34, movingAverage: 36 }
// { date: ISODate("2013-12-29"), temp: 31, movingAverage: 33.67 }
// { date: ISODate("2013-12-30"), temp: 29, movingAverage: 31.33 }
// { date: ISODate("2013-12-31"), temp: 38, movingAverage: 32.67 }
// { date: ISODate("2014-01-01"), temp: 40, movingAverage: 35.67 }
This:
sorts chronologically sorts documents: sortBy: { date: 1 }
creates for each document a span of documents (the window) that:
includes the "current" document and all previous documents within a "2"-"day" window
and within that window, averages temperatures: $avg: "$temp"

I think I may have an answer for my own question. Map Reduce would do it. First use emit to map each document to it's neighbors that it should be averaged with, then use reduce to avg each array... and that new array of averages should be the moving averages plot overtime since it's id would be the new date interval that you care about
I guess I needed to understand map-reduce better ...
:)
For instance... if we wanted to do it in memory (later we can create collections)
GIST https://gist.github.com/mrgcohen/3f67c597a397132c46f7
Does that look right?

I don't believe the aggregation framework can do this for multiple dates in the current version (2.6), or, at least, can't do this without some serious gymnastics. The reason is that the aggregation pipeline processes one document at a time and one document only, so it would be necessary to somehow create a document for each day that contains the previous 3 months worth of relevant information. This would be as a $group stage that would calculate the average, meaning that the prior stage would have produced about 90 copies of each day's record with some distinguishing key that can be used for the $group.
So I don't see a way to do this for more than one date at a time in a single aggregation. I'd be happy to be wrong and have to edit/remove this answer if somebody finds a way to do it, even if it's so complicated it's not practical. A PostgreSQL PARTITION type function would do the job here; maybe that function will be added someday.

MongoDB $elemMatch of $elemMatch and good practice

H,
I'm trying to update the version field in this object but I'm not able to make a query with 2 nested $match. So what I would like to do is get the record with file id 12 and version 1.
I would ask also if is it a good practice have more the one nested array in mongoDB (like this object)...
Query:
db.collection.find({"my_uuid":"434343"},{"item":{$elemMatch:{"file_id":12,"changes":{$elemMatch:{"version":1}}}}}).pretty()
Object:
{
"my_uuid": "434343",
"item": [
{
"file_id": 12,
"no_of_versions" : 1,
"changes": [
{
"version": 1,
"commentIds": [
4,
5,
7
]
},
{
"version": 2,
"commentIds": [
10,
11,
15
]
}
]
},
{
"file_id": 234,
"unseen_comments": 3,
"no_of_versions" : 2,
"changes": [
{
"version": 1,
"commentIds": [
100,
110,
150
]
}
]
}
]
}
Thank you

If you want the entire documents that satisfy the criteria returned in the result, then I think it's fine. But if you want to limit the array contents of item and changes to just the matching elements, then it could be a problem. That's because, you'll have to use the $ positional operator in the projection to limit the contents of the array and only one such operator can appear in the projection. So, you'll not be able to limit the contents of multiple arrays within the document.

Using sum in Mongo

I'm learning mongo and I have this schema below and i would like some help defining a query:
I would like to get the sum of the "entregado" fields that match this code: 151001. In this case i would get this result = 38.
Do I need to change the schema or is easy to get a query for what i want?
{
"_id": 101,
"torre": 1,
"standard": {
"mamposteria": [
{
"codigo": 311017,
"descripcion": "LADRILLO ARCILLA H-10",
"cantidad": 1080,
"um": "UN",
"entregado": 1080,
"fecha": new Date('June 10, 2013'),
"vale": [1322]
},
{
"codigo": 311021,
"descripcion": "LADRILLO ARCILLA H-7",
"cantidad": 200,
"um": "UN",
"entregado": 200,
"fecha": new Date('June 10, 2013'),
"vale": [1322]
},
{
"codigo": 151001,
"descripcion": "CEMENTO GRIS 50 KG",
"cantidad": 17,
"um": "KG",
"entregado": 17,
"fecha": new Date('June 10, 2013'),
"vale": [1322]
}
],
"mortero": [
// . . .
],
"estructura":[
// . . .
]
}
}

To get the count alone u don't need to modify your schema.
Just following the below steps can give you the count what u need. I will not give you the whole query to find the count. That makes u lazy..so will give steps to get the count...
1) use $unwind 2) use $match to find the value for code: 151001 3) use $group to $sum the values present in entregado
This should give you the count.
If you need any other help pls contact me I will help

Slice an array in MongoDB's aggregation framework

I am saving game results in MongoDB and would like to calculate the sum of the 3 best results for every player.
With the aggregation framework I am able to built the following intermediate pipeline result from my database of finished games (each player below has finished 5 games with the gives score):
{
"_id" : "Player1",
"points" : [ 324, 300, 287, 287, 227]
},
{
"_id" : "Player2",
"points" : [ 324, 324, 300, 287, 123]
}
Now I need to sum up the three best values for each player. I was able to sort the array so it would also be ok here to get only the first 3 elements of each array to build the sum of the array in the next pipeline step.
$limit would work fine if I only need the result for one player. I also tried using $slice but that doesn't seem to work in the aggregation framework.
So how do I get the sum of the three best results for each player?

You mentioned that it would also be ok here to get only the first 3 elements of each array to build the sum of the array in the next pipeline step., so do it first, then use:
db.test.aggregate({'$unwind':'$points'},{'$group':{'_id':'$_id','result':{'$sum':'$points'}}}
to get the result.

$slice method for aggregation framework was added in 3.2 version of mongo. For a more detailed answer, take a look here.
And a couple of examples from the mongo page:
{ $slice: [ [ 1, 2, 3 ], 1, 1 ] } // [ 2 ]
{ $slice: [ [ 1, 2, 3 ], -2 ] } // [ 2, 3 ]
{ $slice: [ [ 1, 2, 3 ], 15, 2 ] } // [ ]
{ $slice: [ [ 1, 2, 3 ], -15, 2 ] } // [ 1, 2 ]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Calculate average error via aggregation pymongo - mongodb

Related

PyMongo(3.6.1) - db.collection.aggregate(pipeline) NOT WORKING

Moving averages with MongoDB's aggregation framework?

MongoDB $elemMatch of $elemMatch and good practice

Using sum in Mongo

Slice an array in MongoDB's aggregation framework

Categories

Resources