I want to calculate moving average for my data in MongoDB. My data structure is as below
{
"_id" : NUUID("54ab1171-9c72-57bc-ba20-0a06b4f858b3"),
"DateTime" : ISODate("2018-05-30T21:31:05.957Z"),
"Type" : 3,
"Value" : NumberDecimal("15.905414991993847")
}
I want to calculate the average of values for each type within 2 days and for each 5 seconds. In this case I put Type in $match pipeline but I prefer to group the result by Type and separate the result by Type. Something I did is as below
var start = new Date("2018-05-30T21:31:05.957Z");
var end = new Date("2018-06-01T21:31:05.957Z");
var arr = new Array();
for (var i = 0; i < 34560; i++) {
start.setSeconds(start.getSeconds() + 5);
if (start <= end)
{
var a = new Date(start);
arr.push(a);
}
}
db.Data.aggregate([
{$match:{"DateTime":{$gte:new Date("2018-05-30T21:31:05.957Z"),
$lte:new Date("2018-06-01T21:31:05.957Z")}, "Type":3}},
{$bucket: {
groupBy: "$DateTime",
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"Value": {$avg:"$Value"}
}
}
}
])
It seems, it is working but the performance is too slow. How can I make this faster?
I reproduced the behavior you describe with 2 days worth of 1 second observations in the DB and a $match that pulls just one day's worth. The agg works "fine" if you bucket by, say, 60 seconds. But 15 seconds took 6 times as long, to 30 seconds. And every 5 seconds? 144 seconds. 5 seconds yields an array of 17280 buckets. Yep.
So I went client-side and dragged all 43200 docs to the client and created a naive linear search bucket slot finder and calc in javascript.
c=db.foo.aggregate([
{$match:{"date":{$gte:new Date(osv), $lte:new Date(endv) }}}
]);
c.forEach(function(r) {
var x = findSlot(arr, r['date']);
if(buckets[x] == undefined) {
buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0};
}
var zz = buckets[x];
zz['n']++;
zz['v'] += r['val'];
});
This actually worked somewhat faster but same order of performance, about 92 seconds.
Next, I changed the linear search in findSlot to a bisection search. The 5 second bucket went from 144 seconds to .750 seconds: almost 200x faster. This includes dragging the 43200 records and running the forEach and bucketing logic above. So it stands to reason that $bucket may not be using a great algo and suffers when the bucket array is more than a couple hundred long.
Acknowledging this, we can instead make use of $floor of the delta between the start time and the observation time to bucket the data:
db.foo.aggregate([
{$match:{"date":{$gte:now, $lte:new Date(endv) }}}
// Bucket by turning offset from "now" into floor divided by the number
// of seconds of grouping. In this way, the resulting number becomes the
// slot into the virtual buckets, e.g.:
// date now diff/1000 floor # 5 seconds:
// 1514764800000 1514764800000 0 0
// 1514764802000 1514764800000 2 0
// 1514764804000 1514764800000 4 0
// 1514764806000 1514764800000 6 1
// 1514764808000 1514764800000 8 1
// 1514764810000 1514764800000 10 2
,{$addFields: {"ff": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", now ]}, 1000.0 ]}, secondsBucket ] }} }}
// Now just group by the numeric slot number!
,{$group: {_id: "$ff", n: {$sum:1}, tot: {$sum: "$val"}, avg: {$avg: "$val"}} }
// Get it in 0-n order....
,{$sort: {_id: 1}}
]);
found 17280 in 204 millis
So we now have a server-side solution that is just .204 seconds, or 700x faster. And you don't have to sort the input because $group will take care of bundling the slot numbers. And the $sort after the $group is optional (but sort of handy...)
Related
I have mongodb bucket query to get lowerbound and upperbound counts. Now I want greater than query to display discount for > 10%, > 20%, >50%, >60% etc.
i dont know how to include grater than in boundaries. or using any other way to get count for the offers?
db.getCollection('product').aggregate([{$facet: {
"offers":[{$unwind:"$variants"},
{$match:{"variants.prices.discount_percent": { $exists: 1 },
"variants.is_published":{"$ne":false}}},
{
$bucket: {
groupBy: "$variants.prices.discount_percent",
boundaries: [ 0, 20, 30, 40, 100,Infinity ],
output: {
"count": { $sum: 1 }
}
} }
]}}
])
what I want in this case is total count for discount. so 20% discount includes discount of 30% above, 30% includes 40% above so on..
output should be:
20% - (10),
30% - (5),
40% - (2) etc..
here 20% count is 10 which includes 5 and 2 as well. because its >= 20. 30% includes count 2 of 40% because its >= 30.
I have a mongodb data storage with 1 minute OHLCV data like below (time, open, high, low, close, volume) stored using mongoose in nodejs.
{
"_id":1,
"__v":0,
"data":[
[
1516597690510,
885000,
885000,
885000,
885000,
121.2982
],
[
1516597739868,
885000,
885000,
885000,
885000,
121.2982
]
...
]
}
I need to extract in same format for 5 minute interval from this data. I could not find how to do that in mongodb/mongoose, even after several hours of searching as am newbie. Kindly help. It is confusing esp because its nested array, and not having fields inside array.
NOTE: Suppose for 5 min data, you will have 4 samples(arrays) of 1 min data from data base, then
time : time element of last 1 min data array (of that 5 min interval)
open : first element of first 1 min data array (of that 5 min interval)
high : max of 2nd element in all 1 min data arrays (of that 5 min interval)
low : min of 3rd element in all 1 min data arrays (of that 5 min interval)
close : last of 4th element in all 1 min data arrays (of that 5 min interval)
volume : last element of last array in all 1 min data arrays (of that 5 min interval)
Please check the visual representation here
Idea is to be able to extract 5 min, 10 min, 30 min, 1 hour, 4 hours, 1 day intervals also in the same manner from the base 1 min database.
You need to use aggregate pipeline for doing this comparing the first element in data array which is stored in epoch time, get the epoch time of your $start and $end interval, use that value in query
db.col.aggregate(
[
{$project : {
data : {
$filter : {input : "$data", as : "d", cond : {$and : [{$lt : [ {$arrayElemAt : ["$$d" , 0]}, "$end"]}, {$gte : [ {$arrayElemAt : ["$$d" , 0]}, "$start"]}]}}
}
}
}
]
).pretty()
I have an ISO date in my collection documents.
"start" : ISODate("2015-07-25T17:35:00Z"),
"end" : ISODate("2015-09-01T23:59:00Z"),
Currently they are in GMT +0, i need them to be GMT +8. Therefore i need to add 8 hours to the existing field. How do i do this via a mongodb query?
Advice appreciated.
Updated Code Snippet
var offset = 8,
bulk = db.collection.initializeUnorderedBulkOp(),
count = 0;
db.collection.find().forEach(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { “startDateTime": new Date(
doc.startDateTime.valueOf() + ( 1000 * 60 * 60 * offset )
) }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
});
if ( count % 1000 !=0 )
bulk.execute();
I aggree wholeheartedly with the answer provided by Ewan here in that you really should keep all times in a database in UTC. And all the sentiments are correct there. Only really adding to this with practical examples.
As a working example, Let's say I have two people using the data, one in New York and one in Sydney, being UTC-5 and UTC+10 respectively. Now consider the following data:
{ "date": ISODate("2015-08-01T04:40:03.389Z") }
Based on that, this is the time the actual "event" takes place. To the perspective of the user in Sydney the event takes place on the 1st August as a whole day where as to the person in New York it is still occuring on the 31st July.
If however I construct a "localized" time for Sydney as follows, the UTC consideration is still correct:
new Date("2015/08/01")
ISODate("2015-07-31T14:00:00Z")
This enforces the time difference like it should by converting from the local timezone to UTC. Therefore a localized date will select the correct values in UTC. So the Sydney user perpective of the start of the 1st August includes all times from 2pm on 31st July and similarly adjusted to the end date of a range selection. With data in UTC, this assertion from the client end it correct, and to their perpective the selected data was in the expected range.
In the case where you were "aggregating" results for a given day, then you build in the "time difference" math into the expression. So for UTC+10 you would do:
var offset = 10;
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$add": [
{"$subtract": [ "$date", new Date(0)]},
offset * 1000 * 60 * 60
]},
{ "$mod": [
{ "$add": [
{ "$subtract": [ "$date", new Date(0) ] },
offset * 1000 * 60 * 60
]},
1000 * 60 * 60 * 24
]}
]
},
"count": { "$sum": 1 }
}}
Which then takes the "offset" for the locale in consideration when reporting back the "dates" to the perpective of the client that was viewing the data. So anything that occurred on an "Adjusted date" resulting in a different day such as the 31st August would be aggregated into the correct grouping by this adjustment.
Because your data may very well be used from the perpective of people in different timezones is exactly the reason why you should keep data for dates in UTC format. The client will do the work, or you can adjust accordingly where needed.
In short:
Client: Construct in local time, send in UTC
Server: Provide TZ Offset and adjust from UTC to local on return
Leave your dates in the correct format they are already in and use the methods described here to report on them.
But if you made a mistake
If however you made a mistake in contruction of your data and all times are actually "local" times but repesented as UTC, ie:
ISODate("2015-08-01T11:10:43.569Z") // actually meant to be 11am in UTC+10 :(
Where it should be:
ISODate("2015-08-01T01:10:43.569Z") // This is 11am UTC+10 :)
Then you would correct this as follows:
var offset = 10,
bulk = db.collection.initializeUnorderedBulkOp(),
count = 0;
db.collection.find().forEach(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "date": new Date(
doc.date.valueOf() - ( 1000 * 60 * 60 * offset )
) }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
});
if ( count % 1000 !=0 )
bulk.execute();
Reading each document to get the "date" value and adjusting that accordingly and sending the updated date value back to the document.
By default MongoDB stores all DateTimes as UTC.
There are 2 ways of doing this:
App side (Recommended)
When extracting the start and end from the database, in your language of choice just change it from a UTC to a local datetime.
To have a look at a good example in Python, check out this answer
Database side (Not recommended)
The other option is to write a mongodb query which adds 8 hours on to your start and end like you originally wanted. However this then sets the time as UTC but 8 hours in the future and becomes illogical for other developers and when parsing app side.
This requires updating based on another value in your document so you'll have to loop through each document as described here.
My objects are of the following structure:
{id: 1234, ownerId: 1, typeId: 3456, date:...}
{id: 1235, ownerId: 1, typeId: 3456, date:...}
{id: 1236, ownerId: 1, typeId: 12, date:...}
I would like to query the database so that it returns all the items that belong to a given ownerId but only the first item of a given typeId. IE the typeId field is unique in the results. I would also like to be able to use skip and limit.
In SQL the query would be something like:
SELECT * FROM table WHERE ownerId=1 SORT BY date GROUP BY typeId LIMIT 10 OFFSET 300
I currently have the following query (using pymongo) but it is giving my errors for using $sort, $limit and $skip:
search_dict['ownerId'] = 1
search_dict['$sort'] = {'date': -1}
search_dict['$limit'] = 10
search_dict['$skip'] = 200
collectionName.group(['typeId'], search_dict, {'list': []}, 'function(obj, prev) {prev.list.push(obj)}')
-
I have also tried the aggregation route but as I understand grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow. I need an iterative grouping algorithm.
search_dict = {'ownerId':1}
collectionName.aggregate([
{
'$match': search_dict
},
{
'$sort': {'date': -1}
},
{
'$group': {'_id': "$typeId"}
},
{
'$skip': skip
},
{
'$limit': 10
}
])
Your aggregation looks correct. You need to include the fields you want in the output in the $group stage using $first.
grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow.
It won't touch all items in the collection. If the match + sort is indexed ({ "ownerId" : 1, "date" : -1 }), the index will be used for the match + sort, and the group will only process the documents that are the result of the match.
The constraint is hardly ever cpu, except in cases of unindexed sort. It's usually disk I/O.
I need an iterative grouping algorithm.
What precisely do you mean by "iterative grouping"? The grouping is iterative, as it iterates over the result of the previous stage and checks which group each document belongs to!
I am not to sure how you get the idea that this operation should be computational expensive. This isn't really true for most SQL databases, and it surely isn't for MongoDB. All you need is to create an index over your sort criterium.
Here is how to prove it:
Open up a mongo shell and have this executed.
var bulk = db.speed.initializeOrderedBulkOp()
for ( var i = 1; i <= 100000; i++ ){
bulk.insert({field1:i,field2:i*i,date:new ISODate()});
if((i%100) == 0){print(i)}
}
bulk.execute();
The bulk execution may take some seconds. Next, we create a helper function:
Array.prototype.avg = function() {
var av = 0;
var cnt = 0;
var len = this.length;
for (var i = 0; i < len; i++) {
var e = +this[i];
if(!e && this[i] !== 0 && this[i] !== '0') e--;
if (this[i] == e) {av += e; cnt++;}
}
return av/cnt;
}
The troupe is ready, the stage is set:
var times = new Array();
for( var i = 0; i < 10000; i++){
var start = new Date();
db.speed.find().sort({date:-1}).skip(Math.random()*100000).limit(10);
times.push(new Date() - start);
}
print(times.avg() + " msecs");
The output is in msecs. This is the output of 5 runs for comparison:
0.1697 msecs
0.1441 msecs
0.1397 msecs
0.1682 msecs
0.1843 msecs
The test server runs inside a docker image which in turn runs inside a VM (boot2docker) on my 2,13 GHz Intel Core 2 Duo with 4GB of RAM, running OSX 10.10.2, a lot of Safari windows, iTunes, Mail, Spotify and Eclipse additionally. Not quite a production system. And that collection does not even have an index on the date field. With the index, the averages of 5 runs look like this:
0.1399 msecs
0.1431 msecs
0.1339 msecs
0.1441 msecs
0.1767 msecs
qed, hth.
I want to iterate through an array of items, multiplying price by quantity, and then getting a grand total. I've written the below, but I'm wondering if there's a more efficient way than using the for loop I'm using (as in, some magical one-liner).
test = [
{
price: 13
qty: 2
},
{
price: 40
qty: 3
}
]
total = 0
for item in test
total += item.price * item.qty
alert total
same as OP, just tad shorter - slight adjustment with yours.
test = [
{
price: 13
qty: 2
},
{
price: 40
qty: 3
}
]
total += (p.price * p.qty) for p in test
or just a tad shorter:
total += p.price * p.qty for p in test
Looks fine to me. You could do a oneliner
total = (item.price * item.qty for item in test).reduce (t,s) -> t + s
but I don't think it is any improvement, almost certainly less efficient.