Doing Map / Reduce Trailing 7 & 30 Day Calculations - mongodb

We are having an issue writing a map / reduce for the Mongo shell to process web logs. We have it calculating the daily mobile and desktop user hits but when we are trying to reference past documents to calculate a trailing 7 and 30 days of user hits. Any help or advise would be appreciated.
{
"_id" : {
"SiteName" : "All Sites",
"Date" : ISODate("2011-01-18T00:00:00Z")
},
"value" : {
"Day" : {
"AccessTypeTotal" : 9,
"AccessTypeDirect" : 0,
"AccessTypeDirectPerc" : 0,
"AccessTypeSearch" : 8,
"AccessTypeSearchPerc" : 88.88888888888889,
"AccessTypeNavigation" : 1,
"AccessTypeNavigationPerc" : 11.11111111111111
}
}
}

The MongoDB Cookbook has an excellent article that describes this process.
For trailing 30 days use something like this:
thirty_days_ago = new Date(Date.now() - 60 * 60 * 24 * 30 * 1000);
db.pageviews.mapReduce(map, reduce,
{out: pageview_results, query: {Date: {'$gt': thirty_days_ago}}});
Read the full article for a better understanding of how to use this with your documents.

Related

MongoDB: maximum number of documents in a capped collection

I'm using a capped collection and I defined max size to be 512000000 (512MB)
stats() says (After 1 insert): size:55, storageSize:16384.
Assuming that all documents are the same size, how many documents can I store?
Is it 512000000 / 55 or 512000000 / 16384?
For a capped collection, it's maxSize / avgObjSize. If your documents are about the same size, then it's practically maxSize / size.
You can verify this using a smaller more manageable number:
// create a capped collection with maxSize of 1024
> db.createCollection('test', {capped: true, size: 1024})
// insert one document to get an initial size
> db.test.insert({a:0})
> db.test.stats().size
33
// with similar documents, the collection should contain 1024/33 ~= 31 documents
// so let's insert 100 to make sure it's full
> for(i=1; i<100; i++) { db.test.insert({a:i}) }
> db.test.stats()
{
"ns" : "test.test",
"size" : 1023,
"count" : 31,
"avgObjSize" : 33,
"storageSize" : 36864,
"capped" : true,
"max" : -1,
"maxSize" : 1024,
....
so from the experiment above, count is 31 as expected, even though we inserted 100 documents.
Using your numbers, the max number of documents in your capped collection would be 512000000 / 55 ~= 9,309,090 documents.

Calculating moving average for every 5 seconds in MongoDB

I want to calculate moving average for my data in MongoDB. My data structure is as below
{
"_id" : NUUID("54ab1171-9c72-57bc-ba20-0a06b4f858b3"),
"DateTime" : ISODate("2018-05-30T21:31:05.957Z"),
"Type" : 3,
"Value" : NumberDecimal("15.905414991993847")
}
I want to calculate the average of values for each type within 2 days and for each 5 seconds. In this case I put Type in $match pipeline but I prefer to group the result by Type and separate the result by Type. Something I did is as below
var start = new Date("2018-05-30T21:31:05.957Z");
var end = new Date("2018-06-01T21:31:05.957Z");
var arr = new Array();
for (var i = 0; i < 34560; i++) {
start.setSeconds(start.getSeconds() + 5);
if (start <= end)
{
var a = new Date(start);
arr.push(a);
}
}
db.Data.aggregate([
{$match:{"DateTime":{$gte:new Date("2018-05-30T21:31:05.957Z"),
$lte:new Date("2018-06-01T21:31:05.957Z")}, "Type":3}},
{$bucket: {
groupBy: "$DateTime",
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"Value": {$avg:"$Value"}
}
}
}
])
It seems, it is working but the performance is too slow. How can I make this faster?
I reproduced the behavior you describe with 2 days worth of 1 second observations in the DB and a $match that pulls just one day's worth. The agg works "fine" if you bucket by, say, 60 seconds. But 15 seconds took 6 times as long, to 30 seconds. And every 5 seconds? 144 seconds. 5 seconds yields an array of 17280 buckets. Yep.
So I went client-side and dragged all 43200 docs to the client and created a naive linear search bucket slot finder and calc in javascript.
c=db.foo.aggregate([
{$match:{"date":{$gte:new Date(osv), $lte:new Date(endv) }}}
]);
c.forEach(function(r) {
var x = findSlot(arr, r['date']);
if(buckets[x] == undefined) {
buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0};
}
var zz = buckets[x];
zz['n']++;
zz['v'] += r['val'];
});
This actually worked somewhat faster but same order of performance, about 92 seconds.
Next, I changed the linear search in findSlot to a bisection search. The 5 second bucket went from 144 seconds to .750 seconds: almost 200x faster. This includes dragging the 43200 records and running the forEach and bucketing logic above. So it stands to reason that $bucket may not be using a great algo and suffers when the bucket array is more than a couple hundred long.
Acknowledging this, we can instead make use of $floor of the delta between the start time and the observation time to bucket the data:
db.foo.aggregate([
{$match:{"date":{$gte:now, $lte:new Date(endv) }}}
// Bucket by turning offset from "now" into floor divided by the number
// of seconds of grouping. In this way, the resulting number becomes the
// slot into the virtual buckets, e.g.:
// date now diff/1000 floor # 5 seconds:
// 1514764800000 1514764800000 0 0
// 1514764802000 1514764800000 2 0
// 1514764804000 1514764800000 4 0
// 1514764806000 1514764800000 6 1
// 1514764808000 1514764800000 8 1
// 1514764810000 1514764800000 10 2
,{$addFields: {"ff": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", now ]}, 1000.0 ]}, secondsBucket ] }} }}
// Now just group by the numeric slot number!
,{$group: {_id: "$ff", n: {$sum:1}, tot: {$sum: "$val"}, avg: {$avg: "$val"}} }
// Get it in 0-n order....
,{$sort: {_id: 1}}
]);
found 17280 in 204 millis
So we now have a server-side solution that is just .204 seconds, or 700x faster. And you don't have to sort the input because $group will take care of bundling the slot numbers. And the $sort after the $group is optional (but sort of handy...)

How to aggregate OHLC 5 min from 1 min nested array data (mongodb, mongoose)

I have a mongodb data storage with 1 minute OHLCV data like below (time, open, high, low, close, volume) stored using mongoose in nodejs.
{
"_id":1,
"__v":0,
"data":[
[
1516597690510,
885000,
885000,
885000,
885000,
121.2982
],
[
1516597739868,
885000,
885000,
885000,
885000,
121.2982
]
...
]
}
I need to extract in same format for 5 minute interval from this data. I could not find how to do that in mongodb/mongoose, even after several hours of searching as am newbie. Kindly help. It is confusing esp because its nested array, and not having fields inside array.
NOTE: Suppose for 5 min data, you will have 4 samples(arrays) of 1 min data from data base, then
time : time element of last 1 min data array (of that 5 min interval)
open : first element of first 1 min data array (of that 5 min interval)
high : max of 2nd element in all 1 min data arrays (of that 5 min interval)
low : min of 3rd element in all 1 min data arrays (of that 5 min interval)
close : last of 4th element in all 1 min data arrays (of that 5 min interval)
volume : last element of last array in all 1 min data arrays (of that 5 min interval)
Please check the visual representation here
Idea is to be able to extract 5 min, 10 min, 30 min, 1 hour, 4 hours, 1 day intervals also in the same manner from the base 1 min database.
You need to use aggregate pipeline for doing this comparing the first element in data array which is stored in epoch time, get the epoch time of your $start and $end interval, use that value in query
db.col.aggregate(
[
{$project : {
data : {
$filter : {input : "$data", as : "d", cond : {$and : [{$lt : [ {$arrayElemAt : ["$$d" , 0]}, "$end"]}, {$gte : [ {$arrayElemAt : ["$$d" , 0]}, "$start"]}]}}
}
}
}
]
).pretty()

MongoDB query for count based on some value in other collection

I have a configuration collection with below fields:
1) Model
2) Threshold
In above collection, certain threshold value is given for every model like as follow:
'model1' 200
'model2' 400
'model3' 600
There is another collection named customer with following fields:
1)model
2)customerID
3)baseValue
In above collection, data is as follow:
'model1' 'BIXPTL098' 300
'model2' 'BIXPTL448' 350
'model3' 'BIXPTL338' 500
Now I need to get the count of customer records which have baseValue for that particular model greater than the threshold of that particular model in configuration collection.
Example : For the above demo data, 1 should be returned by the query as there is only one customer(BIXPTL098) with baseValue(300) greater than Threshold(200) for that particular model(model1) in configuration
There are thousands of records in configuration collection. Any help is appreciated.
How often does the threshold change? If it doesn't change very often, I'd store the difference between the model threshold and the customer baseValue on each document.
{
"model" : "model1",
"customerID" : "BIXPTL098",
"baseValue" : 300,
"delta" : 100 // customer baseValue - model1 threshold = 300 - 200 = 100
{
and query for delta > 0
db.customers.find({ "delta" : { "$gt" : 0 } })
If the threshold changes frequently, the easiest option would be to compute customer documents exceeding their model threshold on a model-by-model basis:
> var mt = db.models.findOne({ "model" : "model1" }).threshold
> db.customers.find({ "model" : "model1", "baseValue" : { "$gt" : mt } })

Get closest data using centerSphere - MongoDB

I'm trying to get the closest data from the following data
> db.points.insert({name:"Skoda" , pos : { lon : 30, lat : 30 } })
> db.points.insert({name:"Honda" , pos : { lon : -10, lat : -20 } })
> db.points.insert({name:"Skode" , pos : { lon : 10, lat : -20 } })
> db.points.insert({name:"Skoda" , pos : { lon : 60, lat : -10 } })
> db.points.insert({name:"Honda" , pos : { lon : 410, lat : 20 } })
> db.points.ensureIndex({ loc : "2d" })
then I tried
> db.points.find({"loc" : {"$within" : {"$centerSphere" : [[0 , 0],5]}}})
this time I got different error
error: {
"$err" : "Spherical MaxDistance > PI. Are you sure you are using radians?",
"code" : 13461
then I tried
> db.points.find({"loc" : {"$within" : {"$centerSphere" : [[10 , 10],2]}}})
error: {
"$err" : "Spherical distance would require wrapping, which isn't implemented yet",
"code" : 13462
How to get this done ? I just want to get the closest data based on the given radious from GEO point.
Thanks
A few things to note. Firstly, you are storing your coordinates in a field called "pos" but you are doing a query (and have created an index) on a field called "loc."
The $centerSphere takes a set of coordinates and a value that is in radians. So $centerSphere: [[10, 10], 2] searches for items around [10, 10] in a circle that is 2 * (earth's radius) = 12,756 km. The $centerSphere operator is not designed to search for documents in this large of an area (and wrapping around the poles of the Earth is tricky). Try using a smaller value, such as .005.
Finally, it is probably a better idea to store coordinates as elements in an array since some drivers may not preserve the order of fields in a document (and swapping latitude and longitude results in drastically different geo locations).
Hope this helps:
The radius of the Earth is approximately 3963.192 miles or 6378.137 kilometers.
For 1 mile:
db.places.find( { loc: { $centerSphere: [ [ -74, 40.74 ] ,
1 / 3963.192 ] } } )