I was wondering if someone could help me get my aggregation function right. I'm trying to count the number of times a piece of text appears per hour in a specified day. So far I've got:
db.daily_data.aggregate(
[
{ $project : { useragent: 1, datetime: 1, url: 1, hour: {$hour: new Date("$datetime")} } },
{ $match : { datetime: {$gte: 1361318400000, $lt: 1361404800000}, useragent: /.*LinkCheck by Siteimprove.*/i } },
{ $group : { _id : { useragent: "$useragent", hour: "$hour" }, queriesPerUseragent: {$sum: 1} } }
]
);
But I'm obviously getting it wrong as hour is always 0:
{
"result" : [
{
"_id" : {
"useragent" : "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0) LinkCheck by Siteimprove.com",
"hour" : 0
},
"queriesPerUseragent" : 94215
}
],
"ok" : 1
}
Here's a trimmed down example of a record too:
{
"_id" : ObjectId("50fe63c70266a712e8663725"),
"useragent" : "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0) LinkCheck by Siteimprove.com",
"datetime" : NumberLong("1358848954813"),
"url" : "http://www.somewhere.com"
}
I've also tried using new Date("$datetime").getHours() instead of the $hour function to try and get the same result but with no luck. Can someone point me in the direction of where I'm going wrong?
Thanks!
This is a recommendation rather an answer for your problem.
On MongoDB for analytics it's recommended to pre-aggregate your buckets (hourly buckets in your use case) for every metric you want to calculate.
So, for your metric you can update your pre-aggregated collection (speeding up your query time):
db.user_agent_hourly.update({url: "your_url", useragent: "your user agent", hour: current_HOUR_of_DAY, date: current_DAY_Date}, {$inc: {counter:1}}, {upsert:true})
Take into account that in current_DAY_Date you have to point to stable date value for the current day, i.e., current_year/current_month/current_day 00:00:00 , using the same hour:minute:second to every metric received in current day.
Then, you can query this collection, extracting aggregated analytics for any given period of time as follows:
db.user_agent_hourly.aggregate(
{$match:{date:{$gte: INITIAL_DATE, $lt: FINAL_DATE}}},
{$group:{ _id : { useragent: "$useragent", hour: "$hour" } ,queriesPerUseragent: {$sum: "$count"} } },
{$sort:{queriesPerUseragent:-1}}
)
If you want to filter the results using a specific user agent, you can use the next query:
db.user_agent_hourly.aggregate(
{$match:{date:{$gte: INITIAL_DATE, $lt: FINAL_DATE, useragent: "your_user_agent"}}},
{$group:{ _id : { useragent: "$useragent", hour: "$hour" }, queriesPerUseragent: {$sum: "$count"} } }
)
PS: We store every single received metric in other collection to be able to reprocess it in case of disaster or other needs.
Related
I need to calculate the 90th percentile of the duration where the duration for each document is defined as finish_time - start_time.
My plan was to:
Create $project to calculate that duration in seconds for each document.
Calculate the index (in sorted documents list) that correspond to the 90th percentile: 90th_percentile_index = 0.9 * amount_of_documents.
Sort the documents by the $duration variable the was created.
Use the 90th_percentile_index to $limit the documents.
Choose the first document out of the limited subset of document.
I'm new to MongoDB so I guess the query can be improved. So, the query looks like:
db.getCollection('scans').aggregate([
{
$project: {
duration: {
$divide: [{$subtract: ["$finish_time", "$start_time"]}, 1000] // duration is in seconds
},
Percentile90Index: {
$multiply: [0.9, "$total_number_of_documents"] // I don't know how to get the total number of documents..
}
}
},
{
$sort : {"$duration": 1},
},
{
$limit: "$Percentile90Index"
},
{
$group: {
_id: "_id",
percentiles90 : { $max: "$duration" } // selecting the max, i.e, first document after the limit , should give the result.
}
}
])
The problem I have is that I don't know how to get the total_number_of_documents and therefore I can't calculate the index.
Example:
let's say I have only 3 documents:
{
"_id" : ObjectId("1"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:01:00.000Z"),
}
{
"_id" : ObjectId("2"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:03:00.000Z"),
}
{
"_id" : ObjectId("3"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:08:00.000Z"),
}
So I would expect the result to be something like:
{
percentiles50 : 3 // in minutes, since percentiles50=3 is the minimum value that setisfies the request of atleast 50% of the documents have duration <= percentiles50
}
I used percentiles 50th in the example because I only gave 3 documents but it really doesn't matter, just show me please a query for the i-th percentiles and it will be fine :-)
Given the following schema, I want to write a MongoDB scheme to find all schemes with duration of more than a year. ie (start - end) > 1 year.
I am not sure if I can specify such an expression in mongodb query (start - end) > 1 year.
{
"update" : ISODate("2017-09-26T15:22:13.172Z"),
"create" : ISODate("2017-09-26T15:22:13.172Z"),
"scheme" : {
"since" : ISODate("2017-09-26T15:22:13.172Z"),
"startDate": ISODate("2017-09-26T15:22:13.172Z"),
"endDate": ISODate("2018-09-26T15:22:13.172Z"),
},
}
You can use aggregation with subtract:
db.yourcollection.aggregate(
[{
$project : { dateDifference: { $subtract: [ "$scheme.endDate", "$scheme.endDate" ] }},
$match : { "$dateDifference" : { $gt : 31536000000 } }
}]);
(* 31536000000 = milliseconds per year)
You may use another $project to output any fields you need in the matching documents.
I am finding my way through mongodb and have a collection that contains some documents of this shape:
{
"_id" : ObjectId("547a13b70dc5d228db81c475"),
"INSTRUMENT" : "InstrumentA",
"BID" : 5287,
"ASK" : 5290,
"TIMESTAMP" : ISODate("2014-10-01T23:57:27.137Z")
}
{
"_id" : ObjectId("547a0da20dc5d228db2f034d"),
"INSTRUMENT" : "InstrumentB",
"BID" : 0.88078,
"ASK" : 0.88098,
"TIMESTAMP" : ISODate("2014-10-01T23:58:59.637Z")
}
What I am looking to get is the last known mid (BID + ASK)/2 before a given ISODate for each INSTRUMENT. I got as far as getting the time of the last information across instruments and the last value of that last instrument. Even though the following looks like it works, the lastOccurance is being polluted across instruments.
db.runCommand(
{
group:
{
ns: 'collectionTest',
key : { INSTRUMENT : 1} ,
cond: { TIMESTAMP: { $lte: ISODate("2014-10-01 08:30:00") } } ,
$reduce: function( curr, result ) {
if(curr.TIMESTAMP > result.lastOccurance)
{
result.lastOccurance = curr.TIMESTAMP;
result.MID = (curr.BID + curr.ASK)/2;
result.INSTRUMENT = curr.INSTRUMENT;
}else
{
result.lastOccurance = null;
result.MID = null;
result.INSTRUMENT = null;
}
},
initial: { lastOccurance : ISODate("1900-01-01 00:00:00") }
}
}
)
If anybody can see a fix for this code, please let me know.
It's better to use aggregate instead of group whenever possible because it provides better performance and supports sharding.
With aggregate you can do this as:
db.test.aggregate([
// Only include the docs prior to the given date
{$match: {TIMESTAMP: { $lte: ISODate("2014-10-01 08:30:00") }}},
// Sort them in descending TIMESTAMP order
{$sort: {TIMESTAMP: -1}},
// Group them by INSTRUMENT, taking the first one in each group (which will be
// the last one before the given date) and computing the MID value for it.
{$group: {
_id: '$INSTRUMENT',
MID: {$first: {$divide: [{$add: ['$BID', '$ASK']}, 2]}},
lastOccurance : {$first: '$TIMESTAMP'}
}}
])
I am experiencing a very weird issue with MongoDB shell version: 2.4.6. It has to do with creating ISODate objects from strings. See below for a specific example.
Why does this not work.
collection.aggregate({$project: {created_at: 1, ts: {$add: new Date('created_at')}}}, {$limit: 1})
{
"result" : [
{
"_id" : ObjectId("522ff3b075e90018b2e2dfc4"),
"created_at" : "Wed Sep 11 04:38:08 +0000 2013",
"ts" : ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")
}
],
"ok" : 1
}
But this does.
collection.aggregate({$project: {created_at: 1, ts: {$add: new Date('Wed Sep 11 04:38:08 +0000 2013')}}}, {$limit: 1})
{
"result" : [
{
"_id" : ObjectId("522ff3b075e90018b2e2dfc4"),
"created_at" : "Wed Sep 11 04:38:08 +0000 2013",
"ts" : ISODate("2013-09-11T04:38:08Z")
}
],
"ok" : 1
}
The short answer is that you're passing the string 'created_at' to the Date constructor. If you pass a malformed date string to the constructor, you get ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ") in return.
To properly create a new date you'd have to do so by passing in the contents of 'created_at'. Unfortunately, I don't know of a way to run a date constructor on a string using the aggregation framework at this time. If your collection is small enough, you could do this in the client by iterating over your collection and adding a new date field to each document.
Kindof similar problem:
I had documents in MongoDB with NO dates being set in, and the DB is filling up so ultimately I need to go in and delete items that are older than one year.
I did sortof have date in a crappy human readable string format. which I can grab from '$ParentMasterItem.Name';
Example: '20211109 VendorName ProductType Pricing Update Workflow'.
So here's my attempt to pull out the dates (via substring parsing) -- (thankfully I do happen to know that every one of the 100K documents has it set the same way)
db.MyCollectionName.aggregate({$project: {
created_at: 1,
ts: {$add: {
$dateFromString: {
dateString: {
/* 0123 (year)
45 (month)
67 (day)
# '20211109 blahblah string'*/
$concat: [ { $substr: [ "$ParentMasterItem.Name", 0, 4 ]}, "-",
{ $substr: [ "$ParentMasterItem.Name", 4, 2 ]}, "-",
{ $substr: [ "$ParentMasterItem.Name", 6, 2 ]},
'T01:00:00Z']
}}}}}}, {$limit: 10})
output:
{ _id: 8445390, ts: ISODate("2022-12-19T01:00:00.000Z") },
I have "users" collection and i want day by day total user count eg:
01.01.2012 -> 5
02.01.2012 -> 9
03.01.2012 -> 18
04.01.2012 -> 24
05.01.2012 -> 38
06.01.2012 -> 48
I have createdAt attritube for each user. Can you help me about the query ?
{
"_id" : ObjectId( "5076d3e70546c971539d9f8a" ),
"createdAt" : Date( 1339964775466 ),
"points" : 200,
"profile" : null,
"userId" : "10002"
}
here this is works for, day by day count data
output i got:
30/3/2016 4
26/3/2016 4
21/3/2016 4
12/3/2016 12
14/3/2016 18
10/3/2016 10
9/3/2016 11
8/3/2016 19
7/3/2016 21
script:
model.aggregate({
$match: {
createdAt: {
$gte: new Date("2016-01-01")
}
}
}, {
$group: {
_id: {
"year": { "$year": "$createdAt" },
"month": { "$month": "$createdAt" },
"day": { "$dayOfMonth": "$createdAt" }
},
count:{$sum: 1}
}
}).exec(function(err,data){
if (err) {
console.log('Error Fetching model');
console.log(err);
} else {
console.log(data);
}
});
You have a couple of options, in order of performance :
Maintain the count in seperate aggregation documents. Every time you add a user you update the counter for that day (so, each day has its unique counter document in a, say, a users.daycounters collection). This is easily the fastest approach and scales best.
In 2.2 or higher you can use the aggregation framework. Examples close to your use case are documented here. Look for the $group operator : http://docs.mongodb.org/manual/applications/aggregation/
You can use the map/reduce framework : http://www.mongodb.org/display/DOCS/MapReduce. This is sharding compatible but relatively slow due to the JavaScript context use. Also it's not very straightforward for something as simple as this.
You can use the group() operator documented here : http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group. Since this does not work in a sharded environment and is generally slow due to the use of the single-threaded JavaScript context this is not recommended.