MongoDB: calculate 90th percentile among all documents - mongodb

I need to calculate the 90th percentile of the duration where the duration for each document is defined as finish_time - start_time.
My plan was to:
Create $project to calculate that duration in seconds for each document.
Calculate the index (in sorted documents list) that correspond to the 90th percentile: 90th_percentile_index = 0.9 * amount_of_documents.
Sort the documents by the $duration variable the was created.
Use the 90th_percentile_index to $limit the documents.
Choose the first document out of the limited subset of document.
I'm new to MongoDB so I guess the query can be improved. So, the query looks like:
db.getCollection('scans').aggregate([
{
$project: {
duration: {
$divide: [{$subtract: ["$finish_time", "$start_time"]}, 1000] // duration is in seconds
},
Percentile90Index: {
$multiply: [0.9, "$total_number_of_documents"] // I don't know how to get the total number of documents..
}
}
},
{
$sort : {"$duration": 1},
},
{
$limit: "$Percentile90Index"
},
{
$group: {
_id: "_id",
percentiles90 : { $max: "$duration" } // selecting the max, i.e, first document after the limit , should give the result.
}
}
])
The problem I have is that I don't know how to get the total_number_of_documents and therefore I can't calculate the index.
Example:
let's say I have only 3 documents:
{
"_id" : ObjectId("1"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:01:00.000Z"),
}
{
"_id" : ObjectId("2"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:03:00.000Z"),
}
{
"_id" : ObjectId("3"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:08:00.000Z"),
}
So I would expect the result to be something like:
{
percentiles50 : 3 // in minutes, since percentiles50=3 is the minimum value that setisfies the request of atleast 50% of the documents have duration <= percentiles50
}
I used percentiles 50th in the example because I only gave 3 documents but it really doesn't matter, just show me please a query for the i-th percentiles and it will be fine :-)

Related

MongoDB Array of Locations return only matched location

To give an example scenario... Lets say we have a MongoDB collection of companies. Each company document can have multiple addresses (stored in an array of Addresses). I want to search for companies that are near my location, but only show the address matched by the $geoNear operator, not all the other Address array members.
I'm trying something like:
db.Companies.aggregate(
{
'$geoNear': {
near: [ -77.3898602, 38.8735614],
distanceField: 'dist.Distance',
maxDistance: 0.02020712301086133,
spherical: true,
distanceMultiplier: 4948.75,
includeLocs: "dist.location"
}
})
This gives me the coordinates of the array member that was used to calculate the distance, but I really just want only the parent document minus the address array members that weren't matched.
Any ideas or tips??
Thanks in advance!
Perform a Count
The following example selects documents to process using the $match pipeline operator and then pipes the results to the $group pipeline operator to compute a count of the documents:
db.articles.aggregate( [
{ $match : { score : { $gt : 70, $lte : 90 } } },
{ $group: { _id: null, count: { $sum: 1 } } }
] );
In the aggregation pipeline, $match selects the documents where the score is greater than 70 and less than or equal to 90. These documents are then piped to the $group to perform a count. The aggregation returns the following:
{
"result" : [
{
"_id" : null,
"count" : 3
}
],
"ok" : 1
}

How mongo index intersection works

If I create two indexes, one with descending and one with ascending with time, if I query old and new data do I get to search in both the indexes and I get a good performance in both the cases.
Currently I have index in descending so when I query old data it is coming really really slow. I am thinking of creating one more ascending and try it out. Since I have a huge number of documents (32 million) I thought of asking here first.
This is my index and the query which cause me issue when start/end time is bit old. I have a TTL close to 100 days which make my collection to keep 32 million documents.
index: {
"source_type" : 1.0 ,
"source_id" : 1.0 ,
"key" : 1.0 ,
"start" : -1.0 ,
"end" : -1.0
}
query: keys = diag_db.telemetry_series.aggregate([
{ '$match': {
'source_type': 'SERVER',
'start': { '$gte': start },
'end': { '$lte': end },
'$or': stream_id_query
}},
{ '$project': {
'source_id': 1,
'key': 1
}},
{ '$group':
{ '_id': { 'source': '$source_id', 'key': '$key' }
}}
])['result']

MongoDB Aggregate Time Series

I'm using MongoDB to store time series data using a similar structure to "The Document-Oriented Design" explained here: http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
The objective is to query for the top 10 busiest minutes of the day on the whole system. Each document stores 1 hour of data using 60 sub-documents (1 for each minute). Each minute stores various metrics embedded in the "vals" field. The metric I care about is "orders". A sample document looks like this:
{
"_id" : ObjectId("54d023802b1815b6ef7162a4"),
"user" : "testUser",
"hour" : ISODate("2015-01-09T13:00:00Z"),
"vals" : {
"0" : {
"orders" : 11,
"anotherMetric": 15
},
"1" : {
"orders" : 12,
"anotherMetric": 20
},
.
.
.
}
}
Note there are many users in the system.
I've managed to flatten the structure (somewhat) by doing an aggregate with the following group object:
group = {
$group: {
_id: {
hour: "$hour"
},
0: {$sum: "$vals.0.orders"},
1: {$sum: "$vals.1.orders"},
2: {$sum: "$vals.2.orders"},
.
.
.
}
}
But that just gives me 24 documents (1 for each hour) with the # of orders for each minute during that hour, like so:
{
"_id" : {
"hour" : ISODate("2015-01-20T14:00:00Z")
},
"0" : 282086,
"1" : 239358,
"2" : 289188,
.
.
.
}
Now I need to somehow get the top 10 minutes of the day from this but I'm not sure how. I suspect it can be done with $project, but I'm not sure how.
You could aggregate as:
$match the documents for the specific date.
Construct the $group and $project objects before querying.
$group by the $hour, accumulate all the documents per hour per
minute in an array.Keep the minute somewhere within the document.
$project a variable docs as $setUnion of all the documents per
hour.
$unwind the documents.
$sort by orders
$limit the top 10 documents which is what we require.
Code:
var inputDate = new ISODate("2015-01-09T13:00:00Z");
var group = {};
var set = [];
for(var i=0;i<=60;i++){
group[i] = {$push:{"doc":"$vals."+i,
"hour":"$_id.hour",
"min":{$literal:i}}};
set.push("$"+i);
}
group["_id"] = {$hour:"$hour"};
var project = {"docs":{$setUnion:set}}
db.t.aggregate([
{$match:{"hour":{$lte:inputDate,$gte:inputDate}}},
{$group:group},
{$project:project},
{$unwind:"$docs"},
{$sort:{"docs.doc.orders":-1}},
{$limit:2},
{$project:{"_id":0,
"hour":"$_id",
"doc":"$docs.doc",
"min":"$docs.min"}}
])

Count and Aggregate in MongoDB

I have mongodb collection whose structure is as follows :-
{
"_id" : "mongo",
"log" : [
{
"ts" : ISODate("2011-02-10T01:20:49Z"),
"visitorId" : "25850661"
},
{
"ts" : ISODate("2014-11-01T14:35:05Z"),
"visitorId" : NumberLong(278571823)
},
{
"ts" : ISODate("2014-11-01T14:37:56Z"),
"visitorId" : NumberLong(0)
},
{
"ts" : ISODate("2014-11-04T06:23:48Z"),
"visitorId" : NumberLong(225200092)
},
{
"ts" : ISODate("2014-11-04T06:25:44Z"),
"visitorId" : NumberLong(225200092)
}
],
"uts" : ISODate("2014-11-04T06:25:43.740Z")
}
"mongo" is a search term and "ts" indicates the timestamp when it was searched on website.
"uts" indicates the last time it was searched.
So search term "mongo" was searched 5 times on our website.
I need to get top 50 most searched items in past 3 months.
I am no expert in aggregation in mongodb, but i was trying something like this to atleast get data of past 3 months: -
db.collection.aggregate({$group:{_id:"$_id",count:{$sum:1}}},{$match:{"log.ts":{"$gte":new Date("2014-09-01")}}})
It gave me error :-
exception: sharded pipeline failed on shard DSink9: { errmsg: \"exception: aggregation result exceeds maximum document size (16MB)\", code: 16389
Can anyone please help me?
UPDATE
I was able to write some query. But it gives me syntax error.
db.collection.aggregate(
{$unwind:"$log"},
{$project:{log:"$log.ts"}},
{$match:{log:{"$gte" : new Date("2014-09-01"),"$lt" : new Date("2014-11-04")}}},
{$project:{_id:{val:{"$_id"}}}},
{$group:{_id:"$_id",sum:{$sum:1}}})
You are exceeding a maximum document size in a result, but generally that is an indication that you are "doing it wrong", particularly given your example term of searching for "mongo" in your stored data between two dates:
db.collection.aggregate([
// Always match first, it reduces the workload and can use an index here only.
{ "$match": {
"_id": "mongo"
"log.ts": {
"$gte": new Date("2014-09-01"), "$lt": new Date("2014-11-04")
}
}},
// Unwind the array to de-normalize as documents
{ "$unwind": "$log" },
// Get the count within the range, so match first to "filter"
{ "$match": {
"log.ts": {
"$gte": new Date("2014-09-01"), "$lt": new Date("2014-11-04")
}
}},
// Group the count on `_id`
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}}
]);
Your aggregation result exceeds the max size of mongodb.You can use allowDiskUse option.This option prevent this.And in mongodb shell version 2.6 this will not throw an exception. look at this aggregrate.And you can optimize your query for decreasing the pipeline result.For this look at this question aggregation result

How to count the number of documents on date field in MongoDB

Scenario: Consider, I have the following collection in the MongoDB:
{
"_id" : "CustomeID_3723",
"IsActive" : "Y",
"CreatedDateTime" : "2013-06-06T14:35:00Z"
}
Now I want to know the count of the created document on the particular day (say on 2013-03-04)
So, I am trying to find the solution using aggregation framework.
Information:
So far I have the following query built:
collection.aggregate([
{ $group: {
_id: '$CreatedDateTime'
}
},
{ $group: {
count: { _id: null, $sum: 1 }
}
},
{ $project: {
_id: 0,
"count" :"$count"
}
}
])
Issue: Now considering above query, its giving me the count. But not based on only date! Its taking time as well into consideration for unique count.
Question: Considering the field has ISO date, Can any one tell me how to count the documents based on only date (i.e excluding time)?
Replace your two groups with
{$project:{day:{$dayOfMonth:'$createdDateTime'},month:{$month:'$createdDateTime'},year:{$year:'$createdDateTime'}}},
{$group:{_id:{day:'$day',month:'$month',year:'$year'}, count: {$sum:1}}}
You can read more about the date operators here: http://docs.mongodb.org/manual/reference/aggregation/#date-operators