Can sorting before grouping improve query performance in Mongo using the aggregate framework? - mongodb

I'm trying to aggregate data for 100 accounts for a 14-15 month period, grouping by year and month.
However, the query performance is horrible as it takes 22-27 seconds. There are currently over 15 million records in the collection and I've got an index on the match criteria and can see using explain() that the optimizer uses it.
I tried adding another index on the sort criteria in the query below and after adding the index, the query now takes over 50 seconds! This happens even after I remove the sort from the query.
I'm extremely confused. I thought because grouping can't utilize an index, that if the collection was sorted beforehand, then the grouping could be much faster. Is this assumption correct? If not, what other options do I have? I can bear the query performance to be as much as 5 seconds but nothing more than that.
//Document Structure
{
Acc: 1,
UIC: true,
date: ISODate("2015-12-01T05:00:00Z"),
y: 2015
mm: 12
value: 22.3
}
//Query
db.MyCollection.aggregate([
{ "$match" : { "UIC" : true, "Acc" : { "$in" : [1, 2, 3, ..., 99, 100] }, "date" : { "$gte" : ISODate("2015-12-01T05:00:00Z"), "$lt" : ISODate("2017-02-01T05:00:00Z") } } },
//{ "$sort" : { "UIC" : 1, "Acc" : 1, "y" : -1, "mm" : 1 } },
{ "$group" : { "_id" : { "Num" : "$Num", "Year" : "$y", "Month" : "$mm" }, "Sum" : { "$sum" : "$value" } } }
])

What I would suggest you to do is to make a script (can be in nodejs) that aggregates the data in a different collection. When you have these long queries, what's advisable is to make a different collection containing the aggregation data and query from that.
My second advice would be to create a composed index in this aggregated collection and search by regular expression. In your case I would make an index containing accountId:period. For example, for account 1, and February of 2016, The index would be something like 1:201602.
Then you would be able to perform queries using regular expressions by account and timestamp. Like as if you wanted the registers for 2016 of account 1, you could do something like:
db.aggregatedCollection.find{_id : \1:2016\})
Hope my answer was helpful

Related

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

MongoDB - Get aggregated difference between two date fields

I have one collection called lists with following fields:
{ "_id" : ObjectId("5a7c9f60c05d7370232a1b73"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:40:10Z") }
{ "_id" : ObjectId("5a7c9f85c05d7370232a1b74"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:41:10Z") }
{ "_id" : ObjectId("5a7c9f89c05d7370232a1b75"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:42:10Z") }
{ "_id" : ObjectId("5a7c9f8cc05d7370232a1b76"), "created_date" : ISODate("2018-11-10T04:40:11Z"), "processed_date" : ISODate("2018-11-10T04:42:20Z") }
I need to find out aggregated result in the following format (the difference between processed_date and created_date):
[{
"30Sec":count_for_diffrence_1,
"<=60Sec":count_for_diffrence_2,
"<=90Sec":count_for_diffrence_3
}]
One more thing if we can find out how may item took 30 sec, 60 sec and so on, also make sure that the result for <=60 Sec should not come in <=90Sec.
Any help will be appreciated.
You can try below aggregation query in 3.6 version.
$match with $expr to limit the documents where the time difference is 90 or less seconds.
$group with $sum to count different time slices occurences.
db.collection.aggregate([
{"$match":{"$expr":{"$lte":[{"$subtract":["$processed_date","$created_date"]},90000]}}},
{"$group":{
"_id":null,
"30Sec":{"$sum":{"$cond":{"if":{"$eq":[{"$subtract":["$processed_date","$created_date"]},30000]},"then":1,"else":0}}},
"<=60Sec":{"$sum":{"$cond":{"if":{"$lte":[{"$subtract":["$processed_date","$created_date"]},60000]},"then":1,"else":0}}},
"<=90Sec":{"$sum":{"$cond":{"if":{"$lte":[{"$subtract":["$processed_date","$created_date"]},90000]},"then":1,"else":0}}}
}}
])
Note if the created date is greater than processed date you may want to add a condition to look only for values where difference is between 0 and your requested time slice.
Something like
{$and:[{"$gte":[{"$subtract":["$processed_date","$created_date"]},0]}, {"$lte":[{"$subtract":["$processed_date","$created_date"]},60000]}]}

mongodb $dayOfYear equivalent Unix epoch time aggregation

Is there a method of grouping a Unix epoch time by day, equiv to $dayOfYear
or a process of aggregating floats, ints (into quartiles, hundreds, thousands, %)
try to avoid map reduce but an example of it would be awesome.
You can almost but not quite use Unix time seconds in aggregation pipeline by utilizing the $mod and $divide operators.
The math is Unix time seconds / 86400 to convert seconds into days since Epoch. Then modula that result by 365.25 for the day of the year (leaps every 4).
So the full aggregation for $dayOfYear using seconds is almost as simple as
db.MyCollection.aggregate( {$project : {"day" : {$mod : [ {$divide : ["$unix_seconds", 86400] } , 365.25] } } }, { $group : { _id : "$day" , num : { $sum : 1 } } } , {$sort : {_id : 1}} )
The above adds sorting for sequential day of year.
The problem is that the $mod operator returns both the whole number and remainder. and there is no way of rounding or truncating the remainder. Therefore the results are grouped by whole and remainder.
{
"_id" : 235.1864887063916,
"num" : 1
},
{
"_id" : 235.24300889818738,
"num" : 1
},
{
"_id" : 235.60299520864623,
"num" : 3
},
{
"_id" : 235.66453935674085,
"num" : 1
},
{
"_id" : 235.79900382758004,
"num" : 1
},
{
"_id" : 235.80265845312474,
"num" : 1
},
.. when clearly we want only the whole number
{
"_id" : 235,
"num" : 8
},
What would be nice is a $trunc or modula returning only the whole ($modw), and mod returning only remainder ($modr) operators in mongo.
JavaScript has the Date object which would be available to any server side JavaScript processing for MapReduce functions.
You seem to be aware of the $dayOfYear operator in the aggregation pipeline. There are other operators there for processing dates.
Unless your needs are very specific you should be using the aggregation pipeline. It is very flexible and in most cases will be considerably faster than the equivalent actions run under mapReduce.

How to store MapReduce result in hierarchically in mongo

I want to perform map-reduce operation on some metric and want to store its result aggregated and time-series.
Storing the aggregated result seems to be very simple, but how can i store the result in time-series fashion i.e. whenever the map-reduce function run's the value at that interval should also be recorded in the result collection. (i.e. time-series data)
Let's say i have a following result out of my map-reduce aggregation:-
> db.result.find()
{ "_id" : { "eventId" : 1}, "value" : { "sum" : 21 } }
{ "_id" : { "eventId" : 2}, "value" : { "sum" : 31 } }
I am able to achieve the above very easily with map_reduce aggregation framework.
I want the result to be stored in timeseries as well, like below:-
> db.result.find()
{ "_id" : { "eventId" : 1}, "value" : { "sum" : 21, "ts": {1: 15, 2: 4, 3: 2 } } }
{ "_id" : { "eventId" : 2}, "value" : { "sum" : 31, "ts": {1: 12, 2: 12, 3: 7 } } }
Now whenever the map-reduce function would run it should update the result collection.
I tried numerous ways to do so, but was unable to succeed in it. Any idea how can i achieve it?
Also, if this could be possible under the same map-reduce function call then that would be great.
The general recommendation for such time series data is to use pre-aggregated reports.
If that is not possible, first consider using the aggregation pipeline instead of map-reduce. It's faster and easier if your use case allows it.
With both the aggregation pipeline and map-reduce, you can use the results create the desired document. setOnInsert may be helpful.

Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection?

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);
Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.
As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());
Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...
Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);