Group Aggregation performance in MongoDB - mongodb

I have a large amount of data captured by my apis, like this:
{
"_id" : ObjectId("57446a89e5b49e297031fab8"),
"applicationVersion" : "X.X.XXX.X",
"createdDate" : ISODate("2016-05-16T23:00:00.007Z"),
"identifier" : "v2/events/messages",
"durationInMilliseconds" : NumberLong(14)
}
I want to group the whole collection by the identifier. So I use the aggregation framework
$group : {
_id : {
identifier : "$identifier"
},
count : {
$sum : 1
}
}
I have an index on identifer.
This is a simple count, i may want to work out average api response times and things like that, but the speed is putting me off.
On 7 million documents the aggregation takes around 10 seconds. If I do the equivalent group by in SQL on MSSQL it takes less than a second.
Is there a way I can optimize this type of aggregation or do I need to think about this differently e.g.
changing how I collect the data
use a different tool?

MongoDB doesn't use indexes in aggregation framework except $match and $sort if used as first stage in aggregation framework. This is limitation and we can hope for improvement in future.
See Pipeline Operators and Indexes in MongoDB

Related

Does MongoDB find() query return documents sorted by creation time?

I need documents sorted by creation time (from oldest to newest).
Since ObjectID saves timestamp by default, we can use it to get documents sorted by creation time with CollectionName.find().sort({_id: 1}).
Also, I noticed that regular CollectionName.find() query always returns the documents in same order as CollectionName.find().sort({_id: 1}).
My question is:
Is CollectionName.find() guaranteed to return documents in same order as CollectionName.find().sort({_id: 1}) so I could leave sorting out?
No. Well, not exactly.
A db.collection.find() will give you the documents in the order they appear in the data files most of the times, though this isn't guaranteed.
Result Ordering
Unless you specify the sort() method or use the $near operator, MongoDB does not guarantee the order of query results.
As long as your data files are relatively new and few updates happen, the documents might (and most of the times will) be returned in what appears to be sorted by _id since ObjectId is monotonically increasing.
Later in the lifecycle, old documents may have been moved from their old position (because they increased in size and documents are never partitioned) and new ones are written in the place formerly occupied by another document. In this case, a newer document may be returned in a position between two old documents.
There is nothing wrong with sorting documents by _id, since the index will be used for that, adding only some latency for document retrieval.
However, I would strongly recommend against using the ObjectId for date operations for several reasons:
ObjectIds can not be used for date comparison queries. So you couldn't query for all documents created between date x and date y. To archive that, you'd have to load all documents, extract the date from the ObjectId and compare it – which is extremely inefficient.
If the creation date matters, it should be explicitly addressable in the documents
I see ObjectIds as a choice of last resort for the _id field and tend to use other values (compound on occasions) as _ids, since the field is indexed by default and it is very likely that one can save precious RAM by using a more meaningful value as id.
You could use the following for example which utilizes DBRefs
{
_id: {
creationDate: new ISODate(),
user: {
"$ref" : "creators",
"$id" : "mwmahlberg",
"$db" : "users"
}
}
}
And do a quite cheap sort by using
db.collection.find().sort({_id.creationDate:1})
Is CollectionName.find() guaranteed to return documents in same order as CollectionName.find().sort({_id: 1})
No, it's not! If you didn't specify any order, then a so-called "natural" ordering is used. Meaning that documents will be returned in the order in which they physically appear in data files.
Now, if you only insert documents and never modify them, this natural order will coincide with ascending _id order. Imagine, however, that you update a document in such a way that it grows in size and has to be moved to a free slot inside of a data file (usually this means somewhere at the end of the file). If you were to query documents now, they wouldn't follow any sensible (to an external observer) order.
So, if you care about order, make it explicit.
Source: http://docs.mongodb.org/manual/reference/glossary/#term-natural-order
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
Testing script (for the confused)
> db.foo.insert({name: 'Joe'})
WriteResult({ "nInserted" : 1 })
> db.foo.insert({name: 'Bob'})
WriteResult({ "nInserted" : 1 })
> db.foo.find()
{ "_id" : ObjectId("55814b944e019172b7d358a0"), "name" : "Joe" }
{ "_id" : ObjectId("55814ba44e019172b7d358a1"), "name" : "Bob" }
> db.foo.update({_id: ObjectId("55814b944e019172b7d358a0")}, {$set: {answer: "On a sharded collection the $natural operator returns a collection scan sorted in natural order, the order the database inserts and stores documents on disk. Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index. You cannot specify $natural sort order if the query includes a $text expression."}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.foo.find()
{ "_id" : ObjectId("55814ba44e019172b7d358a1"), "name" : "Bob" }
{ "_id" : ObjectId("55814b944e019172b7d358a0"), "name" : "Joe", "answer" : "On a sharded collection the $natural operator returns a collection scan sorted in natural order, the order the database inserts and stores documents on disk. Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index. You cannot specify $natural sort order if the query includes a $text expression." }

How to improve mongodb group query performance

I am currently using solr to store public tweet information. I have field such as content, sentiment, keywords, tstamp, language, tweet_id to capture the essence of the tweet. I am also evaluating Mongodb for the same use case. I am trying to benchmark mongodb and solr each having one million records.
What I have observed is that group query in mongodb are 2.5 to 3 times slower than the facet query of solr.
The following mongodb query
db.tweets.aggregate(
[
{
$group : {
_id : "$sentiment",
total : { $sum : 1 }
}
}
]
)
takes 481ms. I have index applied on sentiment field.
However the same thing in solr using facet query takes 93ms.
Is there any other configuration in mongodb which needs to be set so as to improve the group query performance in mongodb?
A $group operation and a facet search are not really comparable operations and the $group won't use an index. It looks like you are trying to compute the number of documents with each distinct value of sentiment. MongoDB doesn't have a specific function for this. For a specific value, a much better operation to get the count would be
db.collection.count({ "sentiment" : sentiment })
and you can get all of the distinct values with
db.collection.distinct("sentiment")
Both of these can use an index { "sentiment" : 1 }. You will need multiple queries to get counts for multiple values of sentiment so it's not as convenient as Solr. Faceted searching is a core competency of full text search engines, so it's not surprising this is easier in Solr than MongoDB. MongoDB and Solr meant for totally different uses, so I can't say I'd see why you'd benchmark one versus the other. It's like racing a boat against a car.

Bug for collections that are sharded over a hashed key

When querying for large amounts of data in sharded collections we benefited a lot from querying the shards in parallel.
The following problem does only occur in collections that are sharded over a hashed key.
In Mongo 2.4 it was possible to query with hash borders in order to get all data of one chunk.
We used the query from this post.
It is a range query with hash values as borders:
db.collection.find(
{ "_id" : { "$gte" : -9219144072535768301,
"$lt" : -9214747938866076750}
}).hint({ "_id" : "hashed"})
The same query also works in 2.6 but takes a long time.
The explain() shows that it is using the index but scanned objects is way to high.
"cursor" : "BtreeCursor _id_hashed",
Furthermore the borders are wrong.
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
Was there some big change from 2.4 t0 2.6 which breaks this query?
Even if the borders are interpreted as non-hash values, why does it take so long?
Is there some other way to get all documents of one chunk or hash index range?
Also the mongo internal hadoop connector has this problem with sharded collections.
Thanks!
The query above working in 2.4 was not supported behavior. See SERVER-14557 with a similar complaint and an explanation of how to properly perform this query. Reformatted for proper behavior, your query becomes:
db.collection.find().min({ _id : -9219144072535768301}).max({ _id : -9214747938866076750}).hint({_id : "hashed"})
As reported in the SERVER ticket, there is an additional bug (SERVER-14400) that prevents this query from being targeted towards a single shard. At this point in time there are no plans to address in 2.6. This should however prevent the table scan you are seeing under 2.6 and allow for more efficient retrieval.

$natural order avoids indexes. How does orderby effect the use of indexes?

Profiling slow queries I found something really strange: For the following operation the entire collection was scanned (33061 documents) even though there is an index on the query parameter family_id:
{
"ts" : ISODate("2013-11-27T10:20:26.103Z"),
"op" : "query",
"ns" : "mydb.zones",
"query" : {
"$query" : {
"family_id" : ObjectId("52812295ea84d249934f3d12")
},
"$orderby" : {
"$natural" : 1
}
},
"ntoreturn" : 20,
"ntoskip" : 0,
"nscanned" : 33061,
"keyUpdates" : 0,
"numYield" : 91,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(83271),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(388988),
"w" : NumberLong(22362)
}
},
"nreturned" : 7,
"responseLength" : 2863,
"millis" : 393,
"client" : "127.0.0.1",
"user" : "mydb"
}
After some Google searches without results I found out that leaving out the "$orderby": { "$natural" : 1} the query is very fast and only 7 documents are scanned instead of 33061. So I assume using $orderby in my case does avoid using the index on family_id. The strange thing is that the resulting order is not different in either case. As far as I understand $natural order it is tautologically to use "$orderby": { "$natural" : 1} or no explicit order. Another very interesting observation is that this issue does not arise on capped collection!!
This issue arises the following questions:
If not using any ordering/sorting, shouldn't the resulting order be the order on disk, i.e. $natural order?
Can I create a (compound-)index that would be used sorting naturally?
How can I invert the ordering of a simple query that uses an index an no sorting without severe performance losses?
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
Are the answers of the above questions independent of whether you use sharding/replication or not? What is the natural order of a query over multiple shards?
Note I am using MongoDB 2.2. There is a ticket related to this issue: https://jira.mongodb.org/browse/SERVER-5672. Though it seems in that ticket that the issue occures in capped collections too, which I cannot confirm (maybe due to different mongo versions).
As far as I understand $natural order it is tautologically to use
"$orderby": { "$natural" : 1} or no explicit order.
This is a misdescription of $natural order. MongoDB stores records in a some order on disk and keeps track of them via a doubly linked list. $natural order is the order that you get when you traverse the linked list. However, if you do not specify $natural, that is what you will always get - not random order, not insertion order, not physical disk order, but "logical" disk order - the order they appear in when traversing the linked list.
If not using any ordering/sorting, shouldn't the resulting order be
the order on disk, i.e. $natural order?
Yes, assuming that you understand that "disk order" is not strictly physical order, it's order they are in the linked list of records.
Can I create a (compound-)index that would be used sorting naturally?
I don't know what you mean by sorting naturally - if you are using an index during a query, the documents are traversed in index order, not in $natural order.
How can I invert the ordering of a simple query that uses an index and no sorting without severe performance losses?
You cannot - if you are using an index then you will get the records in index order - your options are to get them in that order, in reverse of that order or to create a compound index where you index by fields you are searching and field(s) you want to sort on.
What happens behind the scenes when using query parameters and orderby? Why is this not happening on capped collections? I would like to understand this strange behaviour.
What happens depends on what indexes are available, but the query optimizer tries to use an index that helps with both filtering and sorting - if that's not possible it will pick the index that has the best actual performance.
Are the answers of the above questions independent of whether you use
sharding/replication or not? What is the natural order of a query over
multiple shards?
It's some non-deterministic merge of $natural orders from each individual shard.

Efficient mongodb query to find the average time in a collection of 10K+ records?

Following is the one record of a collections named outputs.
db.outputs.findOne()
{
"_id" : ObjectId("4e4131e8c7908d3eb5000002"),
"company" : "West Edmonton Mall",
"country" : "Canada",
"created_at" : ISODate("2011-08-09T13:11:04Z"),
"started_at" : ISODate("2011-08-09T11:11:04Z"),
"end_at" : ISODate("2011-08-09T13:09:04Z")
}
The above is just a document. There are around 10K docs and it will keep increasing.
What I need is to find the average hours (taking started_at and end_at) for the past 1 week (taking created_at)?
Right now, youre going to need to query the documents you need to average, likely selecting only the fields you need (started_at and end_at) and do the calculation in your app code.
If you wait for the next major version of MongoDB, there will be a new aggregation framework that will allow you to build an aggregation pipeline for querying documents, selecting fields, and performing calculations on them, and finally returning the calculated value(s). its very cool.
https://www.mongodb.org/display/DOCS/Aggregation+Framework
You can maintain the sum and counts in a separate collection using $inc operator with a value of _id that represents a week. That way, you don't have to query all 10k records. You can just query the collection mantaining sum & count, and divide the sum by count to get the average.
I have explained this in detail in the following post:
http://samarthbhargava.wordpress.com/2012/02/01/real-time-analytics-with-mongodb/