How to improve mongodb group query performance - mongodb

I am currently using solr to store public tweet information. I have field such as content, sentiment, keywords, tstamp, language, tweet_id to capture the essence of the tweet. I am also evaluating Mongodb for the same use case. I am trying to benchmark mongodb and solr each having one million records.
What I have observed is that group query in mongodb are 2.5 to 3 times slower than the facet query of solr.
The following mongodb query
db.tweets.aggregate(
[
{
$group : {
_id : "$sentiment",
total : { $sum : 1 }
}
}
]
)
takes 481ms. I have index applied on sentiment field.
However the same thing in solr using facet query takes 93ms.
Is there any other configuration in mongodb which needs to be set so as to improve the group query performance in mongodb?

A $group operation and a facet search are not really comparable operations and the $group won't use an index. It looks like you are trying to compute the number of documents with each distinct value of sentiment. MongoDB doesn't have a specific function for this. For a specific value, a much better operation to get the count would be
db.collection.count({ "sentiment" : sentiment })
and you can get all of the distinct values with
db.collection.distinct("sentiment")
Both of these can use an index { "sentiment" : 1 }. You will need multiple queries to get counts for multiple values of sentiment so it's not as convenient as Solr. Faceted searching is a core competency of full text search engines, so it's not surprising this is easier in Solr than MongoDB. MongoDB and Solr meant for totally different uses, so I can't say I'd see why you'd benchmark one versus the other. It's like racing a boat against a car.

Related

What is the proper index format that should I use in MongoDB for this particular scenario explained below?

I have the following query to be executed on my MongoDB collection order_error. It has over 60 million documents. The main concern is I am having a $in operator within my query. I tried several possibilities of indices but none of them gave a high-performance improvement. The query is as follows
db.getCollection("order_error").find({
"$and":[
{
"type":"order"
},
{
"Origin.SN":{
"$in":[
"4095",
"4100",
"4509",
"4599",
"4510"
]
}
}
]
}).sort({"timestamp.milliseconds" : 1}).skip(1).limit(100).explain("executionStats")
One issue that needs to be noted is I am allowing sort on timestamp.milliseconds in both directions(ASC + DESC). I have limited the entries within the $in. Usually, it is more. SO what kind of index gives the performance improvement. I tried creating the following indices already
type_1_Origin.SN_1_timestamp.milliseconds_-1
type_1_timestamp.milliseconds_-1_Origin.SN
Is there any better way for index creation?

MongoDB Aggregate of 120M documents

I've a system that records entries by action. There're more than 120M of them and I want to group them with aggregate by id_entry. The structure is as the following :
entry
{
id_entry: ObjectId(...),
created_at: Date(...),
action: {object},
}
When I try to do an aggregate by id_entry and grouping its actions it takes more than 3h to finish :
db.entry.aggregate([
{ '$match': {'created_at': { $gte:ISODate("2016-02-02"), $lt:ISODate("2016-02-03")}}},
{ '$group': {
'_id' :{'id_entry': '$id_entry'},
actions: {
$push: '$action'
}
}}])
But in that range of days there's only around ~4M documents. (id_entry and created_at has indexes)
What im I doing wrong in the aggregate? How can I group 3-4M documents to join them by id_entry in less than 3h?
Thanks
To speed up your particular query, you need an index on the created_at field.
However, the overall performance of the aggregation will also depend on your hardware specification (among other things).
If you find the query's performance to be less than what you require, you can either:
Create a pre-aggregated report (essentially a document that contains the aggregated data you require, updated every time a new data is inserted), or
Utilize sharding to spread your data to more servers.
If you need to run this aggregation query all the time, a pre-aggregated report allows you to have an extremely up-to-date aggregated report of your data that is accessible using a simple find() query.
The tradeoff is that for every insertion, you will also need to update the pre-aggregated document to reflect the current state of your data. However, this is a relatively small tradeoff compared to having to run long/complex aggregation query that could interfere with your day-to-day operation.
One caveat with the aggregation framework is: once the aggregation pipeline encounters a $group or a $project stage, no index can be used. This is because MongoDB index are tied to how the documents are stored physically. Grouping and projecting transform the documents to a state where the document does not have a physical representation in disk anymore.

MongoDB query: Using Limit together with $near skips few documents

I am currently developing an app which gets the specific number of documents from a collection if their location cordinates falls within certain range of distance. I am using a active record library for Codeigniter and the query that is generated is as follows
db.updates.find({locs: { $near: [72.844102008984, 19.130207090604 ], $maxDistance: 5000 }, posted_on : { $lt :1398425538.1942 },}).sort( { posted_on: -1 } ).limit(10).toArray()
The problem I am facing is that the above query skips few documents which should actually get pulled. But if I remove the limit(10) from the above query then proper documents gets pulled.
I am not sure, but does using limit() in MongoDB omit few results ? or does it limits to only the closest(nearest) documents?
P.S - The documents skipped using the limit are not always the same & random results are generated
I suspect you are running into problems with the special nature of the $near query. $near performs both a limit() and a sort() on the cursor returning the results -
Specifies a point for which a geospatial query returns the closest documents first. The query sorts the documents from nearest to farthest.
By default, queries that use a 2d index return a limit of 100 documents; however you may use limit() to change the number of results.
http://docs.mongodb.org/manual/reference/operator/query/near/
While the documentation does specifically discuss overriding the limit of 100 with your own limit call
You can further limit the number of results using cursor.limit().
It is silent on adding your own sort() or both sorting and overriding the limit at the same time. I suspect you are running into side effects of doing both. Note that it's not incorrect to do both - it just may not produce the results you are looking for. I'd suggest trying the same query using $geoWithin
http://docs.mongodb.org/manual/reference/operator/query/geoWithin/
$geoWithin does not apply a sort or a limit on the results, so it gives you something of a more raw result set.
Do you have any identical posted_on dates in the system? I recommend sorting by a second key, perhaps _id. If the sort order is non-deterministic the system may skip documents in a non-deterministic manor. Adding the _id field to your sort order is generally not that expensive if you have an index on the other fields as they will already be very close to the correct order and _id is part of all indexes. ("By default, all collections have an index on the _id field, and applications and users may add additional indexes to support important queries and operations." http://docs.mongodb.org/manual/core/index-single/ )

difference between aggregate ($match) and find, in MongoDB?

What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
Why doesn't the find function allow renaming the field names like the aggregate function?
e.g. In aggregate we can pass the following string:
{ "$project" : { "OrderNumber" : "$PurchaseOrder.OrderNumber" , "ShipDate" : "$PurchaseOrder.ShipDate"}}
Whereas, find does not allow this.
Why does not the aggregate output return as a DBCursor or a List? and also why can't we get a count of the documents that are returned?
Thank you.
Why does not the aggregate output return as a DBCursor or a List?
The aggregation framework was created to solve easy problems that otherwise would require map-reduce.
This framework is commonly used to compute data that requires the full db as input and few document as output.
What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
One of differences, like you stated, is the return type. Find operations output return as a DBCursor.
Other differences:
Aggregation result must be under 16MB. If you are using shards, the full data must be collected in a single point after the first $group or $sort.
$match only purpose is to improve aggregation's power, but it has some other uses, like improve the aggregation performance.
and also why can't we get a count of the documents that are returned?
You can. Just count the number of elements in the resulting array or add the following command to the end of the pipe:
{$group: {_id: null, count: {$sum: 1}}}
Why doesn't the find function allow renaming the field names like the aggregate function?
MongoDB is young and features are still coming. Maybe in a future version we'll be able to do that. Renaming fields is more critical in aggregation than in find.
EDIT (2014/02/26):
MongoDB 2.6 aggregation operations will return a cursor.
EDIT (2014/04/09):
MongoDB 2.6 was released with the predicted aggregation changes.
I investigated a few things about the aggregation and find call:
I did this with a descending sort in a table of 160k documents and limited my output to a few documents.
The Aggregation command is slower than the find command.
If you access to the data like ToList() the aggregation command is faster than the find.
if you watch at the total times (point 1 + 2) the commands seem to be equal
Maybe the aggregation automatically calls the ToList() and does not have to call it again. If you dont call ToList() afterwards the find() call will be much faster.
7 [ms] vs 50 [ms] (5 documents)

Mongo group query does not used indexes or slow down queries

I have used mongodb 1.8.1. In which I have collection which contains more than 1.8 million records. In this collections all records are simple object means not nested objects or array
Like as follows
{ name : "xyz" , "id" : 123 ,"a" : "na" , "c" : "in" , "cmp" : "pq" , "ttl" : "sd"}
All records are like this.
On this collections at time more 5 queries fire in which 2 is simple queries one contains exists in it and another one is simple query which uses index properly.
Another 2 are group queries which in which condition fields are in indexes and one contains exists.
Another one 1 distinct query with proper condition which is also index.
And order of query fire is first qroup queries then 1 simple query then distinct query and last simple query.
So data loads slowly.
If such 2 -3 calls make then it loads very lowly sometimes gives error read time out.
The collections have more than 1 index.
$exists queries do not use indexes (fixed from 1.9.1 onwards)
group commands use the JS context of mongodb which is exclusively locked while it's being used. This will affect performance of concurrent group queries. A new aggregation framework is under development that should help with this (2.1 onwards). Monitor https://jira.mongodb.org/browse/SERVER-447 for progress. In my experience it's usually more performant to do "group" like aggregation app-side.