MongoDB Aggregation query running very slow - mongodb

We version most of our collections in Mongodb. The selected versioning mechanism is as follows:
{ "docId" : 174, "v" : 1, "attr1": 165 } /*version 1 */
{ "docId" : 174, "v" : 2, "attr1": 165, "attr2": "A-1" }
{ "docId" : 174, "v" : 3, "attr1": 184, "attr2" : "A-1" }
So, when we perform our queries we always need to use the aggregation framework in this way to ensure get latest versions of our objects:
db.docs.aggregate( [
{"$sort":{"docId":-1,"v":-1}},
{"$group":{"_id":"$docId","doc":{"$first":"$$ROOT"}}}
{"$match":{<query>}}
] );
The problem with this approach is once you have done your grouping, you have a set of data in memory which has nothing to do with your collection and thus, your indexes cannot be used.
As a result, the more documents your collection has, the slower the query gets.
Is there any way to speed this up?
If not, I will consider to move to one of the approaches defined in this good post: http://www.askasya.com/post/trackversions/

Just in order to complete this question, we went with option 3: one collection to keep latest versions and one collection to keep historical ones. It is introduced here: http://www.askasya.com/post/trackversions/ and some further description (with some nice code snippets) can be found in http://www.askasya.com/post/revisitversions/.
It has been running in production now for 6 months. So far so good. Former approach meant we were always using the aggregate framework which moves away from indexes as soon as you modify the original schema (using $group, $project...) as it doesn't match anymore the original collection. This was making our performance terrible as the data was growing.
With the new approach though the problem is gone. 90% of our queries goes against latest data and this means we target a collection with a simple ObjectId as identifier and we do not require aggregate framework anymore, just regular finds.
Our queries against historical data always include id and version so by indexing these (we include both as _id so we get it out of the box), reads towards those collections are equally fast. This is a point though not to overlook. Read patterns in your application are crucial when designing how your collections/schemas should look like in MongoDB so you must ensure you know them when taking such decisions.

Related

Mongodb - expire subset of data in collection

I am wondering what is the best way to expire only a subset of a collection.
In one collection I store conversion data and click data.
The click data I would like to store for lets a week
And the conversion data for a year.
In my collection "customers" I store something like:
{ "_id" : ObjectId("53f5c0cfeXXXXXd"), "appid" : 2, "action" : "conversion", "uid" : "2_b2f5XXXXXX3ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
And
{ "_id" : ObjectId("53f5c0cfe4b0d9cd24847b7d"), "appid" : 2, "action" : "view", "uid" : "2_b2f58679e6f73ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
for the click data
So should I exucute a ensureIndex or something like a cronjob?
Thank you in advance
There are a couple of built in techniques you can use. The most obvious is a TTL collection which will automatically remove documents based on a date/time field. The caveat here is that for that convenience, you lose some control. You will be automatically doing deletes all the time that you have no control over, and deletes are not free - they require a write lock, they need to be flushed to disk etc. Basically you will want to test to see if your system can handle the level of deletes you will be doing and how it impacts your performance.
Another option is a capped collection - capped collections are pre-allocated on disk and don't grow (except for indexes), they don't have the same overheads as TTL deletes do (though again, not free). If you have a consistent insert rate and document size, then you can work out how much space corresponds to the time frame you wish to keep data. Perhaps 20GiB is 5 days, so to be safe you allocate 30GiB and make sure to monitor from time to time to make sure your data size has not changed.
After that you are into more manual options. For example, you could simply have a field that marks a document as expired or not, perhaps a boolean - that would mean that expiring a document would be an in-place update and about as efficient as you can get in terms of a MongoDB operation. You could then do a batch delete of your expired documents at a quiet time for your system when the deletes and their effect on performance are less of a concern.
Another alternative: you could start writing to a new database every X days in a predictable pattern so that your application knows what the name of the current database is and can determine the names of the previous 2. When you create your new database, you delete the one older than the previous two and essentially always just have 3 (sub in numbers as appropriate). This sounds like a lot of work, but the benefit is that the removal of the old data is just a drop database command, which just unlinks/deletes the data files at the OS level and is far more efficient from an IO perspective than randomly removing documents from within a series of large files. This model also allows for a very clean backup model - mongodump the old database, compress and archive, then drop etc.
As you can see, there are a lot of trade offs here - you can go for convenience, IO efficiency, database efficiency, or something in between - it all depends on what your requirements are and what fits best for your particular use case and system.

MongoDB querying performance for over 5 million records

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.
They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.
I've added several compound indexes with the filtered fields and the timetamp
e.g:
db.events.ensureIndex({somefield: 1, timestamp:-1})
I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.
I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.
I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..
EDIT: example for a query:
> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey#acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
"cursor" : "BtreeCursor user.userName_1_timestamp_-1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 30060,
"nscanned" : 30060,
"nscannedObjectsAllPlans" : 120241,
"nscannedAllPlans" : 120241,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 26495,
"indexBounds" : {
"user.userName" : [
[
"nickey#acme.com",
"nickey#acme.com"
]
],
"timestamp" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "yarin:27017"
}
please note that deviceType has only 2 values in my collection.
This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:
Ensure it's not because of insufficient RAM and excessive paging
Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.
A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.
Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.
Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.
So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.
Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.
The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.
If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.
-- EDIT --
The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey#acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.
I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).
Making this particular query faster could be done using a compound index that contains the device type, e.g. using
a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})
or
b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})
Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.
I'm afraid there's no very good solution for this using mongodb at this time.
Mongo only uses 1 index per query.
So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.
This means that basically you'll need an index for every type of query in order to achieve the best performance.
Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app.
This way you'll only need indexes on every fields, but it may be too much data to process.
If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.
http://docs.mongodb.org/manual/core/query-optimization/

Design pattern for directed acyclic graphs in MongoDB

The problem
As usual the problem is to display a directed acyclic graph in a database. The choices for a Database I had were a relational database like mysql or mongodb. I chose mongoDb because DAGs in relational databases are a mess but if there is a trick I just didn't find please tell me.
The goal is to map a DAG in one or multiple MongoDB Documents. Because we have multiple children and parents SubDocuments where no possibility. I came across multiple design patterns but am not sure which one is the best to go with.
Tree-structure with Ancestors Array
The Ancestors Array is suggested by the mongoDB docs. And is quite easy to understand. As I understand it my document would look like this:
{
"_id" : "root",
"ancestors" : [ null ],
"left": 1
}
{
"_id" : "child1",
"ancestors" : [ "root" ],
"left": 2
}
{
"_id" : "child2",
"ancestors" : [ "root", "child1" ],
"left": 1
}
This allows me to find all children of an element like this:
db.Tree.find({ancestors: 'root'}).sort({left: -1})
and all parents like this:
db.Tree.findOne({_id: 'child1'}).ancestors
DBRefs instead of Strings
My second approach would be to replace the string-keys with DBRefs. But except for longer database records I don't see many advantages over the ancestors array.
String-based array with children and parents
The last idea is to store not only the children of each document but it's parents as well. This would give me all the features I want. The downside is the massive overhead of information I would create by storing all relations two times. Further on I am worried by the amount of administration there is. E.g. if a document gets deleted I have to check all others for a reference in multiple fields.
My Questions
Is MongoDb the right choice over a relational database for this purpose?
Are there any up-/downsides of any of my pattern I missed?
Which pattern would you suggest and why? Do you maybe have experience with one of them?
Why don't you use a graph database? Check ArangoDB, you can use documents like MongoDB and also graphs. MongoDB is a great database, but not for storing graph oriented documents. ArangoDB does.
https://www.arangodb.com/

How to mapreduce on key from another collection

Say I have a collection of users like this:-
{
"_id" : "1234",
"Name" : "John",
"OS" : "5.1",
"Groups" : [{
"_id" : "A",
"Name" : "Group A"
}, {
"_id" : "C",
"Name" : "Group C"
}]
}
And I have a collection of events like this:-
{
"_id" : "15342",
"Event" : "VIEW",
"UserId" : "1234"
}
I'm able to use mapreduce to work out the count of events per user as I can just emit the "UserId" and count off of that, however what I want to do now is count events by group.
If I had a "Groups" array in my event document then this would be easy, however I don't and this is only an example, the actual application of this is much more complicated and I don't want to replicate all that data into the event document.
I've see an example at http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ but I can't see how that applies in this situation as it is aggregating values from two places... all I really want to do is perform a lookup.
In SQL I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName
I'd be happy with multiple passes of mapreduce... first pass to count by UserId into something like { "_id" : "1234", "count" : 9 } but I get stuck on next pass... how to include the group id
Some potential approaches I've considered:-
Include group info in the event document (not feasible)
Work out how to "join" the user collection or look-up the users groups from within the map function so I can emit the group id's as well (don't know how to do this)
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
What is possible and what are the benefits/issues with each approach?
Your third approach is the way to go:
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
To do this you'll need to create a new collection J with the "joined" data you need for map-reduce. There are several strategies you can use for this:
Update your application to insert/update J in the normal course of business. This is best in the case where you need to run MR very frequently and with up-to-date data. It can add substantially to code complexity. From an implementation standpoint, you can do this either directly (by writing to J) or indirectly (by writing changes to a log collection L and then applying the "new" changes to J). If you choose the log collection approach you'll need a strategy for determining what's changed. There are two common ones: high-watermark (based on _id or a timestamp) and using the log collection as a queue with the findAndModify command.
Create/update J in batch mode. This is the way to go in the case of high-performance systems where the multiple updates from the above strategy would affect performance. This is also the way to go if you do not need to run the MR very frequently and/or you do not have to guarantee up-to-the-second data accuracy.
If you go with (2) you will have to iterate over documents in the collections you need to join--as you've figured out, Mongo map-reduce won't help you here. There are many possible ways to do this:
If you don't have many documents and if they are small, you can iterate outside of the DB with a direct connection to the DB.
If you cannot do (1) you can iterate inside the DB using db.eval(). If the number of documents in not small, make sure to use nolock: true as db.eval is blocking by default. This is typically the strategy I choose as I tend to deal with very large document sets and I cannot afford to move them over the network.
If you cannot do (1) and do not want to do (2) you can clone the collections to another node with a temporary DB. Mongo has a convenient cloneCollection command for this. Note that this does not work if the DB requires authentication (don't ask why; it's a strange 10gen design choice). In that case you can use mongodump and mongorestore. Once you have the data local to a new DB you can party on it as you see fit. Once you complete the MR you can update the result collection in your production DB. I use this strategy for one-off map-reduce operations with heavy pre-processing so as to not load the production replica sets.
Good luck!

Mongo Db records retrieval very slow using c# api

I am trying to retrieve 100000 docouments from MongoDb like below and its taking very long to return collection.
var query = Query.EQ("Status", "E");
var items = collection.Find(query).SetLimit(100000).ToList();
Or
var query = Query.GT("_id", idValue);
var items = collection.Find(query).SetLimit(100000).ToList();
Explain:
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" :1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" :
{
"_id" :[[ObjectId("4f79a64eca98b5fc0e5ae35a"),
ObjectId("4f79a64eca98b5fc0e5ae35a")]]
}
}
Any suggestions to improve query performance. My table having 2 million documents.
-Venkat
This question was also asked on Google Groups:
https://groups.google.com/forum/?fromgroups#!topicsearchin/mongodb-user/100000/mongodb-user/a6FHFp5aOnA
As I responded on the Google Groups question I tried to reproduce this and was unable to observe any slowness. I was able to read 100,000 documents in 2-3 seconds, depending on whether the documents were near the beginning or near the end of the collection (because I didn't create an index).
My answer to the Google groups question has more details and a link to the test program I used to try and reproduce this.
Given the information you have provided my best guess is that your document size is too large and the delay is not necessarily on the mongo server but on the transmission of the result set back to your app machine. Take a look at your avg document size in the collection, do you have large embedded arrays for example?
Compare the response time when selecting only one field using the .SetFields method (see example here How to retrieve a subset of fields using the C# MongoDB driver?). If the response time is significantly faster then you know that this is the issue.
Have you defined indices?
http://www.mongodb.org/display/DOCS/Indexes
There are several things to check:
Is your query correctly indexed?
If your query is indexed, what are the odds that the data itself is in memory? If you have 20GB of data and 4GB of RAM, then most of your data is not in memory which means that your disks are doing a lot of work.
How much data does 100k documents represent? If your documents are really big they could be sucking up all of the available disk IO or possibly the network? Do you have enough space to store this in RAM on the client?
You can check for disk usage using iostat (a common linux tool) or perfmon (under Windows). If you run these while your query is running, you should get some idea about what's happening with your disks.
Otherwise, you will have to do some reasoning about how much data is moving around here. In general, queries that return 100k objects are not intended to be really fast (not in MongoDB or in SQL). That's more data than humans typically consume in one screen, so you may want to make smaller batches and read 10k objects 10 times instead of 100k objects once.
If you don't create indexes for your collection the MongoDB will do a full table scan - this is the slowest possible method.
You can run explain() for your query. Explain will tell you which indexes (if any) are used for the query, number of scanned documents and total query duration.
If your query hits all the indexes and it's execution is still slow then you probably have a problem with the size of the collection / RAM.
MongoDB is the fastest when collection data + indexes fits in the memory. If the your collection size is larger than available RAM the performance drop is very large.
You can check the size of your collection with totalSize(), totalIndexSize() or validate() (these are shell commands).