Indexing with mongodb: bad performance / indexOnly=false

Indexing with mongodb: bad performance / indexOnly=false - mongodb

I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?

I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.

Related

Getting rid of _id in mongodb collection

I know it is not possible to remove the _id field in a mongodb collection. However, the size of my collections is large, that the index on the _id field prevents me from loading the other indices in the RAM. My machine has 125GB of RAM and my collection stats is as follows:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
When I do a query like the following:
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
it takes more than 15 seconds on average to return the results. The indices for both caller_id and receiver_id should be around 90GB, which is OK. However, the 73GB index on the _id makes this query very slow.

You correctly told that you can not remove _id field from your document. You also can not remove an index from this field, so this is something you have to live with.
For some reason you start with the assumption that _id index makes your query slow, which is completely unjustifiable and most probably is wrong. This index is not used and just stays there untouched.
Few things I would try to do in your situation:
You have 400 billion documents in your collection, have you thought that this is a right time to start sharding your database? In my opinion you should.
use explain with your query to actually figure out what slows it down.
Looking at your query, I would also try to do the following:
change your document from
{
... something else ...
receiver_id: 234,
caller_id: 342
}
to
{
... something else ...
participants: [342, 234]
}
where your participants are [caller_id, receiver_id] in this order, then you can put only one index on this field. I know that it will not make your indices smaller, but I hope that because you will not use $or clause, you will get results faster. P.S. if you will do this, do not do this in production, test whether it give you a significant improvement and only then change in prod.

There are a lot of potential issues here.
The first is that your indexes do not include all of the data returned. This means Mongo is getting the _id from the index and then using the _id to retrieve and return the document in question. So removing the _id index, even if you could, would not help.
Second, the query includes an OR. This forces Mongo to load both indexes so that it can read them and then retrieve the documents in question.
To improve performance, I think you have just a few choices:
Add the additional elements to the indexes and restrict the data returned to what is available in the index (this would change indexOnly = true in the explain results)
Explore sharding as Skooppa.com mentioned.
Rework the query and/or the document to eliminate the OR condition.

Speeding up $or query in pymongo

I have a collection of 1.8 billion records stored in mongodb, where each record looks like this:
{
"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")
}
I need to get all the records for 2 million specific users (I have the users of interest id in a text file) and process it before I write the results to a database. I have indices on the receiver_id and on caller_id (each is part of a single index).
The current procedure I have is as the following:
for user in list_of_2million_users:
user_records = collection.find({ "$or" : [ { "caller_id": user }, { "receiver_id" : user } ] })
for record in user_records:
process(record)
However, it takes 15 seconds on average to consume the user_records cursor (the process function is very simple with low running time). This will not be feasible to process 2 million users. Any suggestions to speed up the $or query? as it seems to be the most time-consuming step.
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 784,
"nscannedObjects" : 784,
"nscanned" : 784,
"nscannedObjectsAllPlans" : 784,
"nscannedAllPlans" : 784,
"scanAndOrder" : false,
"nYields" : 753,
"nChunkSkips" : 0,
"millis" : 31057,
"server" : "some_server:27017",
"filterSet" : false
}
And this is the collection stats:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
I am running Ubuntu server with 125GB of RAM.
Note that I will run this analysis only once (not periodic thing I will do).

If the indices on caller_id and receiver_id are a single compound index, this query will do a collection scan instead of an index scan. Make sure they are both part of a separate index, i.e.:
db.user_records.ensureIndex({caller_id:1})
db.user_records.ensureIndex({receiver_id:1})
You can confirm that your query is doing an index scan in the mongo shell:
db.user_records.find({'$or':[{caller_id:'example'},{receiver_id:'example'}]}).explain()
If the explain plan returns its cursor type as BTreeCursor, you're using an index scan. If it says BasicCursor, you're doing a collection scan which is not good.
It would also be interesting to know the size of each index. For best query performances, both indices should be completely loaded into RAM. If the indices are so large that only one (or neither!) of them fit into RAM, you will have to page them in from disk to look up the results. If they're too big to fit in your RAM, your options are not too great, basically either splitting up your collection in some manner and re-indexing it, or getting more RAM. You could always get an AWS RAM-heavy instance just for the purpose of this analysis, since this is a one-off thing.

I am no expert in MongoDB, though I had the similar problem & following solutions helped me tackle the problem. Hope it helps you too.
Query is using indexes and scanning exact documents, so there are no issues with your indexing, though I'll suggest you to:
First of all try to see the status of command: mongostat --discover
See for the parameters such as page faults & index miss.
Have you tried warming up (performance of query after executing query for first)? What's the performance after warming up? If it's same as the previous one there might be page faults.
If you are going to run it as an analysis I think warming up the database might help you.

I don't know why your approach is so slow.
But you might want to try these alternative approaches:
Use $in with many ids at once. I'm not sure if mongodb handles millions of values well, but if it does not, sort the list of IDs and then split it into batches.
Do a collection scan in the application and check each entry against a hashset containing the interesting IDs. Should have acceptable performance for a one-off script, especially since you're interested in so many IDs.

Projection makes query slower

I have over 600k of record in MongoDb.
my user Schema looks like this:
{
"_id" : ObjectId,
"password" : String,
"email" : String,
"location" : Object,
"followers" : Array,
"following" : Array,
"dateCreated" : Number,
"loginCount" : Number,
"settings" : Object,
"roles" : Array,
"enabled" : Boolean,
"name" : Object
}
following query:
db.users.find(
{},
{
name:1,
settings:1,
email:1,
location:1
}
).skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 1106,
"server" : "shreyance:27017",
"filterSet" : false
}
and after removing projection same query db.users.find().skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 209,
"server" : "shreyance:27017",
"filterSet" : false
}
As far I know projection always increase performance of a query. So I am unable to understand why MongoDB is behaving like this. Can someone explain this. And when to use projection and when not. And how actually projection is implemented in MongoDB.

You are correct that projection makes this skip query slower in MongoDB 2.6.3. This is related to an optimisation issue with the 2.6 query planner tracked as SERVER-13946.
The 2.6 query planner (as at 2.6.3) is adding SKIP (and LIMIT) stages after projection analysis, so the projection is being unnecessarily applied to results that get thrown out during the skip for this query. I tested a similar query in MongoDB 2.4.10 and the nScannedObjects was equal to the number of results returned by my limit rather than skip + limit.
There are several factors contributing to your query performance:
1) You haven't specified any query criteria ({}), so this query is doing a collection scan in natural order rather than using an index.
2) The query cannot be covered because there is no projection.
3) You have an extremely large skip value of 656,784.
There is definitely room for improvement on the query plan, but I wouldn't expect skip values of this magnitude to be reasonable in normal usage. For example, if this was an application query for pagination with 50 results per page your skip() value would be the equivalent of page number 13,135.

Unless the result of your projection does something to produce an "index only" query, and that means only the the fields "projected" in the result are all present in the index only, then you are always producing more work for the query engine.
You have to consider the process:
How do I match? On document or index? Find appropriate primary or other index.
Given the index, scan and find things.
Now what do I have to return? Is all of the data in the index? If not go back to the collection and pull the documents.
That is the basic process. So unless one of those stages "optimizes" in any way then of course things "take longer".
You need to look at this as designing a "server engine" and understand the steps that need to be undertaken. Considering none of your conditions met anything that would produce "optimal" on the specified steps you need to learn to accept that.
Your "best" case, is wher only the projected fields are the fields present in the chosen index. But really, even that has the overhead of loading the index.
So choose wisely, and understand the constraints and memory requirements for what you are writing our query for. That is what "optimization" is all about.

MongoDB OR condition indexing

I have an OR query which I'm currently using for a semi large update. Essentially my collection is split into two data sets;
1 main repository and 1 subset of the main repository. This is just to allow quicker searching on a small subset of data.
I'm finding however my query which I create to pull things into the subset is timing out.. and when looking at the explain it looks like two queries are actually happening.
PRIMARY> var date = new Date(2012,05,01);
PRIMARY> db.col.find(
{"$or":[
{"date":{"$gt":date}},
{"keywords":{"$in":["Help","Support"]}}
]}).explain();
This produces:
{
"clauses" : [
{
"cursor" : "BtreeCursor ldate_-1",
"nscanned" : 1493872,
"nscannedObjects" : 1493872,
"n" : 1493872,
"millis" : 1035194,
"nYields" : 3396,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"ldate" : [
[
ISODate("292278995-01--2147483647T07:12:56.808Z"),
ISODate("2012-06-01T07:00:00Z")
]
]
}
},
{
"cursor" : "BtreeCursor keywords_1 multi",
"nscanned" : 88526,
"nscannedObjects" : 88526,
"n" : 2515,
"millis" : 1071902,
"nYields" : 56,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"keywords" : [
[
"Help",
"Help"
],
[
"Support",
"Support"
]
]
}
}
],
"nscanned" : 1582398,
"nscannedObjects" : 1582398,
"n" : 1496387,
"millis" : 1071902
}
Is there something I can be indexing better to make this faster? Seems just way to slow...
Thanks ahead of time!

An $or query will evaluate each clause separately and combine the results to remove duplicates .. so if you want to optimize the queries you should first try to explain() each clause individually.
It looks like part of the problem is that you are retrieving a large number of documents while actively writing to that collection, as evidenced by the high nYields (3396). It would be worth reviewing mongostat output while the query is running to consider other factors such as page faulting, lock %, and read/write queues.
If you want to make this query faster for a large number of documents and very active collection updates, two best practice approaches to consider are:
1) Pre-aggregation
Essentially this is updating aggregate stats as documents are inserted/updated so you can make fast real-time queries. The MongoDB manual describes this use case in more detail: Pre-Aggregated Reports.
2) Incremental Map/Reduce
An incremental Map/Reduce approach can be used to calculate aggregate stats in successive batches (for example, from an hourly or daily cron job). With this approach you perform a Map/Reduce using the reduce output option to save results to a new collection, and include a query filter that only selects documents that have been created/updated since the last time this Map/Reduce job was run.

I think you should create a compound index on both date and keywords. Refer to the below post for more specifics based on your use-case
how to structure a compound index in mongodb

Expected Behaviour of Compound _id in MongoDB?

I have a compound _id containing 3 numeric properties:
_id": {
"KeyA": 0,
"KeyB": 0,
"KeyC": 0
}
The database in question has 2 million identical values for KeyA and clusters of 500k identical values for KeyB.
My understanding is that I can efficiently query for KeyA and KeyB using the command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3 } ).limit(100)
When I explain this query the result is:
"cursor" : "BasicCursor",
"nscanned" : 1000100,
"nscannedObjects" : 1000100,
"n" : 100,
"millis" : 1592,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
Without the limit() the result is:
"cursor" : "BasicCursor",
"nscanned" : 2000000,
"nscannedObjects" : 2000000,
"n" : 500000,
"millis" : 3181,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
As I understand it BasicCursor means that index has been ignored and both queries have a high execution time - even when I've only requested 100 records it takes ~1.5 seconds. It was my intention to use the limit to implement pagination but this is obviously too slow.
The command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3, , "_id.KeyC": 1000 } )
Correctly uses the BtreeCursor and executes quickly suggesting the compound _id is correct.
I'm using the release 1.8.3 of MongoDb. Could someone clarify if I'm seeing the expected behaviour or have I misunderstood how to use/query the compound index?
Thanks,
Paul.

The index is not a compound index, but an index on the whole value of the _id field. MongoDB does not look into an indexed field, and instead uses the raw BSON representation of a field to make comparisons (if I read the docs correctly).
To do what you want you need an actual compound index over {_id.KeyA: 1, _id.KeyB: 1, _id.KeyC: 1} (which also should be a unique index). Since you can't not have an index on _id you will probably be better off leaving it as ObjectId (that will create a smaller index and waste less space) and keep your KeyA, KeyB and KeyC fields as properties of your document. E.g. {_id: ObjectId("xyz..."), KeyA: 1, KeyB: 2, KeyB: 3}

You would need a separate compound index for the behavior you desire. In general I recommend against using objects as _id because key order is significant in comparisons, so {a:1, b:1} does not equal {b:1, a:1}. Since not all drivers preserve key order in objects it is very easy to shoot yourself in the foot by doing something like this:
db.foo.save(db.foo.findOne())