Conflict in choosing the perfect index in MongoDB query optimizer - mongodb

My problem is related to the query optimizer of MongoDB and how it picks the perfect index to use. I realized that under some conditions the optimizer doesn't pick the perfect existing index and rather continues using the one that is close enough.
Consider having a simple dataset like:
{ "_id" : 1, "item" : "f1", "type" : "food", "quantity" : 500 }
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
{ "_id" : 5, "item" : "f3", "type" : "food", "quantity" : 300 }
{ "_id" : 6, "item" : "t1", "type" : "toys", "quantity" : 500 }
{ "_id" : 7, "item" : "a1", "type" : "apparel", "quantity" : 250 }
{ "_id" : 8, "item" : "a2", "type" : "apparel", "quantity" : 400 }
{ "_id" : 9, "item" : "t2", "type" : "toys", "quantity" : 50 }
{ "_id" : 10, "item" : "f4", "type" : "food", "quantity" : 75 }
and then want to issue a query as following:
db.inventory.find({"type": "food","quantity": {$gt: 50}})
I go ahead and create the following index:
db.inventory.ensureIndex({"quantity" : 1, "type" : 1})
The statistics of cursor.explain() confirms that this index has the following performance: ( "n" : 4, "nscannedObjects" : 4, "nscanned" : 9). It scanned more indexes than the perfect matching number. Considering the fact that "type" is a higher selective attribute with an identified match, it is surely better to create the following index instead:
db.inventory.ensureIndex({ "type" : 1, "quantity" : 1})
The statistics also confirms that this index performs better: ("n" : 4, "nscannedObjects" : 4, "nscanned" : 4). Meaning the second index needs exactly scanning the same number of indexes as the matched documents.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created.
According to the documentation, every time a new index is created the query optimizer consider it to make the query plan, but I don't see this happening here.
Can anyone explain how the query optimizer really works?

Considering the fact that "type" is a higher selective attribute
Index selectivity is a very important aspect, but in this case, note that you're using an equality query on type and a range query on quantity which is the more compelling reason to swap the order of indices, even if selectivity was lower.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created. [...]
The MongoDB query optimizer is largely statistical. Unlike most SQL engines, MongoDB doesn't attempt to reason what could be a more or less efficient index. Instead, it simply runs different queries in parallel from time to time and remembers which one was faster. The faster strategy will then be used. From time to time, MongoDB will perform parallel queries again and re-evaluate the strategy.
One problem of this approach (and maybe the cause of the confusion) is that there's probably not a big difference with such a tiny dataset - it's often better to simply scan elements than to use any kind of index or search strategy if the data isn't large compared to the prefetch / page size / cache size and pipeline length. As a rule of thumb, simple lists of up to maybe 100 or even 1,000 elements often don't benefit from indexing at all.

Like for doing anything greater, designing indexes requires some forward thinking. The goal is:
Efficiency - fast read / write operations
Selectivity - minimize records scanning
Other requirements - e.g. how are sorts handled?
Selectivity is the primary factor that determines how efficiently an index can be used. Ideally, the index enables us to select only those records required to complete the result set, without the need to scan a substantially larger number of index keys (or documents) in order to complete the query. Selectivity determines how many records any subsequent operations must work with. Fewer records means less execution time.
Think about what queries will be used most frequently by the application. Use explain command and specifically see the executionStats:
nReturned
totalKeysExamined - if the number of keys examined very large than the returned documents? We need some index to reduce it.
Look at queryPlanner, rejectedPlans. Look at winningPlan which shows the keyPattern which shows which keys needed to indexed. Whenever we see stage:SORT, it means that the key to sort is not part of the index or the database was not able to sort documents based on the sort order specified in the database. And needed to perform in-memory sort. If we add the key based on which the sort happens, we will see that the winningPlan's' stage changes from SORT to FETCH. The keys in the index needs to be specified based on the range of the data for them. e.g.: the class will have lesser volume than student. Doing this needs us to have a trade-off. Although the executionTimeMillis will be very less but the docsExamined and keysExamined will be relatively a little large. But this trade-off is worth making.
There is also a way to force queries to use a particular index but this is not recommended to be a part of deployment. The command in concern is the .hint() which can be chained after find or sort for sorting etc. It requires the actual index name or the shape of the index.
In general, when building compound indexes for:
- equality field: field on which queries will perform an equality test
- sort field: field on which queries will specify a sort
- range field: field on which queries perform a range test
The following rules of thumb should we keep in mind:
Equality fields before range fields
Sort fields before range fields
Equality fields before sort fields

Related

MongoDB - performance and collection size

I have a question with regards to collection sizes and query performance –
There are 2 dbs– DB1 & DB2. DB1 has 1 collection, and here’s the output from stats() on this collection –
{
…
"count" : 2085217,
"size" : 17048734192,
"avgObjSize" : 8176,
"capped" : false,
"nindexes" : 3,
"indexDetails" : {},
"totalIndexSize" : 606299456,
"indexSizes" : {
"_id_" : 67664576,
"id_1" : 284165056,
"id_2" : 254469824
},
…
}
A query on this collection, using index id_1 comes back in 0.012 secs. Here’s the output from explain() -
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
….
"indexName" : "id_1",
}
In DB2, I have 4 collections, and here’s the output from stats on DB2 –
{
…
"collections" : 4,
"objects" : 152655935,
"avgObjSize" : 8175.998307514215,
"dataSize" : 1248114666192,
"storageSize" : 1257144933456,
"indexes" : 12,
"indexSize" : 19757688272,
"fileSize" : 1283502112768,
…
}
A query on any collection in DB2, using the index, which I confirmed via explain(), takes at least double the time that it does for the previous query against DB1.
Since mongo should scale well, why is there this diff? I read that mongodb loads all the indexes in memory, and since DB2 has a higher volume than DB1, is that why it’s taking much longer?
Any insights would be greatly helpful. Thanks.
Edit 1:
Adding more info re. collection definition, indexes definitions and queries executed...
All collections (in both DBs) contain the same fields; only the values and the size of documents differ between them.
And, here's the relevant index -
"1" : {
"v" : 1,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "ns.coll1"
}
And, this is how the id field looks like:
"_id" : ObjectId("55f9b6548aefbce6b2fa2fac"),
"id" : {
"pid" : {
"f1" : "val1",
"f2" : "val2"
}
},
And, here's a sample query -
db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}})
Edit 2:
Here's some more info on the hard disk & RAM -
$ free -m
total used free shared buff/cache available
Mem: 386758 3750 1947 25283 381060 355675
Swap: 131071 3194 127877
The hard disk is around 3.5T, out of which 2.9T is already used.
Scaling
MongoDB scales very well. The thing is, it is designed to scale horizontally, not vertically. This means that if your DBs are holding a lot of data, you should shard the collections in order to achieve better parallelization.
Benchmark results
Regarding the difference in query time, I don't think your profiling is conclusive. The DBs are possibly on different machines (with different specs). Supposing the hardware is the same, DB2 apparently holds more documents on its collections and the size of documents are not the same on both DBs. The same query can return data sets with different sizes. That will inevitably have impact on data serialization and other low level aspects. Unless you profile the queries in a more controlled setup, I think your results are pretty much expected.
Suggestions
Take care if you are using DRef on your documents. Its possible Mongo will automatically dereference them; that means more that data to serialize and overhead.
Try running the same queries with a limit specification. You have defined the index to be unique, but I don't know if that automatically makes Mongo stop index traversal once it has found a value. Check if db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}) and db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}).limit(1) run on the same time.
Take a look at Indexes on embedded fields and Indexes on embedded documents. Embedded documents seem to impair even extra overhead.
Finally, if your document has no embedded documents, only embedded fields (which seems to be the case), then define your index more specifically. Create this index
db.coll1.createIndex({"id.pid.f1": 1, "id.pid.f2": 1}, {unique: true})
and run the query again. If this index doesn't improve performance, then I believe you have done everything properly and it may be time to start sharding.

mongodb: will limit() increase query speed?

Is db.inventory.find().limit(10) faster than db.inventory.find()?
I have millions of records in mongodb, I want to get top 10 records in some orders.
Using limit() you inform the server that you will not retrieve more than k documents. Allowing some optimizations to reduce bandwidth consumption and to speed-up sorts. Finally, using a limit clause the server will be able to better use the 32MB max available when sorting in RAM (i.e.: when sort order cannot be obtained from an index).
Now, the long story: find() returns a cursor. By default, the cursor will transfer the results to the client in batches. From the documentation,:
For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes.
Using limit() the cursor will not need to retrieve more documents than necessary. Thus reducing bandwidth consumption and latency.
Please notice that, given your use case, you will probably use a sort() operation as well. From the same documentation as above:
For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
And the sort() documentation page explains further:
If MongoDB cannot obtain the sort order via an index scan, then MongoDB uses a top-k sort algorithm. This algorithm buffers the first k results (or last, depending on the sort order) seen so far by the underlying index or collection access. If at any point the memory footprint of these k results exceeds 32 megabytes, the query will fail1.
1That 32 MB limitation is not specific to sort using a limit() clause. Any sort whose order cannot be obtained from an index will suffer from the same limitation. However, with a plain sort the server need to hold all documents in its memory to sort them. With a limited sort, it only have to store k documents in memory at the same time.
if you need it in order then of course the DB would first sort it based on the criteria and then return the top 10 records. by using the limit you are just saving the network bandwidth. e.g. here I am sorting by name and then giving the top 10 records, it has to scan the whole data and then pick the top 10. (as you can notice its doing COLLSCAN which is understood for collection scan as I don't have the index for this example, the idea to show here is that its doing the full scan of all the records, sort it and then pick the top ones.)
> db.t1.find().sort({name:1}).limit(10).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.t1",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"name" : 1
},
"limitAmount" : 10,
"inputStage" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "Sachin-Mac.local",
"port" : 27017,
"version" : "3.0.2",
"gitVersion" : "6201872043ecbbc0a4cc169b5482dcf385fc464f"
},
"ok" : 1
}

Index strategy for queries with dynamic match criteria

I have a collection which is going to hold machine data as well as mobile data, the data is captured on channel and is maintained at single level no embedding of object , the structure is like as follows
{
"Id": ObjectId("544e4b0ae4b039d388a2ae3a"),
"DeviceTypeId":"DeviceType1",
"DeviceTypeParentId":"Parent1",
"DeviceId":"D1",
"ChannelName": "Login",
"Timestamp": ISODate("2013-07-23T19:44:09Z"),
"Country": "India",
"Region": "Maharashtra",
"City": "Nasik",
"Latitude": 13.22,
"Longitude": 56.32,
//and more 10 - 15 fields
}
Most of the queries are aggregation queries, as used for Analytics dashboard and real-time analysis , the $match pipeline is as follows
{$match:{"DeviceTypeId":{"$in":["DeviceType1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
or
{$match:{"DeviceTypeParentId":{"$in":["Parent1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
and many of my DAL layer find queries and findOne queries are mostly on criteria DeviceType or DeviceTypeParentId.
The collection is huge and its growing, I have used compound index to support this queries, indexes are as follows
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceType" : 1,
"Timestamp" : 1
},
"name" : "DeviceType_1_Timestamp_1",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceTypeParentId" : 1,
"Timestamp" : 1
},
"name" : "DeviceTypeParentId_1_Timestamp_1",
"ns" : "DB.channel_data"
}
]
Now we are going to add support for match criteria on DeviceId and if I follow same strategy as I did for DeviceType and DeviceTypeParentId is not good,as I fell by my current approach I'm creating many indexes and all most all will be same and huge.
So is their any good way to do indexing . I have read a bit about Index Intersection but not sure how will it be helpful.
If any wrong approach is followed by me please point it out as this is my first project and first time I am using MongoDB.
Those indexes all look appropriate for your queries, including the new one you're proposing. Three separate indexes supporting your three kinds of queries are the overall best option in terms of fast queries. You could put indexes on each field and let the planner use index intersection, but it won't be as good as the compound indexes. The indexes are not the same since they support different queries.
I think the real question is, are the (apparently) large memory footprint of the indices actually a problem at this point? Do you have a lot of page faults because of paging indexes and data out of disk?

Does query order affect compound index usage

MongoDB compound indexes support queries on any prefix of the index fields, but does the field order in the query must match the order in the compound index itself?
Assuming we have the following index:
{ "item": 1, "location": 1, "stock": 1 }
Does it cover this query:
{"location" : "Antarctica", "item" : "Hamster Wheel"}
Yes. The order/sequence of the fields in the index creation matters.
In your examples above all the queries that filter on "item" may use the index, but queries that do not use the "item" field and use "location" and or "stock" as your filter condition will NOT use this index.
The sequence of the fields in the filter in the "read" query does NOT matter. MongoDB is smart enough to know that
{"location" : "Antarctica", "item" : "Hamster Wheel"}
is the same as
{"item" : "Hamster Wheel", "location" : "Antarctica"}
As others have pointed out, the best way to ensure that your query is using the index, is to run an explain on your query http://bit.ly/1oE6zo1

Is there a way to force mongodb to store certain index in ram?

I have a collection with a relatively big index (but less than ram available) and looking at performance of find on this collection and amount of free ram in my system given by htop it's seems that mongo is not storing full index in the ram. Is there a way to force mongo to store this particular index in the ram?
Example query:
> db.barrels.find({"tags":{"$all": ["avi"]}}).explain()
{
"cursor" : "BtreeCursor tags_1",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 300393,
"millis" : 55299,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
]
]
}
}
Not the all objects are tagged with "avi" tag:
> db.barrels.find().explain()
{
"cursor" : "BasicCursor",
"nscanned" : 823299,
"nscannedObjects" : 823299,
"n" : 823299,
"millis" : 46270,
"indexBounds" : {
}
}
Without "$all":
db.barrels.find({"tags": ["avi"]}).explain()
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 0,
"millis" : 43440,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
],
[
[
"avi"
],
[
"avi"
]
]
]
}
}
Also this happens when I search for two or more tags (it scans every item as if were no index):
> db.barrels.find({"tags":{"$all": ["avi","mp3"]}}).explain()
{
"cursor" : "BtreeCursor tags_1",
"nscanned" : 300393,
"nscannedObjects" : 300393,
"n" : 6427,
"millis" : 53774,
"indexBounds" : {
"tags" : [
[
"avi",
"avi"
]
]
}
}
No. MongoDB allows the system to manage what is stored in RAM.
With that said, you should be able to keep the index in RAM by running queries against the indexes (check out query hinting) periodically to keep them from getting stale.
Useful References:
Checking Server Memory Usage
Indexing Advice and FAQ
Additionally, Kristina Chodorow provides this excellent answer regarding the relationship between MongoDB Indexes and RAM
UPDATE:
After the update providing the .explain() output, I see the following:
The query is hitting the index.
nscanned is the number of items (docs or index entries) examined.
nscannedObjects is the number of docs scanned
n is the number of docs that match the specified criteria
your dataset is 300393 entries, which is the total number of items in the index, and the matching results.
I may be reading this wrong, but what I'm reading is that all of the items in your collection are valid results. Without knowing your data, it would seem that every item contains the tag "avi". The other thing that this means is that this index is almost useless; indexes provide the most value when they work to narrow the resultant field as much as possible.
From MongoDB's "Indexing Advice and FAQ" page:
Understanding explain's output. There are three main fields to look
for when examining the explain command's output:
cursor: the value for cursor can be either BasicCursor or BtreeCursor.
The second of these indicates that the given query is using an index.
nscanned: he number of documents scanned.
n: the number of documents
returned by the query. You want the value of n to be close to the
value of nscanned. What you want to avoid is doing a collection scan,
that is, where every document in the collection is accessed. This is
the case when nscanned is equal to the number of documents in the
collection.
millis: the number of milliseconds require to complete the
query. This value is useful for comparing indexing strategies, indexed
vs. non-indexed queries, etc.
Is there a way to force mongo to store this particular index in the ram?
Sure, you can walk the index with an index-only query. That will force MongoDB to load every block of the index. But it has to be "index-only", otherwise you will also load all of the associated documents.
The only benefit this will provide is to make some potential future queries faster if those parts of the index are required.
However, if there are parts of the index that are not being accessed by the queries already running, why change this?