MongoDB Find performance: single compound index VS two single field indexes

MongoDB Find performance: single compound index VS two single field indexes - mongodb

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.
Let's suppose we have a people collection of documents with the following shape:
{
_id: 10,
name: "Bob",
age: 32,
profession: "Hacker"
}
Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25
Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).
To better clarify our scenario, consider that:
the field name was already in our documents and we already have an index on that field
we are going to add the new field age due to some new features of our application
the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)
That said, we have to decide which of the following indexes offer the best performance:
One compound index: {name: 1, age: 1}
Two single-field indexes: {name: 1} and {age: 1}
According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.
By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.
This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.
According to test results, we are considering to use the single compound index approach.
Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?

Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:
do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.
And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.
Additional reads:
https://docs.mongodb.com/manual/core/query-optimization/index.html
https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in
Index intersection vs. compound index?
mongodb compund index vs. index intersect
How does the order of compound indexes matter in MongoDB performance-wise?
In MongoDB, I am using a large query, how I will create compound index or single index, So My response time boost up

Related

mongodb - Multiple Compound Indexes involving a common field

We have a collection with millions of data. This data is being rendered in the UI for stats purpose and hence time to render is of key importance.
The queries to render the data involve the below fields:
field_a and field_t
field_b and field_t
field_c and field_t
As we are querying millions of data, we want to use Compound Index to speed up the queries.
To do so, we can simply add 3 different compound indexes as below:
db.mycollection.createIndex( { "field_a": 1, "field_t": 1 }
db.mycollection.createIndex( { "field_b": 1, "field_t": 1 }
db.mycollection.createIndex( { "field_c": 1, "field_t": 1 }
ESR rule is respected while creating the indexes as field_a, field_b and field_c are equality checks and field_t is a range check.
Please note that field_t is common in all the 3 indexes.
Instead of creating 3 different indexes, is there a better approach to this?
Does mongo provide a more efficient way to handle this scenario where same field is being used in multiple compound indexes?

Better or more efficient in what regard?
Having the three indexes that you mentioned is the most efficient approach in terms of query performance. They will allow the database to process only the data that is relevant for each query and nothing else. Any other approach would reduce read efficiency (and speed) which may not be a good tradeoff.
Most databases, MongoDB included, typically use a single index when executing a query. This is mostly a consequence of how indexes work. Typically indexes use a B-tree like data structure, which is an ordered set of information. When following the ESR rule (placing equality conditions before range conditions), all of the information for a specific query is contained within a single bounded subtree in the index which can be directly traversed. It loses the ability to do this when the index is not structured in this way (including putting range keys first).
Other potential approaches using single field indexes would be things like:
Index intersection - where you create (in this case) 4 single field indexes and have the database use 2 for each query. MongoDB typically does not choose this approach very often as it results in scanning larger portions of the index when compared to the compound index approach above.
Using 1 single field index for each query - the database would end up retrieving documents to filter on the other field which could be quite inefficient depending on the selectivity of the other field.
While these may reduce the overall size of the collective indexes, it increases the cost (and decreases the efficiency) of executing the queries. Depending on what you are optimizing for, the approach you've outlined would be considered a best practice in terms of query efficiency.

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?

Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.

Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

MongoDB - Compound Secondary Index vs Concatenated _id Index

I am designing my database with MongoDb thinking in the scalability in the future. My main concern right now is about representing the indexes, as I have read, it is a crucial factor while scaling huge collections, in terms of RAM consumption, and sharding efficiency.
For simplicity, I have two different collections. A user collection which stores the user username, email, and some metadata, and a devices collection, that contains a device name, some metadata, and should be related with its owner. One user can have millions of devices (so it is not worth to store all in a single user document).
The devices collection should support queries in term of the whole device identifier by (username, device_name), or also by the username.
In this case I see some different approaches for storing the indexes:
Use a secondary compound index with username and device_name (in this order)
Use a primary index with and _id containing an string with username#device_name
Use an object in the _id field with both values {owner:username, device:device_name}
For testing this indexes, I have done some server load. I have created three different collections with this different alternatives and filled 5M documents. Some data:
I do not use the automatically generated _id created by mongo, as all my queries requires username/device. So this approach takes some extra space for indexing. The index size is 524MB. It is efficient while querying both by user or by user/device.
As I am replacing the _id with my own string, the index takes less space. In this case 352MB. I am still able to query efficiently by user (with a regex like /^username#/ the explain() reports almost the same results like in 1 in), and by the exact username/device.
The _id index cannot be changed to a compound index, so it is required to create a secondary compound index with {_id.owner, _id.device}. This results in a huge index size of 1059MB!. Queries goes well as in previous cases.
So, I can discard alternative 3, as this is not so much efficient. Between alternative 1 and 2, I prefer 1 as this approach is more clean, but it uses a _id field I will not use. So at this moment, the winning approach seems to be the number 2, as it allows me query efficiently by username or username/device, and it also takes less index space.
Is there a good reason to not use number 2 and follow with number 1, like when selecting the sharding key? Is there something I am missing? I am new with mongoDB and do not want to have problems when scaling my schema.

Skipping the first term of a compound index by using hint()

Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?

The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().

Multiple indexes with different definitions in mongodb

The question is a very simple one, can you have more than one index in a collection. I suppose you can, but every time I search for multiple indexes I get explanations on compound indexes and that is not what I'm looking for.
All I want to do is make sure that I can have two simple separate indexes.
(I'm using PHP, I'll use php code formatting, but I understand
db.posts.ensureIndex({ my_id1: 1 }, {unique: true, background: true});
db.posts.ensureIndex({ my_id2: 1 }, {background: true});
I'll only search for one index at a time.
Compound indexes are not what I'm looking for because:
one index is unique and the other is not.
I think it's not going to be the fastest option. (open the link to understand the reason I think its going to be slower. link)
I just want to make sure that the indexes will work.

You sure can have indexes defined the way you have it. From MongoDB documentation:
How many indexes? Indexes make retrieval by a key, including ordered sequential retrieval, very fast. Updates by key are faster too as MongoDB can find the document to update very quickly. However, keep in mind that each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.
I also recommend you look at how Mongo will decide what index to use when it comes to running a query that goes by both fields.
Also take a look at their Indexing Advice and FAQ page. It will explain things like only one index per query, selectivity, etc.
p.s. This slideshare deck from 10gen suggests there's a limit of 40 indexes per collection.