Skipping the first term of a compound index by using hint()

Skipping the first term of a compound index by using hint() - mongodb

Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?

The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?

Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.

Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

Is it okay to have COLLSCAN for a collection with only few documents?

I have a collection which has just two documents in it, both are used to keep track of a certain count.
I know this will never have more than 2 documents, but when the counter value is increased, it uses findAndModify and shows COLLSCAN.
I believe it is okay to have COLLSCAN here as having an index over they search key wont give any performance benefits, any thoughts?

Indexes are not always good. The main things to understand how they work are:
Index use memory in exchange for better performance. Every time you want to use an index, you need to load it to MongoDB RAM (if its not there yet).
When the Mongo engine gets a query it needs to check which index to use (if there are some) and for each index check if it can use it (contains the relevant query parameters which are union of find, projection and sort). If not Mongo decide whether to use it (best found index) or doing a collection scan (or both).
Index requires handling - every insert/update/delete operation requires updating the index.
There is a lot of overhead to use an index so the benefit should be several times greater than a simple collection scan. You can continue reading here.

Choosing the right database index type

I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?

Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.

How to determine if a full collection scan has been done in MongoDB

I understand that using the output of .explain() on a MongoDB query, you can look at the difference between n and nscanned to determine if a full collection scan has been performed, or if an index has been used. The docs state
You want n and nscanned to be close in value as possible.
Kyle Banker's excellent book MongoDB in Action says something very similar:
Generally speaking, you want the values of n and nscanned to be as close together as possible. When doing a collection scan, this is almost never the case.
Clearly neither of these statements are definitive about comparing n and nscanned. What proportion of difference generally infers a full collection scan - 10%, 20%, 30%+? Are there any other ways to check if a full collection scan has been done?

The answers above are NOT completely correct.
A collection scan will also be performed where an index is used for sorting but cannot assist the match criteria. In such a case, all the docs are scanned (in index order) to find docs which match the find criteria. Another possibility is that there may be a partial collection scan, where the index is able to narrow the subset of docs according to one or more find criteria but still needs to scan this subset of docs to find matches for the full find criteria.
In these situations, explain will show an index being used and not BasicCursor. So whilst the presence of BasicCursor in explain is indicative of a collection scan being performed, the absence of it doesn't mean a collection scan was not performed.
Also, using --notablescan also won't help where an index is used for sorting. Because queries only raise an exception where an index is not used. It doesn't look for whether the index was used for the match or the sort.
The only foolproof method of determining whether a collection scan was performed is to compare the index keys with the match criteria from the query. If the index selected by the query optimiser (and shown in explain) is not capable of answering the query match criteria (ie different fields) then a collection scan is needed.

What proportion of difference generally infers a full collection scan - 10%, 20%, 30%+?
This is impossible to say but if it really matters a whole tonne then you could be seeing a performance degradation of up to 200% for an average find; so yes, you WILL notice it. It is much like any other database on this front.
Are there any other ways to check if a full collection scan has been done?
You could start MongoDB with a flag that tells it to never do a full table scan, in which case it will throw an exception when it attempts to: http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--notablescan
However the best way is to just to use explain here, you will know when a query does not use an index and is forced to scan the entire collection from either disk or memory.

The definitive answer is in the first line of explain() output.
If it says cursor type is "BasicCursor" then it was a simple collection scan.
Otherwise it will say what type of index it used and the name of the index, I.e. "BtreeCursor id"
See the docs here: http://docs.mongodb.org/manual/reference/explain/#explain-output-fields-core for same explanation.

Strictly it seems that a full table scan has only been done when the cursor is basic cursor.
If there is a btree cursor, then possibly a full table scan may still effectively have been done to find records, with that btree index only having been used for sorting. Though, if looking at the output of explain, can you really be sure that it was a full table scan, without going and counting records and looking at the indexes existing.
What, in the context of the question would be clear, if that the query is not efficient and that a better index is needed, or should be hinted to.

You can check the explain's stage (from MongoDB doc):
Stages are descriptive of the operation; e.g.
-COLLSCAN for a collection scan
-IXSCAN for scanning index keys
-FETCH for retrieving documents
-SHARD_MERGE for merging results from shards

Multiple indexes with different definitions in mongodb

The question is a very simple one, can you have more than one index in a collection. I suppose you can, but every time I search for multiple indexes I get explanations on compound indexes and that is not what I'm looking for.
All I want to do is make sure that I can have two simple separate indexes.
(I'm using PHP, I'll use php code formatting, but I understand
db.posts.ensureIndex({ my_id1: 1 }, {unique: true, background: true});
db.posts.ensureIndex({ my_id2: 1 }, {background: true});
I'll only search for one index at a time.
Compound indexes are not what I'm looking for because:
one index is unique and the other is not.
I think it's not going to be the fastest option. (open the link to understand the reason I think its going to be slower. link)
I just want to make sure that the indexes will work.

You sure can have indexes defined the way you have it. From MongoDB documentation:
How many indexes? Indexes make retrieval by a key, including ordered sequential retrieval, very fast. Updates by key are faster too as MongoDB can find the document to update very quickly. However, keep in mind that each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.
I also recommend you look at how Mongo will decide what index to use when it comes to running a query that goes by both fields.
Also take a look at their Indexing Advice and FAQ page. It will explain things like only one index per query, selectivity, etc.
p.s. This slideshare deck from 10gen suggests there's a limit of 40 indexes per collection.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse