MongoDb Index Search returns the entire collection when using a string contains operation - mongodb

Why would MongoDb (4.2.6) return every row from an index (Collation locale: en_US, strength: 1), when searching for a string contained in the document field? Example query:
db.eClearFaces.aggregate()
.match({
"Name": /Test/s
})
.collation({ locale: "en_US", "strength": 1 })
The index it is using is:
Name is simply a string field on the document. Resulting query plan shows that every single record in the collection is returned:
You can see in stage IXSCAN, it returned 56k+ docs (where I expected it to return only 6). That caused the next stage to fetch all 56k docs, but out of those fetched, it returned 6 (the correct count).
I am confused on why - I have both the Collation for the Index and Query configured the same, and its obviously hitting the index. I don't understand why its returning all those extras rows to the next stage.
Index output from profiler:
Did I miss a MongoDb Index or Query fundamental?

The solution ended up being in the MongoDb docs.
The $regex implementation is not collation-aware and is unable to
utilize case-insensitive indexes.
https://www.mongodb.com/docs/manual/reference/operator/query/regex/
So, supplying the collation ended up hindering the performance. Solution was to have an index without collation. It still performed an index scan as expected, but resulted in far fewer results before it fetched from the table.

Related

Mongodb searching on array / indexing

I'm using the airbnb sample set and it has a field that looks like:
"amenities": ["TV", "Cable TV", "Wifi"....
So I'm trying to do a case-INsensitive, wildcard search (on one or more values passed in).
Only thing I've found that works is:
{ amenities: { $in: [ /wi/ ] }}
Is that the best way?
So I ran it in Compass as the dataset was imported (5600 docs), and the Explain says it took ~20ms on my machine and warned there was no index. I then created an index on the amenities column and the same search jumped up to ~100ms. I just created the index through the Compass UI, so not sure why its taking 5x as long with an index? Or if there is a better way to do this?
The way to run that query is:
{ amenities: /wi/i }
//better but not always useful
{ amenities: /wi/i }, { amenities:1, _id:0 }
It already traverses the array, and to be case insensitive it must be on the options.
For multikey indexes the second query won't be a covered query. Otherwise, it would be blazing fast.
I've tested a similar search with and without index though. Exec. time is reduced 10X. (1500ms to 150ms, in a huge collection). Measure with Mongo Hacker.
As you report executionTimeMilliseconds is not that different. But still smaller.
The reason why you don't see a huge decrease in time is because the index stores each array entry separately. When it finds a match, it comes back to collection to fetch the whole array field, instead of using the indexes.
Probably indexes aren't very useful for arrays.
When querying with an unanchored regex, the query executor will have to scan every index key to see if there is a match.
You might find a collated index to be helpful.
Create an index with the appropriate collation, like:
(strength 1 and 2 are case-insensitive)
db.collection.createIndex({amenities:1},{collation:{locale:"en",strength:1}})
Then query using the same collation:
db.collection.find({amenities:"wifi"}).collation({locale:"en",strength:1})
The search will be case insensitive, and it can efficiently use the index.

Why does mongodb not use index scan but collection scan with find()?

I am using mongodb 3.2.4
When I execute db.mytable.find().explain() The winning plan is 'Collscan'
But when I execute db.mytable.find().hint(_id:1).explain() The winning plan is 'IXscan'
So here comes a question: since _id is the default index of a table, why mongodb does not use this index to query?
An index can be used when there is a filter criteria or a sort operation - when the fields in the index are used in the filter predicate and/or the sort. In your case, the find method doesn't have a filter criteria or a sort - so no index is used, and you can see that in the query plan as a collection scan. It is as expected. But, when you provide a hint to the find method the query optimizer tries to use the index, and in your case it did (and you see it in the query plan as an IXSCAN). In either case, with or the without the hint, the find has to scan all the documents or keys in the index.
The _id has a default unique index, yes, but unless you are using the _id field in the query filter predicate or in a sort, the query cannot use it (or, specify explicitly to use index with a hint). You can verify with the following queries, db.mytable.find( { _id: 123 } ) or db.mytable.find( { } ).sort( { _id: -1 } ) the query planner will show index scan even though you do not specify the hint.
The main purpose of the indexes is to make your queries run fast; it is about query performance. It has to be a query with filter predicate and/or a sort operation to use an index (and the fields used in the filter or sort must be indexed for performance). With the find method, in your case, without any of the two you are just accessing all the documents as they are in the collection and the index is of no use (and the query optimizer shows that in the plan).

CosmosDB MongoDB 3.6 fails sort() query with compounded index

Newby MongoDB & CosmosDB user here, I've read the answer to this question How does MongoDB treat find().sort() queries with respect to single and compound indexes? and the offocial MongoDB docs and I believe my index creation mirrors that answer so I am leaning towards this being a CosmosDB issue but reading their documentation CosmosDB 3.6 supports compounded indexes as well, so I am at a loss right now.
I am able to run sort() queries like db.Videos.find().sort({"PublishedOn": 1}) from the mongo command line on a collection with an index created as db.Videos.createIndex({"PublishedOn": 1}) or db.Videos.createIndex({"PublishedOn": -1}).
And when I add a 'where' clause to the find like this db.Videos.find({"IsPinned": false}).sort({"PublishedOn": 1}) the above index still works.
However I now have document look ups which I want to avoid, so I drop the above single field index and create a compounded index like this db.Videos.createIndex({"IsPinned": 1, "PublishedOn": 1}) or db.Videos.createIndex({"PublishedOn": 1, "IsPinned": 1}) but now the query always fails with the error The index path corresponding to the specified order-by item is excluded..
Is this a limitation of CosmosDB or is my 'ordering' in the index bad?
The issue with CosmosDB is that it expects all WHERE fields to be used in the GROUP BY clause as well in exactly the same order else it won't use the index.
Creating an index as db.Videos.createIndex({"IsPinned": 1, "PublishedOn": 1}) and then updating the query to be db.Videos.find({"IsPinned": false}).sort({"IsPinned": 1, "PublishedOn": 1}) works like a charm.
I inferred this from reading the CosmosDB documentation on indexing policies (https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy) as the MongoDB documentation suddenly stops after the index creation (https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb-indexing) section.

Fundamental misunderstanding of MongoDB indices

So, I read the following definition of indexes from [MongoDB Docs][1].
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
Indexes are special data structures that store a small portion of the
collection’s data set in an easy to traverse form. The index stores
the value of a specific field or set of fields, ordered by the value
of the field. The ordering of the index entries supports efficient
equality matches and range-based query operations. In addition,
MongoDB can return sorted results by using the ordering in the index.
I have a sample database with a collection called pets. Pets have the following structure.
{
"_id": ObjectId(123abc123abc)
"name": "My pet's name"
}
I created an index on the name field using the following code.
db.pets.createIndex({"name":1})
What I expect is that the documents in the collection, pets, will be indexed in ascending order based on the name field during queries. The result of this index can potentially reduce the overall query time, especially if a query is strategically structured with available indices in mind. Under that assumption, the following query should return all pets sorted by name in ascending order, but it doesn't.
db.pets.find({},{"_id":0})
Instead, it returns the pets in the order that they were inserted. My conclusion is that I lack a fundamental understanding of how indices work. Can someone please help me to understand?
Yes, it is misunderstanding about how indexes work.
Indexes don't change the output of a query but the way query is processed by the database engine. So db.pets.find({},{"_id":0}) will always return the documents in natural order irrespective of whether there is an index or not.
Indexes will be used only when you make use of them in your query. Thus,
db.pets.find({name : "My pet's name"},{"_id":0}) and db.pets.find({}, {_id : 0}).sort({name : 1}) will use the {name : 1} index.
You should run explain on your queries to check if indexes are being used or not.
You may want to refer the documentation on how indexes work.
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

how to structure a compound index in mongodb

I need some advice in creating and ordering indexes in mongo.
I have a post collection with 5 properties:
Posts
status
start date
end date
lowerCaseTitle
sortOrder
Almost all the posts will have the same status of 1 and only a handful will have a rejected status. All my queries will filter on status, start and end dates, and sort on sortOrder. I also will have one query that does a regex search on the title.
Should I set up a compound key on {status:1, start:1, end:1, sort:1}? Does it matter which order I put the fields in the compound index - should I put status first in the compound index since it's the most broad? Is it better to do a compound index rather than a single index on each property? Does mongo only use a single index on any given query?
Are there any hints for indexes on lowerCaseTitle if I'm doing a regex query on that?
sample queries are:
db.posts.find({status: {$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
db.posts.find( {lowerCaseTitle: /japan/, status:{$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
That's a lot of questions in one post ;) Let me go through them in a practical order :
Every query can use at most one index (with the exception of top level $or clauses and such). This includes any sorting.
Because of the above you will definitely need a compound index for your problem rather than seperate per-field indexes.
Low cardinality fields (so, fields with very few unique values across your dataset) should usually not be in the index since their selectivity is very limited.
Order of the fields in your compound index matter, and so does the relative direction of each field in your compound index (e.g. "{name:1, age:-1}"). There's a lot of documentation about compound indexes and index field directions on mongodb.org so I won't repeat all of it here.
Sorts will only use the index if the sort field is in the index and is the field in the index directly after the last field that was used to select the resultset. In most cases this would be the last field of the index.
So, you should not include status in your index at all since once the index walk has eliminated the vast majority of documents based on higher cardinality fields it will at most have 2-3 documents left in most cases which is hardly optimized by a status index (especially since you mentioned those 2-3 documents are very likely to have the same status anyway).
Now, the last note that's relevant in your case is that when you use range queries (and you are) it'll not use the index for sorting anyway. You can check this by looking at the "scanAndOrder" value of your explain() once you test your query. If that value exists and is true it means it'll sort the resultset in memory (scan and order) rather than use the index directly. This cannot be avoided in your specific case.
So, your index should therefore be :
db.posts.ensureIndex({start:1, end:1})
and your query (order modified for clarity only, query optimizer will run your original query through the same execution path but I prefer putting indexed fields first and in order) :
db.posts.find({start: {$lt: today}, end: {$gt: today}, status: {$gte:0}}).sort({sortOrder:1})