Lucene "contains" search on subset of index - lucene.net

I have an index with around 5 million documents that I am trying to do a "contains" search on. I know how to accomplish this and I have explained the performance cost to the customer, but that is what they want. As expected doing a "contains" search on the entire index is very slow, but sometimes I only want to search a very small subset of the index (say 100 documents or so). I've done this by adding a Filter to the search that should limit the results correctly. However I find that this search and the entire index search perform almost exactly the same. Is there something I'm missing here? It feels like this search is also searching the entire index.

Adding a filter to the search will not limit the scope of the index.
You need to be more clear about what you need from your search, but I don't believe what you want is possible.
Is the subset of documents always the same? If so, maybe you can get clever with multiple indices. (e.g. search the smaller index and if there aren't enough hits, then search the larger index).

You can try SingleCharTokenAnalyzer

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

MongoDB/Mongoose whats the best way to count a lot of documents with a filter?

I understand that estimatedDocumentCount() uses metadata to count, which makes it faster when there are a lot of documents. But the drawback is that you cant add a filter on it like countDocuments(). What if there are still a lot of documents, but you want to use a filter, what's the best way to do that, if there is a way.
Well, you got it right.
countDocuments(...) is how you count documents with a filter.
If you're facing issues with speed, I'd advise you to add an index on the fields you're planning to filter with, this way it's an index scan and the result will be almost immediate.
It would follow that you are filtering based on the contents of the document, and that you would not be able to do so only with metadata.
You could add an index, or insert your data into an exclusive collection using normalized data models.
The documentation also suggests that you could create a view and count based on the metadata.

Any way to make Mongo filtering dynamic records faster?

Scenario
I have Mongo collection Items that have dynamic item objects in it. Currently I have over 3 million records. I'm using C# with MongoSharp but I don't think it has anything to do with my problem.
Here is an example Item (it has a lot more fields than just 3):
{
_id : "1234567890",
Code : 888596937,
RefNumber : "GHTZKL",
...
}
AFAIK there is no point in using TextSearch since it's not really words, just some codes so it won't give me anything beneficial. I also cannot index them all since then I will have to index every single field.
Problem
Right now when I filter data it takes about 1-3 seconds (on ssd). Is there any way I can make it filter my items faster or it is as fast as I can get?
You don't mention what field you want to search on, but it sounds like you want to search on any arbitrary attribute. This is a common design and borders on an antipattern for MongoDB. The only way to avoid the collection scan you're getting now is to index the fields you want to search on, but indexing every field when you don't know what the fields will be ahead of time isn't possible. The solution to this to name only the common fields (and index on them), then group the other fields into name/value pairs in an array in the the document. You can then index that array to get your fast searches.
A word of caution on NVP arrays: if you array gets very large (hundreds of attributes), your index size will blow up spectacularly. It's best to try to keep the array size fairly small.
For more information on this design pattern, see Asya's great writeup.

Fast Substring Search in Mongo

We have a Mongo database with about 400,000 entries, each of which have a relatively short (< 20 characters) title. We want to be able to do fast substring searches on these titles (fast enough to be able to use the results in things like autocomplete bars). We are also only searching for prefixes (does the title start with substring). What can we do?
If you only do prefix searches then indexing that field should be enough. Rooted regex queries use index and should be fast.
Sergio is correct, but to be more specific, an index on that and a left-rooted prefix without the i (case-insensitivity) flag will make efficient use of the index. This is noted in the docs in fact:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions
Don't forget to use .explain() if you want to benchmark the queries too.

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.