Extremely Slow MongoDb C# Driver 2.0 RegEx Query - mongodb

I have the following query - it takes about 20-40 seconds to complete (similar queries without RegEx on the same collection take milliseconds at most):
var filter = Builders<BsonDocument>.Filter.Regex("DescriptionLC", new BsonRegularExpression(descriptionStringToFindFromCallHere, "i"));
var mongoStuff = GetMongoCollection<BsonDocument>(MongoConstants.StuffCollection);
var stuff = await mongoStuff
.Find(filter)
.Limit(50)
.Project(x => Mapper.Map<BsonDocument, StuffViewModel>(x))
.ToListAsync();
I saw an answer here that seems to imply that this query would be faster using the following format (copied verbatim):
var names = namesCollection.AsQueryable().Where(name =>
name.FirstName.ToLower().Contains("hamster"));
However, the project is using MongoDb .NET Driver 2.0 and it doesn't support LINQ. So, my question comes down to:
a). Would using LINQ be noticeably faster, or about the same? I can update to 1, but I would rather not.
b). Is there anything I can do to speed this up? I am already looking for a lower-case only field.
------------END ORIGINAL OF POST------------
Edit: Reducing the number of "stuff" returned via changing .Limit(50) to say .Limit(5) reduces the call time linearly. 40 seconds drops to 4 with the latter, I have experimented with different numbers and it seems to be a direct correlation. That's strange to me, but I don't really understand how this works.
Edit 2: It seems that the only solution might be to use "starts with" instead of "contains" regular expressions. Apparently the latter doesn't use indices efficiently according to the docs ("Index Use" section).
Edit 3: In the end, I did three things (field was already indexed):
1). Reduce the number of results returned - this help dramatically, linear correlation between number of items returned and amount of time the call takes.
2). Changed the search to lower-case only - this helped only slightly.
3). Changed the regular expression to only search "starts with" rather than "contains", again, this barely helped, changes for that were:
//Take the stringToSearch and make it into a "starts with" RegEx
var startingWithSearchRegEx = "^" + stringToSearch;
Then pass that into the new BsonRegularExpression instead of just the search string.
Still looking for any feedback!!

Regex on hundreds of thousands documents is not recommend as it's essentially doing document scan so no index is being used at all.
This is the main reason why your query is so slow. It has nothing to do with .net driver.
If you have a lot of text or searching for text patterns often, I'll suggest create text index on field of interest and do full text search. Please see docs for $text

Related

Swift: What is the best way to quickly search through a huge database to find relevant result?

I'm trying to implement a search algorithm that can search through hundreds of thousands of products and display the most relevant searches.
My current process is
Get user's input and filter out prepositions and punctuations to arrive at keywords
Break keywords into and array
For each of the keywords find all the products that contains the keyword in the product description and add all the product to a RawProductDictionary.
Calculate Levenshtein Distance number between the Keywords and each product description.
Create an array of product based the Levenshtein Distance number.
this question builds on top of this question
Swift: How can the dictionary values be arranged based on each item's Levenshtein Distance number
this is my Levenshtein Distance function
func levenshteinDist(test: String, key: String) -> Int {
let empty = Array<Int>(repeating:0, count: key.count)
var last = [Int](0...key.count)
for (i, testLetter) in test.enumerated() {
var cur = [i + 1] + empty
for (j, keyLetter) in key.enumerated() {
cur[j + 1] = testLetter == keyLetter ? last[j] : min(last[j], last[j + 1], cur[j]) + 1
}
last = cur
}
return last.last!
}
This is the function that implements step 5
func getProductData(){
Global.displayProductArry = []
var pIndexVsLevNum = [String : Int]()
for product0 in Global.RawSearchDict{
let generatedString = product0.value.name.uppercased()
let productIndex = product0.key
let relevanceNum = levenshteinDist(test: generatedString, key: self.userWordSearch)
pIndexVsLevNum[productIndex] = relevanceNum
}
print(pIndexVsLevNum)
Global.displayProductArry = []
for (k,v) in (Array(pIndexVsLevNum).sorted {$0.1 < $1.1}) {
print("\(k):\(v)")
Global.displayProductArry.append(Global.RawSearchDict[k]!)
}
}
The code works but the products are not that relevant to the user input
Levenshtein Distance number is not always indicative of relevance. Products with shorter description are usually disadvantaged and missed.
what is the best way to implement searching through hundreds of thousand of products quickly in swift?
I believe you are looking for Full-Text Search.
You could use existing tools for that, rather than creating your own information retrieval process.
Looks like SQLite can give you that:
See: https://medium.com/flawless-app-stories/how-to-use-full-text-search-on-ios-7cc4553df0e0
According to Wikipedia:
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (insertions, deletions or
substitutions) required to change one word into the other.
You should be using Levenshtein distance to compute individual words with each other, not entire product descriptions with a single word. The reason you would compare individual words with each other is to determine if the user has made a typo, and determine if he actually meant to type something else. Hence the first part of your problem is to first try to clean up the users query.
First check for perfect matches against your keyword database
For words which do not match perfectly, run Levenshtein to create a list of words most closely matching.
Lets stepback for a moment and look at the big picture:
Simply using Levenshtein distance by itself, is not the best way to determine which is the most relevant product by comparing with the entire product description, since normally the product description will be much much larger than a users query and will describe a variety of features. Let us assume that the words are correctly spelled and forget spellchecking for a moment so we can focus on relevancy.
You will have to use a combination of techniques to determine which is the most relevant document to display:
First, create a tf-idf database to determine how important each word is in a product description. Words like and, is, the etc are very common and usually will not help you determine which document is most relevant to a users query.
The longer a product description is, the more often a word is likely to occur. This is why we need to compute inverse document frequency, to determine how rare a word is across the entire database of documents.
By creating tf-idf database, you can rank the most important words in a product description, as well as determine how common a word is across all documents. This will help you assign weights to the value of each word.
A high weight in tf-idf is reached by a high term frequency in a given
document, and a low document frequency of the term in the whole
collection of documents.
Hence, for each word in a query, you must compute the relevancy score for all documents in your product description database. This should ideally be done in advance, so that you can quickly retrieve results. There are multiple ways you can compute TF-IDF, so based on your ability, select one option, and compute for TF-IDF for every unique word in your document.
Now how will you use TF-IDF to produce relevant results ?
Here is an example:
Query: "Chocolate Butter Pancakes"
You should have already computed the TF, and IDFs for each of the three words in the query. A simplistic formula for computing relevance is:
Simplistic Product Description Score: TF-IDF(Chocolate) + TF-IDF(Butter) + TF-IDF(Pancakes)
Compute the Product Description score for every single product description (for the words in the query), and sort the results from highest score to lowest score to get the most relevant result.
The above example, is a very simple explanation of how to compute relevancy, since the question you asked is actually a huge topic. To improve relevancy, you would have to do several additional things:
Stemming, Lemmatization and other Text Normalization techniques prior to computing TF-IDF of your product descriptions.
Likewise, you may need to do the same for you search queries.
As you can imagine, the above algorithm to provide sorted relevant results would perform poorly if you have a large database of product descriptions. To improve performance, you may have to do a number of things:
Cache the results of previous queries. If new products are not added / removed, and the product descriptions do not change often, then it becomes much easier.
If descriptions change, or products are added / removed, you need to compute TF-IDF for the entire database again to get the more relevant results. You will also need to trash your previous cache and cache new results instead. This means that you would have to periodically recompute TF-IDF for your entire database, depending on how it often it is updated.
As you can see, even this simple example is already starting to get complicated to implement, even though we haven't even started talking about more advanced techniques in Natural Language processing, even things as simple as how to consider the usage of synonyms in a document.
Hence this question is simply too broad for anyone to provide an answer on stackoverflow.
Rather than implement a solution yourself, I would recommend searching for a ready-made solution and incorporating it in your project instead. Search is a common feature nowadays, and as there are many solutions available for different platforms, perhaps you could offload search to a web-service, so you are not limited by having to use Swift - and can then just use ready-made solution like Solr, Lucene, Elastic Search etc..

Autocomplete and text search memory issues in apostrophe-cms: need ideas

I’m having trouble to use the text search and the autocomplete because I have a piece with +87k documents, some of them being big (~3.4MB of text).
I already:
Removed every field from the text index, except title , searchBoost and seoDescription ; these are the only fields copied to highSearchText and the field lowSearchText is always set to an empty string.
Modified the standard text index, including the fields type, published and trash in the beginning of it. I'm also modified the queries to have equality conditions on these fields. The result returned by the command db.aposDocs.stats() shows:
type_1_published_1_trash_1_highSearchText_text_lowSearchText_text_title_text_searchBoost_text: 12201984 (~11 MB, fits nicely in memory)
Verified that this index is being used, both in ‘toDistinc’ query as well in the final ‘toArray’ query.
What I think is the biggest problem
The documents have many repeated words in the title, so if the user types a word present in 5k document titles, the server suffers.
Idea I'm testing
The MongoDB docs says that to improve performance the entire collection must fit in RAM (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet).
So, I created a separate collection named “search” with just the fields highSearchText (string, indexed as text) and highSearchWords (array, also indexed), which result in total size of ~ 19 MB.
By doing the same operations of the standard apostrophe autocomplete in this collection, I achieved much faster, but similar results.
I had to write events to automatically update the search collection when the piece changes, but it seems to work until now.
Issues
I'm testing this search collection with the autocomplete. For the simple text search, I’m just limiting the sorted response to 50 results. Maybe I'll have to use the search collection as well, because the search could still breaks.
Is there some easier approach I'm missing? Please, any ideas are welcome.

How do I make Algolia Search guaruntee that it will return a number of results

I need Algolia to always return me 5 results from a full text search even if the query text itself bears little or no relevance to the actual returned results. Before someone suggests it, I have already tried to set the removeWordsIfNoResults option to all of it's possible modes and this still doesn't guarantee that I get my 5 results.
The purpose of this is to create a 'relevant entities' sidebar where the name of the current entity is used to search for other entities.
Any suggestions?
Using the removeWordsIfNoResults=allOptional query parameter is indeed a good way to go -> because all query words are required to match an object by default, fallbacking to "optional" is a good way to still retrieve results if one you the query words (or the combination of words) doesn't match anything.
index.search(query, { removeWordsIfNoResults: 'allOptional' });
Another solution is to always consider all query words as optional (not only as a fallback); to make sure the query foo bar baz is interpreted as OPT(foo) AND OPT(bar) AND OPT(baz) <=> foo OR bar OR baz. The difference is that this query will retrieve more results than the previous one because 1 single matching word will be enough to retrieve the object.
index.search(query, { optionalWords: query });
That being said, there is no way to force the engine to retrieve "at least" 5 results. What I would recommend is to have a small frontend logic:
- do the query with removeWordsIfNoResults or optionalWords
- if the engines returns less than 5 results, do another query

Is it possible to make lucene.net ignore case of the field for queries?

I have documents indexed with field "GuidId" field and "guidid". How can I make lucene net ignore the case ...so that the following query searches regardless of the case ?
TermQuery termQuery = new TermQuery(new Term("GuidId", guidId.ToString()));
I don't want to write another query for the documents with fields "guidid" ..i.e. lowercase
Ideally, don't have fields names with funky cases. If you are defining field names dynamically or some such, then you should lowercase them before adding them to the index. That done, it should be easy enough to keep the query fields' names lowercase as well, and you're in good shape.
If, for whatever reason, you are stuck with this case-sensitive data, you'll be stuck expanding your queries to search all the known permutations of the field name to get all your results. Something like:
Query finalQuery = new DisjunctionMaxQuery(0)
finalQuery.add(new TermQuery(new Term("GuidId", guidId.ToString())));
finalQuery.add(new TermQuery(new Term("guidid", guidId.ToString())));
DisjunctionMaxQuery would probably be a good choice here, since it only returns the maximum scoring hit among is query collection, rather than possibly compounding scores across multiple hits.
You can also use MultiFieldQueryParser to similar effect. I don't believe it uses DisjunctionMax, but it doesn't sound like it would likely be that big a deal in this case.

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.