Lucene.net Fuzzy Phrase Search - lucene.net

I have tried this myself for a considerable period and looked everywhere around the net - but have been unable to find ANY examples of Fuzzy Phrase searching via Lucene.NET 2.9.2. ( C# )
Is something able to advise how to do this in detail and/or provide some example code - I would seriously seriously appreciate any help as I am totally stuck ?

I assume that you have Lucene running and created a search index with some fields in it. So let's assume further that:
var fields = ... // a string[] of the field names you wish to search in
var version = Version.LUCENE_29; // your Lucene version
var queryString = "some string to search for";
Once you have all of these you can go ahead and define a search query on multiple fields like this:
var analyzer = LuceneIndexProvider.CreateAnalyzer();
var query = new MultiFieldQueryParser(version, fields, analyzer).Parse(queryString);
Maybe you already got that far and are only missing the fuzzy part. I simply add a tilde ~ to every word in the queryString to tell Lucene to do a fuzzy search for all words in the queryString:
if (fuzzy && !string.IsNullOrEmpty(queryString)) {
// first escape the queryString so that e.g. ~ will be escaped
queryString = QueryParser.Escape(queryString);
// now split, add ~ and join the queryString back together
queryString = string.Join("~ ",
queryString.Split(' ', StringSplitOptions.RemoveEmptyEntries)) + "~";
// now queryString will be "some~ string~ to~ search~ for~"
}
The key point here is that Lucene uses fuzzy search only for terms that end with a ~. That and some more helpful info was found on
http://scatteredcode.wordpress.com/2011/05/26/performing-a-fuzzy-search-with-multiple-terms-through-multiple-lucene-net-document-fields/.

Related

How to check if list of string contains (substring) value in EpiFind?

I have indexed an object which has a list of strings like ["ZZA-KL-2A", "ZZA-KL-ZZB"]. I want to search and get all items which starts with a certain 3 letter code. So I want to check each item in the list and check something like 'StartsWith'.
I can see from documentation that we have something like Match, MatchContained but nothing for start with for the list of string items.
Please note that this question is not related to ordinary string comparison in C# or LINQ before flagging the question.
Just use a filter
var searchQuery = client.Search<MyContent>()
.Filter(x => x.OrderNumber.StartsWith("Find"));
https://world.episerver.com/documentation/developer-guides/search-navigation/NET-Client-API/searching/Filtering/
You may use Prefix or PrefixCaseInsensitive:
Matching by beginning of a string (startsWith)
The Prefix method lets you match by the beginning of a string. The
following search matches blog posts titled Find and Find rocks! but
not find, Finding or Hello Find.
var searchQuery = client.Search<BlogPost>().Filter(x =>
x.Title.Prefix("Find"));
Use the PrefixCaseInsensitive method to match by the beginning of a
string in a case-insensitive way. The following search matches blog
posts titled Find, Find rocks! and Find but not Finding or Hello Find.
var searchQuery = client.Search<BlogPost>().Filter(x =>
x.Title.PrefixCaseInsensitive("Find"));
Source: https://docs.developers.optimizely.com/digital-experience-platform/v1.1.0-search-and-navigation/docs/strings#matching-by-beginning-of-a-string-startswith

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

How to search across all the fields?

In Lucene, we can use TermQuery to search a text with a field. I am wondering how to search a keyword across a bunch of fields or all the searchable fields?
Another approach, which doesn't require to index anything more than what you already have, nor to combine different queries, is using the MultiFieldQueryParser.
You can provide a list of fields where you want to search on and your query, that's all.
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
Version.LUCENE_41,
new String[]{"title", "content", "description"},
new StandardAnalyzer(Version.LUCENE_41));
Query query = queryParser.parse("here goes your query");
This is how I would do it with the original lucene library written in Java. I'm not sure whether the MultiFieldQueryParser is available in lucene.net too.
Two approaches
1) Index-time approach: Use a catch-all field. This is nothing but appending all the text from all the fields (total text from your input doc) and place that resulting huge text in a single field. You've to add an additional field while indexing to act as a catch-all field.
2) Search-time approach: Use a BooleanQuery to combine multiple queries, for example TermQuery instances. Those multiple queries can be formed to cover all the target fields.
Example check at the end of the article.
Use approach 2 if you know the target-field list at runtime. Otherwise, you've got to use the 1st approach.
Another easy approach to search across all fields using "MultifieldQueryParser" is use IndexReader.FieldOption.ALL in your query.
Here is example in c#.
Directory directory = FSDirectory.Open(new DirectoryInfo(HostingEnvironment.MapPath(VirtualIndexPath)));
//get analyzer
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);
//get index reader and searcher
IndexReader indexReader__1 = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader__1);
//add all possible fileds in multifieldqueryparser using indexreader getFieldNames method
var queryParser = new MultiFieldQueryParser(Version.LUCENE_29, indexReader__1.GetFieldNames(IndexReader.FieldOption.ALL).ToArray(), analyzer);
var query = queryParser.Parse(Criteria);
TopDocs resultDocs = null;
//perform search
resultDocs = indexSearch.Search(query, indexReader__1.MaxDoc());
var hits = resultDocs.scoreDocs;
click here to check out my pervious answer to same quesiton in vb.net

handling wildcard search with Special character in Lucene.net search

how do i perform wildcard search a word in lucene that contain special character. for example i have a word like "91-95483534" if i search like "91*" it works and if i search like "91-95483534" also works fine. but my senario is that to search "91-9548*". if i perform like this "91-9548*". i got no output. am i missing anything. my actual code is given below:
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(new string[] {"column1","column2"}, new StandardAnalyzer());
queryParser.SetAllowLeadingWildcard(true);
Query query = queryParser.Parse(QueryParser.Escape(strKeyWord) + "*");
As you used StandardAnalyzer, that indexed your word as 91 and 95483534 if you used INDEX_ANALYZED when you index....
if you want to search as 91-9548* , use INDEX_NOT_ANALYZED when you index that specified field which have "91-95483534" as terms
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/core/org/apache/lucene/document/Field.Index.html

Make Lucene index a value and store another

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.
If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}
You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.