Lucene.NET stemming problem - lucene.net

I'm running into a problem using the SnowBallAnalyzer in Lucene.NET. It works great for some words, but others it doesn't find any results on at all, and I'm not sure how to dig into this further to find out what is happening. I am testing the search on the USDA Food Description file which can be found here (http://www.ars.usda.gov/SP2UserFiles/Place/12354500/Data/SR23/asc/FOOD_DES.txt). I'm using the English stemming algorithm. I get the following results when searching for "eggs":
Bagels, egg
Bread, egg
Egg, whole, raw, fresh
Egg, white, raw, fresh
Egg, yolk, raw, fresh
Egg, yolk, raw, frozen
Egg, whole, cooked, fried
...
Those results are great. However I get no results at all when searching for "apple". When I use the StandardAnalyzer, and search for "apple" I get the following results.
Croissants, apple
Strudel, apple,
Babyfood, juice, apple
Babyfood, apple-banana juice
...
Not the best results, but at least it's showing something. Anyone know why the stemming analyzer would be filtering in such a way that I would not get any results?
Edit: Here is my prototype code that I'm working with.
static string[] Search(string searchTerm)
{
//Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English");
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
Lucene.Net.Search.Query query = parser.Parse(searchTerm);
Lucene.Net.Search.Searcher searcher = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo("./index/")), true);
var topDocs = searcher.Search(query, null, 10);
List<string> results = new List<string>();
foreach(var scoreDoc in topDocs.scoreDocs)
{
results.Add(searcher.Doc(scoreDoc.doc).Get("raw"));
}
return results.ToArray();
}

Are you sure you used Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English") to write your index ? You have to use the same analyzer to write and query the index.

Related

Reject method of collection does not work

Route::get('/product',function(){ $product = Product::all(); $filtered_product = $product->reject(function ($product) { $specific_product = $product->where("price",'=',"10.00")->get(); foreach($specific_product as $sp){ return $sp->price; } }); dd($filtered_product); });
I want to exlude some records which match the condition above. I know I can do it in simpler way, but I have a weird habit which I like to do thing in more complex way, So I can be proud of myself.. sound crazy right?...anyway..why the code above return an empty array??? please don't tell me to read the document. I am here because I have read it a thousand time and I still don't get it. thanks
I expect the result which does not include the records which have price 10.00

Lucene Duplicated results / spaces in text search

I’m actually using Lucene 2.9.4.1 and everything works just fine if I search for something that exists just once in the same line.
Per instance, if Lucene find the same string that I’m looking for in the same line, I have duplicated (or more) results.
I’m actually using the following BooleanQuery:
booleanQuery.Add(new TermQuery(new Term(propertyInfo.Name, textSearch)), BooleanClause.Occur.SHOULD);
The second issue is about searching by something with spaces like “hello world”: never works.
Can anyone advise me or help me with these two malfunctioning features, please?
Thank you so much in advance,
Best regards,
Well, I just found the answer that solved both of my issues =)
I was using this:
BooleanQuery booleanQuery = new BooleanQuery();
PropertyInfo[] propertyInfos = typeof(T).GetProperties();
foreach (PropertyInfo propertyInfo in propertyInfos)
{
booleanQuery.Add(new TermQuery(new Term(propertyInfo.Name, textSearch)), BooleanClause.Occur.SHOULD);
}
And now I use this:
var booleanQuery = new BooleanQuery();
textSearch = QueryParser.Escape(textSearch.Trim().ToLower());
string[] properties = typeof(T).GetProperties().Select(x => x.Name).ToArray();
Analyzer analyzer = new StandardAnalyzer(global::Lucene.Net.Util.Version.LUCENE_29);
MultiFieldQueryParser titleParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, properties, analyzer);
Query titleQuery = titleParser.Parse(textSearch);
booleanQuery.Add(titleQuery, BooleanClause.Occur.SHOULD);
It seems that Analyzer and MultiFieldQueryParser are the solution for my problems: no more duplicated results, I can search by something with spaces and … the performance as significantly raised up (faster results) =)

Lucene.Net - weird behaviour in different servers

I was writing a search for one of our sites: (SITE A)
BooleanQuery booleanQuery = new BooleanQuery();
foreach (var field in fields)
{
QueryParser qp = new QueryParser(field, new StandardAnalyzer());
Query query = qp.Parse(search.ToLower() + "*");
if (field.Contains("Title")) { query.SetBoost((float)1.8); }
booleanQuery.Add(query, BooleanClause.Occur.SHOULD);
}
// CODE DIFFERENCE IS HERE
Query query2 = new TermQuery(new Term("StateProperties.IsActive", "True"));
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
// END CODE DIFFERENCE
Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(21, true);
searcher.Search(booleanQuery, collector);
hits = collector.TopDocs().scoreDocs;
this was working as expected.
since we own a few sites, and they use the same skeleton,
i uploaded the search to another site ( SITE B )
but the search stopped returning results.
after playing a round a bit with the code, i managed to make it work like so: (showing only the rewriten lines of code)
QueryParser qp2 = new QueryParser("StateProperties.IsActive", new StandardAnalyzer());
Query query2 = qp2.Parse("True");
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
anyone knows why this is happening ?
i have checked the Lucene dll version, and its the same version in both sites (2.9.2.2)
is the code i have written in SITE A wrong ? is SITE B code wrong ?
is this my fault at all ? can production server infulance something like this ?
Doesn't they have individual indexes on disk? If they have been indexed differently, they would also return different results. One thing that comes to mind is if there is some sort of case sensitivity that matters, becayse a TermQuery will look for an EXACT match, where as the parser will try to tokenize/filter the search term according to the analyzer (and probably search for "true" instead of "True".

Sorting Topscorecollector Results in Lucene.net?

I am doing a search operation by using lucene where i was taking my results by using topscorecollector, but i found that it unable to sort my topscorecollector results. I found it quiet odd to sort that. Can we sort the TopscoreCollector results?
My code looks like this
TopScoreDocCollector collector = TopScoreDocCollector.create(100, true);
indexSearch.Search(andSearchQuery, filter, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].doc;
float score = hits[i].score;
Lucene.Net.Documents.Document doc = indexSearch.Doc(docId);
document.Add(doc);
}
Can anybody help me?
Also one more doubt
we can sort the search results like this
Hits hits = IndexSearch.search(searchQuery, filter, sort);
But it is showing that Hits become obselete by Lucene 3.0. so i have opted for TopscoreCollector. But now iam very much confused?
If anyother alternate method for Hits, Please pass that to me...
TopScoreDocCollector will return results sorted by score. To get results sorted on a field you will need to use a method overload that returns TopFieldDocs.
IE: IndexSearcher.Search(query, filter, nResults, sort)
If you dont want to limit the number of results use a very large value for the nResults parameter. If i remember correctly passing Int32.MAX_VALUE will make Lucene generate an exception when initializing its PriorityQueue but Int32.MAX_VALUE-1 is fine.

What is the better way to do the below program(c#3.0)

Consider the below program
private static bool CheckFactorPresent(List<FactorReturn> factorReturnCol)
{
bool IsPresent = true;
StringBuilder sb = new StringBuilder();
//Get the exposure names from Exposure list.
//Since this will remain same , so it has been done outside the loop
List<string> lstExposureName = (from item in Exposures
select item.ExposureName).ToList<string>();
foreach (FactorReturn fr in factorReturnCol)
{
//Build the factor names from the ReturnCollection dictionary
List<string> lstFactorNames = fr.ReturnCollection.Keys.ToList<string>();
//Check if all the Factor Names are present in ExposureName list
List<string> result = lstFactorNames.Except(lstExposureName).ToList();
if (result.Count() > 0)
{
result.ForEach(i =>
{
IsPresent = false;
sb.AppendLine("Factor" + i + "is not present for week no: " + fr.WeekNo.ToString());
});
}
}
return IsPresent;
}
Basically I am checking if all the FactorNames[lstFactorNames] are present in
ExposureNames[lstExposureName] list by using lstFactorNames.Except(lstExposureName).
And then by using the Count() function(if count() > 0), I am writing the error
messages to the String Builder(sb)
I am sure that someone can definitely write a better implementation than the one presented.
And I am looking forward for the same to learn something new from that program.
I am using c#3.0 and dotnet framework 3.5
Thanks
Save for some naming convention issues, I'd say that looks fine (for what I can figure out without seeing the rest of the code, or the purpose in the effort. The naming conventions though, need some work. A sporadic mix of ntnHungarian, PascalCase, camelCase, and abbrv is a little disorienting. Try just naming your local variables camelCase exclusively and things will look a lot better. Best of luck to you - things are looking good so far!
- EDIT -
Also, you can clean up the iteration at the end by just running a simple foreach:
...
foreach (var except in result)
{
isPresent = false;
builder.AppendFormat("Factor{0} is not present for week no: {1}\r\n", except, fr.WeekNo);
}
...