Lucene.Net - weird behaviour in different servers - lucene.net

I was writing a search for one of our sites: (SITE A)
BooleanQuery booleanQuery = new BooleanQuery();
foreach (var field in fields)
{
QueryParser qp = new QueryParser(field, new StandardAnalyzer());
Query query = qp.Parse(search.ToLower() + "*");
if (field.Contains("Title")) { query.SetBoost((float)1.8); }
booleanQuery.Add(query, BooleanClause.Occur.SHOULD);
}
// CODE DIFFERENCE IS HERE
Query query2 = new TermQuery(new Term("StateProperties.IsActive", "True"));
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
// END CODE DIFFERENCE
Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(21, true);
searcher.Search(booleanQuery, collector);
hits = collector.TopDocs().scoreDocs;
this was working as expected.
since we own a few sites, and they use the same skeleton,
i uploaded the search to another site ( SITE B )
but the search stopped returning results.
after playing a round a bit with the code, i managed to make it work like so: (showing only the rewriten lines of code)
QueryParser qp2 = new QueryParser("StateProperties.IsActive", new StandardAnalyzer());
Query query2 = qp2.Parse("True");
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
anyone knows why this is happening ?
i have checked the Lucene dll version, and its the same version in both sites (2.9.2.2)
is the code i have written in SITE A wrong ? is SITE B code wrong ?
is this my fault at all ? can production server infulance something like this ?

Doesn't they have individual indexes on disk? If they have been indexed differently, they would also return different results. One thing that comes to mind is if there is some sort of case sensitivity that matters, becayse a TermQuery will look for an EXACT match, where as the parser will try to tokenize/filter the search term according to the analyzer (and probably search for "true" instead of "True".

Related

Applying the same Analyzer to Queries and Fields

I am trying to build a basic search for my API backend. Users pass arbitrary queries and the backend is supposed to return results (obviously). I would prefer a solution that works with a local index as well as Elasticsearch.
On my entity I defined an analyzer like this:
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class ),
filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = StopFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class,
params = {
#Parameter(name = "minGramSize", value = "2"),
#Parameter(name = "maxGramSize", value = "3") } )
}
)
For the query, I tried the following:
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(this.entityManager);
Analyzer analyzer = fullTextEntityManager.getSearchFactory().getAnalyzer("ngram");
QueryParser queryParser = new MultiFieldQueryParser(ALL_FIELDS, analyzer);
queryParser.setDefaultOperator(QueryParser.AND_OPERATOR);
org.apache.lucene.search.Query query = queryParser.parse(queryString);
javax.persistence.Query persistenceQuery =
fullTextEntityManager.createFullTextQuery(query, MyEntity.class);
List<MyEntity> result = persistenceQuery.getResultList();
As far as I understand, I need to provide an analyzer for the Query so that the search query are "ngram-tokenized" and a match can be found. Before, I used SimpleAnalyzer and as a result, the search only matched full words which - I think - backs my theory (Sorry, I am still learning this).
The above code gives me a NullPointerException:
java.lang.NullPointerException: null
at org.hibernate.search.engine.impl.ImmutableSearchFactory.getAnalyzer(ImmutableSearchFactory.java:370) ~[hibernate-search-engine-5.11.1.Final.jar:5.11.1.Final]
at org.hibernate.search.engine.impl.MutableSearchFactory.getAnalyzer(MutableSearchFactory.java:203) ~[hibernate-search-engine-5.11.1.Final.jar:5.11.1.Final]
at org.hibernate.search.impl.SearchFactoryImpl.getAnalyzer(SearchFactoryImpl.java:50) ~[hibernate-search-orm-5.11.1.Final.jar:5.11.1.Final]
in the line
Analyzer analyzer = fullTextEntityManager.getSearchFactory().getAnalyzer("ngram");
You cannot retrieve the analyzer from Hibernate Search when using the Elasticsearch integration, because in that case there is no analyzer locally: the analyzer only exists remotely, in the Elasticsearch cluster.
If you only need a subset of the query syntax, try out the "simple query string" query: it's a query that can be built using the DSL (so it will work the same with Lucene and Elasticsearch) and that provides the most common features (boolean queries, fuzziness, phrases, ...). For example:
Query luceneQuery = queryBuilder.simpleQueryString()
.onFields("name", "history", "description")
.matching("war + (peace | harmony)")
.createQuery();
The syntax is a bit different, but only because it's aiming at end users and tries to be simpler.
EDIT: If simple query strings are not an option, you can create an analyzer manually: this should work even when using the Elasticsearch integration.
org.apache.lucene.analysis.custom.CustomAnalyzer#builder() should be a good starting point. There are several examples in the javadoc of that class.
Make sure you only create the analyzer once and store it somewhere, e.g. in a static constant: creating an analyzer may be costly.

Lucene Duplicated results / spaces in text search

I’m actually using Lucene 2.9.4.1 and everything works just fine if I search for something that exists just once in the same line.
Per instance, if Lucene find the same string that I’m looking for in the same line, I have duplicated (or more) results.
I’m actually using the following BooleanQuery:
booleanQuery.Add(new TermQuery(new Term(propertyInfo.Name, textSearch)), BooleanClause.Occur.SHOULD);
The second issue is about searching by something with spaces like “hello world”: never works.
Can anyone advise me or help me with these two malfunctioning features, please?
Thank you so much in advance,
Best regards,
Well, I just found the answer that solved both of my issues =)
I was using this:
BooleanQuery booleanQuery = new BooleanQuery();
PropertyInfo[] propertyInfos = typeof(T).GetProperties();
foreach (PropertyInfo propertyInfo in propertyInfos)
{
booleanQuery.Add(new TermQuery(new Term(propertyInfo.Name, textSearch)), BooleanClause.Occur.SHOULD);
}
And now I use this:
var booleanQuery = new BooleanQuery();
textSearch = QueryParser.Escape(textSearch.Trim().ToLower());
string[] properties = typeof(T).GetProperties().Select(x => x.Name).ToArray();
Analyzer analyzer = new StandardAnalyzer(global::Lucene.Net.Util.Version.LUCENE_29);
MultiFieldQueryParser titleParser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, properties, analyzer);
Query titleQuery = titleParser.Parse(textSearch);
booleanQuery.Add(titleQuery, BooleanClause.Occur.SHOULD);
It seems that Analyzer and MultiFieldQueryParser are the solution for my problems: no more duplicated results, I can search by something with spaces and … the performance as significantly raised up (faster results) =)

How to make Appstats show both small and read operations?

I'm profiling my application locally (using the Dev server) to get more information about how GAE works. My tests are comparing the common full Entity query and the Projection Query. In my tests both queries do the same query, but the Projection is specified with 2 properties. The test kind has 100 properties, all with the same value for each Entity, with a total of 10 Entities. An image with the Datastore viewer and the Appstats generated data is shown bellow. In the Appstats image, Request 4 is a memcache flush, Request 3 is the test database creation (it was already created, so no costs here), Request 2 is the full Entity query and Request 1 is the projection query.
I'm surprised that both queries resulted in the same amount of reads. My guess is that small and read operations and being reported the same by Appstats. If this is the case, I want to separate them in the reports. That's the queries related functions:
// Full Entity Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
// Projection Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
String projectionPropName = DBCreation.PROPERTY_NAME_PREFIX + i;
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 0, Integer.class));
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 1, Integer.class));
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
Any ideas?
EDIT: To get a better overview of the problem I have created another test, which do the same query but uses the keys only query instead. For this case, Appstats is correctly showing DATASTORE_SMALL operations in the report. I'm still pretty confused about the behavior of the projection query which should also be reporting DATASTORE_SMALL operations. Please help!
[I wrote the go port of appstats, so this is based on my experience and recollection.]
My guess is this is a bug in appstats, which is a relatively unmaintained program. Projection queries are new, so appstats may not be aware of them, and treats them as normal read queries.
For some background, calculating costs is difficult. For write ops, the cost are returned with the results, as they must be, since the app has no way of knowing what changed (which is where the write costs happen). For reads and small ops, however, there is a formula to calculate the cost. Each appstats implementation (python, java, go) must implement this calculation, including reflection or whatever is needed over the request object to determine what's going on. The APIs for doing this are not entirely obvious, and there's lots of little things, so it's easy to get it wrong, and annoying to get it right.

Facing Issue in zend_search_lucene

I am using Zend Lucene Search:
......
$results = $test->fetchAll();
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
foreach ($results as $result) {
$doc = new Zend_Search_Lucene_Document();
// add Fields
$doc->addField(
Zend_Search_Lucene_Field::Text('testid', $result->id));
$doc->addField(
Zend_Search_Lucene_Field::Keyword('testemail', strtolower(($result->email))));
$doc->addField(
Zend_Search_Lucene_Field::Text('testconfirmdate', $result->confirmdate));
$doc->addField(
Zend_Search_Lucene_Field::Text('testcreateddate', $result->createddate));
// Add document to the index
$index->addDocument($doc);
}
// Optimize index.
$index->optimize();
// Search by query
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
if(strlen($Data['name']) > 2){
//$query = Zend_Search_Lucene_Search_QueryParser::parse($Data['name'].'*');
$pattern = new Zend_Search_Lucene_Index_Term($Data['name'].'*');
$query = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);
$this->view->hits = $index->find(strtolower($query));
}
else{
$query = $Data['name'];
$this->view->hits = $index->find($query);
}
............
Works fine here:
It works when I give complete word, first 3 character, case insensitive words
My issues are:
When I search for email, i got error like "Wildcard search is supported only for non-multiple word terms "
When I search for number/date like "1234" or 09/06/2011, I got error like "At least 3 non-wildcard characters are required at the beginning of pattern"
I want to search date, email, number here.
In file zend/search/Lucene/search/search/query/wildcard a parameter is set,
private static $_minPrefixLength = 3;
chnage it and it may work..!
Based on NaanuManu's suggestion, I did a little more digging to figure this out - I posted my answer on a related question here, but repeating for convenience:
Taken directly from the Zend Reference documentation, you can use:
Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() to
query the minimum required prefix length and
use Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength() to
set it.
So my suggestion would be either of two things:
Set the prefixMinLength to 0 using Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0)
Validate all search queries using javascript or otherwise to ensure there is a minimum of Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() before any wildcards used (I recommend querying that instead of assuming the default of "3" so the validation is flexible)

Lucene.NET stemming problem

I'm running into a problem using the SnowBallAnalyzer in Lucene.NET. It works great for some words, but others it doesn't find any results on at all, and I'm not sure how to dig into this further to find out what is happening. I am testing the search on the USDA Food Description file which can be found here (http://www.ars.usda.gov/SP2UserFiles/Place/12354500/Data/SR23/asc/FOOD_DES.txt). I'm using the English stemming algorithm. I get the following results when searching for "eggs":
Bagels, egg
Bread, egg
Egg, whole, raw, fresh
Egg, white, raw, fresh
Egg, yolk, raw, fresh
Egg, yolk, raw, frozen
Egg, whole, cooked, fried
...
Those results are great. However I get no results at all when searching for "apple". When I use the StandardAnalyzer, and search for "apple" I get the following results.
Croissants, apple
Strudel, apple,
Babyfood, juice, apple
Babyfood, apple-banana juice
...
Not the best results, but at least it's showing something. Anyone know why the stemming analyzer would be filtering in such a way that I would not get any results?
Edit: Here is my prototype code that I'm working with.
static string[] Search(string searchTerm)
{
//Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English");
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
Lucene.Net.Search.Query query = parser.Parse(searchTerm);
Lucene.Net.Search.Searcher searcher = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo("./index/")), true);
var topDocs = searcher.Search(query, null, 10);
List<string> results = new List<string>();
foreach(var scoreDoc in topDocs.scoreDocs)
{
results.Add(searcher.Doc(scoreDoc.doc).Get("raw"));
}
return results.ToArray();
}
Are you sure you used Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English") to write your index ? You have to use the same analyzer to write and query the index.