Autocomplete for addresses using Lucene .NET - autocomplete

I have not been able to get this nailed down. I have tried several different analyzers and they all get me close, but not exactly what I want. Using SOLR is not an option at the moment.
An examples of what I would like:
Input: 200
Matches: 200 E Dragon Dr.
200 W Paragon Rd.
200 Lick Skillet Dr.
Input: 200 E
Matches: 200 E Dragon Dr.
200 E Toll Rd.
Input: 200 E D
Matches: 200 E Dragon Dr.
If I use the simple analyzer then it will not match on the number. The whitespace analyzer gets the desired effect with just the number, but once I add the E it does not return as I expect. What would be the best analyzer or am I using the wrong queries?
Thanks,
EDIT:
I have taken the below answer and did a ton of googling and I am getting close just using the query parser and the whitespaceanalyzer. I am just letting the query parser determine the best query and it seems to work.

try using a keyword analyzer and a query parser to search an address field in Lucene. I am using a MultiFieldQueryParser, but you could use a regular query parser too:
public StartsWithQuery Prefix(string prefix, string[] fields, Dictionary<string,string> filterFields = null )
{
if(!string.IsNullOrEmpty(prefix))
{
var parser = new MultiFieldQueryParser(Version.LUCENE_29, fields, new KeywordAnalyzer());
var boolQuery = new BooleanQuery();
boolQuery.Add(parser.Parse(prefix + "*"), BooleanClause.Occur.MUST);
if (filterFields != null)
{
foreach (var field in filterFields)
{
boolQuery.Add(new TermQuery(new Term(field.Key, field.Value)), BooleanClause.Occur.MUST);
}
}
}
return this;
}

Related

jayData complex filter evaluation

I am new to jayData and am trying to filter on an entity set. The filter needs to perform an complex evaluation beyond what I saw in the samples.
Here is a working sample of what I am trying to accomplish (the listView line isn't and is just there to show what I plan to do with the data):
function () {
var weekday = moment().isoWeekday()-1;
console.log(weekday);
var de = leagueDB.DailyEvents.toArray(function (events) {
console.log(events);
var filtered = [];
for (var e = 0; e < events.length;e++) {
console.log(events[e]);
console.log(events[e].RecurrenceRule);
var rule = RRule.fromString(events[e].RecurrenceRule);
var ruleOptions = rule.options.byweekday;
var isDay = ruleOptions.indexOf(weekday);
console.log(ruleOptions, isDay);
if(isDay =! -1)
{
filtered.push(events[e]);
}
}
$("#listView").kendoListView({dataSource:filtered});
});
Basically it is just evaluating a recurring rule string to see if the current day meets that criteria, if so add that event to the list for viewing.
But it blows up when I try to do this:
eventListLocal:leagueDB.DailyEvents.filter(function(e){
console.log("The Weekday is:"+viewModel.weekday);
console.log(e);
console.log("The recurrence rule is:"+e.RecurrenceRule);
var rruleOptions = viewModel.rruleOptions(e.RecurrenceRule);
if (rruleOptions !== -1) {
return true;
}
}).asKendoDataSource()
The error that is generating is:
Exception: Unable to resolve type:undefined
The thing is it seems to be occurring on "e" and the console logs like the event is not being passed in. However, I am not seeing a list either. In short I am really confused as to what is going on.
Any help would be appreciated.
Thanks,
You can't write filter expressions such as this.
When you write .filter(...), jaydata will parse your expression and then it will generate filter for underlying provider, for example where for webSql and $filter for oDataProvider.
Both JayData expression parser and the data provider itself should understand your filter.
Your filter is not suitable for this approach, because most of your codes are not familiar for jaydata expression parser and the underlying data provider, for example your console.log etc.
You can simplify your filter, or you should load all your data into an array, and then you can use filter method of array itself, there, you can write any filter you like, and your filter will work. Of course this has performance issue in some scenarios when your data set is large.
Read more on http://jaydata.org/tutorials/entityexpressions-the-heart-of-jaydata

Finding bad words from large list of email addressess using PHP -Mongodb

I have large list of email addressses from a file. It comes around 1 million email ids. I have list of bad words like spam,junk etc, it consist of 20,000+ bad words.
I need to validate email ids. If bad words is present any where in email id it will be marked as invalid.
For example;
testspam#gmail.com - invalid
newuser#desspam.com - invalid
I would like to know which will be fastest comparison method as array looping will take time.
I tried following methods
//$keyword_list- array of bad words;
//$check_key- the email id which need to validate
$arrays = array_chunk($keyword_list, 2000);
for($i=0;$i<count($arrays);$i++)
{
if (preg_match('/'.implode('|', $arrays[$i]).'/', $check_key, $matches)){
return 1;
}
}
The above method is taking more time when comparing 1 million data.
Next we tried with the following code and this also takes more time
//$contain = bad words separated by '|'
// $str - the email id which need to validate
if(stripos($contain,"|") !== false)
{
$s = preg_split('/[|]+/i',$contain);
$len = sizeof($s);
for($i=0;$i < $len;$i++)
{
if(stripos($str,$s[$i]) !== false)
{
return(true);
}
}
}
if(stripos($str,$contain) !== false)
{
return(true);
}
return(false);
Finally I had tried Mongodb Text Search. It works fast with the following issues
If 'Hell' is the word in my bad list and my email id is like
head#e-hellinglysussex.sch.uk, then the Mongodb Text Search wont matches it.
Here is the code I used;
$ret = $db->command( array("text" =>$section, "search" => $keyword_string, "limit"=>$cnt_finalnonmatch));
where $section = Collection name,
$keyword_string = Comparing keywords string separated by space, for eg "Hell Spam Junk" etc,
$cnt_finalnonmatch = total number of comparing email ids
Please help me to solve this issue.
I am not entirely sure, but I suspect that the problem is that 'Hell' is not equal to 'hell' when you search for text since mongodb is case sensitive.
The solution should be to force all the strings and word to be lowercase (or uppercase)
We have used Mongodb 'like' to solve this issue;
$keywords = $key['keyword']; // Keywords need to compare
$regexObj = new MongoRegex("/".$keywords."/i"); // MongoRegex function declration
$where = array($section => $regexObj); // $section is the collection name
$resultset = $info->find($where);

Lucene.Net - weird behaviour in different servers

I was writing a search for one of our sites: (SITE A)
BooleanQuery booleanQuery = new BooleanQuery();
foreach (var field in fields)
{
QueryParser qp = new QueryParser(field, new StandardAnalyzer());
Query query = qp.Parse(search.ToLower() + "*");
if (field.Contains("Title")) { query.SetBoost((float)1.8); }
booleanQuery.Add(query, BooleanClause.Occur.SHOULD);
}
// CODE DIFFERENCE IS HERE
Query query2 = new TermQuery(new Term("StateProperties.IsActive", "True"));
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
// END CODE DIFFERENCE
Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(21, true);
searcher.Search(booleanQuery, collector);
hits = collector.TopDocs().scoreDocs;
this was working as expected.
since we own a few sites, and they use the same skeleton,
i uploaded the search to another site ( SITE B )
but the search stopped returning results.
after playing a round a bit with the code, i managed to make it work like so: (showing only the rewriten lines of code)
QueryParser qp2 = new QueryParser("StateProperties.IsActive", new StandardAnalyzer());
Query query2 = qp2.Parse("True");
booleanQuery.Add(query2, BooleanClause.Occur.MUST);
anyone knows why this is happening ?
i have checked the Lucene dll version, and its the same version in both sites (2.9.2.2)
is the code i have written in SITE A wrong ? is SITE B code wrong ?
is this my fault at all ? can production server infulance something like this ?
Doesn't they have individual indexes on disk? If they have been indexed differently, they would also return different results. One thing that comes to mind is if there is some sort of case sensitivity that matters, becayse a TermQuery will look for an EXACT match, where as the parser will try to tokenize/filter the search term according to the analyzer (and probably search for "true" instead of "True".

Facing Issue in zend_search_lucene

I am using Zend Lucene Search:
......
$results = $test->fetchAll();
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
foreach ($results as $result) {
$doc = new Zend_Search_Lucene_Document();
// add Fields
$doc->addField(
Zend_Search_Lucene_Field::Text('testid', $result->id));
$doc->addField(
Zend_Search_Lucene_Field::Keyword('testemail', strtolower(($result->email))));
$doc->addField(
Zend_Search_Lucene_Field::Text('testconfirmdate', $result->confirmdate));
$doc->addField(
Zend_Search_Lucene_Field::Text('testcreateddate', $result->createddate));
// Add document to the index
$index->addDocument($doc);
}
// Optimize index.
$index->optimize();
// Search by query
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
if(strlen($Data['name']) > 2){
//$query = Zend_Search_Lucene_Search_QueryParser::parse($Data['name'].'*');
$pattern = new Zend_Search_Lucene_Index_Term($Data['name'].'*');
$query = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);
$this->view->hits = $index->find(strtolower($query));
}
else{
$query = $Data['name'];
$this->view->hits = $index->find($query);
}
............
Works fine here:
It works when I give complete word, first 3 character, case insensitive words
My issues are:
When I search for email, i got error like "Wildcard search is supported only for non-multiple word terms "
When I search for number/date like "1234" or 09/06/2011, I got error like "At least 3 non-wildcard characters are required at the beginning of pattern"
I want to search date, email, number here.
In file zend/search/Lucene/search/search/query/wildcard a parameter is set,
private static $_minPrefixLength = 3;
chnage it and it may work..!
Based on NaanuManu's suggestion, I did a little more digging to figure this out - I posted my answer on a related question here, but repeating for convenience:
Taken directly from the Zend Reference documentation, you can use:
Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() to
query the minimum required prefix length and
use Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength() to
set it.
So my suggestion would be either of two things:
Set the prefixMinLength to 0 using Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0)
Validate all search queries using javascript or otherwise to ensure there is a minimum of Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() before any wildcards used (I recommend querying that instead of assuming the default of "3" so the validation is flexible)

Lucene.NET stemming problem

I'm running into a problem using the SnowBallAnalyzer in Lucene.NET. It works great for some words, but others it doesn't find any results on at all, and I'm not sure how to dig into this further to find out what is happening. I am testing the search on the USDA Food Description file which can be found here (http://www.ars.usda.gov/SP2UserFiles/Place/12354500/Data/SR23/asc/FOOD_DES.txt). I'm using the English stemming algorithm. I get the following results when searching for "eggs":
Bagels, egg
Bread, egg
Egg, whole, raw, fresh
Egg, white, raw, fresh
Egg, yolk, raw, fresh
Egg, yolk, raw, frozen
Egg, whole, cooked, fried
...
Those results are great. However I get no results at all when searching for "apple". When I use the StandardAnalyzer, and search for "apple" I get the following results.
Croissants, apple
Strudel, apple,
Babyfood, juice, apple
Babyfood, apple-banana juice
...
Not the best results, but at least it's showing something. Anyone know why the stemming analyzer would be filtering in such a way that I would not get any results?
Edit: Here is my prototype code that I'm working with.
static string[] Search(string searchTerm)
{
//Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English");
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
Lucene.Net.Search.Query query = parser.Parse(searchTerm);
Lucene.Net.Search.Searcher searcher = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Store.FSDirectory.Open(new DirectoryInfo("./index/")), true);
var topDocs = searcher.Search(query, null, 10);
List<string> results = new List<string>();
foreach(var scoreDoc in topDocs.scoreDocs)
{
results.Add(searcher.Doc(scoreDoc.doc).Get("raw"));
}
return results.ToArray();
}
Are you sure you used Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English") to write your index ? You have to use the same analyzer to write and query the index.