Supporting typeahead autocomplete with ElasticSearch - autocomplete

Is there a standard way to implement character-by-character typeahead autocomplete using ElasticSearch for small fields (e.g. place names).
(At the time of writing this, there are a number of discussions available via search, but nothing that seems definitive. (Also, I see there is talk of the effect of feature support for autocomplete/suggest in Apache Lucene 4.))
Thanks for thoughts.

You can use Edge NGram based analyzer, see http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer.html
Or use the suggest plugin: https://github.com/spinscale/elasticsearch-suggest-plugin
HTH

As David wrote, you can use NGrams or the suggest plugin. With lucene 4 it will be possible to have better auto-suggestions out-of-the-box, without the need to mantain a separate index.
For now you can also just make a terms facet on your field and use a regex pattern to keep only the entries that start with the relevant prefix:
"facets" : {
"tag" : {
"terms" : {
"field" : "field_name",
"regex" : "prefix.*"
}
}
}
The regex is just an example, it can be improved and you can also make it case insensitive using the proper regex flag. Also, beware that making on a facet on a field that contains many unique terms is not a great idea, unless you have enough memory for it.

Use the built-in autocompletion suggester that's available since version 0.90.3:
http://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html
It's blazingly fast and was developed for exactly that use case.

Related

Which one of the locator is efficient by.css or by.xpath or by.id in Protractor?

Which one is better, in terms of performance , to use : by.css or by.xpath or by.id.
I have a really lengthy xpath :
by.xpath('//*#id="logindiv"]/div[3]/div/div[1]/div/nav/div/div[1]/form/div/div/button')
which can be used with other selectors like by.css or by.id.
But it is not very clear which one is better.
Protractor uses selenium-webdriver underneath for element lookup/interaction etc, so this is not protractor specific question, but rather selenium-webdriver specific.
CSS selectors perform far better than Xpath and it is well documented in Selenium community. Here are some reasons,
Xpath engines are different in each browser, hence make them inconsistent.
Last time I checked, IE does not have a native xpath engine, therefore selenium-webdriver injects its own xpath engine for compatibility of its API. Hence we lose the advantage of using native browser features that selenium-webdriver inherently promotes.
Xpath tend to become complex like your example and hence make hard to read/maintain in my opinion.
However there are some situations where, you need to use xpath, for example, searching for a parent element or searching element by its text (I wouldn't recommend the later).
You can read blog from Simon(creator of selenium-webdriver) here . He also recommends CSS over Xpath.
So I would recommend you use id, name etc for faster lookup. If thats not available use css and finally use xpath if none other suite your situation.

Text Preprocessing in Spark-Scala

I want to apply preprocessing phase on a large amount of text data in Spark-Scala such as Lemmatization - Remove Stop Words(using Tf-Idf) - POS tagging , there is any way to implement them in Spark - Scala ?
for example here is one sample of my data:
The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
after preprocessing:
perfect fit iPod photo great sound great price use everywhere very useful
and they have POS tags e.g (iPod,NN) (photo,NN)
there is a POS tagging (sister.arizona) is it applicable in Spark?
Anything is possible. The question is what YOUR preferred way of doing this would be.
For example, do you have a stop word dictionary that works for you (it could just simply be a Set), or would you want to run TF-IDF to automatically pick the stop words (note that this would require some supervision, such as picking the threshold at which the word would be considered a stop word). You can provide the dictionary, and Spark's MLLib already comes with TF-IDF.
The POS tags step is tricky. Most NLP libraries on the JVM (e.g. Stanford CoreNLP) don't implement java.io.Serializable, but you can perform the map step using them, e.g.
myRdd.map(functionToEmitPOSTags)
On the other hand, don't emit an RDD that contains non-serializable classes from that NLP library, since steps such as collect(), saveAsNewAPIHadoopFile, etc. will fail. Also to reduce headaches with serialization, use Kryo instead of the default Java serialization. There are numerous posts about this issue if you google around, but see here and here.
Once you figure out the serialization issues, you need to figure out which NLP library to use to generate the POS tags. There are plenty of those, e.g. Stanford CoreNLP, LingPipe and Mallet for Java, Epic for Scala, etc. Note that you can of course use the Java NLP libraries with Scala, including with wrappers such as the University of Arizona's Sista wrapper around Stanford CoreNLP, etc.
Also, why didn't your example lower-case the processed text? That's pretty much the first thing I would do. If you have special cases such as iPod, you could apply the lower-casing except in those cases. In general, though, I would lower-case everything. If you're removing punctuation, you should probably first split the text into sentences (split on the period using regex, etc.). If you're removing punctuation in general, that can of course be done using regex.
How deeply do you want to stem? For example, the Porter stemmer (there are implementations in every NLP library) stems so deeply that "universe" and "university" become the same resulting stem. Do you really want that? There are less aggressive stemmers out there, depending on your use case. Also, why use stemming if you can use lemmatization, i.e. splitting the word into the grammatical prefix, root and suffix (e.g. walked = walk (root) + ed (suffix)). The roots would then give you better results than stems in most cases. Most NLP libraries that I mentioned above do that.
Also, what's your distinction between a stop word and a non-useful word? For example, you removed the pronoun in the subject form "I" and the possessive form "my," but not the object form "me." I recommend picking up an NLP textbook like "Speech and Language Processing" by Jurafsky and Martin (for the ambitious), or just reading the one of the engineering-centered books about NLP tools such as LingPipe for Java, NLTK for Python, etc., to get a good overview of the terminology, the steps in an NLP pipeline, etc.
There is no built-in NLP capability in Apache Spark. You would have to implement it for yourself, perhaps based on a non-distributed NLP library, as described in marekinfo's excellent answer.
I would suggest you to take a look in spark's ml pipeline. You may not get everything out of the box yet, but you can build your capabililties and use pipeline as a framework..

Is there a port for KStem for .NET?

I'm about to launch into a Lucene.NET implementation and I am concerned about using the PorterStemFilter. Reading here, and reading source code, it appears to be far, far too aggressive for my needs.
I need something simpler that doesn't look for roots but just removes "er", "ed", "s", etc suffixes. From what I've read, KStem would do the trick.
I can't for the life of me find a .NET version of KStem. I can't even find source code for the Java version to handroll a port.
Could someone point me in the right direction?
Looks like it is easy enough to handcraft a reduced PorterStemmer by simply removing steps I don't want. Anyone have success with that?
You could use the HunspellStemmer, part of contrib. It can use freely available hunspell dictionaries to provide proper stemming.

Lucene.Net features

Am new to Lucene.Net
Which is the best Analyzer to use in Lucene.Net?
Also,I want to know how to use Stop words and word stemming features ?
I'm also new to Lucene.Net, but I do know that the Simple Analyzer omits any stop words, and indexes all tokens/works.
Here's a link to some Lucene info, by the way, the .NET version is an almost perfect, byte-for-byte rewrite of the Java version, so the Java documentation should work fine in most cases: http://darksleep.com/lucene/. There's a section in there about the three analyzers, Simple, Stop, and Standard.
I'm not sure how Lucene.Net handles word stemming, but this link, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2, demonstrates how to create your own Analyzer in Java, and uses a PorterStemFilter to do word-stemming.
...[T]he Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English
I hope that is helpful.
The best analyzer which i found is the StandardAnalyzer in which you can specify the stopwords also.
For Example :-
string indexFileLocation = #"C:\Index";
string stopWordsLocation = #"C:\Stopwords.txt";
var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
Analyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));
It depends on your requirements. If your requirements are ultra simple - e.g. case insensitve, non-stemming searches - then StandardAnalyzer is a good choice. If you look into the Analyzer class and get familiar with Filters, particulary TokenFilter, you can exert an enormous amount of control over your index by rolling your own analyzer.
Stemmers are tricky, and it's important to have a deep understanding of what type of stemming you really need. I've used the Snowball stemmers. For example, the word "policy" and "police" have the same root in the English Snowball stemmer, and getting hits on documents with "policy" when the search term "police" isn't so hot. I've implemented strategies to support stemmed and non-stemmed search so that may be avoided, but it's important to understand the impact.
Beware of temptations like stop words. If you need to search for the phrase "to be or not to be" and the standard stop words are enabled, your search will fail to find documents with that phrase.

Can I identify matched terms when searching with sphinx?

I am using sphinx to do full text search on a mysql database through thinking sphinx.
I would like to highlight the matched terms in the results I show to the user.
Shpinx is smart enough that searching for 'botulism' will match "i like to inject botulinum into my eyes"
How can I get it to tell me that 'botulinum' matches 'botulism'?
First,
I'm heavily using sphinx for one of my project but I'm not using ThinkingSphinx since the config file we use is quite complex, I'm using a customized act_as_sphinx plugin.
To answer your question from pure sphinx point of view :
there is an a BuildExcerpts api in sphinx to get excerpt of a content with matching underlined see http://www.sphinxsearch.com/docs/current.html#api-func-buildexcerpts.
Thinking Sphinx should provide this functionnality
to match botulism as botulinum you should compile sphinx with stemmer, maybe the porter algorithm may answer your question : see http://www.sphinxsearch.com/docs/current.html#conf-morphology
Hope this helps and I highly encourage you to look at the sphinx documentation to fully use this very efficient indexer
Manfred