Lucene.Net features - lucene.net

Am new to Lucene.Net
Which is the best Analyzer to use in Lucene.Net?
Also,I want to know how to use Stop words and word stemming features ?

I'm also new to Lucene.Net, but I do know that the Simple Analyzer omits any stop words, and indexes all tokens/works.
Here's a link to some Lucene info, by the way, the .NET version is an almost perfect, byte-for-byte rewrite of the Java version, so the Java documentation should work fine in most cases: http://darksleep.com/lucene/. There's a section in there about the three analyzers, Simple, Stop, and Standard.
I'm not sure how Lucene.Net handles word stemming, but this link, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2, demonstrates how to create your own Analyzer in Java, and uses a PorterStemFilter to do word-stemming.
...[T]he Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English
I hope that is helpful.

The best analyzer which i found is the StandardAnalyzer in which you can specify the stopwords also.
For Example :-
string indexFileLocation = #"C:\Index";
string stopWordsLocation = #"C:\Stopwords.txt";
var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
Analyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));

It depends on your requirements. If your requirements are ultra simple - e.g. case insensitve, non-stemming searches - then StandardAnalyzer is a good choice. If you look into the Analyzer class and get familiar with Filters, particulary TokenFilter, you can exert an enormous amount of control over your index by rolling your own analyzer.
Stemmers are tricky, and it's important to have a deep understanding of what type of stemming you really need. I've used the Snowball stemmers. For example, the word "policy" and "police" have the same root in the English Snowball stemmer, and getting hits on documents with "policy" when the search term "police" isn't so hot. I've implemented strategies to support stemmed and non-stemmed search so that may be avoided, but it's important to understand the impact.
Beware of temptations like stop words. If you need to search for the phrase "to be or not to be" and the standard stop words are enabled, your search will fail to find documents with that phrase.

Related

Text Preprocessing in Spark-Scala

I want to apply preprocessing phase on a large amount of text data in Spark-Scala such as Lemmatization - Remove Stop Words(using Tf-Idf) - POS tagging , there is any way to implement them in Spark - Scala ?
for example here is one sample of my data:
The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
after preprocessing:
perfect fit iPod photo great sound great price use everywhere very useful
and they have POS tags e.g (iPod,NN) (photo,NN)
there is a POS tagging (sister.arizona) is it applicable in Spark?
Anything is possible. The question is what YOUR preferred way of doing this would be.
For example, do you have a stop word dictionary that works for you (it could just simply be a Set), or would you want to run TF-IDF to automatically pick the stop words (note that this would require some supervision, such as picking the threshold at which the word would be considered a stop word). You can provide the dictionary, and Spark's MLLib already comes with TF-IDF.
The POS tags step is tricky. Most NLP libraries on the JVM (e.g. Stanford CoreNLP) don't implement java.io.Serializable, but you can perform the map step using them, e.g.
myRdd.map(functionToEmitPOSTags)
On the other hand, don't emit an RDD that contains non-serializable classes from that NLP library, since steps such as collect(), saveAsNewAPIHadoopFile, etc. will fail. Also to reduce headaches with serialization, use Kryo instead of the default Java serialization. There are numerous posts about this issue if you google around, but see here and here.
Once you figure out the serialization issues, you need to figure out which NLP library to use to generate the POS tags. There are plenty of those, e.g. Stanford CoreNLP, LingPipe and Mallet for Java, Epic for Scala, etc. Note that you can of course use the Java NLP libraries with Scala, including with wrappers such as the University of Arizona's Sista wrapper around Stanford CoreNLP, etc.
Also, why didn't your example lower-case the processed text? That's pretty much the first thing I would do. If you have special cases such as iPod, you could apply the lower-casing except in those cases. In general, though, I would lower-case everything. If you're removing punctuation, you should probably first split the text into sentences (split on the period using regex, etc.). If you're removing punctuation in general, that can of course be done using regex.
How deeply do you want to stem? For example, the Porter stemmer (there are implementations in every NLP library) stems so deeply that "universe" and "university" become the same resulting stem. Do you really want that? There are less aggressive stemmers out there, depending on your use case. Also, why use stemming if you can use lemmatization, i.e. splitting the word into the grammatical prefix, root and suffix (e.g. walked = walk (root) + ed (suffix)). The roots would then give you better results than stems in most cases. Most NLP libraries that I mentioned above do that.
Also, what's your distinction between a stop word and a non-useful word? For example, you removed the pronoun in the subject form "I" and the possessive form "my," but not the object form "me." I recommend picking up an NLP textbook like "Speech and Language Processing" by Jurafsky and Martin (for the ambitious), or just reading the one of the engineering-centered books about NLP tools such as LingPipe for Java, NLTK for Python, etc., to get a good overview of the terminology, the steps in an NLP pipeline, etc.
There is no built-in NLP capability in Apache Spark. You would have to implement it for yourself, perhaps based on a non-distributed NLP library, as described in marekinfo's excellent answer.
I would suggest you to take a look in spark's ml pipeline. You may not get everything out of the box yet, but you can build your capabililties and use pipeline as a framework..

Query Expansion in Perl

Is there any existing implementation for query expansion in Perl?
By query expansion I mean, when the user enter a query in our database
it will expand the search based on related terms.
In principle we have a XML file (e.g. MESH) with which
we want to refer to for query expansion.
Bio::DB::MeSH - Term retrieval from a Web MeSH database
my $mesh = Bio::DB::MeSH->new();
my $term = $mesh->get_exact_term('Butter');
print $term->description;
You got a serviceable answer already but there is a far deeper and more robust alternative for more serious usage: UMLS::Similarity and UMLS::Interface. The problem being these are a little bit of a bear to install, require MySQL, take up quite a bit of disk space, and require you to have the MeSH stuff locally and make sure your usage is in compliance with a couple dozen related licenses for the dictionaries/sources.
I do not mean to disparage Bio::DB::MeSH, it's useful and part of a bigger picture (BioPerl), but it has fragile heuristics and is at the mercy of availability and trivial HTML changes in the target/source site (it was broken the last time I used it for example though it was easy to patch locally).

Basic Profanity Filter in Objective C for iPhone

How have you like minded individuals tackled the basic challenge of filtering profanity, obviously one can't possibly tackle every scenario but it would be nice to have one at the most basic level as a first line of defense.
In Obj-c I've got
NSString *tokens = [text componentsSeparatedByString:#" "];
And then I loop through each token to see if any of the keywords (I've got about 400 in a list) are found within each token.
Realising False positives are also a problem, if the word is a perfect match, its flagged as profanity otherwise if more than 3 words with profanity are found without being perfect matches it is also flagged as profanity.
Later on I will use a webservice that tackles the problem more precisely, but I really just need something basic. So if you wrote the word penis it would go yup naughty naughty, bad word written.
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Jeff has an interesting article to consider before embarking on such a piece of code:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
I just have a suggestion for tokenizing the string. Your ways works well if the words are all separated by strings but that is rarely the case in most usage scenarios as you would normally have to deal with newlines, punctuation, etc. Try this if you are interested:
NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];
[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];
Source: http://www.tech-recipes.com/rx/3418/cocoa-explode-break-nsstring-into-individual-words/
Well, searching in that manner is certainly not the most efficient way to search for profanity... a more efficient approach would be to construct a finite state automaton to detect the words, and run the text once through that FSA. You don't really need to split strings to find profanity, and all that splitting adds extra allocation and copying overhead that you don't need. Also, there may be common patterns in some of the blacklisted words, which you are not exploiting by searching each word individually.
That said, I think 400 words is quite a lot. Who, exactly, is your audience? What if a user has a medical question? Should such questions actually be disallowed? I can only think of a handful of words that would be considered profane in any context, so you might want to rethink the filtering.
A couple of things:
FSA won't necessarily work depending on how intelligent you want the filter to be
Regex are generally extremely slow depending on how many you want to run
400 words is somewhat low, depending on your needs and langauges
There are a number of extremely tricky cases to be careful of when filtering, particularly embedding of words such as "ASSume"
My company, Inversoft, builds a commercial filtering solution and it is quite intelligent. It doesn't use regex or FSA, but has a custom built fast-linear processing technology that makes it extremely fast and accurate (4,000+ messages per second). It also has over 600 English words in a number of categories including Slang, Racial Slurs, Drug, Gang, Religious, etc.
If you are looking for an intelligent context-aware solution with support, you should check out Clean Speak from Inversoft. Hooking it up to Obj-C should be simple using the XML WebService.

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing...‎ Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy...‎ Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.
In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.
If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.

Can I identify matched terms when searching with sphinx?

I am using sphinx to do full text search on a mysql database through thinking sphinx.
I would like to highlight the matched terms in the results I show to the user.
Shpinx is smart enough that searching for 'botulism' will match "i like to inject botulinum into my eyes"
How can I get it to tell me that 'botulinum' matches 'botulism'?
First,
I'm heavily using sphinx for one of my project but I'm not using ThinkingSphinx since the config file we use is quite complex, I'm using a customized act_as_sphinx plugin.
To answer your question from pure sphinx point of view :
there is an a BuildExcerpts api in sphinx to get excerpt of a content with matching underlined see http://www.sphinxsearch.com/docs/current.html#api-func-buildexcerpts.
Thinking Sphinx should provide this functionnality
to match botulism as botulinum you should compile sphinx with stemmer, maybe the porter algorithm may answer your question : see http://www.sphinxsearch.com/docs/current.html#conf-morphology
Hope this helps and I highly encourage you to look at the sphinx documentation to fully use this very efficient indexer
Manfred