I am using sphinx to do full text search on a mysql database through thinking sphinx.
I would like to highlight the matched terms in the results I show to the user.
Shpinx is smart enough that searching for 'botulism' will match "i like to inject botulinum into my eyes"
How can I get it to tell me that 'botulinum' matches 'botulism'?
First,
I'm heavily using sphinx for one of my project but I'm not using ThinkingSphinx since the config file we use is quite complex, I'm using a customized act_as_sphinx plugin.
To answer your question from pure sphinx point of view :
there is an a BuildExcerpts api in sphinx to get excerpt of a content with matching underlined see http://www.sphinxsearch.com/docs/current.html#api-func-buildexcerpts.
Thinking Sphinx should provide this functionnality
to match botulism as botulinum you should compile sphinx with stemmer, maybe the porter algorithm may answer your question : see http://www.sphinxsearch.com/docs/current.html#conf-morphology
Hope this helps and I highly encourage you to look at the sphinx documentation to fully use this very efficient indexer
Manfred
Related
I started today learning Netlogo based on an existing simulation which uses ask-concurrent. When I looked up the documentation to find what ask-concurrent does, I found the following: NOTE: The following information is included only for backwards compatibility. We don't recommend using the ask-concurrent primitive at all in new models.
However, it doesn't say anything about alternatives to this function. What should I use instead of ask-concurrent?
Use regular ask. If you find that changes the meaning of the code in a way that you consider undesirable, then in order to advise you on what to do about it, we will need to know the details of your code.
http://ccl.northwestern.edu/netlogo/docs/programming.html#ask-concurrent contains several examples of pieces of code that use ask-concurrent and how they can be written using regular ask instead.
I'm about to launch into a Lucene.NET implementation and I am concerned about using the PorterStemFilter. Reading here, and reading source code, it appears to be far, far too aggressive for my needs.
I need something simpler that doesn't look for roots but just removes "er", "ed", "s", etc suffixes. From what I've read, KStem would do the trick.
I can't for the life of me find a .NET version of KStem. I can't even find source code for the Java version to handroll a port.
Could someone point me in the right direction?
Looks like it is easy enough to handcraft a reduced PorterStemmer by simply removing steps I don't want. Anyone have success with that?
You could use the HunspellStemmer, part of contrib. It can use freely available hunspell dictionaries to provide proper stemming.
Is there a standard way to implement character-by-character typeahead autocomplete using ElasticSearch for small fields (e.g. place names).
(At the time of writing this, there are a number of discussions available via search, but nothing that seems definitive. (Also, I see there is talk of the effect of feature support for autocomplete/suggest in Apache Lucene 4.))
Thanks for thoughts.
You can use Edge NGram based analyzer, see http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer.html
Or use the suggest plugin: https://github.com/spinscale/elasticsearch-suggest-plugin
HTH
As David wrote, you can use NGrams or the suggest plugin. With lucene 4 it will be possible to have better auto-suggestions out-of-the-box, without the need to mantain a separate index.
For now you can also just make a terms facet on your field and use a regex pattern to keep only the entries that start with the relevant prefix:
"facets" : {
"tag" : {
"terms" : {
"field" : "field_name",
"regex" : "prefix.*"
}
}
}
The regex is just an example, it can be improved and you can also make it case insensitive using the proper regex flag. Also, beware that making on a facet on a field that contains many unique terms is not a great idea, unless you have enough memory for it.
Use the built-in autocompletion suggester that's available since version 0.90.3:
http://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html
It's blazingly fast and was developed for exactly that use case.
Is there any existing implementation for query expansion in Perl?
By query expansion I mean, when the user enter a query in our database
it will expand the search based on related terms.
In principle we have a XML file (e.g. MESH) with which
we want to refer to for query expansion.
Bio::DB::MeSH - Term retrieval from a Web MeSH database
my $mesh = Bio::DB::MeSH->new();
my $term = $mesh->get_exact_term('Butter');
print $term->description;
You got a serviceable answer already but there is a far deeper and more robust alternative for more serious usage: UMLS::Similarity and UMLS::Interface. The problem being these are a little bit of a bear to install, require MySQL, take up quite a bit of disk space, and require you to have the MeSH stuff locally and make sure your usage is in compliance with a couple dozen related licenses for the dictionaries/sources.
I do not mean to disparage Bio::DB::MeSH, it's useful and part of a bigger picture (BioPerl), but it has fragile heuristics and is at the mercy of availability and trivial HTML changes in the target/source site (it was broken the last time I used it for example though it was easy to patch locally).
Am new to Lucene.Net
Which is the best Analyzer to use in Lucene.Net?
Also,I want to know how to use Stop words and word stemming features ?
I'm also new to Lucene.Net, but I do know that the Simple Analyzer omits any stop words, and indexes all tokens/works.
Here's a link to some Lucene info, by the way, the .NET version is an almost perfect, byte-for-byte rewrite of the Java version, so the Java documentation should work fine in most cases: http://darksleep.com/lucene/. There's a section in there about the three analyzers, Simple, Stop, and Standard.
I'm not sure how Lucene.Net handles word stemming, but this link, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2, demonstrates how to create your own Analyzer in Java, and uses a PorterStemFilter to do word-stemming.
...[T]he Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English
I hope that is helpful.
The best analyzer which i found is the StandardAnalyzer in which you can specify the stopwords also.
For Example :-
string indexFileLocation = #"C:\Index";
string stopWordsLocation = #"C:\Stopwords.txt";
var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
Analyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));
It depends on your requirements. If your requirements are ultra simple - e.g. case insensitve, non-stemming searches - then StandardAnalyzer is a good choice. If you look into the Analyzer class and get familiar with Filters, particulary TokenFilter, you can exert an enormous amount of control over your index by rolling your own analyzer.
Stemmers are tricky, and it's important to have a deep understanding of what type of stemming you really need. I've used the Snowball stemmers. For example, the word "policy" and "police" have the same root in the English Snowball stemmer, and getting hits on documents with "policy" when the search term "police" isn't so hot. I've implemented strategies to support stemmed and non-stemmed search so that may be avoided, but it's important to understand the impact.
Beware of temptations like stop words. If you need to search for the phrase "to be or not to be" and the standard stop words are enabled, your search will fail to find documents with that phrase.