How to do a prefix query in Apache Lucy with Perl? - perl

I am using Apache Lucy to speed up a typeahead (autocomplete) field on a Web form. I am querying against nearly 800k records. I have a working setup but would like to limit my responses to terms that begin with the query string. Currently the query matchers either match the whole word or if I tokenize with /./ I can match the query against partials of whole words.
While going through the documentation I found Lucy::Docs::Cookbook::CustomQueryParser.
On that page under the heading Extending the query language, there was a reference to PrefixQuery. This package does not exist in Lucy and I had to do some more searching. Eventually I found the PrefixQuery.pm code sample in lucy's git repository.
Note that this package references another non-existent package called Lucy::Search::Tally. Removing the references to tally allowed me to get this example working, but it is far from a functional matcher. It doesn't handle multiple fields, no scoring, etc…
Does anyone know of a way to make Lucy do prefix matching without all this mucking about?

Found a solution in the Apache docs.
http://lucy.apache.org/docs/perl/Lucy/Docs/Cookbook/CustomQuery.html

Related

How do I use the Github Search API to look for repositories, using multiple queries?

As per the official documentation - "*For example, if you want to search for popular Tetris repositories written in assembly code, your query might look like this:
q=tetris+language:assembly&sort=stars&order=desc
This query searches for repositories with the word tetris in the name, the description, or the README. The results are limited to repositories where the primary language is assembly. The results are sorted by stars in descending order, so that the most popular repositories appear first in the search results.*"
However, I'd like to use more than one keyword to avoid false positives. For instance I want to count repositories that make use of the TensorFlow framework for meta-learning. However, I don't see there is any code example for this. Could someone please help with that?
Here is what I did using the THUNDER CLIENT extension in VSCode, but I am not sure if it's correct -
https://api.github.com/search/repositories?q=Tensorflow, meta-learning+language:python
I also found this question, where the poster says to use "+". However, I don't see any official documentation anywhere.

What's the best way to count and score text in a mongodb

Having a collection of {_id: 'xxx', text: 'abc'}
What is the best way to have a list of entities with the same text, considering also spelling mistakes, for example 'gogle' 'google' ordered by the number of similar entities?
mongodb doesn't have the capability to find misspelled items. there are some thirdparty libraries/plugins that offer this feature by storing double metaphone key codes along with the original version of the text. here's an example program in c# that gets the result you want. see this page for a brief explainer on how it works. if you're not coding in c#, there's this plugin for mongoose.

How to search for multiple tags around one location?

I'm trying to figure out what's the best solution to find all nodes of certain types around a given GPS-Location.
Let's say I want to get all cafes, pubs, restaurant and parks around a given point X.xx,Y.yy.
[out:json];(node[amenity][leisure](around:500,52.2740711,10.5222147););out;
This returns nothing because I think it searches for nodes that are both, amenity and leisure which is not possible.
[out:json];(node[amenity or leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity,leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity;leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity|leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity]|[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity],[leisure](around:500,52.2740711,10.5222147););out;
[out:json];(node[amenity];[leisure](around:500,52.2740711,10.5222147););out;
These solutions result in an error (400: Bad Request)
The only working solution I found is the following one which results in really long queries
[out:json];(node[amenity=cafe](around:500,52.2740711,10.5222147);node[leisure=park](around:500,52.2740711,10.5222147);node[amenity=pub](around:500,52.2740711,10.5222147);node[amenity=restaurant](around:500,52.2740711,10.5222147););out;
Isn't there an easier solution without multiple "around" statements?
EDIT:
Found This on which is a little bit shorter. But still multiple "around" statements.
[out:json];(node["leisure"~"park"](around:400,52.2784715,10.5249662);node["ameni‌​ty"~"cafe|pub|restaurant"](around:400,52.2784715,10.5249662););out;
What you're probably looking for is regular expression support for keys (not only values).
Here's an example based on your query above:
[out:json];
node[~"^(amenity|leisure)$"~"."](around:500,52.2740711,10.5222147);
out;
NB: Since version 0.7.54 (released in Q1/2017) Overpass API also supports filter criteria with 'or' conditions. See this example on how to use this new (if: ) filter.

Lucene.Net/SpellChecker - multi-word/phrase based auto-suggest

I've implemented Lucenet.NET on my site, using it to index my products which are theatre shows, tours and attractions around London.
I want to implement a "Did you mean?" feature for when users misspell product names that takes the whole product titles into account and not just single words. For example,
If the user typed:
Lodnon Eye
I would like to auto-suggest:
London
London Eye
I assume I nead to have the analyzer index the titles as if they are a single entity, so that SpellChecker can nearest-match on the phrase, as well as the individual words.
How would I do this?
There is a excellent blog series here:
Lucene.NET
Introduction to Lucene
Indexing basics
Search basics
Did you mean..
Faceted Search
Class Reference
I have also found another project called SimpleLucene which you can use to maintain your lucene indexes whenever you need to update or delete a document. Read about it here
i've just recently implemented a phrase autosuggest system in lucene.net.
basically, the java version of lucene has a shinglefilter in one of the contrib folders which breaks down a sentence into all possible phrase combinations. Unfortunately lucene.nets contrib filters aren't quite there yet and so we don't have a shingle filter.
but, a lucene index written in java can be read by lucene.net as long as the versions are the same. so what i did was the following :
created a spell index in lucene.net using the spellcheck.IndexDictionary method as laid out in the "did you mean" section of jake scotts link. please note that only creates a spelling index of single words, not phrases.
i then created a java app that uses the shingle filter to create phrases of the text i'm searching and saves it in a temporary index.
i then wrote another method in dotnet to open this temporary index and add each of the phrases as a line or document into my spelling index that already contains the single words. the trick is to make sure the documents you're adding have the same form as the rest of the spell documents, so i ripped out the methods used in the spellchecker code in the lucene.net project and edited those.
once you've done that you can call the spellcheck.suggestsimilar method and pass it a misspelled phrase and it will return you a valid suggestion.
This is probably not the best solution and I definitely would use the answer suggested by spaceman but here is another possible solution. Use the KeywordAnalyzer or the KeywordTonenizer on each title, this will not break down the title into separate tokens but keep it as one token. Using the SuggestSimilar method would return the whole title as suggestions.

Lucene.NET - Search phrase containing "and"

Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):
"oil and gas field" (complete phrase)
"research and development" (complete phrase)
"r&d" (complete phrase)
Ideally, I'd like to use the QueryParser as the input is coming from the user.
During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:
contents:"oil gas field"
contents:"research development"
Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.
If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:
contents:"r d"
Which again, isn't quite what I'm looking for.
Any advice?
if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".
Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.
The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally # to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.
There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.
Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.
Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.