Lucene.NET - Search phrase containing "and" - lucene.net

Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):
"oil and gas field" (complete phrase)
"research and development" (complete phrase)
"r&d" (complete phrase)
Ideally, I'd like to use the QueryParser as the input is coming from the user.
During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:
contents:"oil gas field"
contents:"research development"
Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.
If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:
contents:"r d"
Which again, isn't quite what I'm looking for.
Any advice?

if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".

Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.
The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally # to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.
There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.
Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.
Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.

Related

Elasticsearch - is there a method to match using "almost ident"

I use Facebook and Google maps to get a full Geo Entities data values (country, city, street, zip...).
I store these values on my mongoDB,
I noticed that some locations are deffer in the way they were written on Face and on Google, for (an unreal) example Face wrote the name of 'Hawaii' with an 'e' - Haweii.
I use match_all fields (country + city + street...) to search for entities at the same location but since some are written a bit different i will not find them.
Is there a way make elasticsearch search for 'Hawaii' and any other option that sounds like Hawaii but written a bit different?
Thanks for any help!
Using Google API one can get a location's
full details
To match words that sound similar you can use the phonetic analyzer. You can also give fuzzy query a try to match words with spelling mistakes. None of them are fool proof though and may result in false positives. Guess you'll have to experiment a little to come up with a solution that best fits your need.
If you have a known set of differences between Facebook and Google maps, you could look at using Synonyms at either index time or query time to accommodate differences in the APIs; There are merits to taking either approach.

Efficiently extract WikiData entities from text

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.
The problem is simple: How to extract every WikiData entity from a given text?
An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.
So far I've tried the following:
Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.
This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.
Suggestions?
I tagged the question Scala because I'm using Spark for the task.
Some suggestions:
consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation
In particular, DBpedia Spotlight is one system designed for exactly this task.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Does Algolia have a search with recommendation?

I was wondering if the Algolia service provides some kind of recommendations mechanism when doing a search.
I could not find in the API documentation anything related with providing the client with more refined and intelligent search alternatives based on the index data.
The scenario I am trying to describe is the following (this example is a bit over-the-top):
Given a user is searching for "red car" then the system provides more specific search alternatives and possibly related items that exist in the database (e.g. ferrari, red driving gloves, fast and furious soundtrack :-) )
Update
For future reference, I ended up doing a basic recommendation system using the text search capabilities of Algolia.
Summarising, when caris saved it's attributes color, speed, engine, etc, are used to create synonyms indexes, for example for engine ferrari in Engine index:
{
synonyms: ['red', 'ferrari', 'fast'],
value: 'ferrari'
}
Finally, the each index, must indicate the synonyms attribute for search and value as the returned result of a search.
Algolia does not provide that kind of "intelligence" out of the box.
Something you can do to approximate what you're looking for is using synonyms and a combination of other parameters:
Define synonyms groups such as "car,Ferrari,driving gloves", "red,dark red,tangerine,orange", ...
when sending a search query, set optionalWords to the list of words contained in that query. This will make each word of your query optional.
Also set removeStopWords to true so that words like "the", "a" (...) are ignored, to improve relevance.
With a well defined list of synonyms, this will make your original query interpreted as many other possibilities and thus increase the variety of possible results.
Be aware though that it could also impact the relevance of your results, as users might for instance not want to look for gloves when they search for a car!

full text search algorithm for mongodb

I have a blog and I want to search titles with mongodb, not with solr or elastic search, for example, I have those titles,
wolkswagen
wolkswagen polo
wolkswagen passat
In wolkswagen, I have the history of wolkswagen, in polo and passat, I have those cars' definitions, I tokenized titles by space. When I type "wolkswagen", polo and passat are on top, but wolkswagen should be on top, what should be the algorithm to take wolkswagen on top ?
thank you :)
Ok well you have two options here:
You can use the new FTS feature in 2.4: http://architects.dzone.com/articles/mongodb-full-text-search . I should mention that FTS is experimental and very badly documented so this might not suite you. It sorts by relevance by default so the pattern of results you are looking for is automatically applied.
You can do client processing (not advised for large sets) whereby you get the results out and you manually actually test for the relevance to each word in the search block. As to the algorithm for that maybe something like:
iterate every word separated by a space
assign a value of 0 - 1 for how complete of a word it is, if it matches a complete word then assign 1
Add this up and place it back into the row for each result.
Use client side sorting to sort by the score of each result.
I am afraid that without knowledge of your programming language that is about the best the I can do.

phrase synonym analyser Lucene.net?

I am using the Synonym Analyser, but this only adds synonyms for single words. Is there a similar analyser for phrases or does anyone know any other way of adding phrase synonyms? For example, "The Big Apple" should return a hit for "New York".
Thanks.
You can obviously build your own analyzer...I built a synonym analyzer that took single words and matched multiple words...custom development.
Instead of doing that, I would recommend dynamically injecting synonyms during query building or parsing. For example, you could have a person search for "The Big Apple"...
1) check the phrase "The Big Apple" for synonym phrases
2) If synonym phrases exist, build a boolean query with 2 PhraseQueries "The Big Apple" and "New York".
Another (more performant way) is to use MultiPhraseQueries instead of boolean PhraseQueries. This would depend on how complex your boolean queries get...I have found both work pretty fast in my case.
Down side of this is that it will be a bit slower on search. The benefit is that it is completely dynamic and doesn't require index rebuilds if you configure/change synonyms. It is also perfect if you have a multi-tenant solution where each client can have different synonyms.