phrase synonym analyser Lucene.net? - lucene.net

I am using the Synonym Analyser, but this only adds synonyms for single words. Is there a similar analyser for phrases or does anyone know any other way of adding phrase synonyms? For example, "The Big Apple" should return a hit for "New York".
Thanks.

You can obviously build your own analyzer...I built a synonym analyzer that took single words and matched multiple words...custom development.
Instead of doing that, I would recommend dynamically injecting synonyms during query building or parsing. For example, you could have a person search for "The Big Apple"...
1) check the phrase "The Big Apple" for synonym phrases
2) If synonym phrases exist, build a boolean query with 2 PhraseQueries "The Big Apple" and "New York".
Another (more performant way) is to use MultiPhraseQueries instead of boolean PhraseQueries. This would depend on how complex your boolean queries get...I have found both work pretty fast in my case.
Down side of this is that it will be a bit slower on search. The benefit is that it is completely dynamic and doesn't require index rebuilds if you configure/change synonyms. It is also perfect if you have a multi-tenant solution where each client can have different synonyms.

Related

CHECK length limit-n constraint in text field instead vartext(n) use

SQL language, in particular PostgreSQL 9+, have many ways to do the same thing... But in many circumstances (see Notes sec. for a rationale) we need to "cut diversity" and opt to a standard way.
There are a tendency to adopt text data type instead varchar. Will be "the standard way to express strings" in PostgreSQL (!), avoiding lost time in project discussions and casting similar formats...
But, how to use text preserving the size limit constraint?
I use CHECK(char_length(field)<N) and have no problem to change the limit in live environment, so it is perhaps the best way... Is it?
Some variations: in general what is the best choice?
in CREATE TABLE:
1.1. CHECK after the data type, just like default value definitions. This is the best practice?
1.2. CHECK after all column definitions. Usual to multi-column declaration like CHECK(char_length(col1)<N1 AND char_length(col2)<N2).
1.2.1. Some people like also to express all individual CHECKs after, to not "pollute" column declarations.
Use in trigger: there are some advantage?
Other ways... Other relevant one?
1.1, 1.2, 2 or 3, what is the best practice ?
CONTEXT NOTES
In projects and teams with some KISS or Convention over configuration demands, we need "good practices" recommendations... I was looking for it, in the context of CREATE TABLE ... text/varchar and project maintenance. No unbiased "good practices" recommendation in the horizon: Stackoverflow votings are the only reasonable record of this kind of recommendation.
Convention scope
(edit) For individual use, of course, as #ConsiderMe commented, "no matter what you choose, as long as you stick with it throughout the entire time there will be no problem with it".
This question, by other hand, is about "SQL community" or "PostgreSQL community" best practices conventions.
I like to keep the code as short as it's possible so it'd go with length(string) in the CHECK constraint. I do not see a particular use for char_length in this case - it takes up more "code space".
Internally, they are both textlen anyways.
You should be careful about signs that take more than 1 byte. In this case I would use octet_length. As an example consider character ą which returns 1 when asked for length, and 2 when asked for octet_length. It's been a pain doing migrations between database systems with different length enforcement.
I believe that a good source for "best practices" would be to follow documentation.
It says that using CHECK constraint inline with a column implies a column constraint which is bound to a particular column.
It mentions table constraint which is written separately from any column definition and enforces data corectness between several columns.
Basically in projects I'm involved I follow this rule for readability and maintenance purposes.
I wouldn't even consider creating trigger for such things. To me they are designed for much more complex tasks. I don't see a reason to enforce simple data correctness rules in triggers.
I can't think of any other solution which would be as basic as the standard ones and still doing it's simple job as those mentioned above.
The Depesz article on which this reasoning was based is outdated. The only argument against varchar(N) was that changing N required a table rewrite, and as of Postgres 9.2, this is no longer the case.
This gives varchar(N) a clear advantage, as increasing N is basically instantaneous, while changing the CHECK constraint on a text field will involve re-checking the entire table.

Does Algolia have a search with recommendation?

I was wondering if the Algolia service provides some kind of recommendations mechanism when doing a search.
I could not find in the API documentation anything related with providing the client with more refined and intelligent search alternatives based on the index data.
The scenario I am trying to describe is the following (this example is a bit over-the-top):
Given a user is searching for "red car" then the system provides more specific search alternatives and possibly related items that exist in the database (e.g. ferrari, red driving gloves, fast and furious soundtrack :-) )
Update
For future reference, I ended up doing a basic recommendation system using the text search capabilities of Algolia.
Summarising, when caris saved it's attributes color, speed, engine, etc, are used to create synonyms indexes, for example for engine ferrari in Engine index:
{
synonyms: ['red', 'ferrari', 'fast'],
value: 'ferrari'
}
Finally, the each index, must indicate the synonyms attribute for search and value as the returned result of a search.
Algolia does not provide that kind of "intelligence" out of the box.
Something you can do to approximate what you're looking for is using synonyms and a combination of other parameters:
Define synonyms groups such as "car,Ferrari,driving gloves", "red,dark red,tangerine,orange", ...
when sending a search query, set optionalWords to the list of words contained in that query. This will make each word of your query optional.
Also set removeStopWords to true so that words like "the", "a" (...) are ignored, to improve relevance.
With a well defined list of synonyms, this will make your original query interpreted as many other possibilities and thus increase the variety of possible results.
Be aware though that it could also impact the relevance of your results, as users might for instance not want to look for gloves when they search for a car!

Lucene.Net/SpellChecker - multi-word/phrase based auto-suggest

I've implemented Lucenet.NET on my site, using it to index my products which are theatre shows, tours and attractions around London.
I want to implement a "Did you mean?" feature for when users misspell product names that takes the whole product titles into account and not just single words. For example,
If the user typed:
Lodnon Eye
I would like to auto-suggest:
London
London Eye
I assume I nead to have the analyzer index the titles as if they are a single entity, so that SpellChecker can nearest-match on the phrase, as well as the individual words.
How would I do this?
There is a excellent blog series here:
Lucene.NET
Introduction to Lucene
Indexing basics
Search basics
Did you mean..
Faceted Search
Class Reference
I have also found another project called SimpleLucene which you can use to maintain your lucene indexes whenever you need to update or delete a document. Read about it here
i've just recently implemented a phrase autosuggest system in lucene.net.
basically, the java version of lucene has a shinglefilter in one of the contrib folders which breaks down a sentence into all possible phrase combinations. Unfortunately lucene.nets contrib filters aren't quite there yet and so we don't have a shingle filter.
but, a lucene index written in java can be read by lucene.net as long as the versions are the same. so what i did was the following :
created a spell index in lucene.net using the spellcheck.IndexDictionary method as laid out in the "did you mean" section of jake scotts link. please note that only creates a spelling index of single words, not phrases.
i then created a java app that uses the shingle filter to create phrases of the text i'm searching and saves it in a temporary index.
i then wrote another method in dotnet to open this temporary index and add each of the phrases as a line or document into my spelling index that already contains the single words. the trick is to make sure the documents you're adding have the same form as the rest of the spell documents, so i ripped out the methods used in the spellchecker code in the lucene.net project and edited those.
once you've done that you can call the spellcheck.suggestsimilar method and pass it a misspelled phrase and it will return you a valid suggestion.
This is probably not the best solution and I definitely would use the answer suggested by spaceman but here is another possible solution. Use the KeywordAnalyzer or the KeywordTonenizer on each title, this will not break down the title into separate tokens but keep it as one token. Using the SuggestSimilar method would return the whole title as suggestions.

Lucene.NET - Search phrase containing "and"

Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):
"oil and gas field" (complete phrase)
"research and development" (complete phrase)
"r&d" (complete phrase)
Ideally, I'd like to use the QueryParser as the input is coming from the user.
During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:
contents:"oil gas field"
contents:"research development"
Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.
If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:
contents:"r d"
Which again, isn't quite what I'm looking for.
Any advice?
if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".
Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.
The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally # to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.
There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.
Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.
Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.

Core Data Query slow

What's the secret to pulling up items that match characters typed into the search bar that react instantaneously? For instance, if I type in a letter "W" in the search bar, all phrases that contain a letter "W" in any character position within the phrase are returned immediately.
So if a database of 20,000 phrases contains 500 phrases with the letter "W", they would appear as soon as the user typed the first character. Then as additional characters are typed, the list would automatically gets shorter.
I can send query's up to a SQL server from the iPhone and get this type of response, however, no matter what we try and taking the suggestions of other users, we still can't get good response time when storing the database locally on the iPhone.
I know that this performance is available, because there are many other apps out there that display results as soon as you start typing.
Please note that this isn't the same as indexing all words in every phrase, as this only will bring up matches where the word starts with the character typed in. In this case, we're looking for characters within words.
I think asynchronous results filtering is the answer. Instead of updating the search results every time the user types a new character, put the db query on a background thread when the first character is typed. If a new character is typed before the query is finished, cancel the old query and start a new one. Finally, you will get to the point where the user stops typing long enough for the query to return. That way, the query itself never blocks the user's typing.
I believe the UISearchDisplayController class offers this type of asynchronous search, though whether you want to use that class or just adopt the asynchronous design pattern from it is up to you.
If you're willing to get away from the database for this, you could use a generalized suffix tree with all the terms in your phrases. You can build in a suffix tree in linear time and, I believe, use it to find all occurrences of a substring very quickly. The web has lots of pages about suffix trees and suffix arrays. Wikipedia is probably a good place to start.
I have a fun scheme for you. You can build an index of the characters that exist in each phrase via a 32-bit integer. Flip the bits [0-25] to represent the characters (case-insensitive) a-z that exist in the phrase. Build a second bitmap of the query string. Now you can do comparisons via bitwise operations (& and |) to determine matches. This is very fast and believe it or not SQLite actually supports bitwise operations in queries - so you can even use this scheme to go straight to the database. I have working code that does this built into one of our iPhone applications - Alphagram.