Sphinx and query for ranked results by first AND and then combinations of OR - sphinx

I thought it is rather common case, but just can't figure out and find any info about it.
Say I have following texts:
Dogs hate cats
My cat eat mice but hate dogs
Mice hate cats but don't care about dogs
Giraffes don't care about any cat
Dogs are brave in most cases
I can't figure out query, which returns texts in following order:
1) First all texts which contains ALL three 'dog', 'cat' and 'mice',
2) Then all texts which contains each pair of ('dog', 'cat'), ('dog', 'mice'), ('cat', 'mice') in no particular order
3) Then all texts which contains any of 'dog', 'cat' or 'mice'
So result of such query for given texts should be something like this (preferably shorter text first, but not necessarily)
My cat eat mice but hate dogs
Mice hate cats but don't care about dogs
Dogs hate cats
Giraffes don't care about any cat
Dogs are brave in most cases
Could anybody help me please?

Well in general there are two parts to it, Matching and Ranking.
For matching you just want documents that contain at least one of the words (ie will accept a document with jsut one of them). The quorum operator is probably easiest way of doing that, but a few other methods would work too.
... MATCH(' "dog cat mice"/1 ')
Then you want to make the ones with most words (3) show first - which is about ranking
http://sphinxsearch.com/docs/current.html#ranking-overview
in general might well find the WordCount Ranker is ok for your situation
... OPTION ranker=wordcount
But read the above section on ranking, there is much more elaborate ranking that cound be done.
With ranking, you need to get into the mindset of thinking how sphinx computes a score for each result, and then just orders the results in descending weight order. (as opposed to 'this, then that, then that'. )
Edit to add: you mention dog being a query term, but have documents containing dogs so do make sure look at morphology and stemming to account for that.
http://sphinxsearch.com/docs/current.html#conf-morphology

Related

Better Postgres trigram ranking

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.
My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.
My problem is that the relevance ranking isn't very good. There are two issues:
I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.
The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.
I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.
Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.
I know that this is an old question but this might be useful for others.
If the text you want to search falls in ascii table (characters in the range of [a-zA-Z0-9] and some other symbols), then you probably want to use Full Text Search feature (read official document Full Text Search).
because not only it gives you ability to sort by relevancy, but also ability to customize things like text steming (using snowball algorithm), which maps words like connection, connections, connective, connected, and connecting to connect (Read more about Snowball Stem). This makes your application performs better on search.
But if your requirement is to search text out of range of ascii table, like unicode, which is common if you try to support Asian languages like Japanese, Thai, Korean, etc. then using pg_trgm is perfectly fine.
To do the search that is not biased to the shorter text as mentioned in the question, you could use word_similarity(), instead of similarity().
As per the official documentation:
word_similarity( text, text )
Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.
So for example:
postgres=# SELECT word_similarity('white cat', 'white dog and black cat') as "similarity 1", word_similarity('white cat', 'I have a white dog and a black cat') as "similarity 2", word_similarity('white cat', 'I have a lovely white dog and a cute big black cat in a house') as "similarity 3";
similarity 1 | similarity 2 | similarity 3
--------------+--------------+--------------
0.6 | 0.6 | 0.6
(1 row)
As shown above, they all have equal scores.
And when you want to use it in a query:
SELECT col, word_similarity('some query', col) from my_table where col <% 'some query';
According to the document:
text <% text → boolean
Returns true if the similarity between the trigram set in the first argument and a continuous extent of an
ordered trigram set in the second argument is greater than the current
word similarity threshold set by pg_trgm.word_similarity_threshold
parameter.
For somehting more complicated like calculating hit scores, relevance weight/boost, and faster response time on larger dataset, you should use Elastic instead, but keep in mind that Elastic instance needs at least 2GB of ram and more, so you need dedicated EC2 instance(s) for that purpose. But for small-medium app, pg_trgm works just fine while saving your server cost.
Hope you find this helpful.

Postgres...how to improve ilike results (quality not speed)

I have a list of chemicals in my database and I provide our users with the ability to do a live search via our website. I use SQLAlchemy and the query I use looks something like this:
Compound.query.filter(Compound.name.ilike(f'%{name}%')).limit(50).all()
When someone searches for toluene, for example, they don't get the result they're looking for because there are many chemicals that have the word toluene in them, such as:
2, 4 Dinitrotoluene
2-Chloroethyl-p-toluenesulfonate
4-Bromotoluene
6-Amino-m-toluenesulfonic acid
a,2,4-trichlorotoluene
a,o-Dichlorotoluene
a-Bromtoluene
etc...
I realize I could increase my limit but I feel like 50 is more than enough. Or, I could change the ilike(f'%{name}%')) to something like ilike(f'{name}%')) but our business requirements don't want this. What I'd rather do is improve the ability for Postgres to return results so that toluene is at the top of the search results.
Any ideas on how Postgres' ilike capability?
Thanks in advance.
One option is to better rank the results. Postgres text search allows you to rank the results.
A cheap and dirty version of preferential ranking is to do multiple queries for name = ?, ilike(f'{name}%')), and ilike(f'%{name}%')) using a union. That way the ilike(f'{name}%')) results come first.
And rather than a hard limit, offer pagination. SQLAlchemy has paginate to help.
ILIKE yields a boolean. It doesn't specify what order to return the results, just whether to return them at all (you can order by a boolean, but if you only return trues there is nothing left to order by). So by the time you are done improving it, it would no longer be ILIKE at all but something else completely.
You might be looking for something like <-> from pg_trgm, which provides a distance score which can be sorted on. Although really, you could just order the result based on the length of the compound name, and return the shortest 50 that contain the target.
something like ilike(f'{name}%')) but our business requirements don't want this
Isn't your business requirement to get better results?
But at least in my database, this could just return a bunch of names in inverted format, like toluene, 2,4-dinitro, so the results might not be much better, unless you avoid storing such inverted names. Sorting by either <-> or by length would overcome that problem. But they would also penalize toluene, ACS reagent grade 99.99% by HPLC, should you have names like that.

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

Using full text search in PostgreSQL, how can I make certain words worth less to match?

I am trying to use Postgres full-text search to search an index of company names. There are lots of duplicates, typos, etc. When matching company names, things like LLC and Inc are not quite stop-words (as in, I want them to count for something) but they are not nearly as important as most other words. Is there a way to query such that some words count more than other words when matching?
(I'm doing this all through Django, but if I can figure out the SQL to use I can probably get the rest of the way there...)
You can use the 3-argument form of "setweight" to de-weight specific lexemes. You would do this in the tsvector, not in the tsquery.
select setweight(setweight(to_tsvector('The DBA LLC'),'A'),'D','{llc}');
setweight
-------------------
'dba':2A 'llc':3D

Scalable way to search for (similar) strings in a database

Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.
What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.
*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.
Example input sentence:
I don't like cangaroos.
Example suggestions from the database:
Cangaroos are not my favorite.
Cangaroos are evil.
I once had a cangaroo. Never again.
These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.
P.s. Could PostgreSQL's full text search do something like this?
Thank you.
PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;
I assume you have one table that looks like this one:
CREATE TABLE sentences
(
sentence_id integer PRIMARY KEY,
sentence text
) ;
INSERT INTO sentences (sentence_id, sentence)
VALUES
(1, 'Cangaroos are not my favorite.'),
(2, 'A vegetable sentence.'),
(3, 'Cangaroos are evil.'),
(4, 'Again, some plants in my garden.'),
(5, 'I once had a cangaroo. Never again.') ;
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
To find the answers you're looking for, you execute:
-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;
-- And now, select the sentences 'similar' to the input one
SELECT
similarity(sentence, 'I don''t like cangaroos') AS similarity,
sentence_id,
sentence
FROM
sentences
WHERE
/* That's how you choose your sentences:
% means 'similar to', in the trigram sense */
sentence % 'I don''t like cangaroos'
ORDER BY
similarity DESC ;
The result that you get is:
similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
0.3125 | 3 | Cangaroos are evil.
0.2325 | 1 | Cangaroos are not my favorite.
0.2173 | 5 | I once had a cangaroo. Never again.
Hope this gives you what you want...