Partial vs complete match ranking in postgresql text search - postgresql

I am trying to implement incremental search using PostgreSQL. The problem I am running into is result ranking. I would like complete matches to be ranked higher than partial matches and I don't really know how to do that. For example in this query (to show how things are ranked as the user types the query):
select
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jonath:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jonathan:*'))
or the other way around (to show how different documents rank the same query)
select
ts_rank_cd(to_tsvector('hello jon'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonah'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jon:*'))
all rankings return 0.1. How would I go about making more complete results rank higher?

I would try using an operator from pg_trgm to break ties among the ts_rank_cd. Probably the "<->>>" operator (introduced in v11) would be my first choice:
select
'hello jon' <->>> 'jon:*',
'hello jonah' <->>> 'jon:*',
'hello jonathan'<->>>'jon:*';
?column? | ?column? | ?column?
----------+------------+----------
0 | 0.57142854 | 0.7
Note that this returns distance, not similarity, so lower is better.

Related

How to optimize fuzzy string matching pairs of phrases (intersection names) in PostgreSQL

We have a table of intersection names like 'Main St / Broadway Ave' and we are trying to match potentially messy user input (of the form (street1, street2)) to these names. There's no guarantee the input would be in the same order as the street names.
We split the intersection names to a long format table in order to optimize doing two fuzzy distance comparisons, e.g.
+--------+----------------+
| int_id | street |
+--------+----------------+
| 1 | 'Broadway Ave' |
+--------+----------------+
| 1 | 'Main St' |
+--------+----------------+
And put a gist trigram index on the street column.
So then the query finds all int_ids that are close to one or the other street input and then does a GROUP BY to find the one with the closest combined distances (I'll insert the query later). This works pretty well but we still need it to work faster. Is there something in PostgreSQL's full text search library that could do the trick?
Query example used in a function, with related explain https://explain.depesz.com/s/J9lj
SELECT intersections.int_id,
SUM(LEAST(
intersections.street <-> street1,
intersections.street <-> street2))
, intersections.int_id
FROM intersections
WHERE (street1 <% intersections.street
OR
street2 <% intersections.street
)
GROUP BY intersections.int_id
HAVING COUNT(DISTINCT TRIM(intersections.street)) > 1
ORDER BY AVG(
LEAST(
intersections.street <-> street1,
intersections.street <-> street2))
I know you are looking for a PostgreSQL solution, but this would be much easier (and faster) if you clone your data into Elastic Search and do the searches there. Elastic Search also gives you way more flexibility then a relational database could.

SphinxQL Variables Deprecated, Alternate Query?

I had what I thought was a fairly straightforward SphinxQL query, but it turns out # variables are deprecated (see example below)
SELECT *,#weight AS m FROM test1 WHERE MATCH('tennis') ORDER BY m DESC LIMIT 0,1000 OPTION ranker=bm25, max_matches=3000, field_weights=(title=10, content=5);
I feel like there must be a way to sort the results by strength of match. What is the replacement?
On another note, what if I want to include in it a devaluation if certain other words appear. For example, let's say I wanted to devalue results that had the word "apparel" in them. Could that be executed in the same query?
Thanks!
Well results are 'by default' in weight decending, so just do...
SELECT * FROM test1 WHERE MATCH('tennis') LIMIT 0,1000 OPTION ...
But otherwise its, just the # variables, are replaced by 'functions' mainly because its more 'SQL like'. So #weight, is WEIGHT()
SELECT * FROM test1 WHERE MATCH('tennis') ORDER BY WEIGHT() DESC ...
or
SELECT *,WEIGHT() AS m FROM test1 WHERE MATCH('tennis') ORDER BY m DESC ...
For reference #group is instead GROUPBY(), #count is COUNT(*), #distinct is COUNT(DISTINCT ...), #geodist is GEODIST(...) , and #expr doesnt really have an equivlent, either just use the expression directly, or use your own custom named alias.
As for second question. Kinda tricky, they isnt really a 'negative' weighter. Ther is a keyword boost operator, but as far can't use it to specifically devalue.
The only way I can think maybe have it work, is if negative match was against a specific field, could build a complex ranking exspression. Basically as a negative weight instead, would need a specific field for the ranking expression, so could use to select that column
... MATCH('#!(negative) tennis #negative apparel')
... OPTION ranker=expr('SUM(word_count*IF(user_weight=99,-1,1))'), field_weights(negative=99)
That's a very basic demo expression for illustrative purposes, a real one would probably be a lot more complex. Its just showing using 99 as a placeholder for 'negative' multiplication.
Would need the new negative field creating, which could just be a duplicate of other field(s)

Limit results on OR condition in Sphinx

I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9
There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)

why is this postgresql full text search query returning ts_rank of 0?

Before I invest in using solr or lucene or sphinx, I wanted to try to implement a search capability on my system using postgresql full text search.
I have a national list of businesses in a table that I want to search. I created a ts vector that combines the business name and city so that I can do a search like "outback atlanta".
I am also implementing an auto-completion function by using the wildcard capability of the search by appending ":" to the search pattern and inserting " & " between keywords, so the search pattern "outback atl" turns into the "outback & atl:" before getting converted into a query using to_tsquery().
Here's the problem that I am running into currently.
if the search pattern is entered as "ou", many "Outback Steakhouse" records are returned.
if the search pattern is entered as "out", no results are returned.
if the search pattern is entered as "outb", many "Outback Steakhouse" records are returned.
doing a little debugging, I came up with this:
select ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('ou:*')) as "ou",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('out:*')) as "out",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('outb:*')) as "outb"
which results this:
ou out outb
0.0607927 0 0.0607927
What am I doing wrong?
Is this a limitation of pg full text search?
Is there something that I can do with my dictionary or configuration to get around this anomaly?
UPDATE:
I think that "out" may be a stop word.
when I run this debug query, I don't get any lexemes for "out"
SELECT * FROM ts_debug('english','out back outback');
alias description token dictionaries dictionary lexemes
asciiword Word all ASCII out {english_stem} english_stem {}
blank Space symbols {}
asciiword Word all ASCII back {english_stem} english_stem {back}
blank Space symbols {}
asciiword Word all ASCII outback {english_stem} english_stem {outback}
So now I ask how do I modify the stop word list to remove a word?
UPDATE:
here is the query that I currently using:
select id,name,address,city,state,likes
from view_business_favorite_count
where textsearchable_index_col ## to_tsquery('simple',$1)
ORDER BY ts_rank(textsearchable_index_col, to_tsquery('simple',$1)) DESC
When I execute the query (I'm using Strongloop Loopback + Express + Node), I pass the pattern in to replace $1 param. The pattern (as stated above) will look something like "keyword:" or "keyword1 & keyword2 & ... & keywordN:"
thanks
The problem here is that you are searching against business names and as #Daniel correctly pointed out - 'english' dictionary will not help you to find "fuzzy" match for NON-dictionary words like "Outback Steakhouse" etc;
'simple' dictionary
'simple' dictionary on its own will not help you neither, in your case business names will work only for exact match as all words are unstemmed.
'simple' dictionary + pg_trgm
But if you use 'simple' dictionary together with pg_trgm module - it will be exactly what you need, in particular:
for to_tsvector('simple','<business name>') you don't need to worry about stop words "hack", you will get all the lexemes unstemmed;
using similarity() from pg_trgm you will get the the highest "rank"
for the best match,
look at this:
WITH pg_trgm_test(business_name,search_pattern) AS ( VALUES
('Outback Steakhouse','ou'),
('Outback Steakhouse','out'),
('Outback Steakhouse','outb')
)
SELECT business_name,search_pattern,similarity(business_name,search_pattern)
FROM pg_trgm_test;
result:
business_name | search_pattern | similarity
--------------------+----------------+------------
Outback Steakhouse | ou | 0.1
Outback Steakhouse | out | 0.15
Outback Steakhouse | outb | 0.2
(3 rows)
Ordering by similarity DESC you will be able to get what you need.
UPDATE
For you situation there are 2 possible options.
Option #1.
Just create trgm index for name column in view_business_favorite_count table; index definition may be the following:
CREATE INDEX name_trgm_idx ON view_business_favorite_count USING gin (name gin_trgm_ops);
Query will look something like that:
SELECT
id,
name,
address,
city,
state,
likes,
similarity(name,$1) AS trgm_rank -- similarity score
FROM
view_business_favorite_count
WHERE
name % $1 -- trgm search
ORDER BY trgm_rank DESC;
Option #2.
With full text search, you need to :
create a separate table, for example unnested_business_names, where you will store 2 columns: 1st column will keep all lexemes from to_tsvector('simple',name) function, 2nd column will have vbfc_id(FK for id from view_business_favorite_count table);
add trgm index for the column, which contains lexemes;
add trigger for unnested_business_names, which will update OR insert OR delete new values from view_business_favorite_count to keep all words up to date

PostgreSQL: Find sentences closest to a given sentence

I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.
I know that I can use the ## operator with a to_tsquery but tsquery accepts specific words as queries.
One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.
However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.
Full Text Search (FTS)
You could use plainto_tsquery() to (per documentation) ...
produce tsquery ignoring punctuation
SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')
plainto_tsquery
------------------
'sentenc' & 'irrelev' & 'word' & 'punctuat'
Use it like:
SELECT *
FROM tbl
WHERE to_tsvector('english', sentence) ## plainto_tsquery('english', 'My new sentence');
But that is still rather strict and only provides very limited tolerance for similarity.
Trigram similarity
Might be better suited to search for similarity, even overcome typos to some degree.
Install the additional module pg_trgm, create a GiST index and use the similarity operator % in a nearest neighbour search:
Basically, with a trigram GiST index on sentence:
-- SELECT set_limit(0.3); -- adjust tolerance if needed
SELECT *
FROM tbl
WHERE sentence % 'My new sentence'
ORDER BY sentence <-> 'My new sentence'
LIMIT 10;
More:
Finding similar strings with PostgreSQL quickly
Finding similar posts with PostgreSQL
Slow fulltext search for terms with high occurence
Combine both
You can even combine FTS and trigram similarity:
PostgreSQL FTS and Trigram-similarity Query Optimization
it's a pretty late answer, but I'm adding in case anyone encounters. If you add ": *" to the end of the words, it will bring up similar ones.
Sample:
JS autocomlete -> Codeigniter:
barcode = $ this-> input-> get ("term"). ":*";
Query:
$ query = 'select * from tablaneme where xx ##? LIMIT 15 ';
$ barcodequery = $ this-> db-> query ($ query, array (explode ("", $ barcode)))) -> result_array ();