Postgres full text search and spelling mistakes (aka fuzzy full text search) - postgresql

I have a scenario, where I have data for informal communications that I need to be able to search. Therefore I want full text search, but I also to make sense of spelling mistakes. Question is how do I take spelling mistakes into account in order to be able to do fuzzy full text search??
This is very briefly discussed in Postgres Full Text Search is Good Enough where the article discusses misspelling.
So I have built a table of "documents", created indexes etc.
CREATE TABLE data (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
text TEXT NOT NULL);
I can create an additional column of type tsvector and index accordingly...
alter table data
add column search_index tsvector
generated always as (to_tsvector('english', coalesce(text, '')))
STORED;
create index search_index_idx on data using gin (search_index);
I have for example, some text where the data says "baloon", but someone may search "balloon", so I insert two rows (one deliberately misspelled)...
insert into data (text) values ('baloon');
insert into data (text) values ('balloon');
select * from data;
id | text | search_index
----+---------+--------------
1 | baloon | 'baloon':1
2 | balloon | 'balloon':1
... and perform full text searches against the data...
select * from data where search_index ## plainto_tsquery('balloon');
id | text | search_index
----+---------+--------------
2 | balloon | 'balloon':1
(1 row)
But I don't get back results for the misspelled version "baloon"... So using the suggestion in the linked article I've built a lookup table of all the words in my lexicon as follows...
"you may obtain good results by appending the similar lexeme to your tsquery"
CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');
CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);
... and I can search for similar words which may have been misspelled
select word, similarity(word, 'balloon') as similarity from data_words where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');
word | similarity
---------+------------
baloon | 0.6666667
balloon | 1
... but how do I actually include misspelled words in my query?
Isn't this what the article above means?
select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
plainto_tsquery
----------------------------------
'balloon' & 'baloon' & 'balloon'
(1 row)
... plugged into an actual search, and I get no rows!
select * from data where text ## plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
select * from data where search_index ## phraseto_tsquery('baloon balloon'); -- no rows returned
I'm not sure where I'm going wrong here - can any shed any light? I feel like I'm super close to getting this going...?

SELECT to_tsquery('balloon |' ||
string_agg(word, ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;

For anyone looking at this thread, the accepted answer by #laurenz-albe needed a slight modification for me:
It required single quotes around the argument values passed to the string_agg function, which can be done using the format function along with the %L placeholder.
This updated code worked for me:
SELECT to_tsquery('balloon |' ||
string_agg(format('%L', word), ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;

Related

Postgresql - Similarity with trigram (pg_trgm)

I am currently implementing a search functionality on my app. I have an user table which contains an username and full_name fields. I want to search the users with the best similarity (from username or full_name). I searched a lot on stackoverflow and I found out a very performatic implementation:
https://stackoverflow.com/a/44856792/5979369
I used this code and I created this search query:
SELECT username, email, full_name
, similarity(username , 'mar') AS s_username
, similarity(full_name , 'mar') AS s_full_name
, row_number() OVER () AS rank -- greatest similarity first
FROM user
WHERE (username || ' ' || full_name) % 'mar' -- !!
ORDER BY (username || ' ' || full_name) <-> 'mar' -- !!
LIMIT 20;
I have an user which username is mariazirita but when I use this query searching by mar it doesn't return nothing. If I search for maria instead it already returns the user.
What can I do to improve this query to also return the user when I search for mar or ma?
Thank you
The problem here is the % operator. It will return TRUE only if the similarity exceeds the pg_trgm.similarity_threshold parameter, which defaults to 0.3.
SELECT similarity('mariazirita', 'mar');
similarity
════════════
0.23076923
(1 row)
SELECT similarity('mariazirita', 'maria');
similarity
════════════
0.3846154
(1 row)
So you can either lower the threshold or remove the condition with % from the query.

T-sql: Highlight invoice numbers if they occur in a payment description field

I have two sql-server tables: bills and payments. I am trying to create a VIEW to highlight the bill numbers if they occur in the payment description field. For example:
TABLE bll
|bllID | bllNum |
| -------- | -------- |
| 1 | qwerty123|
| 2 | qwerty345|
| 3 | 1234 |
TABLE payments
|paymentID | description |
| -------- | ---------------------------------- |
| 1 | payment of qwerty123 and qwerty345 |
I want to highlight both the 'qwerty123' and 'qwerty345' strings by adding html code to it. The code I have is this:
SELECT REPLACE(payments.description,
COALESCE((SELECT TOP 1 bll.bllNum
FROM bll
WHERE COALESCE(bll.bllNum, '') <> '' AND
PATINDEX('%' + bll.bllNum + '%', payments.description) > 0), ''),
'<font color=red>' +
COALESCE((SELECT TOP 1 bll.bllNum
FROM bll
WHERE COALESCE(bll.bllNum, '') <> '' AND
PATINDEX('%' + bll.bllNum + '%', payments.description) > 0), '') +
'</font>')
FROM payments
This works but only for the first occurrence of a bill number. If the description field has more than one bill number, the consecutive bill numbers are not highlighted. So in my example 'qwerty123' gets highlighted, but not 'qwerty345'
I need to highlight all occurrences. How can I accomplish this?
With the caveat that this is not a task best done in the database, one possible approach you could try is to use string_split to break your description into words and then join this to your Bills, doing your string manipulation on matching rows.
Note, according to the documentation, string_split is not 100% guaranteed to retain its correct ordering but always has in my usage. It could always be substituted for an alternative function to work on the same principle.
select string_agg (w.word,' ') [Description]
from (
select
case when exists (select * from bill b where b.billnum=s.value)
then Concat('<font colour="red">',s.value,'</font>') else s.value end word
from payments p
cross apply String_Split(description,' ')s
)w
Example DB Fiddle
Okay, I understand, I can put code in the front-end application by looping through the bill numbers and replacing them as they are found in the description. Just thought/ hoped there was a simple solution to this using t-sql. But I understand the difficulty.

Postgres reverse LIKE lookup indexing and performance

We have a musicians table containing records with multiple string fields, say:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Sting", "", "Bass"
"Ringo", "Starr", "Drums"
"Paul", "McCartney", "Bass"
I want to pass postgres a long string, say:
"It is known that Jimi liked to set light to his guitar and smash up
all the drums while on stage."
and i want to get returned the fields that have any matches - preferably in order of the most matches first:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Ringo", "Starr", "Drums"
because i need the search to be case insensitive, i'm constructing a query like this...
select * from musicians where lowercase_string like '%'||firstname||'%' or lowercase_string like '%'||lastname||'%' or lowercase_string like '%'||instrument||'%'
and then looping through (in ruby in my case) to capture the result with the most matches.
this is however very slow in the sql stage (1 minute+).
i've tried adding lower-case GIN index using pg_trgm as suggested here - but it's not helping - presumably because the like query is back to front?
Thanks!
With my testing, it seems that no trigram index could help your query at all. And no other index type could possibly speed up an (I)LIKE / FTS based search.
I should mention that all of the queries below use the trigram indexes, when they are queried "reversed": when the table contains the document (which is indexed), and your parameter is the query. The (I)LIKE variant variant f.ex. 2-3 times faster with it.
These the queries I've tested:
select *
from musicians
where :input_string ilike '%' || firstname || '%'
or :input_string ilike '%' || lastname || '%'
or :input_string ilike '%' || instrument || '%'
At first, FTS seemed a great idea, but my testing shows that even without ranking, it is 60-100 times slower than the (I)LIKE variant. (So even, when you don't have to post-process results with these methods, these are not worth it).
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
However, ORDER BY rank doesn't slow down that much further: it is 70-120 times slower than the (I)LIKE variant.
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
order by ts_rank(to_tsvector(:input_string), plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
Then, for a last effort, I tried the (fairly new) "word similarity" operators of the trigram module: <% and %> (available from PostgreSQL 9.6).
select *
from musicians
where :input_string %> firstname
or :input_string %> lastname
or :input_string %> instrument
select *
from musicians
where firstname <% :input_string
or lastname <% :input_string
or instrument <% :input_string
These were somewhat faster then FTS: around 50-70 times slower than the (I)LIKE variant.
(Partially working) rextester: it is run against PostgreSQL 9.5, so the 9.6 operators obviously won't run here.
Update: IF full word match is enough for you, you can actually reverse your query, to be able to use indexes. You'll need to "parse" your query (aka. "long string") though:
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(ls, '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where firstname ilike word
or lastname ilike word
or instrument ilike word
group by musicians.id
Note: I parsed the query for every complete word. You can have some other logic there, or it can even be parsed in client side.
The default, btree index shines here, as it is much faster than the trigram index with (I)LIKE (we won't need them anyway, as we are looking for complete word match here):
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(lower(ls), '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where lower(firstname) = word
or lower(lastname) = word
or lower(instrument) = word
group by musicians.id
http://rextester.com/PSABJ6745
You could even get the match count with something like
sum((lower(firstname) = word)::int
+ (lower(lastname) = word)::int
+ (lower(instrument) = word)::int)
The ilike option with match ordering:
with long_string (ls) as (values
('It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage.')
)
select musicians.*, matches
from
musicians
cross join
long_string
cross join lateral
(select
(ls ilike format ('%%%s%%', first_name) and first_name != '')::int +
(ls ilike format ('%%%s%%', last_name) and last_name != '')::int +
(ls ilike format ('%%%s%%', instrument) and instrument != '')::int
as matches
) m
where matches > 0
order by matches desc
;
first_name | last_name | instrument | matches
------------+-----------+------------+---------
Jimi | Hendrix | Guitar | 2
Phil | Collins | Drums | 1
Ringo | Starr | Drums | 1

Postgres Full Text Search on multiple columns

I am using Postgres 9.3 w/ Laravel 5 and I have set up the following migration:
DB::statement("ALTER TABLE users ADD COLUMN searchtext TSVECTOR");
DB::statement("UPDATE users SET searchtext = to_tsvector('english', first_name || ' ' || last_name || ' ' || email)");
DB::statement("CREATE INDEX searchtext_gin ON users USING GIN(searchtext)");
DB::statement("CREATE TRIGGER ts_searchtext BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('searchtext', 'pg_catalog.english', 'first_name', 'last_name', 'email')");
If I have an entry with the first name "Christopher", and I run the following query I get no results
return User::whereRaw("searchtext ## to_tsquery('Chris')")->get();
If I search for "Christopher" I get the record. What do I need to do to be able to search with a partial match?
The english dictionary doesn't stem nicknames.
regress=> SELECT to_tsvector('english', 'Christopher'), to_tsquery('english', 'Chris');
to_tsvector | to_tsquery
---------------+------------
'christoph':1 | 'chris'
(1 row)
You'll need to overlay a dictionary that maps nicknames too, so christopher can be stemmed to chris.

Inserting values into multiple columns by splitting a string in PostgreSQL

I have the following heap of text:
"BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,
URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,
16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,
1.0,Version,1.0,".
What I'd like to do is extract data from this in the following manner:
BundleSize:155648
DynamicSize:204800
Identifier:com.URLConnectionSample
Name:URLConnectionSample
ShortVersion:1.0
Version:1.0
BundleSize:155648
DynamicSize:16384
Identifier:com.IdentifierForVendor3
Name:IdentifierForVendor3
ShortVersion:1.0
Version:1.0
All tips and suggestions are welcome.
It isn't quite clear what do you need to do with this data. If you really need to process it entirely in the database (looks like the task for your favorite scripting language instead), one option is to use hstore.
Converting records one by one is easy:
Assuming
%s =
BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0
SELECT * FROM each(hstore(string_to_array(%s, ',')));
Output:
key | value
--------------+-------------------------
Name | URLConnectionSample
Version | 1.0
BundleSize | 155648
Identifier | com.URLConnectionSample
DynamicSize | 204800
ShortVersion | 1.0
If you have table with columns exactly matching field names (note the quotes, populate_record is case-sensitive to key names):
CREATE TABLE data (
"BundleSize" integer, "DynamicSize" integer, "Identifier" text,
"Name" text, "ShortVersion" text, "Version" text);
You can insert hstore records into it like this:
INSERT INTO data SELECT * FROM
populate_record(NULL::data, hstore(string_to_array(%s, ',')));
Things get more complicated if you have comma-separated values for more than one record.
%s = BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,1.0,Version,1.0,
You need to break up an array into chunks of number_of_fields * 2 = 12 elements first.
SELECT hstore(row) FROM (
SELECT array_agg(str) AS row FROM (
SELECT str, row_number() OVER () AS i FROM
unnest(string_to_array(%s, ',')) AS str
) AS str_sub
GROUP BY (i - 1) / 12) AS row_sub
WHERE array_length(row, 1) = 12;
Output:
"Name"=>"URLConnectionSample", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.URLConnectionSample", "DynamicSize"=>"204800", "ShortVersion"=>"1.0"
"Name"=>"IdentifierForVendor3", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.IdentifierForVendor3", "DynamicSize"=>"16384", "ShortVersion"=>"1.0"
And inserting this into the aforementioned table:
INSERT INTO data SELECT (populate_record(NULL::data, hstore(row))).* FROM ...
the rest of the query is the same.