Strange result searching with to_tsquery under Postgresql - postgresql

I got a strange result searching for an expression like pro-physik.de with tsquery.
If I ask for pro-physik:* by tsquery I want to get all entries starting with pro-physik. Unfortunately those entries with pro-physik.de are missing.
Here are 2 examples to demonstrate the problem:
Query 1:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-physik:*') = true
Result 1: false (should be true)
Query 2:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-p:*') = true
Result 2: true
Has anybody an idea how I could solve this problem?

The core of the problem is that the parser will parse pro-physik.de as a hostname:
SELECT alias, token FROM ts_debug('simple', 'pro-physik.de');
alias | token
-------+---------------
host | pro-physik.de
(1 row)
Compare this:
SELECT alias, token FROM ts_debug('simple', 'pro-physik-de');
alias | token
-----------------+---------------
asciihword | pro-physik-de
hword_asciipart | pro
blank | -
hword_asciipart | physik
blank | -
hword_asciipart | de
(6 rows)
Now pro-physik and pro-p are not hostnames, so you get
SELECT to_tsquery('simple', 'pro-physik:*');
to_tsquery
---------------------------------------
'pro-physik':* & 'pro':* & 'physik':*
(1 row)
SELECT to_tsquery('simple', 'pro-p:*');
to_tsquery
-----------------------------
'pro-p':* & 'pro':* & 'p':*
(1 row)
The first tsquery will not match because physik is not a prefix of pro-physik.de, and the second will match because pro-p, pre and p all three are prefixes.
As a workaround, use full text search like this:
select
to_tsvector('simple', replace('pro-physik.de', '.', ' ')) ##
to_tsquery('simple', replace('pro-physik:*', '.', ' '))

Related

Postgres full text search and spelling mistakes (aka fuzzy full text search)

I have a scenario, where I have data for informal communications that I need to be able to search. Therefore I want full text search, but I also to make sense of spelling mistakes. Question is how do I take spelling mistakes into account in order to be able to do fuzzy full text search??
This is very briefly discussed in Postgres Full Text Search is Good Enough where the article discusses misspelling.
So I have built a table of "documents", created indexes etc.
CREATE TABLE data (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
text TEXT NOT NULL);
I can create an additional column of type tsvector and index accordingly...
alter table data
add column search_index tsvector
generated always as (to_tsvector('english', coalesce(text, '')))
STORED;
create index search_index_idx on data using gin (search_index);
I have for example, some text where the data says "baloon", but someone may search "balloon", so I insert two rows (one deliberately misspelled)...
insert into data (text) values ('baloon');
insert into data (text) values ('balloon');
select * from data;
id | text | search_index
----+---------+--------------
1 | baloon | 'baloon':1
2 | balloon | 'balloon':1
... and perform full text searches against the data...
select * from data where search_index ## plainto_tsquery('balloon');
id | text | search_index
----+---------+--------------
2 | balloon | 'balloon':1
(1 row)
But I don't get back results for the misspelled version "baloon"... So using the suggestion in the linked article I've built a lookup table of all the words in my lexicon as follows...
"you may obtain good results by appending the similar lexeme to your tsquery"
CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');
CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);
... and I can search for similar words which may have been misspelled
select word, similarity(word, 'balloon') as similarity from data_words where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');
word | similarity
---------+------------
baloon | 0.6666667
balloon | 1
... but how do I actually include misspelled words in my query?
Isn't this what the article above means?
select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
plainto_tsquery
----------------------------------
'balloon' & 'baloon' & 'balloon'
(1 row)
... plugged into an actual search, and I get no rows!
select * from data where text ## plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
select * from data where search_index ## phraseto_tsquery('baloon balloon'); -- no rows returned
I'm not sure where I'm going wrong here - can any shed any light? I feel like I'm super close to getting this going...?
SELECT to_tsquery('balloon |' ||
string_agg(word, ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;
For anyone looking at this thread, the accepted answer by #laurenz-albe needed a slight modification for me:
It required single quotes around the argument values passed to the string_agg function, which can be done using the format function along with the %L placeholder.
This updated code worked for me:
SELECT to_tsquery('balloon |' ||
string_agg(format('%L', word), ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;

problems with full-text search in postgres

I have the next table, and data:
/* script for people table, with field tsvector and gin */
CREATE TABLE public.people (
id INTEGER,
name VARCHAR(30),
lastname VARCHAR(30),
complete TSVECTOR
)
WITH (oids = false);
CREATE INDEX idx_complete ON public.people
USING gin (complete);
/* data for people table */
INSERT INTO public.people ("id", "name", "lastname", "complete")
VALUES
(1, 'MICHAEL', 'BRYANT BRYANT', '''bryant'':2,3 ''michael'':1'),
(2, 'HENRY STEVEN', 'BUSH TIESSEN', '''bush'':3 ''henri'':1 ''steven'':2 ''tiessen'':4'),
(3, 'WILLINGTON STEVEN', 'STEPHENS FLINN', '''flinn'':4 ''stephen'':3 ''steven'':2 ''willington'':1'),
(4, 'BRET', 'MARTINEZ AROCH', '''aroch'':3 ''bret'':1 ''martinez'':2'),
(5, 'TERENCE BERT', 'CAVALIERE ENRON', '''bert'':2 ''cavalier'':3 ''terenc'':1');
I need retrieve the names and lastnames, according the tsvector field. Actually I have the query:
SELECT * FROM people WHERE complete ## to_tsquery('WILLINGTON & FLINN');
And the result is right (the third record). BUT if I try with
SELECT * FROM people WHERE complete ## to_tsquery('STEVEN & FLINN');
/* the same record! */
I don't have results. Why? What can I do?
You should use the same language to search your table as the values in your field 'complete' where inserted.
Check the result of that query compared english and german:
select * ,
to_tsvector('english', concat_ws(' ', name, lastname )) as english,
to_tsvector('german', concat_ws(' ', name, lastname )) as german
from public.people
so that should work for you :
SELECT * FROM people WHERE complete ## to_tsquery('english','STEVEN & FLINN');
You are probably using a text search configuration where either STEVEN or FLINN are modified by stemming.
I can reproduce this here:
test=> SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.german
(1 row)
test=> SELECT complete FROM public.people WHERE id = 3;
complete
-------------------------------------------------
'flinn':4 'stephen':3 'steven':2 'willington':1
(1 row)
test=> SELECT * FROM ts_debug('STEVEN & FLINN');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+---------------+-------------+---------
asciiword | Word, all ASCII | STEVEN | {german_stem} | german_stem | {stev}
blank | Space symbols | | {} | |
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | FLINN | {german_stem} | german_stem | {flinn}
(4 rows)
test=> SELECT * FROM public.people
WHERE complete ## to_tsquery('STEVEN & FLINN');
id | name | lastname | complete
----+------+----------+----------
(0 rows)
So you see, the German Snowball dictionary stems STEVEN to stev.
Since complete contains the unstemmed version steven, no match is found.
You should use the same text search configuration when you populate complete and in the query.

PostgreSQL full text search yielding weird results

I have a schema like this (simplified):
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name NOT NULL
);
CREATE INDEX users_idx
ON users
USING GIN (to_tsvector('finnish', name));
But I'm getting completely invalid results with my queries:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmin');
name
------
(0 rows)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmink');
name
--------------------
Riitta ja Lemminki
Riitta ja Lemminki
(2 rows)
# select name from users where name ilike 'lemmink%';
name
----------------------
Lemminkäinen Matilda
Lemminkäinen Matias
Lemminkäinen Kyösti
Lemminkäinen Tuomas
(4 rows)
Another example:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partu');
name
----------
Partuuna
(1 row)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partur');
name
------------------------
Parturi-Kampaamo Raija
Parturi-Kampaamo Siema
(2 rows)
I was expecting to get the bottom two results on both queries...
Using the following version:
psql (9.4.6, server 9.5.2)
WARNING: psql major version 9.4, server major version 9.5.
Some psql features might not work.
I don't speak Finnish, but it seems expected result. FTS looks for lexemes, not for parts of words, Eg, do is not a lexemme for dog, but dog is for dogs:
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('do');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
?column?
----------
f
(1 row)
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('dog');
?column?
----------
t
(1 row)
So I believe in Parturi last i is optional ending - right?..
Update:
from https://en.wiktionary.org/wiki/parturi :
partur[i], partur[eita] => lexeme will be partur

Postgres Full Text Search on multiple columns

I am using Postgres 9.3 w/ Laravel 5 and I have set up the following migration:
DB::statement("ALTER TABLE users ADD COLUMN searchtext TSVECTOR");
DB::statement("UPDATE users SET searchtext = to_tsvector('english', first_name || ' ' || last_name || ' ' || email)");
DB::statement("CREATE INDEX searchtext_gin ON users USING GIN(searchtext)");
DB::statement("CREATE TRIGGER ts_searchtext BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('searchtext', 'pg_catalog.english', 'first_name', 'last_name', 'email')");
If I have an entry with the first name "Christopher", and I run the following query I get no results
return User::whereRaw("searchtext ## to_tsquery('Chris')")->get();
If I search for "Christopher" I get the record. What do I need to do to be able to search with a partial match?
The english dictionary doesn't stem nicknames.
regress=> SELECT to_tsvector('english', 'Christopher'), to_tsquery('english', 'Chris');
to_tsvector | to_tsquery
---------------+------------
'christoph':1 | 'chris'
(1 row)
You'll need to overlay a dictionary that maps nicknames too, so christopher can be stemmed to chris.

postgresql + textsearch + german umlauts + UTF8

I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!
You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)