Full text search configuration on postgresql - postgresql

I'm facing an issue concerning the text search configuration on postgresql.
I have a table users wich contains a column name. The name of users maybe a french, english, spanish or any other language.
So I need to use the Full Text Search of postgresql. The default text serach configuration I'm using now is the simple configuration but is not efficient to make the search and get the suitable results.
I'm trying to combine different text search configuration like this:
(to_tsvector('english', document) || to_tsvector('french', document) || to_tsvector('spanish', document) || to_tsvector('russian', document)) ##
(to_tsquery('english', query) || to_tsquery('french', query) || to_tsquery('spanish', query) || to_tsquery('russian', query))
But this query didn't give suitable results, if we test for example:
select (to_tsvector('english', 'adam and smith') || to_tsvector('french', 'adam and smith') || to_tsvector('spanish', 'adam and smith') || to_tsvector('russian', 'adam and smith'))
tsvector: 'adam':1,4,7,10 'and':5,8 'smith':3,6,9,12
Using the origin language of the word:
select (to_tsvector('english', 'adam and smith'))
tsvector: 'adam':1 'smith':3
The first thing to mention that the stopwords were not token into consideration when we combine different configuration with || operator.
Is there any solution to combine different text search configuration and use the suitable language when a user search a text?

Maybe you think that || is an “or” operator, but it concatenates text search vectors.
Take a look at what happens in your expression.
Running \dF+ french in psql will show you that for asciiwords, a French Snowball stemmer is used. That removes stop words and reduces the words to their stem. Similar for English and Russian.
You can use ts_debug to see this in operation:
test=> SELECT * FROM ts_debug('english', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | adam | {english_stem} | english_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {english_stem} | english_stem | {smith}
(5 rows)
test=> SELECT * FROM ts_debug('french', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------+-------------+---------
asciiword | Word, all ASCII | adam | {french_stem} | french_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {french_stem} | french_stem | {and}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {french_stem} | french_stem | {smith}
(5 rows)
Now if you concatenate these four tsvectors, you end up with adam in position 1, 4, 7 and 10.
There is no good way to use full text search for different languages at once.
But if it is really personal names you are searching, I would do the following:
Create a text search configuration with a simple dictionary for asciiwords, and either use an empty stopword file for the dictionary or one that contains stopwords that are acceptable in all languages.
Personal names normally should not be stemmed, so you avoid that problem. And if you miss a stopword, that's no big deal. It only makes the resulting tsvector (and index) larger, but with personal names there should not be too many stopwords anyway.

Related

Create full text search configuration with two dictionaries

I want to perform a full text search on a postgresql column using the english_stem dictionary and the simple dictionary. I can do something like this:
ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
But this checks that the word is in both dictionaries. Is there a way to alter this configuration so the word can be matched with one dictionary OR the other?
Edit:
The reason I think they are not being checked in order is because when searching for a partial word that should be found in the simple dictionary, nothing is returned.
select * from ts_debug('english', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+----------
asciiword | Word, all ASCII | gutter | {english_stem} | english_stem | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {english_stem} | english_stem | {clean}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {english_stem} | english_stem | {servic}
select * from ts_debug('simple', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | gutter | {simple} | simple | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {simple} | simple | {cleaning}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {simple} | simple | {services}
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'cleani:*')));
name
------
(0 rows)
But searching for a partial in the english dictionary returns as expected.
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'clea:*')));
name
--------------------------
Gutter Cleaning Services
But this checks that the word is in both dictionaries.
That's not correct. As noted in the docs (see the description for the dictionary_name parameter), it checks them in order; it only checks the 2nd dictionary if it did not get a token from the first. You can verify this with ts_debug().
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | cars | {simple} | simple | {cars}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {simple} | simple | {boats}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+-----------------------+--------------+------------
asciiword | Word, all ASCII | cars | {english_stem,simple} | english_stem | {car}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {english_stem,simple} | english_stem | {boat}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
The reason for the difference in the last two queries is that english_stem stems 'Cleaning' to 'clean', so searching for 'cleani*' will not match. Try adding the to_tsvector and to_tsquery expressions as a column and removing them from the WHERE; you'll see that "Gutter Cleaning Services" is stemmed to 'clean':2 'gutter':1 'servic':3.
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleani:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'cleani':* | Gutter Cleaning Services
(1 row)
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleaning:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'clean':* | Gutter Cleaning Services
(1 row)
If you change the ts_query to instead search for cleaning:*, that will get stemmed as well and again match. But, english_stem cannot figure out that 'cleani' is meant to stem to 'clean' unless it also sees the 'ng'. So, that falls through to simple, which performs no stemming, and you end up with the mismatch - still a trailing i in the tsquery, but not in the tsvector.
Stemming isn't meant to work on arbitrary prefixes of words, only on whole ones; for prefix matching, you'd use a traditional left-anchored LIKE.

Does SQL have a way to group rows without squashing the group into a single row?

I want to do a single query that outputs an array of arrays of table rows. Think along the lines of <table><rowgroup><tr><tr><tr><rowgroup><tr><tr>. Is SQL capable of this? (specifically, as implemented in MariaDB, though migration to AWS RDS might occur one day)
The GROUP BY statement alone does not do this, it creates one row per group.
Here's an example of what I'm thinking of…
SELECT * FROM memes;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| puppies | gif |
| cats | jpeg |
| doggos | mp4 |
| horses | gif |
| chickens | gif |
| ducks | jpeg |
+------------+----------+
SELECT * FROM memes GROUP BY file_ext WITHOUT COLLAPSING GROUPS;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| cats | jpeg |
| ducks | jpeg |
+------------+----------+
| puppies | gif |
| horses | gif |
| chickens | gif |
+------------+----------+
| doggos | mp4 |
+------------+----------+
I've been using MySQL for ~20 years and have not come across this functionality before but maybe I've just been looking in the wrong place ¯\_(ツ)_/¯
I haven't seen an array rendering such as the one you want, but you can simulate it with multiple GROUP BY / GROUP_CONCAT() clauses.
For example:
select concat('[', group_concat(g), ']') as a
from (
select concat('[', group_concat(file_name), ']') as g
from memes
group by file_ext
) x
Result:
a
---------------------------------------------------------
[[puppies,horses,chickens],[kittens,cats,ducks],[doggos]]
See running example at DB Fiddle.
You can tweak the delimiters such as ,, [, and ].
SELECT ... ORDER BY file_ext will come close to your second output.
Using GROUP BY ... WITH ROLLUP would let you do subtotals under each group, which is not what you wanted either, but it would give you extra lines where you want the breaks.

Postgresql: Combining similarity with tsvector

I got a database table containing more than 50 million records
which i need to full text search as fast as possible.
On a smaller table i just had a index on the text column and i use the similarity function to get similar results. I was also able to sort by the result of similarity().
Now, after my table is a lot bigger, i switched to tsvector. I created a column for the tsvector result and a trigger which updates the column before insert or update. After that i can search ultra fast (<100ms).
The problem is that i would like to use a combination of both tsvector and similarity.
Example
My table contains the following data.
| MyColumn |
------------
| Apple |
| Orange |
| ... |
But if i search for "App" i don't get "Apple" back.
Any ideas on how to get a fast "like/similar" search with a "score/similarity" score ?
https://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
Also, * can be attached to a lexeme to specify prefix matching:
smth like that?.:
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('app:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | f
(4 rows)
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('ap:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | t
(4 rows)

Operator ~<~ in Postgres

(Originally part of this question, but it was bit irrelevant, so I decided to make it a question of its own.)
I cannot find what the operator ~<~ is. The Postgres manual only mentions ~ and similar operators here, but no sign of ~<~.
When fiddling in the psql console, I found out that these commands give the same results:
SELECT * FROM test ORDER BY name USING ~<~;
SELECT * FROM test ORDER BY name COLLATE "C";
And these gives the reverse ordering:
SELECT * FROM test ORDER BY name USING ~>~;
SELECT * FROM test ORDER BY name COLLATE "C" DESC;
Also some info on the tilde operators:
\do ~*~
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Description
------------+------+---------------+----------------+-------------+-------------------------
pg_catalog | ~<=~ | character | character | boolean | less than or equal
pg_catalog | ~<=~ | text | text | boolean | less than or equal
pg_catalog | ~<~ | character | character | boolean | less than
pg_catalog | ~<~ | text | text | boolean | less than
pg_catalog | ~>=~ | character | character | boolean | greater than or equal
pg_catalog | ~>=~ | text | text | boolean | greater than or equal
pg_catalog | ~>~ | character | character | boolean | greater than
pg_catalog | ~>~ | text | text | boolean | greater than
pg_catalog | ~~ | bytea | bytea | boolean | matches LIKE expression
pg_catalog | ~~ | character | text | boolean | matches LIKE expression
pg_catalog | ~~ | name | text | boolean | matches LIKE expression
pg_catalog | ~~ | text | text | boolean | matches LIKE expression
(12 rows)
The 3rd and 4th rows is the operator I'm looking for, but the description is a bit insufficient for me.
~>=~, ~<=~, ~>~ and ~<~ are text pattern (or varchar, basically the same) operators, the counterparts of their respective siblings >=, <=, >and <. They sort character data strictly by their byte values, ignoring rules of any collation setting (as opposed to their siblings). This makes them faster, but also invalid for most languages / countries.
The "C" locale is effectively the same as no locale, meaning no collation rules. That explains why ORDER BY name USING ~<~ and ORDER BY name COLLATE "C" end up doing the same. The latter syntax variant should be preferred: more standard, less error-prone.
Detailed explanation in the last chapter of this related answer on dba.SE:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Note that ~~ / ~~* are Postgres operators for LIKE / ILIKE and barely related to the above. Similarly, !~~ / !~~* for NOT LIKE / NOT IILKE. (Use standard LIKE notation instead of these "internal" operators.)
Related:
~~ Operator In Postgres
Symfony2 Doctrine - ILIKE clause for PostgreSQL?
Find rows where text array contains value similar to input

Escaping special characters in to_tsquery

How do you espace special characters in string passed to to_tsquery? For instance, this kind of query:
select to_tsquery('AT&T');
Produces:
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
to_tsquery
------------
(1 row)
Edit: I also noticed that there is the same issue in to_tsvector.
A simple solution is to create the tsquery as follows:
select $$'AT&T'$$::tsquery;
You can make more complex queries:
select $$'AT&T' & Phone | '|Bang!'$$::tsquery;
See the text search docs for more.
I found this comment very useful that uses the plainto_tsquery('AT&T) function https://stackoverflow.com/a/16020565/350195
If you want 'AT&T' to be treated as a search word, you're going to need some customised components, because the default parser splits it as two words:
steve#steve#[local] =# select * from ts_parse('default', 'AT&T');
tokid | token
-------+-------
1 | AT
12 | &
1 | T
(3 rows)
steve#steve#[local] =# select * from ts_debug('simple', 'AT&T');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | AT | {simple} | simple | {at}
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | T | {simple} | simple | {t}
(3 rows)
As you can see from the documentation for CREATE TEXT PARSER this is not very trivial, as the parser appears to need to be a C function.
You might find this post of someone getting "underscore_word" to be recognised as a single token useful: http://postgresql.1045698.n5.nabble.com/Configuring-Text-Search-parser-td2846645.html