Escaping special characters in to_tsquery - postgresql

How do you espace special characters in string passed to to_tsquery? For instance, this kind of query:
select to_tsquery('AT&T');
Produces:
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
to_tsquery
------------
(1 row)
Edit: I also noticed that there is the same issue in to_tsvector.

A simple solution is to create the tsquery as follows:
select $$'AT&T'$$::tsquery;
You can make more complex queries:
select $$'AT&T' & Phone | '|Bang!'$$::tsquery;
See the text search docs for more.

I found this comment very useful that uses the plainto_tsquery('AT&T) function https://stackoverflow.com/a/16020565/350195

If you want 'AT&T' to be treated as a search word, you're going to need some customised components, because the default parser splits it as two words:
steve#steve#[local] =# select * from ts_parse('default', 'AT&T');
tokid | token
-------+-------
1 | AT
12 | &
1 | T
(3 rows)
steve#steve#[local] =# select * from ts_debug('simple', 'AT&T');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | AT | {simple} | simple | {at}
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | T | {simple} | simple | {t}
(3 rows)
As you can see from the documentation for CREATE TEXT PARSER this is not very trivial, as the parser appears to need to be a C function.
You might find this post of someone getting "underscore_word" to be recognised as a single token useful: http://postgresql.1045698.n5.nabble.com/Configuring-Text-Search-parser-td2846645.html

Related

Create full text search configuration with two dictionaries

I want to perform a full text search on a postgresql column using the english_stem dictionary and the simple dictionary. I can do something like this:
ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
But this checks that the word is in both dictionaries. Is there a way to alter this configuration so the word can be matched with one dictionary OR the other?
Edit:
The reason I think they are not being checked in order is because when searching for a partial word that should be found in the simple dictionary, nothing is returned.
select * from ts_debug('english', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+----------
asciiword | Word, all ASCII | gutter | {english_stem} | english_stem | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {english_stem} | english_stem | {clean}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {english_stem} | english_stem | {servic}
select * from ts_debug('simple', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | gutter | {simple} | simple | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {simple} | simple | {cleaning}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {simple} | simple | {services}
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'cleani:*')));
name
------
(0 rows)
But searching for a partial in the english dictionary returns as expected.
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'clea:*')));
name
--------------------------
Gutter Cleaning Services
But this checks that the word is in both dictionaries.
That's not correct. As noted in the docs (see the description for the dictionary_name parameter), it checks them in order; it only checks the 2nd dictionary if it did not get a token from the first. You can verify this with ts_debug().
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | cars | {simple} | simple | {cars}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {simple} | simple | {boats}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+-----------------------+--------------+------------
asciiword | Word, all ASCII | cars | {english_stem,simple} | english_stem | {car}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {english_stem,simple} | english_stem | {boat}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
The reason for the difference in the last two queries is that english_stem stems 'Cleaning' to 'clean', so searching for 'cleani*' will not match. Try adding the to_tsvector and to_tsquery expressions as a column and removing them from the WHERE; you'll see that "Gutter Cleaning Services" is stemmed to 'clean':2 'gutter':1 'servic':3.
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleani:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'cleani':* | Gutter Cleaning Services
(1 row)
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleaning:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'clean':* | Gutter Cleaning Services
(1 row)
If you change the ts_query to instead search for cleaning:*, that will get stemmed as well and again match. But, english_stem cannot figure out that 'cleani' is meant to stem to 'clean' unless it also sees the 'ng'. So, that falls through to simple, which performs no stemming, and you end up with the mismatch - still a trailing i in the tsquery, but not in the tsvector.
Stemming isn't meant to work on arbitrary prefixes of words, only on whole ones; for prefix matching, you'd use a traditional left-anchored LIKE.

when converting text/string to date in postgres, random date is generated

I have a text column indicating date i.e. 20170101
UPDATE table_name
SET work_date = to_date(workdate, 'YYYYMMDD');
I used this command to convert it as date. However, I got a odd result. I read though other existing posts but not sure what's wrong here.
+----------+---------------+
| workdate | work_date |
+----------+---------------+
| 20170211 | 2207-05-09 |
| 20170930 | 2209-04-27 |
| 20170507 | 2208-02-29 |
| 20170318 | 2207-08-24 |
+----------+---------------+
I think you must be mistaken about the data you are supplying to to_date.
For example, input to these functions is not restricted by normal ranges, thus to_date('20096040','YYYYMMDD') returns 2014-01-17 rather than causing an error.
Source: https://www.postgresql.org/docs/9.6/static/functions-formatting.html

Full text search configuration on postgresql

I'm facing an issue concerning the text search configuration on postgresql.
I have a table users wich contains a column name. The name of users maybe a french, english, spanish or any other language.
So I need to use the Full Text Search of postgresql. The default text serach configuration I'm using now is the simple configuration but is not efficient to make the search and get the suitable results.
I'm trying to combine different text search configuration like this:
(to_tsvector('english', document) || to_tsvector('french', document) || to_tsvector('spanish', document) || to_tsvector('russian', document)) ##
(to_tsquery('english', query) || to_tsquery('french', query) || to_tsquery('spanish', query) || to_tsquery('russian', query))
But this query didn't give suitable results, if we test for example:
select (to_tsvector('english', 'adam and smith') || to_tsvector('french', 'adam and smith') || to_tsvector('spanish', 'adam and smith') || to_tsvector('russian', 'adam and smith'))
tsvector: 'adam':1,4,7,10 'and':5,8 'smith':3,6,9,12
Using the origin language of the word:
select (to_tsvector('english', 'adam and smith'))
tsvector: 'adam':1 'smith':3
The first thing to mention that the stopwords were not token into consideration when we combine different configuration with || operator.
Is there any solution to combine different text search configuration and use the suitable language when a user search a text?
Maybe you think that || is an “or” operator, but it concatenates text search vectors.
Take a look at what happens in your expression.
Running \dF+ french in psql will show you that for asciiwords, a French Snowball stemmer is used. That removes stop words and reduces the words to their stem. Similar for English and Russian.
You can use ts_debug to see this in operation:
test=> SELECT * FROM ts_debug('english', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | adam | {english_stem} | english_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {english_stem} | english_stem | {smith}
(5 rows)
test=> SELECT * FROM ts_debug('french', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------+-------------+---------
asciiword | Word, all ASCII | adam | {french_stem} | french_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {french_stem} | french_stem | {and}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {french_stem} | french_stem | {smith}
(5 rows)
Now if you concatenate these four tsvectors, you end up with adam in position 1, 4, 7 and 10.
There is no good way to use full text search for different languages at once.
But if it is really personal names you are searching, I would do the following:
Create a text search configuration with a simple dictionary for asciiwords, and either use an empty stopword file for the dictionary or one that contains stopwords that are acceptable in all languages.
Personal names normally should not be stemmed, so you avoid that problem. And if you miss a stopword, that's no big deal. It only makes the resulting tsvector (and index) larger, but with personal names there should not be too many stopwords anyway.

Postgresql: Combining similarity with tsvector

I got a database table containing more than 50 million records
which i need to full text search as fast as possible.
On a smaller table i just had a index on the text column and i use the similarity function to get similar results. I was also able to sort by the result of similarity().
Now, after my table is a lot bigger, i switched to tsvector. I created a column for the tsvector result and a trigger which updates the column before insert or update. After that i can search ultra fast (<100ms).
The problem is that i would like to use a combination of both tsvector and similarity.
Example
My table contains the following data.
| MyColumn |
------------
| Apple |
| Orange |
| ... |
But if i search for "App" i don't get "Apple" back.
Any ideas on how to get a fast "like/similar" search with a "score/similarity" score ?
https://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
Also, * can be attached to a lexeme to specify prefix matching:
smth like that?.:
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('app:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | f
(4 rows)
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('ap:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | t
(4 rows)

Operator ~<~ in Postgres

(Originally part of this question, but it was bit irrelevant, so I decided to make it a question of its own.)
I cannot find what the operator ~<~ is. The Postgres manual only mentions ~ and similar operators here, but no sign of ~<~.
When fiddling in the psql console, I found out that these commands give the same results:
SELECT * FROM test ORDER BY name USING ~<~;
SELECT * FROM test ORDER BY name COLLATE "C";
And these gives the reverse ordering:
SELECT * FROM test ORDER BY name USING ~>~;
SELECT * FROM test ORDER BY name COLLATE "C" DESC;
Also some info on the tilde operators:
\do ~*~
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Description
------------+------+---------------+----------------+-------------+-------------------------
pg_catalog | ~<=~ | character | character | boolean | less than or equal
pg_catalog | ~<=~ | text | text | boolean | less than or equal
pg_catalog | ~<~ | character | character | boolean | less than
pg_catalog | ~<~ | text | text | boolean | less than
pg_catalog | ~>=~ | character | character | boolean | greater than or equal
pg_catalog | ~>=~ | text | text | boolean | greater than or equal
pg_catalog | ~>~ | character | character | boolean | greater than
pg_catalog | ~>~ | text | text | boolean | greater than
pg_catalog | ~~ | bytea | bytea | boolean | matches LIKE expression
pg_catalog | ~~ | character | text | boolean | matches LIKE expression
pg_catalog | ~~ | name | text | boolean | matches LIKE expression
pg_catalog | ~~ | text | text | boolean | matches LIKE expression
(12 rows)
The 3rd and 4th rows is the operator I'm looking for, but the description is a bit insufficient for me.
~>=~, ~<=~, ~>~ and ~<~ are text pattern (or varchar, basically the same) operators, the counterparts of their respective siblings >=, <=, >and <. They sort character data strictly by their byte values, ignoring rules of any collation setting (as opposed to their siblings). This makes them faster, but also invalid for most languages / countries.
The "C" locale is effectively the same as no locale, meaning no collation rules. That explains why ORDER BY name USING ~<~ and ORDER BY name COLLATE "C" end up doing the same. The latter syntax variant should be preferred: more standard, less error-prone.
Detailed explanation in the last chapter of this related answer on dba.SE:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Note that ~~ / ~~* are Postgres operators for LIKE / ILIKE and barely related to the above. Similarly, !~~ / !~~* for NOT LIKE / NOT IILKE. (Use standard LIKE notation instead of these "internal" operators.)
Related:
~~ Operator In Postgres
Symfony2 Doctrine - ILIKE clause for PostgreSQL?
Find rows where text array contains value similar to input