How the configure postgresql tokenization for full text search? - postgresql

This works as expected:
# select to_tsvector('SICK FOTOCEL VS#VE180-P132') ## 'p132'::tsquery;
?column?
----------
t
However, when the '#' is replaced by a '/' i get
# select to_tsvector('SICK FOTOCEL VS/VE180-P132') ## 'p132'::tsquery;
?column?
----------
f
This is because VS/VE180-P132 is classified as a file token. This is not correct in our use case. How do i change this behaviour? For instance, dropping the token types email, url and file?

You cannot change this behaviour unless you want to write a new parser in C.
But you can use the workaround of replacing certain characters in all strings before you use full text search on them:
SELECT to_tsvector(regexp_replace('SICK FOTOCEL VS/VE180-P132', '[/.]', ' '))
## to_tsquery(regexp_replace('p132', '[/.]', ' '));

Related

Postgres Escape Single and Double Quotes in Text Field

I may have an odd request. I'm not finding any help via Google.
I am using the DbVisualizer Pro 10.0.15 gui tool connected to a PostgreSQL db.
I need to create a csv file from a database table. I select the records I need in a query then export the results to a .csv file. I can do that easy.
select note from notes;
highlight all results records >> right-click >> select export >> choose csv
Some of the records have both single and/or double-quotes in the content.
The person receiving this file needs to upload the csv file into another system. They are stating that these single and double-quotes in the content will not work in their upload. I've been asked to escape these quotes. They want to keep them in the content, but have them appear in the field with the backslash escape character, i.e: it is John's ball would show in the csv file as: it is John\'s ball. The same for dbl-quotes.
I could probably do this with a search-and-replace function in a text editor after creating the csv file, but I'd like to think this can be done via sql.
I've tried playing with the regexp_replace() function.
select regexp_replace(note, '"', '\"') as notes from notes works on the dbl-quotes, but I'm not having any luck on the single quotes.
Help? Is there a way to do this?
You can escape double quotes by doing:
postgres=# SELECT REGEXP_REPLACE('this "is" a string', '"', '\"', 'g');
regexp_replace
----------------------
this \"is\" a string
(1 row)
For single quotes, the approach is similar, but you have to escape them using another single quote. So instead of having something like /', it should be ''. The query is:
postgres=# SELECT REGEXP_REPLACE('this ''is'' a string', '''', '\''', 'g');
regexp_replace
----------------------
this \'is\' a string
(1 row)
Note the 'g' flag in the end, this forces it to replace all occurrences and not just the first one found.
You can also replace both single and double quotes in a single statement, although they are replaced with the same string (\" in this case).
postgres=# SELECT REGEXP_REPLACE('this "is" a ''normal'' string', '["'']', '\"', 'g');
regexp_replace
---------------------------------
this \"is\" a \"normal\" string
(1 row)

PostgreSQL regexp.replace all unwanted chars

I have registration codes in my PostgreSQL table which are written messy, like MU-321-AB, MU/321/AB, MU 321-AB and so forth...
I would need to clear all of this to get MU321AB.
For this I uses following expression:
SELECT DISTINCT regexp_replace(ccode, '([^A-Za-z0-9])', ''), ...
This expression work as expected in 'NET' but not in PostgreSQL where it 'clears' only first occurrence of unwanted character.
How would I modify regular expression which will replace all unwanted chars in string to get clear code with only letters and numbers?
Use the global flag, but without any capture groups:
SELECT DISTINCT regexp_replace(ccode, '[^A-Za-z0-9]', '', 'g'), ...
Note that the global flag is part of the standard regular expression parser, so .NET is not following the standard in this case. Also, since you do not want anything extracted from the string - you just want to replace some characters - you should not use capture groups ().

postgres text search - numhword matching and order

The postgres documentation here explains the numhword parser as one that matches Hyphenated word, letters and digits. The example they give for this is postgres-beta1 and this matches nicely. However, somethnig such as postgres-9-beta1 does not match, and i can't seem to find a default parser that will work with this. SQL below.
Is my best choice something that parsed just on spaces? Is there such a default parser? (It seems test_parser doesn't ship with 9.5 anymore...)
I want to tokenize alphanumerics, hyphenated. Am I stuck with regular expressions for the time being, or is there a straightforward way to create a customer parser (without dropping down into C) ?
CREATE TEXT SEARCH DICTIONARY simple_nostem_no_stop (TEMPLATE = pg_catalog.simple);
CREATE TEXT SEARCH CONFIGURATION test_id_search ( COPY = pg_catalog.simple );
alter text search configuration test_id_search
drop mapping for asciihword, asciiword, email, file, float, host, hword, hword_asciipart, hword_numpart, hword_part, int, numhword, numword, sfloat, uint, url, url_path, version, word ;
ALTER TEXT SEARCH CONFIGURATION test_id_search
ALTER MAPPING FOR numhword WITH simple_nostem_no_stop;
\dF+ test_id_search
Text search configuration "public.test_id_search"
Parser: "pg_catalog.default"
Token | Dictionaries
---------+-----------------------
numhword | simple_nostem_no_stop
/* This works as i hoped, per the docs: */
test_db=# select to_tsvector('test_id_search', ' postgresql-beta1 ') ;
to_tsvector
----------------------
'postgresql-beta1':1
(1 row)
/* This doesn't seem to work? */
test_db=# select to_tsvector('test_id_search', ' postgresql-9-beta1 ') ;
to_tsvector
-------------
(1 row)

RTRIM string ending with an incremental number

I am trying to narrow down the follow string to just the username. The number at the end is always different. I can LTRIM just fine, but when I try to use RTRIM I am having difficulty removing everything to the right of the username.
C:\documents and settings\[USERNAME]\my documents\reports\204452.pdf
Will RTRIM work in this instance? If not, a point in the right direction would be appreciated.
Thanks.
If the username is always the third level of the full path, you can use a regular expression:
regexp_substr(<file path>, '[^\\]+', 1, 3)
For example:
select regexp_substr('C:\documents and settings\[USERNAME]\my documents\reports\204452.pdf', '[^\\]+', 1, 3)
from dual;
or using a subquery just to make it more readable:
select regexp_substr(file_path, '[^\\]+', 1, 3)
from (
select 'C:\documents and settings\[USERNAME]\my documents\reports\204452.pdf'
as file_path
from dual
);
REGEXP_SUBSTR(FILE_PATH,'[^\\]+',1,3)
-------------------------------------
[USERNAME]
Note that the backslash has to be escaped in the pattern.

PostgreSQL prevent non-matching tsqueries from matching tsvector

Given the following query:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('cats ate');
This query will return true as a result. Now, what if I don't want "cats" to also match the word "cat", is there any way I can prevent this?
Also, is there any way I can make sure that the tsquery matches the entire string in that particular order (e.g. the "cats ate" is counted as a single token rather than two). At the moment the following query will also match:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('ate cats');
cat matching cats is due to english stemming, english being probably your default text search configuration. See the result of show default_text_search_config to be sure.
It can be avoided by using the simple configuration. Try the function calls with explicit text configurations:
select to_tsvector('simple', 'fat cat ate rat') ## plainto_tsquery('simple', 'cats ate');
Or change it with:
set default_text_search_config='simple';