Full Text search with multiple synonyms in PostgresSQL - postgresql

I am implementing Full Text Search with PostgreSQL. I am using following type query to search in document column.
FROM schema.table t0
WHERE t0.document ## websearch_to_tsquery('error')
I am working on to use FTS Dictionaries to search for similar words. I come across C:\Program Files\PostgreSQL\14\share\tsearch_data folder where I have defined word and its synonyms in xsyn_sample.rules file. File content is as mentioned below.
# Sample rules file for eXtended Synonym (xsyn) dictionary
# format is as follows:
#
# word synonym1 synonym2 ...
#
error fault issue mistake malfunctioning
I want to use this dictionary but don't know how to use it. When I search for 'error', I wants to display result for 'error', 'fault', 'issues', 'mistakes' etc which are having similar meanings. Kindly share if you have ever come across this implementation. Few things I am asking for
Is this xsyn_sample.rules is sufficient for this? If not then what other techniques can be used for this type of search?
How to configure postgreSQL 14 in my local system to use this dictionary instead of 'simple' or 'english'. I know how to use both of these dictionary with select plainto_tsquery('english','errors'); and select plainto_tsquery('simple','errors'); queries. Similarly I want to use my custom dictionary.
Is there any better source for dictionaries use in postgres in compare to https://www.postgresql.org/docs/current/textsearch-dictionaries.html ?

Don't edit the example rules file, create your own file mysyn.rules and add the synonyms there. Then create a dictionary that uses the file:
CREATE TEXT SEARCH DICTIONARY mysyn (TEMPLATE = xsyn_template, RULES = mysyn);
Then copy the English text search configuration and add your dictionary:
CREATE TEXT SEARCH CONFIGURATION myconf (COPY = english);
ALTER TEXT SEARCH CONFIGURATION myconf
ALTER MAPPING FOR word, asciiword WITH mysyn, english_stem;

Related

make tsvector tokenize by space only

I need to create a tsvector that does not split its content by hyphens but ideally only by whitespace.
select to_tsvector('simple','7073-03-001-01 7072-05-003-06')
creates
'-001':3 '-003':7 '-01':4 '-03':2 '-05':6 '-06':8 '7072':5 '7073':1
where I rather want
'7072-05-003-06':2 '7073-03-001-01':1
is this possible somehow?
There is a simple example of a parser called test_parser which seems to do what you want. It was last in the documents in 9.4, after that it was moved to only be documented in the source tree. These test extensions aren't always installed, so you might need to take special steps (depending on how you installed PostgreSQL and what your OS is and whether you are really using an EOL version) to get it.
create extension test_parser ;
create text search configuration test ( parser = testparser);
ALTER TEXT SEARCH CONFIGURATION test ADD MAPPING FOR word WITH simple;
SELECT * FROM to_tsvector('test', '7073-03-001-01 7072-05-003-06');
to_tsvector
---------------------------------------
'7072-05-003-06':2 '7073-03-001-01':1

Add new language to postgresql full text search

Is there any way to add new languages to postgresq full text search?
Where can I read or start from ?
You can look at this a link from PostgreSQL documentation, where CREATE DICTIONARY commands are listed. There are several types of dictionaries that can be used and added, and commands for adding them deffer.
For example, if you wish to add Ispell dictionary, you would do it like this:
CREATE TEXT SEARCH DICTIONARY my_lang_ispell (
TEMPLATE = ispell,
DictFile = path_to_my_lang_dict_file,
AffFile = path_to_my_lang_affixes_file,
StopWords = path_to_my_lang_astop_words_file
);
DictFile and AffFile are files you need to google somewhere, depending on the language you want to add. StopWords file keeps words that should be ignored, I guess you can also find that file on the Internet.

PostgreSQL CREATE TEXT SEARCH DICTIONARY reports stop-word on thesaurus file row that actually do not contain the reported stop-word

I try to create thesaurus from the thesaurus file moby.ths using the query
CREATE TEXT SEARCH DICTIONARY thesaurus_moby (
TEMPLATE = thesaurus,
DictFile = moby,
Dictionary = english_stem
);
Postgres reports the following error:
[F0000] ERROR: thesaurus sample word "over" is a stop word (rule 63944)
When I open moby.ths then the line 63944 does not include "over".
Is the rule number not the same as the line number in .ths file?
How can I find the row in .ths file that corresponds to rule number?
UPDATE:
There are no "over" in .ths file sample side.
Does Postgres do some preprocessing by dropping rows from original .ths file and creates its rule set?
If so, how can I see the rule set that Postgres is working on and where the rule 63944 is probably present?

Find my Postgres text search dictionaries

I created a thesaurus for full text search a few months back. I just recently added some entries, and (I think) I update it like this:
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH [my_thesaurus], english_stem;
However, I don't actually don't remember what my thesaurus was called. How can I figure this out?
You may find it in the output of:
SELECT dictname FROM pg_catalog.pg_ts_dict;
If you use psql client, you can use the following command.
\dFd[+] PATTERN
lists text search dictionaries
Basically, you can use \dFd+ to list all dictionaries along with their initialization options.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.