Find my Postgres text search dictionaries - postgresql

I created a thesaurus for full text search a few months back. I just recently added some entries, and (I think) I update it like this:
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH [my_thesaurus], english_stem;
However, I don't actually don't remember what my thesaurus was called. How can I figure this out?

You may find it in the output of:
SELECT dictname FROM pg_catalog.pg_ts_dict;

If you use psql client, you can use the following command.
\dFd[+] PATTERN
lists text search dictionaries
Basically, you can use \dFd+ to list all dictionaries along with their initialization options.

Related

Full Text search with multiple synonyms in PostgresSQL

I am implementing Full Text Search with PostgreSQL. I am using following type query to search in document column.
FROM schema.table t0
WHERE t0.document ## websearch_to_tsquery('error')
I am working on to use FTS Dictionaries to search for similar words. I come across C:\Program Files\PostgreSQL\14\share\tsearch_data folder where I have defined word and its synonyms in xsyn_sample.rules file. File content is as mentioned below.
# Sample rules file for eXtended Synonym (xsyn) dictionary
# format is as follows:
#
# word synonym1 synonym2 ...
#
error fault issue mistake malfunctioning
I want to use this dictionary but don't know how to use it. When I search for 'error', I wants to display result for 'error', 'fault', 'issues', 'mistakes' etc which are having similar meanings. Kindly share if you have ever come across this implementation. Few things I am asking for
Is this xsyn_sample.rules is sufficient for this? If not then what other techniques can be used for this type of search?
How to configure postgreSQL 14 in my local system to use this dictionary instead of 'simple' or 'english'. I know how to use both of these dictionary with select plainto_tsquery('english','errors'); and select plainto_tsquery('simple','errors'); queries. Similarly I want to use my custom dictionary.
Is there any better source for dictionaries use in postgres in compare to https://www.postgresql.org/docs/current/textsearch-dictionaries.html ?
Don't edit the example rules file, create your own file mysyn.rules and add the synonyms there. Then create a dictionary that uses the file:
CREATE TEXT SEARCH DICTIONARY mysyn (TEMPLATE = xsyn_template, RULES = mysyn);
Then copy the English text search configuration and add your dictionary:
CREATE TEXT SEARCH CONFIGURATION myconf (COPY = english);
ALTER TEXT SEARCH CONFIGURATION myconf
ALTER MAPPING FOR word, asciiword WITH mysyn, english_stem;

make tsvector tokenize by space only

I need to create a tsvector that does not split its content by hyphens but ideally only by whitespace.
select to_tsvector('simple','7073-03-001-01 7072-05-003-06')
creates
'-001':3 '-003':7 '-01':4 '-03':2 '-05':6 '-06':8 '7072':5 '7073':1
where I rather want
'7072-05-003-06':2 '7073-03-001-01':1
is this possible somehow?
There is a simple example of a parser called test_parser which seems to do what you want. It was last in the documents in 9.4, after that it was moved to only be documented in the source tree. These test extensions aren't always installed, so you might need to take special steps (depending on how you installed PostgreSQL and what your OS is and whether you are really using an EOL version) to get it.
create extension test_parser ;
create text search configuration test ( parser = testparser);
ALTER TEXT SEARCH CONFIGURATION test ADD MAPPING FOR word WITH simple;
SELECT * FROM to_tsvector('test', '7073-03-001-01 7072-05-003-06');
to_tsvector
---------------------------------------
'7072-05-003-06':2 '7073-03-001-01':1

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4
tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

Add new language to postgresql full text search

Is there any way to add new languages to postgresq full text search?
Where can I read or start from ?
You can look at this a link from PostgreSQL documentation, where CREATE DICTIONARY commands are listed. There are several types of dictionaries that can be used and added, and commands for adding them deffer.
For example, if you wish to add Ispell dictionary, you would do it like this:
CREATE TEXT SEARCH DICTIONARY my_lang_ispell (
TEMPLATE = ispell,
DictFile = path_to_my_lang_dict_file,
AffFile = path_to_my_lang_affixes_file,
StopWords = path_to_my_lang_astop_words_file
);
DictFile and AffFile are files you need to google somewhere, depending on the language you want to add. StopWords file keeps words that should be ignored, I guess you can also find that file on the Internet.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.