Create full text search configuration with two dictionaries - postgresql

I want to perform a full text search on a postgresql column using the english_stem dictionary and the simple dictionary. I can do something like this:
ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
But this checks that the word is in both dictionaries. Is there a way to alter this configuration so the word can be matched with one dictionary OR the other?
Edit:
The reason I think they are not being checked in order is because when searching for a partial word that should be found in the simple dictionary, nothing is returned.
select * from ts_debug('english', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+----------
asciiword | Word, all ASCII | gutter | {english_stem} | english_stem | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {english_stem} | english_stem | {clean}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {english_stem} | english_stem | {servic}
select * from ts_debug('simple', 'gutter cleaning services');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | gutter | {simple} | simple | {gutter}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cleaning | {simple} | simple | {cleaning}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | services | {simple} | simple | {services}
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'cleani:*')));
name
------
(0 rows)
But searching for a partial in the english dictionary returns as expected.
select name from categories where (to_tsvector('english_simple_conf', name) ## (to_tsquery('english_simple_conf', 'clea:*')));
name
--------------------------
Gutter Cleaning Services

But this checks that the word is in both dictionaries.
That's not correct. As noted in the docs (see the description for the dictionary_name parameter), it checks them in order; it only checks the 2nd dictionary if it did not get a token from the first. You can verify this with ts_debug().
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | cars | {simple} | simple | {cars}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {simple} | simple | {boats}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
testdb=# ALTER TEXT SEARCH CONFIGURATION english_simple_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
WITH english_stem, simple;
ALTER TEXT SEARCH CONFIGURATION
testdb=# select * from ts_debug('public.english_simple_conf', 'cars boats n0taword');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+----------+-----------------------+--------------+------------
asciiword | Word, all ASCII | cars | {english_stem,simple} | english_stem | {car}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | boats | {english_stem,simple} | english_stem | {boat}
blank | Space symbols | | {} | |
numword | Word, letters and digits | n0taword | {simple} | simple | {n0taword}
(5 rows)
The reason for the difference in the last two queries is that english_stem stems 'Cleaning' to 'clean', so searching for 'cleani*' will not match. Try adding the to_tsvector and to_tsquery expressions as a column and removing them from the WHERE; you'll see that "Gutter Cleaning Services" is stemmed to 'clean':2 'gutter':1 'servic':3.
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleani:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'cleani':* | Gutter Cleaning Services
(1 row)
testdb=# select to_tsvector('english_simple_conf', name), to_tsquery('english_simple_conf', 'cleaning:*'), name from categories;
to_tsvector | to_tsquery | name
---------------------------------+------------+--------------------------
'clean':2 'gutter':1 'servic':3 | 'clean':* | Gutter Cleaning Services
(1 row)
If you change the ts_query to instead search for cleaning:*, that will get stemmed as well and again match. But, english_stem cannot figure out that 'cleani' is meant to stem to 'clean' unless it also sees the 'ng'. So, that falls through to simple, which performs no stemming, and you end up with the mismatch - still a trailing i in the tsquery, but not in the tsvector.
Stemming isn't meant to work on arbitrary prefixes of words, only on whole ones; for prefix matching, you'd use a traditional left-anchored LIKE.

Related

Any way to find and delete almost similar records with SQL?

I have a table in Postgres DB, that has a lot of almost identical rows. For example:
1. 00Zicky_-_San_Pedro_Danilo_Vigorito_Remix
2. 00Zicky_-_San_Pedro__Danilo_Vigorito_Remix__
3. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx__
4. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx_
5. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
6. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
I think about to writing a little golang script to remove duplicates, but maybe SQL can do it?
Table definition:
\d+ songs
Table "public.songs"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
song_id | integer | | not null | nextval('songs_song_id_seq'::regclass) | plain | | |
song_name | character varying(250) | | not null | | extended | | |
fingerprinted | smallint | | | 0 | plain | | |
file_sha1 | bytea | | | | extended | | |
total_hashes | integer | | not null | 0 | plain | | |
date_created | timestamp without time zone | | not null | now() | plain | | |
date_modified | timestamp without time zone | | not null | now() | plain | | |
Indexes:
"pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap
Tried several methods to find duplicates, but that methods search only for exact similarity.
There is no operator which is essentially A almost = B. (Well there is full text search, but that seems to be a little excessive here.) If the only difference is the number of - and _ then just get rid of them and compare the the resulting difference. If they are equal, then one is a duplicate. You can use the replace() function to remove them. So something like: (see demo)
delete
from songs s2
where exists ( select null
from songs s1
where s1.song_id < s2.song_id
and replace(replace(s1.name, '_',''),'-','') =
replace(replace(s2.name, '_',''),'-','')
);
If your table is large this will not be fast, but a functional index may help:
create index song_name_idx on songs
(replace(replace(name, '_',''),'-',''));

PostgreSQL full text search cannot find "andy"

I have this PostgreSQL query:
SELECT d.user_id, display_name, avatar_url
FROM user_directory_search
WHERE
user_id like '#and%';
I get these results:
user_id | display_name | avatar_url
----------------------------------------+--------------+------------
#andy.huang:synapse.siliconmotion.com | |
#andy.zhao:synapse.siliconmotion.com | Andy.zhao |
#andy.yao:synapse.siliconmotion.com | |
#andy.zou:synapse.siliconmotion.com | |
#andy.xie:synapse.siliconmotion.com | |
#andy.chang:synapse.siliconmotion.com | andy.chang |
#andy.chuang:synapse.siliconmotion.com | andy.chuang |
#andy.hsiao:synapse.siliconmotion.com | |
(8 rows)
But when I use the command:
SELECT d.user_id, display_name, avatar_url
FROM user_directory_search
WHERE
vector ## to_tsquery('english', '(andy:* | andy)');
I got nothing:
user_id | display_name | avatar_url
---------+--------------+------------
(0 rows)
Does anyone know the reason?
The problem is that the full text parser parses these strings as host names:
SELECT alias, description, token, lexemes
FROM ts_debug('english', '#andy.huang:synapse.siliconmotion.com')
WHERE alias <> 'blank';
alias | description | token | lexemes
-------+-------------+---------------------------+-----------------------------
host | Host | andy.huang | {andy.huang}
host | Host | synapse.siliconmotion.com | {synapse.siliconmotion.com}
(2 rows)
You could replace the offending periods with spaces during indexing:
SELECT alias, description, token, lexemes
FROM ts_debug('english',
translate('#andy.huang:synapse.siliconmotion.com', '.', ' '))
WHERE alias <> 'blank';
alias | description | token | lexemes
-----------+-----------------+---------------+--------------
asciiword | Word, all ASCII | andy | {andi}
asciiword | Word, all ASCII | huang | {huang}
asciiword | Word, all ASCII | synapse | {synaps}
asciiword | Word, all ASCII | siliconmotion | {siliconmot}
asciiword | Word, all ASCII | com | {com}
(5 rows)
But I would use the simple full text search configuration if I were you. Or do you want stemming (compare "token" and "lexemes" above)?

Full text search configuration on postgresql

I'm facing an issue concerning the text search configuration on postgresql.
I have a table users wich contains a column name. The name of users maybe a french, english, spanish or any other language.
So I need to use the Full Text Search of postgresql. The default text serach configuration I'm using now is the simple configuration but is not efficient to make the search and get the suitable results.
I'm trying to combine different text search configuration like this:
(to_tsvector('english', document) || to_tsvector('french', document) || to_tsvector('spanish', document) || to_tsvector('russian', document)) ##
(to_tsquery('english', query) || to_tsquery('french', query) || to_tsquery('spanish', query) || to_tsquery('russian', query))
But this query didn't give suitable results, if we test for example:
select (to_tsvector('english', 'adam and smith') || to_tsvector('french', 'adam and smith') || to_tsvector('spanish', 'adam and smith') || to_tsvector('russian', 'adam and smith'))
tsvector: 'adam':1,4,7,10 'and':5,8 'smith':3,6,9,12
Using the origin language of the word:
select (to_tsvector('english', 'adam and smith'))
tsvector: 'adam':1 'smith':3
The first thing to mention that the stopwords were not token into consideration when we combine different configuration with || operator.
Is there any solution to combine different text search configuration and use the suitable language when a user search a text?
Maybe you think that || is an “or” operator, but it concatenates text search vectors.
Take a look at what happens in your expression.
Running \dF+ french in psql will show you that for asciiwords, a French Snowball stemmer is used. That removes stop words and reduces the words to their stem. Similar for English and Russian.
You can use ts_debug to see this in operation:
test=> SELECT * FROM ts_debug('english', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | adam | {english_stem} | english_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {english_stem} | english_stem | {smith}
(5 rows)
test=> SELECT * FROM ts_debug('french', 'adam and smith');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------+-------------+---------
asciiword | Word, all ASCII | adam | {french_stem} | french_stem | {adam}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | and | {french_stem} | french_stem | {and}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | smith | {french_stem} | french_stem | {smith}
(5 rows)
Now if you concatenate these four tsvectors, you end up with adam in position 1, 4, 7 and 10.
There is no good way to use full text search for different languages at once.
But if it is really personal names you are searching, I would do the following:
Create a text search configuration with a simple dictionary for asciiwords, and either use an empty stopword file for the dictionary or one that contains stopwords that are acceptable in all languages.
Personal names normally should not be stemmed, so you avoid that problem. And if you miss a stopword, that's no big deal. It only makes the resulting tsvector (and index) larger, but with personal names there should not be too many stopwords anyway.

Error in Insert query : syntax error at or near ","

My insert query is,
insert into app_library_reports
(app_id,adp_id,reportname,description,searchstr,command,templatename,usereporttemplate,reporttype,sentbothfiles,useprevioustime,usescheduler,cronstr,option,displaysettings,isanalyticsreport,report_columns,chart_config)
values
(25,18,"Report_Barracuda_SpamDomain_summary","Report On Domains Sending Spam Emails","tl_tag:Barracuda_spam AND action:2","BarracudaSpam/Report_Barracuda_SpamDomain_summary.py",,,,,,,,,,,,);
Schema for the table 'app_library_reports' is:
Table "public.app_library_reports"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+---------+------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('app_library_reports_id_seq'::regclass) | plain | |
app_id | integer | | plain | |
adp_id | integer | | plain | |
reportname | text | | extended | |
description | text | | extended | |
searchstr | text | | extended | |
command | text | | extended | |
templatename | text | | extended | |
usereporttemplate | boolean | | plain | |
reporttype | text | | extended | |
sentbothfiles | text | | extended | |
useprevioustime | text | | extended | |
usescheduler | text | | extended | |
cronstr | text | | extended | |
option | text | | extended | |
displaysettings | text | | extended | |
isanalyticsreport | boolean | | plain | |
report_columns | json | | extended | |
chart_config | json | | extended | |
Indexes:
"app_library_reports_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"app_library_reports_adp_id_fkey" FOREIGN KEY (adp_id) REFERENCES app_library_adapter(id)
"app_library_reports_app_id_fkey" FOREIGN KEY (app_id) REFERENCES app_library_definition(id)
When I execute insert query it gives error:ERROR: syntax error at or near ","
Please help me to find out this error.Thank you.
I'm fairly certain your immediate error is coming from the empty string of commas (i.e. ,,,,,,,) appearing at the end of the INSERT. If you don't want to specify values for a particular column, you can pass NULL for the value. But in your case, since you only specify values for the first 6 columns, another way is to just specify those 6 columns names when you insert:
INSERT INTO app_library_reports
(app_id, adp_id, reportname, description, searchstr, command)
VALUES
(25, 18, 'Report_Barracuda_SpamDomain_summary',
'Report On Domains Sending Spam Emails', 'tl_tag:Barracuda_spam AND action:2',
'BarracudaSpam/Report_Barracuda_SpamDomain_summary.py')
This insert would only work if the columns not specified accept NULL. If some of the other columns are not nullable, then you would have to pass in values for them.

Escaping special characters in to_tsquery

How do you espace special characters in string passed to to_tsquery? For instance, this kind of query:
select to_tsquery('AT&T');
Produces:
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
to_tsquery
------------
(1 row)
Edit: I also noticed that there is the same issue in to_tsvector.
A simple solution is to create the tsquery as follows:
select $$'AT&T'$$::tsquery;
You can make more complex queries:
select $$'AT&T' & Phone | '|Bang!'$$::tsquery;
See the text search docs for more.
I found this comment very useful that uses the plainto_tsquery('AT&T) function https://stackoverflow.com/a/16020565/350195
If you want 'AT&T' to be treated as a search word, you're going to need some customised components, because the default parser splits it as two words:
steve#steve#[local] =# select * from ts_parse('default', 'AT&T');
tokid | token
-------+-------
1 | AT
12 | &
1 | T
(3 rows)
steve#steve#[local] =# select * from ts_debug('simple', 'AT&T');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | AT | {simple} | simple | {at}
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | T | {simple} | simple | {t}
(3 rows)
As you can see from the documentation for CREATE TEXT PARSER this is not very trivial, as the parser appears to need to be a C function.
You might find this post of someone getting "underscore_word" to be recognised as a single token useful: http://postgresql.1045698.n5.nabble.com/Configuring-Text-Search-parser-td2846645.html