I'm evaluating PostgreSQL to see if it is a viable alternative for ElasticSearch to begin with (migrating later is fine). I've been reading that PG full text capability is now 'good enough'. I'm running version 11.
Why doesn't this detect a match? I thought stemming would have easily detected different forms of the word "big":
SELECT to_tsvector('english', 'bigger') ## to_tsquery('english', 'big')
Am I using the wrong configuration?
You can also reuse the scripts english.sh and english.sql from https://dba.stackexchange.com/questions/57058/how-do-i-use-an-ispell-dictionary-with-postgres-text-search.
I have modified in the generated dictionaries:
in english.affix I have added the IG > GER rule:
flag *R:
E > R # As in skate > skater
[^AEIOU]Y > -Y,IER # As in multiply > multiplier
[AEIOU]Y > ER # As in convey > conveyer
[^EY] > ER # As in build > builder
IG > GER # For big > bigger
in english.dict I have modified
big/PY
to
big/PYR
After running english.sql for the current database (you need to modify database name in the script):
postgres=# select ts_debug('english bigger');
select ts_debug('english bigger');
ts_debug
----------------------------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",english,"{english_ispell,english_stem}",english_ispell,{english})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",bigger,"{english_ispell,english_stem}",english_ispell,"{bigger,big}")
(3 rows)
postgres=# SELECT to_tsvector('english bigger') ## to_tsquery('english', 'big');
SELECT to_tsvector('english bigger') ## to_tsquery('english', 'big');
?column?
----------
t
(1 row)
Looks like I need to install an ispell dictionary as the English dictionary doesn't do this by default.
https://www.postgresql.org/docs/current/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
Also see this answer: https://stackoverflow.com/a/61213187/148390
Related
I am working on Oracle to PostgreSQL migration. Some .sql files are written to generate reports which is executing in Linux server. Below is code which has to be migrate, Looking for alternate in PostgreSQL.
Ex: In Oracle: test.sql
--------------------
col cdmate format 9999999 heading "Code Material"
col cdsorm format A11 heading "Code sort"
select t.cdmate, t.cdsorm from t_sormat t
Code Material Code sort
-------------- ---------------------
4832764 Sort-able
You can use column aliases and the FORMAT function to get what you need:
SELECT
FORMAT('%-10s', cdmate) AS "Code Material",
FORMAT('%-11s', cdsorm) AS "Code Sort"
FROM t_sormat;
Code Material | Code Sort
---------------+-------------
4832764 | Sort-able
This works as expected:
# select to_tsvector('SICK FOTOCEL VS#VE180-P132') ## 'p132'::tsquery;
?column?
----------
t
However, when the '#' is replaced by a '/' i get
# select to_tsvector('SICK FOTOCEL VS/VE180-P132') ## 'p132'::tsquery;
?column?
----------
f
This is because VS/VE180-P132 is classified as a file token. This is not correct in our use case. How do i change this behaviour? For instance, dropping the token types email, url and file?
You cannot change this behaviour unless you want to write a new parser in C.
But you can use the workaround of replacing certain characters in all strings before you use full text search on them:
SELECT to_tsvector(regexp_replace('SICK FOTOCEL VS/VE180-P132', '[/.]', ' '))
## to_tsquery(regexp_replace('p132', '[/.]', ' '));
I created a Postgresql full text search using 'german'. How can I configer, that when I search for "Bezirk", lines containing "Bez." are also a match? (And vice-versa)
#pozs is right. You need to use a synonym dictionary.
1 - In the directory $SHAREDIR/tsearch_data create the file german.syn with the following contents:
Bez Bezirk
2 - Execute the query:
CREATE TEXT SEARCH DICTIONARY german_syn (
template = synonym,
synonyms = german);
CREATE TEXT SEARCH CONFIGURATION german_syn(COPY='simple');
ALTER TEXT SEARCH CONFIGURATION german_syn
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_syn, german_stem;
Now you can test it. Execute queries:
test=# SELECT to_tsvector('german_syn', 'Bezirk') ## to_tsquery('german_syn', 'Bezirk & Bez');
?column?
----------
t
(1 row)
test=# SELECT to_tsvector('german_syn', 'Bez Bez.') ## to_tsquery('german_syn', 'Bezirk');
?column?
----------
t
(1 row)
Additional links:
PostgreSQL: A Full Text Search engine (expired)
Try using a wildcard in your search.
For example:
tableName.column LIKE 'Bez%'
The % will search for any letter or number after the Bez
Description is very vague to understand what you are trying to achieve, but it looks like you need simple pattern matching search as you looking for abbreviations (so need to do stemming like in Full Text Search). I would with pg_trgm for this purpose:
WITH t(word) AS ( VALUES
('Bez'),
('Bezi'),
('Bezir')
)
SELECT word, similarity(word, 'Bezirk') AS similarity
FROM t
WHERE word % 'Bezirk'
ORDER BY similarity DESC;
Result:
word | similarity
-------+------------
Bezir | 0.625
Bezi | 0.5
Bez | 0.375
(3 rows)
What I'm trying to do is to raise out of range error in case of dates outside of the supported range like what typecasting does.
I'm using PostgreSQL-9.1.6 on CentOS. The issue is below...
postgres=# select to_date('20130229','yyyymmdd');
to_date
------------
2013-03-01
(1 row)
But the output I want to see is:
postgres=# select '20130229'::date;
ERROR: date/time field value out of range: "20130229"
Surfing the web I found an informative page. So I did adding IS_VALID_JULIAN to the function body of to_date, adding the four lines marked + below to formatting.c:
Datum
to_date(PG_FUNCTION_ARGS)
{
text *date_txt = PG_GETARG_TEXT_P(0);
text *fmt = PG_GETARG_TEXT_P(1);
DateADT result;
struct pg_tm tm;
fsec_t fsec;
do_to_timestamp(date_txt, fmt, &tm, &fsec);
+ if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+ errmsg("date out of range: \"%s\"",text_to_cstring(date_txt))));
result = date2j(tm.tm_year, tm.tm_mon, tm.tm_mday) - POSTGRES_EPOCH_JDATE;
PG_RETURN_DATEADT(result);
}
Then I rebuilt PostgreSQL:
pg_ctl -m fast stop # 1. stopping pgsql
vi src/backend/utils/adt/formatting.c # 2. using the version above
rm -rf /usr/local/pgsql/* # 3. getting rid of all bin files
./configure --prefix=/usr/local/pgsql
--enable-nls --with-perl --with-libxml
--with-pam --with-openssl
make && make install # 4. rebuilding source
pg_ctl start # 5. starting the engine
My bin directory info is below.
[/home/postgres]echo $PATH
/usr/lib64/qt-3.3/bin:
/usr/local/bin:
/bin:
/usr/bin:
/usr/local/sbin:
/usr/sbin:
/sbin:
/home/postgres/bin:
/usr/bin:
/usr/local/pgsql/bin:
/usr/local/pgpool/bin:
/usr/local/pgtop/bin/pg_top:
[/home/postgres]which pg_ctl
/usr/local/pgsql/bin/pg_ctl
[/home/postgres]which postgres
/usr/local/pgsql/bin/postgres
[/usr/local/bin]which psql
/usr/local/pgsql/bin/psql
But upon checking to_date again, the result remained the same.
postgres=# select to_date('20130229','yyyymmdd');
to_date
------------
2013-03-01
(1 row)
Is there anything I missed?
You can write your own to_date() function, but you have to call it with its schema-qualified name. (I used the schema "public", but there's nothing special about that.)
create or replace function public.to_date(any_date text, format_string text)
returns date as
$$
select to_date((any_date::date)::text, format_string);
$$
language sql
Using the bare function name executes the native to_date() function.
select to_date('20130229', 'yyyymmdd');
2013-03-01
Using the schema-qualified name executes the user-defined function.
select public.to_date('20130229', 'yyyymmdd');
ERROR: date/time field value out of range: "20130229"
SQL state: 22008
I know that's not quite what you're looking for. But . . .
It's simpler than rebuilding PostgreSQL from source.
Fixing up your existing SQL and PLPGSQL source code is a simple search-and-replace with a streaming editor. I'm pretty sure that can't go wrong, as long as you really want every use of the native to_date() to be public.to_date().
The native to_date() function will still work as designed. Extensions and other code might rely on its somewhat peculiar behavior. Think hard and long before you change the behavior of native functions.
New SQL and PLPGSQL would need to be reviewed, though. I wouldn't expect developers to remember to write public.to_date() every time. If you use version control, you might be able to write a precommit hook to make sure only public.to_date() is used.
The native to_date() function has behavior I don't see documented. Not only can you call it with February 29, you can call it with February 345, or February 9999.
select to_date('201302345', 'yyyymmdd');
2014-01-11
select to_date('2013029999', 'yyyymmdd');
2040-06-17
Given the following query:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('cats ate');
This query will return true as a result. Now, what if I don't want "cats" to also match the word "cat", is there any way I can prevent this?
Also, is there any way I can make sure that the tsquery matches the entire string in that particular order (e.g. the "cats ate" is counted as a single token rather than two). At the moment the following query will also match:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('ate cats');
cat matching cats is due to english stemming, english being probably your default text search configuration. See the result of show default_text_search_config to be sure.
It can be avoided by using the simple configuration. Try the function calls with explicit text configurations:
select to_tsvector('simple', 'fat cat ate rat') ## plainto_tsquery('simple', 'cats ate');
Or change it with:
set default_text_search_config='simple';