postgresql + textsearch + german umlauts + UTF8 - postgresql

I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!

You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)

Related

problems with full-text search in postgres

I have the next table, and data:
/* script for people table, with field tsvector and gin */
CREATE TABLE public.people (
id INTEGER,
name VARCHAR(30),
lastname VARCHAR(30),
complete TSVECTOR
)
WITH (oids = false);
CREATE INDEX idx_complete ON public.people
USING gin (complete);
/* data for people table */
INSERT INTO public.people ("id", "name", "lastname", "complete")
VALUES
(1, 'MICHAEL', 'BRYANT BRYANT', '''bryant'':2,3 ''michael'':1'),
(2, 'HENRY STEVEN', 'BUSH TIESSEN', '''bush'':3 ''henri'':1 ''steven'':2 ''tiessen'':4'),
(3, 'WILLINGTON STEVEN', 'STEPHENS FLINN', '''flinn'':4 ''stephen'':3 ''steven'':2 ''willington'':1'),
(4, 'BRET', 'MARTINEZ AROCH', '''aroch'':3 ''bret'':1 ''martinez'':2'),
(5, 'TERENCE BERT', 'CAVALIERE ENRON', '''bert'':2 ''cavalier'':3 ''terenc'':1');
I need retrieve the names and lastnames, according the tsvector field. Actually I have the query:
SELECT * FROM people WHERE complete ## to_tsquery('WILLINGTON & FLINN');
And the result is right (the third record). BUT if I try with
SELECT * FROM people WHERE complete ## to_tsquery('STEVEN & FLINN');
/* the same record! */
I don't have results. Why? What can I do?
You should use the same language to search your table as the values in your field 'complete' where inserted.
Check the result of that query compared english and german:
select * ,
to_tsvector('english', concat_ws(' ', name, lastname )) as english,
to_tsvector('german', concat_ws(' ', name, lastname )) as german
from public.people
so that should work for you :
SELECT * FROM people WHERE complete ## to_tsquery('english','STEVEN & FLINN');
You are probably using a text search configuration where either STEVEN or FLINN are modified by stemming.
I can reproduce this here:
test=> SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.german
(1 row)
test=> SELECT complete FROM public.people WHERE id = 3;
complete
-------------------------------------------------
'flinn':4 'stephen':3 'steven':2 'willington':1
(1 row)
test=> SELECT * FROM ts_debug('STEVEN & FLINN');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+---------------+-------------+---------
asciiword | Word, all ASCII | STEVEN | {german_stem} | german_stem | {stev}
blank | Space symbols | | {} | |
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | FLINN | {german_stem} | german_stem | {flinn}
(4 rows)
test=> SELECT * FROM public.people
WHERE complete ## to_tsquery('STEVEN & FLINN');
id | name | lastname | complete
----+------+----------+----------
(0 rows)
So you see, the German Snowball dictionary stems STEVEN to stev.
Since complete contains the unstemmed version steven, no match is found.
You should use the same text search configuration when you populate complete and in the query.

String Include some other strings

How do I check in postgres that a varchar contains 'aaa' or 'bbb'?
I tried myVarchar IN ('aaa', 'bbb') but, obviously, it's true when myvarchar is exactly equal to 'aaa' or 'bbb'.
for multiple similarity check the best fit in terms of speed and laconic syntax would be
SIMILAR TO '%(aaa|bbb|ccc)%'
you can use ANY & LIKE operators together.
SELECT * FROM "myTable" WHERE "myColumn" LIKE ANY( ARRAY[ '%aaa%', '%bbb%' ] );
Assuming this is your table:
CREATE TABLE t
(
myVarchar varchar
) ;
INSERT INTO t (myVarchar)
VALUES
('something aaa else'),
('also some bbb'),
('maybe ccc') ;
-- (some random data, this query is PostgreSQL specific)
INSERT INTO t (myVarchar)
SELECT
random()::varchar
FROM
generate_series(1, 10000) ;
SQL Standard approach:
You can do (in all SQL standard databases):
SELECT
*
FROM
t
WHERE
myVarchar LIKE '%aaa%' or myVarchar LIKE '%bbb%' ;
and you'll get:
| myvarchar |
| :----------------- |
| something aaa else |
| also some bbb |
PostgreSQL specific approaches
Specifically for PostgreSQL, you can use a (single) regex with multiple values to look for:
SELECT
*
FROM
t
WHERE
myVarchar ~ 'aaa|bbb' ;
| myvarchar |
| :----------------- |
| something aaa else |
| also some bbb |
dbfiddle here
If you need quick finds, you can use trigram indexes, like this:
CREATE EXTENSION pg_trgm; -- Only needed if extension not already installed
CREATE INDEX myVarchar_like_idx
ON t
USING GIST (myVarchar gist_trgm_ops);
... the query using LIKE will be much faster.

PostgreSQL full text search yielding weird results

I have a schema like this (simplified):
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name NOT NULL
);
CREATE INDEX users_idx
ON users
USING GIN (to_tsvector('finnish', name));
But I'm getting completely invalid results with my queries:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmin');
name
------
(0 rows)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmink');
name
--------------------
Riitta ja Lemminki
Riitta ja Lemminki
(2 rows)
# select name from users where name ilike 'lemmink%';
name
----------------------
Lemminkäinen Matilda
Lemminkäinen Matias
Lemminkäinen Kyösti
Lemminkäinen Tuomas
(4 rows)
Another example:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partu');
name
----------
Partuuna
(1 row)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partur');
name
------------------------
Parturi-Kampaamo Raija
Parturi-Kampaamo Siema
(2 rows)
I was expecting to get the bottom two results on both queries...
Using the following version:
psql (9.4.6, server 9.5.2)
WARNING: psql major version 9.4, server major version 9.5.
Some psql features might not work.
I don't speak Finnish, but it seems expected result. FTS looks for lexemes, not for parts of words, Eg, do is not a lexemme for dog, but dog is for dogs:
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('do');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
?column?
----------
f
(1 row)
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('dog');
?column?
----------
t
(1 row)
So I believe in Parturi last i is optional ending - right?..
Update:
from https://en.wiktionary.org/wiki/parturi :
partur[i], partur[eita] => lexeme will be partur

Strange result searching with to_tsquery under Postgresql

I got a strange result searching for an expression like pro-physik.de with tsquery.
If I ask for pro-physik:* by tsquery I want to get all entries starting with pro-physik. Unfortunately those entries with pro-physik.de are missing.
Here are 2 examples to demonstrate the problem:
Query 1:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-physik:*') = true
Result 1: false (should be true)
Query 2:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-p:*') = true
Result 2: true
Has anybody an idea how I could solve this problem?
The core of the problem is that the parser will parse pro-physik.de as a hostname:
SELECT alias, token FROM ts_debug('simple', 'pro-physik.de');
alias | token
-------+---------------
host | pro-physik.de
(1 row)
Compare this:
SELECT alias, token FROM ts_debug('simple', 'pro-physik-de');
alias | token
-----------------+---------------
asciihword | pro-physik-de
hword_asciipart | pro
blank | -
hword_asciipart | physik
blank | -
hword_asciipart | de
(6 rows)
Now pro-physik and pro-p are not hostnames, so you get
SELECT to_tsquery('simple', 'pro-physik:*');
to_tsquery
---------------------------------------
'pro-physik':* & 'pro':* & 'physik':*
(1 row)
SELECT to_tsquery('simple', 'pro-p:*');
to_tsquery
-----------------------------
'pro-p':* & 'pro':* & 'p':*
(1 row)
The first tsquery will not match because physik is not a prefix of pro-physik.de, and the second will match because pro-p, pre and p all three are prefixes.
As a workaround, use full text search like this:
select
to_tsvector('simple', replace('pro-physik.de', '.', ' ')) ##
to_tsquery('simple', replace('pro-physik:*', '.', ' '))

PostgreSQL LIKE operator doesn't match hyphen

I've a problem with LIKE operator in PostgreSQL. It does not match patterns that contain the - character. I've tried to escape these characters using the ESCAPE option, but it still does not work.
Query:
select * from footable where trascrizione like '% [---]is Abraam %';
Sample data (contents of column trascrizione):
[---]is Abraam [e]t Ise[---] / ((crux quadrata))
How can I solve this problem?
That pattern would not match because there is no space before [---]is Ambraam. There is a space in your pattern, between the % and [ characters, and it's requiring that space to be in your data. Try LIKE '%[---]is Abraam %'.
Please show the code. I have no problem:
SELECT * FROM a WHERE a LIKE '%-' AND b LIKE '%-%' AND c LIKE '-%';
a | b | c | d | e
------+---+------+---+---
foo- | - | -bar | |
(1 Zeile)