PostgreSQL full text search yielding weird results - postgresql

I have a schema like this (simplified):
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name NOT NULL
);
CREATE INDEX users_idx
ON users
USING GIN (to_tsvector('finnish', name));
But I'm getting completely invalid results with my queries:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmin');
name
------
(0 rows)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('lemmink');
name
--------------------
Riitta ja Lemminki
Riitta ja Lemminki
(2 rows)
# select name from users where name ilike 'lemmink%';
name
----------------------
Lemminkäinen Matilda
Lemminkäinen Matias
Lemminkäinen Kyösti
Lemminkäinen Tuomas
(4 rows)
Another example:
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partu');
name
----------
Partuuna
(1 row)
# select name from users where to_tsvector('finnish', name) ## to_tsquery('partur');
name
------------------------
Parturi-Kampaamo Raija
Parturi-Kampaamo Siema
(2 rows)
I was expecting to get the bottom two results on both queries...
Using the following version:
psql (9.4.6, server 9.5.2)
WARNING: psql major version 9.4, server major version 9.5.
Some psql features might not work.

I don't speak Finnish, but it seems expected result. FTS looks for lexemes, not for parts of words, Eg, do is not a lexemme for dog, but dog is for dogs:
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('do');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
?column?
----------
f
(1 row)
t=# select to_tsvector('english', 'Dogs eats bone') ## to_tsquery('dog');
?column?
----------
t
(1 row)
So I believe in Parturi last i is optional ending - right?..
Update:
from https://en.wiktionary.org/wiki/parturi :
partur[i], partur[eita] => lexeme will be partur

Related

Postgres FTS Priority Field

I am using Postgres FTS to search a field in a table. The only issue is for some reason the below issue is happening.
store=# select name from service where to_tsvector(name) ## to_tsquery('find:*') = true;
name
--------------
Finding Nora
(1 row)
store=# select name from service where to_tsvector(name) ## to_tsquery('findi:*') = true;
name
------
(0 rows)
store=# select name from service where to_tsvector(name) ## to_tsquery('findi:*') = true;
How come when searching using the query findi:*,the result doesnt show?
In my PG 12.2 with default text search configuration I have:
# select to_tsvector('Finding Nora');
to_tsvector
-------------------
'find':1 'nora':2
(1 row)
# select to_tsquery('findi:*');
to_tsquery
------------
'findi':*
(1 row)
I understand that because there is no lexeme findi in the default dictionary, the query does not find any match.

How to use LIKE for an Inet column?

I have a column in my database called mgmt which is an inet type
I would like to do this:
select * from source
where mgmt LIKE inet('10.208.6%')
but this gives the error "ERROR: invalid input syntax for type inet:"
Any idea how I can use the LIKE clause for an inet type?
First idea:
Casting into text and try then:
WHERE mgmt::text LIKE '10.208.6%'
Second idea:
Try this
WHERE mgmt BETWEEN inet('10.208.60.0') AND inet('10.208.69.255')
(or, even more precise, because your wildcard covers both, 6 and 60s)
WHERE (mgmt BETWEEN inet('10.208.60.0') AND inet('10.208.69.255'))
OR (mgmt BETWEEN inet('10.208.6.0') AND inet('10.208.6.255'))
Third idea:
If you just want to check the 6 range, use the << operator to check the IP range:
WHERE mgmt << inet('10.208.6/24')
I would use the << operator to do inet operations:
postgres=# SELECT '192.168.1.1'::inet << '192.168.1.0/24'::inet;
?column?
----------
t
(1 row)
postgres=# SELECT '192.168.10.1'::inet << '192.168.1.0/24'::inet;
?column?
----------
f
(1 row)
But if you really must use ILIKE for some reason, you can cast to text:
postgres=# SELECT text('192.168.10.1'::inet) ilike '192.168.1%';
?column?
----------
t
(1 row)
More information can be found in the documentation

Importing bytea data into PostgreSQL by using COPY FROM stdin

I generated a (UTF-8) file by an external program for importing into PostgreSQL 9.6.1. Problem is the bytea field (PWHASH).
Snippet from this file (using TAB as delimiter)
COPY USERS (ID,CODE,PWHASH,EMAIL) FROM stdin;
7 test1 E'\\\\x657B954D27B4AC56FA997D24A5FF2563' test#amce.org
\.
When importing with
psql mydb myrole -f test.sql
Everything goes well.
However, if i query the result, the byte array is not 16 bytes, but 37 bytes:
select passwordhash,length(passwordhash) from users;
passwordhash | length
------------------------------------------------------------------------------+--------
\x45275c78363537423935344432374234414335364641393937443234413546463235363327 | 37
What is the correct syntax for this?
The format of the input file is wrong. It should be like this:
7 test1 \\x657B954D27B4AC56FA997D24A5FF2563 test#amce.org
I will have to "prepare" data I believe. Smth like here:
t=# insert into u select 'x657B954D27B4AC56FA997D24A5FF2563';
INSERT 0 1
Time: 5990.809 ms
t=# select b from u;
b
----------------------------------------------------------------------
\x783635374239353444323742344143353646413939374432344135464632353633
(1 row)
Time: 0.234 ms
t=# insert into u select decode('657B954D27B4AC56FA997D24A5FF2563','hex');
INSERT 0 1
Time: 62.767 ms
t=# select b from u;
b
----------------------------------------------------------------------
\x783635374239353444323742344143353646413939374432344135464632353633
\x657b954d27b4ac56fa997d24a5ff2563
(2 rows)
Time: 0.208 ms
So in your case you can:
create table t as select ID,CODE,PWHASH::text,EMAIL from users where false;
COPY t (ID,CODE,PWHASH,EMAIL) FROM stdin;
insert into users select ID,CODE,decode(substr(PWHASH,4),'hex'),EMAIL from t;

Strange result searching with to_tsquery under Postgresql

I got a strange result searching for an expression like pro-physik.de with tsquery.
If I ask for pro-physik:* by tsquery I want to get all entries starting with pro-physik. Unfortunately those entries with pro-physik.de are missing.
Here are 2 examples to demonstrate the problem:
Query 1:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-physik:*') = true
Result 1: false (should be true)
Query 2:
select
to_tsvector('simple', 'pro-physik.de') ##
to_tsquery('simple', 'pro-p:*') = true
Result 2: true
Has anybody an idea how I could solve this problem?
The core of the problem is that the parser will parse pro-physik.de as a hostname:
SELECT alias, token FROM ts_debug('simple', 'pro-physik.de');
alias | token
-------+---------------
host | pro-physik.de
(1 row)
Compare this:
SELECT alias, token FROM ts_debug('simple', 'pro-physik-de');
alias | token
-----------------+---------------
asciihword | pro-physik-de
hword_asciipart | pro
blank | -
hword_asciipart | physik
blank | -
hword_asciipart | de
(6 rows)
Now pro-physik and pro-p are not hostnames, so you get
SELECT to_tsquery('simple', 'pro-physik:*');
to_tsquery
---------------------------------------
'pro-physik':* & 'pro':* & 'physik':*
(1 row)
SELECT to_tsquery('simple', 'pro-p:*');
to_tsquery
-----------------------------
'pro-p':* & 'pro':* & 'p':*
(1 row)
The first tsquery will not match because physik is not a prefix of pro-physik.de, and the second will match because pro-p, pre and p all three are prefixes.
As a workaround, use full text search like this:
select
to_tsvector('simple', replace('pro-physik.de', '.', ' ')) ##
to_tsquery('simple', replace('pro-physik:*', '.', ' '))

postgresql + textsearch + german umlauts + UTF8

I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!
You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)