problems with full-text search in postgres - postgresql

I have the next table, and data:
/* script for people table, with field tsvector and gin */
CREATE TABLE public.people (
id INTEGER,
name VARCHAR(30),
lastname VARCHAR(30),
complete TSVECTOR
)
WITH (oids = false);
CREATE INDEX idx_complete ON public.people
USING gin (complete);
/* data for people table */
INSERT INTO public.people ("id", "name", "lastname", "complete")
VALUES
(1, 'MICHAEL', 'BRYANT BRYANT', '''bryant'':2,3 ''michael'':1'),
(2, 'HENRY STEVEN', 'BUSH TIESSEN', '''bush'':3 ''henri'':1 ''steven'':2 ''tiessen'':4'),
(3, 'WILLINGTON STEVEN', 'STEPHENS FLINN', '''flinn'':4 ''stephen'':3 ''steven'':2 ''willington'':1'),
(4, 'BRET', 'MARTINEZ AROCH', '''aroch'':3 ''bret'':1 ''martinez'':2'),
(5, 'TERENCE BERT', 'CAVALIERE ENRON', '''bert'':2 ''cavalier'':3 ''terenc'':1');
I need retrieve the names and lastnames, according the tsvector field. Actually I have the query:
SELECT * FROM people WHERE complete ## to_tsquery('WILLINGTON & FLINN');
And the result is right (the third record). BUT if I try with
SELECT * FROM people WHERE complete ## to_tsquery('STEVEN & FLINN');
/* the same record! */
I don't have results. Why? What can I do?

You should use the same language to search your table as the values in your field 'complete' where inserted.
Check the result of that query compared english and german:
select * ,
to_tsvector('english', concat_ws(' ', name, lastname )) as english,
to_tsvector('german', concat_ws(' ', name, lastname )) as german
from public.people
so that should work for you :
SELECT * FROM people WHERE complete ## to_tsquery('english','STEVEN & FLINN');

You are probably using a text search configuration where either STEVEN or FLINN are modified by stemming.
I can reproduce this here:
test=> SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.german
(1 row)
test=> SELECT complete FROM public.people WHERE id = 3;
complete
-------------------------------------------------
'flinn':4 'stephen':3 'steven':2 'willington':1
(1 row)
test=> SELECT * FROM ts_debug('STEVEN & FLINN');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+---------------+-------------+---------
asciiword | Word, all ASCII | STEVEN | {german_stem} | german_stem | {stev}
blank | Space symbols | | {} | |
blank | Space symbols | & | {} | |
asciiword | Word, all ASCII | FLINN | {german_stem} | german_stem | {flinn}
(4 rows)
test=> SELECT * FROM public.people
WHERE complete ## to_tsquery('STEVEN & FLINN');
id | name | lastname | complete
----+------+----------+----------
(0 rows)
So you see, the German Snowball dictionary stems STEVEN to stev.
Since complete contains the unstemmed version steven, no match is found.
You should use the same text search configuration when you populate complete and in the query.

Related

What is the seq_name column in the ag_label table?

I'm working on a new feature that involves labels for Apache AGE and I'm looking for tools with which I can work with.
In psql interface, when you input the command SELECT * FROM ag_catalog.ag_label; the following output is shown:
name | graph | id | kind | relation | seq_name
------------------+--------+----+------+-----------------------+-------------------------
_ag_label_vertex | 495486 | 1 | v | test._ag_label_vertex | _ag_label_vertex_id_seq
_ag_label_edge | 495486 | 2 | e | test._ag_label_edge | _ag_label_edge_id_seq
vtx_label | 495486 | 3 | v | test.vtx_label | vtx_label_id_seq
elabel | 495486 | 4 | e | test.elabel | elabel_id_seq
I came across this and wasn't able to figure out what kind of data I can retrieve from it, what is it used for or how can it help me.
Can you explain the seq_name column?
Seq_name refers to sequences. Sequences are single-row tables that can be thought of as 'number generators' that start at some minimum integer value and then increment as they are 'consumed'.
A sequence that is associated with a column can be used to assign values to it. For example, 'mytable_seq_id' associated with column 'id' in a particular table 'mytable' might start at 1. Then as you add more entries to mytable, the 'id' column begins to increment to 2,3 and so on.
Postgres docs on creating sequences:
https://www.postgresql.org/docs/current/sql-createsequence.html
As for AGE, here's a comment taken directly out of the 'graph_commands.c' source file. It describes how sequences are used to generate labels ids.
static Oid create_schema_for_graph(const Name graph_name)
{
char *graph_name_str = NameStr(*graph_name);
CreateSchemaStmt *schema_stmt;
CreateSeqStmt *seq_stmt;
TypeName *integer;
DefElem *data_type;
DefElem *maxvalue;
DefElem *cycle;
Oid nsp_id;
/*
* This is the same with running the following SQL statement.
*
* CREATE SCHEMA `graph_name`
* CREATE SEQUENCE `LABEL_ID_SEQ_NAME`
* AS integer
* MAXVALUE `LABEL_ID_MAX`
* CYCLE
*
* The sequence will be used to assign a unique id to a label in the graph.
*
* schemaname doesn't have to be graph_name but the same name is used so
* that users can find the backed schema for a graph only by its name.
*
* ProcessUtilityContext of this command is PROCESS_UTILITY_SUBCOMMAND
* so the event trigger will not be fired.
*/
Note that sequences are used in other functions in the AGE internals as well, and the above function is just one example.

PostgreSQL Stored procedure performance

I use PostgreSQL and I have a table 'PERSON' in schema 'public' that looks like this:
+----+-------------+-------+----------------------------+
| id | internal_id | name | created |
+----+-------------+-------+----------------------------+
| 1 | P0001-XX00 | Bob | 2021-05-24 22:10:01.93025 |
+----+-------------+-------+----------------------------+
| 2 | P0001-CX00 | Tom | 2021-06-27 22:10:01.93025 |
+----+-------------+-------+----------------------------+
| 3 | P0002-XX00 | Anna | 2021-05-24 22:10:01.93025 |
+----+-------------+-------+----------------------------+
id -> bigint; internal_id -> character varying; name -> character varying; created -> timestamp without timezone
I need to write procedure that delete those records that are older than fixed timestamp, for example: now(). But as soon as such an old record has been found, I need to check if there are other records in the table which are not old yet and with the same first 5 characters in internal_id as the found old record. If there are such records, then I should not delete the old record.
So I wrote the following procedure with plpgsql and it seems to work:
BEGIN
DELETE FROM public."PERSON" AS t1
WHERE t1.created < now()
AND NOT EXISTS
(
SELECT
FROM public."PERSON" AS t2
WHERE left(t2.internal_id, 5) = left(t1.internal_id, 5)
AND t2.created >= now()
);
COMMIT;
END;
Questions:
Could it have been made more correct or prettier or cleaner? Perhaps, instead of using the left() function, it was necessary to use LIKE, or, in principle, to do it somehow differently?
Do you think this procedure has normal performance or it can be improved?
Thank you in advance!
That should work just fine.
For good performance, create an index:
CREATE INDEX ON public."PERSON" (left(internal_id, 5), created);

String Include some other strings

How do I check in postgres that a varchar contains 'aaa' or 'bbb'?
I tried myVarchar IN ('aaa', 'bbb') but, obviously, it's true when myvarchar is exactly equal to 'aaa' or 'bbb'.
for multiple similarity check the best fit in terms of speed and laconic syntax would be
SIMILAR TO '%(aaa|bbb|ccc)%'
you can use ANY & LIKE operators together.
SELECT * FROM "myTable" WHERE "myColumn" LIKE ANY( ARRAY[ '%aaa%', '%bbb%' ] );
Assuming this is your table:
CREATE TABLE t
(
myVarchar varchar
) ;
INSERT INTO t (myVarchar)
VALUES
('something aaa else'),
('also some bbb'),
('maybe ccc') ;
-- (some random data, this query is PostgreSQL specific)
INSERT INTO t (myVarchar)
SELECT
random()::varchar
FROM
generate_series(1, 10000) ;
SQL Standard approach:
You can do (in all SQL standard databases):
SELECT
*
FROM
t
WHERE
myVarchar LIKE '%aaa%' or myVarchar LIKE '%bbb%' ;
and you'll get:
| myvarchar |
| :----------------- |
| something aaa else |
| also some bbb |
PostgreSQL specific approaches
Specifically for PostgreSQL, you can use a (single) regex with multiple values to look for:
SELECT
*
FROM
t
WHERE
myVarchar ~ 'aaa|bbb' ;
| myvarchar |
| :----------------- |
| something aaa else |
| also some bbb |
dbfiddle here
If you need quick finds, you can use trigram indexes, like this:
CREATE EXTENSION pg_trgm; -- Only needed if extension not already installed
CREATE INDEX myVarchar_like_idx
ON t
USING GIST (myVarchar gist_trgm_ops);
... the query using LIKE will be much faster.

postgresql + textsearch + german umlauts + UTF8

I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!
You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)

Filter an ID Column against a range of values

I have the following SQL:
SELECT ',' + LTRIM(RTRIM(CAST(vessel_is_id as CHAR(2)))) + ',' AS 'Id'
FROM Vessels
WHERE ',' + LTRIM(RTRIM(CAST(vessel_is_id as varCHAR(2)))) + ',' IN (',1,2,3,4,5,6,')
Basically, I want to filter the vessel_is_id against a variable list of integer values (which is passed in as a varchar into the stored proc). Now, the above SQL does not work. I do have rows in the table with a `vessel__is_id' of 1, but they are not returned.
Can someone suggest a better approach to this for me? Or, if the above is OK
EDIT:
Sample data
| vessel_is_id |
| ------------ |
| 1 |
| 2 |
| 5 |
| 3 |
| 1 |
| 1 |
So I want to returned all of the above where vessel_is_id is in a variable filter i.e. '1,3' - which should return 4 records.
Cheers.
Jas.
IF OBJECT_ID(N'dbo.fn_ArrayToTable',N'FN') IS NOT NULL
DROP FUNCTION [dbo].[fn_ArrayToTable]
GO
CREATE FUNCTION [dbo].fn_ArrayToTable (#array VARCHAR(MAX))
-- =============================================
-- Author: Dan Andrews
-- Create date: 04/11/11
-- Description: String to Tabled-Valued Function
--
-- =============================================
RETURNS #output TABLE (data VARCHAR(256))
AS
BEGIN
DECLARE #pointer INT
SET #pointer = CHARINDEX(',', #array)
WHILE #pointer != 0
BEGIN
INSERT INTO #output
SELECT RTRIM(LTRIM(LEFT(#array,#pointer-1)))
SELECT #array = RIGHT(#array, LEN(#array)-#pointer),
#pointer = CHARINDEX(',', #array)
END
RETURN
END
Which you may apply like:
SELECT * FROM dbo.fn_ArrayToTable('2,3,4,5,2,2')
and in your case:
SELECT LTRIM(RTRIM(CAST(vessel_is_id AS CHAR(2)))) AS 'Id'
FROM Vessels
WHERE LTRIM(RTRIM(CAST(vessel_is_id AS VARCHAR(2)))) IN (SELECT data FROM dbo.fn_ArrayToTable('1,2,3,4,5,6')
Since Sql server doesn't have an Array you may want to consider passing in a set of values as an XML type. You can then turn the XML type into a relation and join on it. Drawing on the time-tested pubs database for example. Of course you're client may or may not have an easy time generating the XML for the parameter value, but this approach is safe from sql-injection which most "comma seperated" value approaches are not.
declare #stateSelector xml
set #stateSelector = '<values>
<value>or</value>
<value>ut</value>
<value>tn</value>
</values>'
select * from authors
where state in ( select c.value('.', 'varchar(2)') from #stateSelector.nodes('//value') as t(c))