PostgreSQL full text search abbreviations - postgresql

I created a Postgresql full text search using 'german'. How can I configer, that when I search for "Bezirk", lines containing "Bez." are also a match? (And vice-versa)

#pozs is right. You need to use a synonym dictionary.
1 - In the directory $SHAREDIR/tsearch_data create the file german.syn with the following contents:
Bez Bezirk
2 - Execute the query:
CREATE TEXT SEARCH DICTIONARY german_syn (
template = synonym,
synonyms = german);
CREATE TEXT SEARCH CONFIGURATION german_syn(COPY='simple');
ALTER TEXT SEARCH CONFIGURATION german_syn
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_syn, german_stem;
Now you can test it. Execute queries:
test=# SELECT to_tsvector('german_syn', 'Bezirk') ## to_tsquery('german_syn', 'Bezirk & Bez');
?column?
----------
t
(1 row)
test=# SELECT to_tsvector('german_syn', 'Bez Bez.') ## to_tsquery('german_syn', 'Bezirk');
?column?
----------
t
(1 row)
Additional links:
PostgreSQL: A Full Text Search engine (expired)

Try using a wildcard in your search.
For example:
tableName.column LIKE 'Bez%'
The % will search for any letter or number after the Bez

Description is very vague to understand what you are trying to achieve, but it looks like you need simple pattern matching search as you looking for abbreviations (so need to do stemming like in Full Text Search). I would with pg_trgm for this purpose:
WITH t(word) AS ( VALUES
('Bez'),
('Bezi'),
('Bezir')
)
SELECT word, similarity(word, 'Bezirk') AS similarity
FROM t
WHERE word % 'Bezirk'
ORDER BY similarity DESC;
Result:
word | similarity
-------+------------
Bezir | 0.625
Bezi | 0.5
Bez | 0.375
(3 rows)

Related

How to remove multiple characters between 2 special characters in a column in SSIS expression

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.
First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Why Doesn't This Full Text Search Match in PostgreSQL?

I'm evaluating PostgreSQL to see if it is a viable alternative for ElasticSearch to begin with (migrating later is fine). I've been reading that PG full text capability is now 'good enough'. I'm running version 11.
Why doesn't this detect a match? I thought stemming would have easily detected different forms of the word "big":
SELECT to_tsvector('english', 'bigger') ## to_tsquery('english', 'big')
Am I using the wrong configuration?
You can also reuse the scripts english.sh and english.sql from https://dba.stackexchange.com/questions/57058/how-do-i-use-an-ispell-dictionary-with-postgres-text-search.
I have modified in the generated dictionaries:
in english.affix I have added the IG > GER rule:
flag *R:
E > R # As in skate > skater
[^AEIOU]Y > -Y,IER # As in multiply > multiplier
[AEIOU]Y > ER # As in convey > conveyer
[^EY] > ER # As in build > builder
IG > GER # For big > bigger
in english.dict I have modified
big/PY
to
big/PYR
After running english.sql for the current database (you need to modify database name in the script):
postgres=# select ts_debug('english bigger');
select ts_debug('english bigger');
ts_debug
----------------------------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",english,"{english_ispell,english_stem}",english_ispell,{english})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",bigger,"{english_ispell,english_stem}",english_ispell,"{bigger,big}")
(3 rows)
postgres=# SELECT to_tsvector('english bigger') ## to_tsquery('english', 'big');
SELECT to_tsvector('english bigger') ## to_tsquery('english', 'big');
?column?
----------
t
(1 row)
Looks like I need to install an ispell dictionary as the English dictionary doesn't do this by default.
https://www.postgresql.org/docs/current/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
Also see this answer: https://stackoverflow.com/a/61213187/148390

Postgres Full-Text Search with Hyphen and Numerals

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.
SELECT to_tsvector('english', 'abc-xyz');
returns
'abc':2 'abc-xyz':1 'xyz':3
However,
SELECT to_tsvector('english', 'abc-001');
returns
'-001':2 'abc':1
Why not something like this?
'abc':2 'abc-001':1 '001':3
And what should I do to be able to search by the numeric portion alone, without the hyphen?
Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():
SELECT * FROM ts_debug('english', 'abc-001');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | abc | {simple} | simple | {abc}
int | Signed integer | -001 | {simple} | simple | {-001}
Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).
A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash (—) or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)
SELECT to_tsvector('english', translate('abc-001', '-', '—'))
## to_tsquery ('english', '001'); -- true now
db<>fiddle here
This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.
But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.
select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
'''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;
This results in:
'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )
It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

Can I disable dictionary in postgres ts_vector / ts_query full text search?

I need to do a text search on machine language. If I use any of the available text search dictonaries, the ts_vectors are messing up.
ex. move -> becomes mov and my searching is failing.
any Idea how to index non- lingual words?
Thanks!
Have you tried the simple dictionary with an empty stop word file?
Create an empty stop word file $(pg_config --sharedir)/tsearch_data/empty.stop and run:
CREATE TEXT SEARCH DICTIONARY machine (
TEMPLATE = pg_catalog.simple,
STOPWORDS = empty
);
CREATE TEXT SEARCH CONFIGURATION machine (
PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION machine
ADD MAPPING FOR asciiword, word, numword, asciihword, hword,
numhword, hword_asciipart, hword_part,
hword_numpart, email, protocol, url, host,
url_path, file, sfloat, float, int, uint,
version, tag, entity, blank
WITH machine;
Then you can get:
test=> SELECT * FROM ts_debug('machine', 'move');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | move | {machine} | machine | {move}
(1 row)
If you want this configuration by default (so you don't have to specify 'machine' all the time), change the parameter default_text_search_config appropriately.

postgres text search - numhword matching and order

The postgres documentation here explains the numhword parser as one that matches Hyphenated word, letters and digits. The example they give for this is postgres-beta1 and this matches nicely. However, somethnig such as postgres-9-beta1 does not match, and i can't seem to find a default parser that will work with this. SQL below.
Is my best choice something that parsed just on spaces? Is there such a default parser? (It seems test_parser doesn't ship with 9.5 anymore...)
I want to tokenize alphanumerics, hyphenated. Am I stuck with regular expressions for the time being, or is there a straightforward way to create a customer parser (without dropping down into C) ?
CREATE TEXT SEARCH DICTIONARY simple_nostem_no_stop (TEMPLATE = pg_catalog.simple);
CREATE TEXT SEARCH CONFIGURATION test_id_search ( COPY = pg_catalog.simple );
alter text search configuration test_id_search
drop mapping for asciihword, asciiword, email, file, float, host, hword, hword_asciipart, hword_numpart, hword_part, int, numhword, numword, sfloat, uint, url, url_path, version, word ;
ALTER TEXT SEARCH CONFIGURATION test_id_search
ALTER MAPPING FOR numhword WITH simple_nostem_no_stop;
\dF+ test_id_search
Text search configuration "public.test_id_search"
Parser: "pg_catalog.default"
Token | Dictionaries
---------+-----------------------
numhword | simple_nostem_no_stop
/* This works as i hoped, per the docs: */
test_db=# select to_tsvector('test_id_search', ' postgresql-beta1 ') ;
to_tsvector
----------------------
'postgresql-beta1':1
(1 row)
/* This doesn't seem to work? */
test_db=# select to_tsvector('test_id_search', ' postgresql-9-beta1 ') ;
to_tsvector
-------------
(1 row)