Simplifying PostgreSQL Full Text Search tsvector and tsquery with aliases

Simplifying PostgreSQL Full Text Search tsvector and tsquery with aliases - postgresql

I am trying to simplify this query as it is going to be dynmaically generated by PHP and I would like to reduce the processing overhead (the real query will be much longer but the structure will be the same!).
SELECT title, type_name, ts_rank_cd(ARRAY[0.1,0.2,0.4,1.0],
setweight(to_tsvector(coalesce(title,'')), 'A')
||
setweight(to_tsvector(coalesce(type_name,'')), 'B')
,
to_tsquery('search & query'))
FROM TestView WHERE
setweight(to_tsvector(coalesce(title,'')), 'D')
||
setweight(to_tsvector(coalesce(type_name,'')), 'B')
##
to_tsquery('search & query');
I am looking to try to reduce the need to specify the tsquery and tsvector twice by defining something like an alias so that it does not have to be specified twice. Something like this (which fails, I am not sure if it is even close to correct!)
SELECT title, type_name, ts_rank_cd(ARRAY[0.1,0.2,0.4,1.0],
searchvector
,
searchquery
FROM TestView WHERE
setweight(to_tsvector(coalesce(title,'')), 'D')
||
setweight(to_tsvector(coalesce(type_name,'')), 'B') AS searchvector
##
to_tsquery('search & query') AS searchquery;
Is this possible or am I just stuck with generating it all twice.
For context 'TestView' is a view generated from a number of tables.
Any help much appreciated!

SELECT title,
type_name,
ts_rank_cd(ARRAY[0.1,0.2,0.4,1.0],weight,query)
FROM (
SELECT title,
type_name,
setweight(to_tsvector(coalesce(title,'')), 'A')
||setweight(to_tsvector(coalesce(type_name,'')), 'B') as weight,
to_tsquery('search & query') as query
FROM TestView
) t
WHERE weight ## query

Related

Syntaxerror, when creating an MATERIALIZED VIEW of tsvectors for fulltextsearch in postgresql

I am trying to implement a full text search, taking spelling mistakes into account.
Therefor, I try to create a MATERIALIZED VIEW of tsvector of all relevant columns.
CREATE MATERIALIZED VIEW unique_lexeme AS
SELECT word FROM ts_stat(
'SELECT to_tsvector('simple', cve.descriptions) ||
to_tsvector('simple', cpeMatch.criteria) ||
to_tsvector('simple', array_to_string(reference.tags, ' '))
FROM cve
JOIN cpeMatch ON cpeMatch.cve_id = cve.id
JOIN reference ON reference.cve_id = cve.id
GROUP BY cve.id');
But when I run this code, I get:
SQL-Fehler [42601]: FEHLER: Syntaxfehler bei »simple«
Position: 92
Saying there is a syntax error at 'simple'.
I have no idea how to resolve this issue.
Just to make clear, I installed pg_trgm but didn't make any configs ore changes.

You need to quote simple but you are already in a quoted string. The easiest is to change the string delimiter:
CREATE MATERIALIZED VIEW unique_lexeme AS
SELECT word FROM ts_stat(
$$SELECT to_tsvector('simple', cve.descriptions) ||
to_tsvector('simple', cpeMatch.criteria) ||
to_tsvector('simple', array_to_string(reference.tags, ' '))
FROM cve
JOIN cpeMatch ON cpeMatch.cve_id = cve.id
JOIN reference ON reference.cve_id = cve.id
GROUP BY cve.id$$);

Postgres full text search and spelling mistakes (aka fuzzy full text search)

I have a scenario, where I have data for informal communications that I need to be able to search. Therefore I want full text search, but I also to make sense of spelling mistakes. Question is how do I take spelling mistakes into account in order to be able to do fuzzy full text search??
This is very briefly discussed in Postgres Full Text Search is Good Enough where the article discusses misspelling.
So I have built a table of "documents", created indexes etc.
CREATE TABLE data (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
text TEXT NOT NULL);
I can create an additional column of type tsvector and index accordingly...
alter table data
add column search_index tsvector
generated always as (to_tsvector('english', coalesce(text, '')))
STORED;
create index search_index_idx on data using gin (search_index);
I have for example, some text where the data says "baloon", but someone may search "balloon", so I insert two rows (one deliberately misspelled)...
insert into data (text) values ('baloon');
insert into data (text) values ('balloon');
select * from data;
id | text | search_index
----+---------+--------------
1 | baloon | 'baloon':1
2 | balloon | 'balloon':1
... and perform full text searches against the data...
select * from data where search_index ## plainto_tsquery('balloon');
id | text | search_index
----+---------+--------------
2 | balloon | 'balloon':1
(1 row)
But I don't get back results for the misspelled version "baloon"... So using the suggestion in the linked article I've built a lookup table of all the words in my lexicon as follows...
"you may obtain good results by appending the similar lexeme to your tsquery"
CREATE TABLE data_words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', text) FROM data');
CREATE INDEX data_words_idx ON data_words USING GIN (word gin_trgm_ops);
... and I can search for similar words which may have been misspelled
select word, similarity(word, 'balloon') as similarity from data_words where similarity(word, 'balloon') > 0.4 order by similarity(word, 'balloon');
word | similarity
---------+------------
baloon | 0.6666667
balloon | 1
... but how do I actually include misspelled words in my query?
Isn't this what the article above means?
select plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
plainto_tsquery
----------------------------------
'balloon' & 'baloon' & 'balloon'
(1 row)
... plugged into an actual search, and I get no rows!
select * from data where text ## plainto_tsquery('balloon' || ' ' || (select string_agg(word, ' ') from data_words where similarity(word, 'balloon') > 0.4));
select * from data where search_index ## phraseto_tsquery('baloon balloon'); -- no rows returned
I'm not sure where I'm going wrong here - can any shed any light? I feel like I'm super close to getting this going...?

SELECT to_tsquery('balloon |' ||
string_agg(word, ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;

For anyone looking at this thread, the accepted answer by #laurenz-albe needed a slight modification for me:
It required single quotes around the argument values passed to the string_agg function, which can be done using the format function along with the %L placeholder.
This updated code worked for me:
SELECT to_tsquery('balloon |' ||
string_agg(format('%L', word), ' | ')
)
FROM data_words
WHERE similarity(word, 'balloon') > 0.4;

Full-Text Search producing no results

I have the following View:
CREATE VIEW public.profiles_search AS
SELECT
profiles.id,
profiles.bio,
profiles.title,
(
setweight(to_tsvector(profiles.search_language::regconfig, profiles.title::text), 'B'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, profiles.bio), 'A'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, profiles.category::text), 'B'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, array_to_string(profiles.tags, ',', '*')), 'C'::"char")
) AS document
FROM profiles
GROUP BY profiles.id;
However, if profiles.tags is empty then document is empty, even if the rest of the fields (title, bio, and category) contain data.
Is there some way to make the make that field optional such that it having empty data doesn't result in an an empty document?

This seems to be the common string concatenation issue - concatenating a NULL value makes the whole result NULL.
Here it is suggested you should always provide a default value for any input with coalesce():
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
setweight(to_tsvector(coalesce(body,'')), 'D');
If you do not want to provide default values for complex datatypes (like coalesce(profiles.tags, ARRAY[]::text[]) as suggested by #approxiblue), I suspect you could simply do:
CREATE VIEW public.profiles_search AS
SELECT
profiles.id,
profiles.bio,
profiles.title,
(
setweight(to_tsvector(profiles.search_language::regconfig, profiles.title::text), 'B'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, profiles.bio), 'A'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, profiles.category::text), 'B'::"char") ||
setweight(to_tsvector(profiles.search_language::regconfig, coalesce(array_to_string(profiles.tags, ',', '*'), '')), 'C'::"char")
) AS document
FROM profiles
GROUP BY profiles.id;

understanding complex SP in DB2

I need to make changes to an SP which has a bunch of complex XML functions and what not
Declare ResultCsr2 Cursor For
WITH
MDI_BOM_COMP(PROD_ID,SITE_ID, xml ) AS (
SELECT TC401F.T41PID,TC401F.T41SID,
XMLSERIALIZE(
XMLAGG(
XMLELEMENT( NAME "MDI_BOM_COMP",
XMLFOREST(
trim(TC401F.T41CTY) AS COMPONENT_TYPE,
TC401F.T41LNO AS COMP_NUM,
trim(TC401F.T41CTO) AS CTRY_OF_ORIGIN,
trim(TC401F.T41DSC) AS DESCRIPTION,
TC401F.T41EFR AS EFFECTIVE_FROM,
TC401F.T41EFT AS EFFECTIVE_TO,
trim(TC401F.T41MID) AS MANUFACTURER_ID,
trim(TC401F.T41MOC) AS MANUFACTURER_ORG_CODE,
trim(TC401F.T41CNO) AS PROD_ID,
trim(TC401F.T41POC) AS PROD_ORG_CODE,
TC401F.T41QPR AS QTY_PER,
trim(TC401F.T41SBI) AS SUB_BOM_ID,
trim(TC401F.T41SBO) AS SUB_BOM_ORG_CODE, --ADB01
trim(TC401F.T41VID) AS SUPPLIER_ID,
trim(TC401F.T41SOC) AS SUPPLIER_ORG_CODE,
TC401F.T41UCT AS UNIT_COST
)
)
) AS CLOB(1M)
)
FROM TC401F TC401F
GROUP BY T41PID,T41SID
)
SELECT
RowNum, '<BOM_INBOUND>' ||
XMLSERIALIZE (
XMLELEMENT(NAME "INTEGRATION_MESSAGE_CONTROL",
XMLFOREST(
'FULL_UPDATE' as ACTION,
'POLARIS' as COMPANY_CODE,
TRIM(TC400F.T40OCD) as ORG_CODE,
'5' as PRIORITY,
'INBOUND_ENTITY_INTEGRATION' as MESSAGE_TYPE,
'POLARIS_INTEGRATION' as USERID,
'TA' as RECEIVER,
HEX(Generate_Unique()) as SOURCE_SYSTEM_TOKEN
),
XMLELEMENT(NAME "BUS_KEY",
XMLFOREST(
TRIM(TC400F.T40BID) as BOM_ID,
TRIM(TC400F.T40OCD) as ORG_CODE
)
)
) AS VARCHAR(1000)
)
|| '<MDI_BOM>' ||
XMLSERIALIZE (
XMLFOREST(
TRIM(TC400F.T40ATP) AS ASSEMBLY_TYPE,
TRIM(TC400F.T40BID) AS BOM_ID,
TRIM(TC400F.T40CCD) AS CURRENCY_CODE,
TC400F.T40DPC AS DIRECT_PROCESSING_COST,
TC400F.T40EFD AS EFFECTIVE_FROM,
TC400F.T40EFT AS EFFECTIVE_TO,
TRIM(TC400F.T40MID) AS MANUFACTURER_ID,
TRIM(TC400F.T40MOC) AS MANUFACTURER_ORG_CODE,
TRIM(TC400F.T40OCD) AS ORG_CODE,
TRIM(TC400F.T40PRF) AS PROD_FAMILY,
TRIM(TC400F.T40PID) AS PROD_ID,
TRIM(TC400F.T40POC) AS PROD_ORG_CODE,
TRIM(TC400F.T40ISA) AS IS_ACTIVE,
TRIM(TC400F.T40VID) AS SUPPLIER_ID,
TRIM(TC400F.T40SOC) AS SUPPLIER_ORG_CODE,
TRIM(TC400F.T40PSF) AS PROD_SUB_FAMILY,
CASE TRIM(TC400F.T40PML)
WHEN '' THEN TRIM(TC400F.T40PML)
ELSE TRIM(TC400F.T40PML) || '~' || TRIM(TC403F.T43MDD)
END AS PROD_MODEL
) AS VARCHAR(3000)
)
|| IFNULL(MBC.xml, '') ||
XMLSERIALIZE (
XMLFOREST(
XMLFOREST(
TRIM(TC400F.T40CCD) AS CURRENCY_CODE,
TC400F.T40PRI AS PRICE,
TRIM(TC400F.T40PTY) AS PRICE_TYPE
) AS MDI_BOM_PRICE,
XMLFOREST(
TRIM(TC400F.T40CCD) AS CURRENCY_CODE,
TRIM(TC400F.T40PRI) AS PRICE,
'TRANSACTION_VALUE' AS PRICE_TYPE
) AS MDI_BOM_PRICE,
XMLFOREST(
TRIM(TC400F.T40INA) AS INCLUDE_IN_AVERAGING
) AS MDI_BOM_IMPL_BOM_PROD_FAMILY_AUTOMOBILES
) AS VARCHAR(3000)
)
|| '</MDI_BOM>' ||
'</BOM_INBOUND>' XML
FROM (
SELECT
ROW_NUMBER() OVER (
ORDER BY T40STS
,T40SID
,T40BID
) AS RowNum
,t.*
FROM TC400F t
) TC400F
LEFT OUTER JOIN MDI_BOM_COMP MBC
ON TC400F.T40SID = MBC.SITE_ID
AND TC400F.T40PID = MBC.PROD_ID
LEFT OUTER JOIN TC403F TC403F
ON TC400F.T40PML <> ''
AND TC400F.T40PML = TC403F.T43MDL
WHERE TC400F.T40STS = '10'
AND TC400F.RowNUM BETWEEN
(P_STARTROW + (P_PAGENOS - 1) * P_NBROFRCDS)
AND (P_STARTROW + (P_PAGENOS - 1) * P_NBROFRCDS +
P_NBROFRCDS - 1);
Given above is a cursor declaration in the SP code which I am struggling to understand. The very first WITH itself seems to be mysterious. I have used it along with temporary table names but this is the first time, Im seeing something of this sort which seems to be an SP or UDF? Can someone please guide me on how to understand and make sense out of all this?
Adding further to the question, the actual requirement here is to arrange the data in the XML such a way that that those records which have TC401F.T41SBI field populated should appear in the beginning of the XML output..
This field is being selected as below in the code:
trim(TC401F.T41SBI) AS SUB_BOM_ID. If this field is non-blank, this should appear first in the XML and any records with this field value Blank should appear only after. What would be the best approach to do this? Using ORDER BY in any way does not really seem to help as the XML is actually created through some functions and ordering by does not affect how the items are arranged within the XML. One approach I could think of was using a where clause where TC401F.T41SBI <> '' first then append those records where TC401F.T41SBI = ''

Best I can do is help with the CTE.
WITH
MDI_BOM_COMP(PROD_ID,SITE_ID, xml ) AS (
SELECT TC401F.T41PID,TC401F.T41SID,
This just generates a table named MDI_BOM_COMP with three columns named PROD_ID, SITE_ID, and XML. The table will have one record for each PROD_ID, SITE_ID, and the contents of XML will be an XML snippet with all the components for that product and site.
Now the XML part can be a bit confusing, but if we break it down into it's scalar and aggregate components, we can make it a bit more understandable.
First ignore the grouping. so the CTE retrieves each row in TC401F. XMLELEMENT and XMLFORREST are scalar functions. XMLELEMENT creates a single XML element The tag is the first parameter, and the content of the element is the second in the above example. XMLFORREST is like a bunch of XMLELEMENTs concatenated together.
XMLSERIALIZE(
XMLAGG(
XMLELEMENT( NAME "MDI_BOM_COMP",
XMLFOREST(
trim(TC401F.T41CTY) AS COMPONENT_TYPE,
TC401F.T41LNO AS COMP_NUM,
trim(TC401F.T41CTO) AS CTRY_OF_ORIGIN,
trim(TC401F.T41DSC) AS DESCRIPTION,
TC401F.T41EFR AS EFFECTIVE_FROM,
TC401F.T41EFT AS EFFECTIVE_TO,
trim(TC401F.T41MID) AS MANUFACTURER_ID,
trim(TC401F.T41MOC) AS MANUFACTURER_ORG_CODE,
trim(TC401F.T41CNO) AS PROD_ID,
trim(TC401F.T41POC) AS PROD_ORG_CODE,
TC401F.T41QPR AS QTY_PER,
trim(TC401F.T41SBI) AS SUB_BOM_ID,
trim(TC401F.T41SBO) AS SUB_BOM_ORG_CODE, --ADB01
trim(TC401F.T41VID) AS SUPPLIER_ID,
trim(TC401F.T41SOC) AS SUPPLIER_ORG_CODE,
TC401F.T41UCT AS UNIT_COST
)
)
) AS CLOB(1M)
So in the example, for each row in the table, XMLFORREST creates a list of XML elements, one for each of COMPONENT_TYPE, COMP_NUM, CTRY_OF_ORIGIN, etc. These elements form the content of another XML element MDI_BOM_COMP which is created by XMLELEMENT.
Now for each row in the table we have selected PROD_ID, SITE_ID, and created some XML. Next we group by PROD_ID, and SITE_ID. The aggregation function XMLAGG will collect all the XML for each PROD_ID and SITE_ID, and concatenate it together.
Finally XMLSERIALIZE will convert the internal XML representation to the string format we all know and love ;)

I think I found the answer for my requirement. I had to add an order by field name after XMLELEMENT function

Postgres reverse LIKE lookup indexing and performance

We have a musicians table containing records with multiple string fields, say:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Sting", "", "Bass"
"Ringo", "Starr", "Drums"
"Paul", "McCartney", "Bass"
I want to pass postgres a long string, say:
"It is known that Jimi liked to set light to his guitar and smash up
all the drums while on stage."
and i want to get returned the fields that have any matches - preferably in order of the most matches first:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Ringo", "Starr", "Drums"
because i need the search to be case insensitive, i'm constructing a query like this...
select * from musicians where lowercase_string like '%'||firstname||'%' or lowercase_string like '%'||lastname||'%' or lowercase_string like '%'||instrument||'%'
and then looping through (in ruby in my case) to capture the result with the most matches.
this is however very slow in the sql stage (1 minute+).
i've tried adding lower-case GIN index using pg_trgm as suggested here - but it's not helping - presumably because the like query is back to front?
Thanks!

With my testing, it seems that no trigram index could help your query at all. And no other index type could possibly speed up an (I)LIKE / FTS based search.
I should mention that all of the queries below use the trigram indexes, when they are queried "reversed": when the table contains the document (which is indexed), and your parameter is the query. The (I)LIKE variant variant f.ex. 2-3 times faster with it.
These the queries I've tested:
select *
from musicians
where :input_string ilike '%' || firstname || '%'
or :input_string ilike '%' || lastname || '%'
or :input_string ilike '%' || instrument || '%'
At first, FTS seemed a great idea, but my testing shows that even without ranking, it is 60-100 times slower than the (I)LIKE variant. (So even, when you don't have to post-process results with these methods, these are not worth it).
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
However, ORDER BY rank doesn't slow down that much further: it is 70-120 times slower than the (I)LIKE variant.
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
order by ts_rank(to_tsvector(:input_string), plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
Then, for a last effort, I tried the (fairly new) "word similarity" operators of the trigram module: <% and %> (available from PostgreSQL 9.6).
select *
from musicians
where :input_string %> firstname
or :input_string %> lastname
or :input_string %> instrument
select *
from musicians
where firstname <% :input_string
or lastname <% :input_string
or instrument <% :input_string
These were somewhat faster then FTS: around 50-70 times slower than the (I)LIKE variant.
(Partially working) rextester: it is run against PostgreSQL 9.5, so the 9.6 operators obviously won't run here.
Update: IF full word match is enough for you, you can actually reverse your query, to be able to use indexes. You'll need to "parse" your query (aka. "long string") though:
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(ls, '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where firstname ilike word
or lastname ilike word
or instrument ilike word
group by musicians.id
Note: I parsed the query for every complete word. You can have some other logic there, or it can even be parsed in client side.
The default, btree index shines here, as it is much faster than the trigram index with (I)LIKE (we won't need them anyway, as we are looking for complete word match here):
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(lower(ls), '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where lower(firstname) = word
or lower(lastname) = word
or lower(instrument) = word
group by musicians.id
http://rextester.com/PSABJ6745
You could even get the match count with something like
sum((lower(firstname) = word)::int
+ (lower(lastname) = word)::int
+ (lower(instrument) = word)::int)

The ilike option with match ordering:
with long_string (ls) as (values
('It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage.')
)
select musicians.*, matches
from
musicians
cross join
long_string
cross join lateral
(select
(ls ilike format ('%%%s%%', first_name) and first_name != '')::int +
(ls ilike format ('%%%s%%', last_name) and last_name != '')::int +
(ls ilike format ('%%%s%%', instrument) and instrument != '')::int
as matches
) m
where matches > 0
order by matches desc
;
first_name | last_name | instrument | matches
------------+-----------+------------+---------
Jimi | Hendrix | Guitar | 2
Phil | Collins | Drums | 1
Ringo | Starr | Drums | 1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Simplifying PostgreSQL Full Text Search tsvector and tsquery with aliases - postgresql

SELECT title, type_name, ts_rank_cd(ARRAY[0.1,0.2,0.4,1.0],weight,query) FROM ( SELECT title, type_name, setweight(to_tsvector(coalesce(title,'')), 'A') ||setweight(to_tsvector(coalesce(type_name,'')), 'B') as weight, to_tsquery('search & query') as query FROM TestView ) t WHERE weight ## query

Related

Syntaxerror, when creating an MATERIALIZED VIEW of tsvectors for fulltextsearch in postgresql

Postgres full text search and spelling mistakes (aka fuzzy full text search)

Full-Text Search producing no results

understanding complex SP in DB2

Postgres reverse LIKE lookup indexing and performance

Categories

Resources