Postgresql Exact match using Full text ts_query - postgresql

I want to match all exact match including:
- Exact match
- Plurals
- Mispelling
Table data:
natural loofah (YES - Exact match)
natural loofahs (YES - Exact match with plural)
natural lofah (YES - Exact match with misspelling)
Loofah natural (NO)
all natural loofah (NO - It's not exact match)
I tried with this but its not working
SELECT
query
FROM reports
WHERE to_tsvector('english', query) ## websearch_to_tsquery('english', 'natural loofah')
GROUP BY query

No this won't work. You have misunderstood what full text search does: it searches for whole words (optional: prefixes), ignoring flexion.
Full text search does not search for similarity.
Maybe you'd be better off with a trigram index:
CREATE EXTENSION pg_trgm;
CREATE INDEX ON reports USING gin (query gin_trgm_ops);
SELECT query
FROM reports
WHERE query % 'natural loofah';
Here % is the similarity operator.

As an alternative to trigram search, Postgres also has a text processor called 'simple' that will leave in stopwords and not perform any stemming on the search terms. This will let you do exact phrase matches on your query document, just be aware of what you lose in building a search index with 'simple'.
# select * from websearch_to_tsquery('english', '"eats, shoots, and leaves"');
websearch_to_tsquery
------------------------------
'eat' <-> 'shoot' <2> 'leav'
(1 row)
# select * from websearch_to_tsquery('simple', '"eats, shoots, and leaves"');
websearch_to_tsquery
--------------------------------------------
'eats' <-> 'shoots' <-> 'and' <-> 'leaves'
(1 row)

Related

Postgresql: how to set a weight for tsquery

How to set a weight for tsquery? I need to set a weight for tsquery obtained from plainto_tsquery.
Is it possible? Something like setweight(plainto_tsquery(''), 'A'), but it works only for tsvector.
I have this problem too. My use case is large documents, many sections, and I wish to provide an option for "search heading text only". (Headings have weight A and are scattered throughout the document; other sections have weight B, C or D depending upon where they occur.)
Here are two solutions that should help.
Solution 1: setweight function for tsquery
The function converts the tsquery to text, applies a regular expression to set the weights, then coverts back to tsquery.
CREATE FUNCTION setweight(query tsquery, weights text) RETURNS tsquery AS $$
SELECT regexp_replace(
query::text,
'(?<=[^ !])'':?(\*?)A?B?C?D?', ''':\1'||weights,
'g'
)::tsquery;
$$ LANGUAGE SQL IMMUTABLE;
Example:
select setweight( plainto_tsquery('fat cats and rats'), 'A' );
-- 'fat':A & 'cat':A & 'rat':A
select setweight( phraseto_tsquery('fat cats and rats'), 'A' );
-- 'fat':A <-> 'cat':A <2> 'rat':A
select setweight( to_tsquery('fat & (cat:A & rat) & !dog:*CD'), 'BC' );
-- 'fat':BC & 'cat':BC & 'rat':BC & !'dog':*BC
Solution 2: Functional index based on filtered tsvector
First create additional indexes on the fulltext column you'll be searching on.
e.g.
CREATE INDEX fulltext_idx
ON your_table USING gin
(fulltext)
CREATE INDEX fulltext_idx_A
ON your_table USING gin
(ts_filter(fulltext, '{a}'))
CREATE INDEX fulltext_idx_AB
ON your_table USING gin
(ts_filter(fulltext, '{a,b}'))
For whatever combination of weights you need.
Then, when searching, use the filtered expression. e.g.:
SELECT *
FROM your_table
WHERE ts_filter(fulltext, '{a}') ## plainto_tsquery('your query')
The search should take place on the indexed expression.
Discussion
Solution 1 gives you the function you're looking for, but the problem with weighted queries is that although postgres will use the index to find candidate matches, it still needs to pull back each document to check the weights.
In my case, when searching by titles only, Solution 2 appears to give better performance. The text within titles (weight A) uses a much smaller vocabulary than in the whole document, so the fulltext_idx_A is considerably smaller than fulltext_idx and the results don't need rechecked after matching.
For your own case, performance will depend entirely on your own document structure and the nature of your queries, so test using 'explain analyse' to select the better solution. Given the age of the ticket mind you, I assume you've solved this one already :-)
Note: ts_filter() and phraseto_tsquery() are from Postgres 9.6.
Here is the Best article about Postgres Full Text Search :
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
and you can also set weight by using :
setweight(to_tsvector(coalesce($columnName, '')), '$weight')
Where column name something like users.name (table.column)
And Weight you want E.g A, B or C

Are there plans to add 'OR' to attribute searches in Sphinx?

A little background is in order for this question since it is on surface too generic:
Recently I ran into an issue where I had to move the attribute values I was pushing into my sphinxql query as full-text because the attribute needed to be part of an 'OR' query.
In other words I was doing:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3)
When I tried to add an 'OR' to the attributes such as:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
it failed because Sphinx 2.* does not support OR in the attribute query.
I was also unable to simply put the name and customer IDs in to the query:
Select * from idx_test where MATCH('Terms ((#(name_id) 1|2|3)|(#customer_id) 4|5|6))')
Because (as far as I can tell) you can't push integer fields into the full_text search.
My solution was to index the id fields a second time appended by _text:
Select name_id, name_id as name_id_text
and then add that to the field list:
sql_attr_uint = name_id
sql_field_string = name_id_text
sql_attr_uint = customer_id
sql_field_string = customer_id_text
So now I can do my OR query as full_text:
Select * from idx_test where MATCH('Terms ((#(name_id_text) 1|2|3)|(#customer_id_text) 4|5|6))')
However recently I found an article that discussed the tradeoff between attribute and full-text searches. The upshot is that "it could reduce performance of queries that otherwise match few records". Which is precisely what my name_id/city_id query does. In an ideal world then I'd be able to go back to:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
If Sphinx would only allow for OR between attributes since as far as I can tell once I have a query that is filtering down to a relatively low # of results I'd have a much faster query using attributes vs full_text.
So my two-part question therefor is:
Am I in fact correct that this is the case (a query that would reduce the # of results significantly is better served doing attributes then full-text)?
If so are there plans to add OR to the attribute part of the SphinxQL query?
If so, when?
OR filter has been added in the Sphinx fork (from 2.3 branch) - Manticore, see https://github.com/manticoresoftware/manticore/commit/76b04de04feb8a4db60d7309bf1e57114052e298
For now it's only between attributes, OR between MATCH and attributes is not supported yet.
While yes, OR is not supported directly in WHERE, can still run the query. Your
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
example can be written as
Select *, IN(name_id,1,2,3) + IN(customer_id,4,5,6) as filter
from idx_test where MATCH('Terms') and filter > 0
It is a bit more cumbersome, but should work. You still get the full benefit of the full-text inverted index, so performance actully shoudnt be bad. The fitler is only executed against docs matching the terms.
(this may look crazy, if coming from say mysql background, but remeber sphinxQL isnt mysql :)
You dont get 'short circuiting (ie customer_id filter, will still be run, even if matches name_id), so perhaps
Select *, IF(IN(name_id,1,2,3) OR IN(customer_id,4,5,6),1,0) as filter
from idx_test where MATCH('Terms') and filter =1
is even better, the if function has an OR operator! (as sphinx could potentially short-circuit, but don't know if it does)
(but also yes, if the 'filter' is highly selective (matching few rows), than including in the full-text query can be good. As it discards the rows earlier in processing. The problem with non-selective filters, is they have lots of matching rows, so a long doclist to process during text-query processing)

postgresql tsvector partial text match

I'm trying to create a PostgreSQL query to find a partial text inside a tsvector column.
I have a tsvector value like this "'89' 'TT7' 'test123'" and I need to find any rows that contains "%es%".
How can I do that?
I tried
select * from use_docs_conteudo
WHERE textodados ## to_tsquery('es')
It looks like you want to use fast ILIKE queries for wild match. pg_trgm will be the right tool to go with. You can use POSIX regex rules for defining your query.
WITH data(t) AS ( VALUES
('test123! TT7 89'::TEXT),
('test123, TT7 89'::TEXT),
('test#test123.domain TT7 89'::TEXT)
)
SELECT count(*) FROM data WHERE t ~* 'es' AND t ~* '\mtest123\M';
Result:
count
-------
3
(1 row)
Links for existing answers:
Postgresql full text search part of words
PostgreSQL: Full Text Search - How to search partial words?

Searching individual words in a string

I know about full-text search, but that only matches your query against individual words. I want to select strings that contain a word that starts with words in my query. For example, if I search:
appl
the following should match:
a really nice application
apples are cool
appliances
since all those strings contains words that start with appl. In addition, it would be nice if I could select the number of words that match, and sort based on that.
How can I implement this in PostgreSQL?
Prefix matching with Full Text Search
FTS supports prefix matching. Your query works like this:
SELECT * FROM tbl
WHERE to_tsvector('simple', string) ## to_tsquery('simple', 'appl:*');
Note the appended :* in the tsquery. This can use an index.
See:
Get partial match from GIN indexed TSVECTOR column
Alternative with regular expressions
SELECT * FROM tbl
WHERE string ~ '\mappl';
Quoting the manual here:
\m .. matches only at the beginning of a word
To order by the count of matches, you could use regexp_matches()
SELECT tbl_id, count(*) AS matches
FROM (
SELECT tbl_id, regexp_matches(string, '\mappl', 'g')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY tbl_id
ORDER BY matches DESC;
Or regexp_split_to_table():
SELECT tbl_id, string, count(*) - 1 AS matches
FROM (
SELECT tbl_id, string, regexp_split_to_table(string, '\mappl')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY 1, 2
ORDER BY 3 DESC, 2, 1;
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or later has index support for simple regular expressions with a trigram GIN or GiST index. The release notes for Postgres 9.3:
Add support for indexing of regular-expression searches in pg_trgm
(Alexander Korotkov)
See:
PostgreSQL LIKE query performance variations
Depesz wrote a blog about index support for regular expressions.
SELECT * FROM some_table WHERE some_field LIKE 'appl%' OR some_field LIKE '% appl%';
As for counting the number of words that match, I believe that would be too expensive to do dynamically in postgres (though maybe someone else knows better). One way you could do it is by writing a function that counts occurrences in a string, and then add ORDER BY myFunction('appl', some_field). Again though, this method is VERY expensive (i.e. slow) and not recommended.
For things like that, you should probably use a separate/complimentary full-text search engine like Sphinx Search (google it), which is specialized for that sort of thing.
An alternative to that, is to have another table that contains keywords and the number of occurrences of those keywords in each string. This means you need to store each phrase you have (e.g. really really nice application) and also store the keywords in another table (i.e. really, 2, nice, 1, application, 1) and link that keyword table to your full-phrase table. This means that you would have to break up strings into keywords as they are entered into your database and store them in two places. This is a typical space vs speed trade-off.

How do you do phrase-based full text search in postgres that takes advantage of the full-text index?

Let's say you have a postgres 8.3 table as follows:
CREATE TABLE t1 (body text, body_vector tsvector);
I want to be able to search it for phrases using the full text index (GiST, GiN or both on the tsvector column). The best workaround I've been able to find is to first do the full text search on both words (boolean AND) and then do a like comparison on the body for the phrase. Of course, this fails to capture any stemming or spell-checking that postgres' full-text search does for you. An example of this is if I'm searching for the phrase 'w1 w2', I'd use:
SELECT * FROM t1 WHERE body_vector ## 'w1 & w2'::tsquery AND body LIKE 'w1 w2';
Is there a way to do this where you don't have to resort to searching on the text column?
If you want exact phrase matching, that's the way to do it. You can also try WHERE body_vector ## plainto_tsquery('w1 w2'), and then order it by ranking. (the point being that the hits where the words are right next to each other should end up on top)
Update: PostgreSQL 9.6 text search supports phrases
select
*
from (values
('i heart new york'),
('i hate york new')
) docs(body)
where
to_tsvector(body) ## phraseto_tsquery('new york')
(1 row retrieved)
or by distance between words:
-- a distance of exactly 2 "hops" between "quick" and "fox"
select
*
from (values
('the quick brown fox'),
('quick brown cute fox')
) docs(body)
where
to_tsvector(body) ## to_tsquery('quick <2> fox')
(1 row retrieved)