Postgresql ILIKE versus TSEARCH - postgresql

I have a query with a number of test fields something like this:
SELECT * FROM some-table
WHERE field1 ILIKE "%thing%"
OR field2 ILIKE "%thing"
OR field3 ILIKE "%thing";
The columns are pretty much all varchar(50) or thereabouts. Now I understand to improve performance I should index the fields upon which the search operates. Should I be considering replacing ILIKE with TSEARCH completely?

A full text search setup is not identical to a "contains" like query. It stems words etc so you can match "cars" against "car".
If you really want a fast ILIKE then no standard database index or FTS will help. Fortunately, the pg_trgm module can do that.
http://www.postgresql.org/docs/9.1/static/pgtrgm.html
http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/

One thing that is very important: NO B-TREE INDEX will ever improve this kind of search:
where field ilike '%SOMETHING%'
What I am saying is that if you do a:
create index idx_name on some_table(field);
The only access you will improve is where field like 'something%'. (when you search for values starting with some literal). So, you will get no benefit by adding a regular index to field column in this case.
If you need to improve your search response time, definitely consider using FULL TEXT SEARCH.

Adding a bit to what the others have said.
First you can't really use an index based on a value in the middle of the string. Indexes are tree searches generally, and you have no way to know if your search will be faster than just scanning the table, so PostgreSQL will default to a seq scan. Indexes will only be used if they match the first part of the string. So:
SELECT * FROM invoice
WHERE invoice_number like 'INV-2012-435%'
may use an index but like '%44354456%' cannot.
In general in LedgerSMB we use both, depending on what kind of search we are doing. You might see a search like:
select * from parts
WHERE partnumber ilike ? || '%'
and plainto_tsquery(get_default_language(), ?) ## description;
So these are very different. Use each one where it makes the most sense.

Related

What is the correct way to create a case-insensitive trigram-index in postgres?

...and is it something I should do anyway?
From my brief testing, making a trigram index and searching using
where name like '%query%'
is faster than
where name ilike '%query%'
So it seems like I should do it, but I'm surprised I've not been able to find out how.
(My test data is fairly homogenous - 1.5M rows made up of 16 entries repeated. I can imagine this might mess with the results.)
This is how I expected it to work (note the lower(name)):
create extension pg_trgm;
create table users(name text);
insert into users values('Barry');
create index "idx" on users using gin (lower(name) gin_trgm_ops);
select count(*) from users where (name like '%bar%');
but this returns 0.
Either of
select count(*) from users where (name like '%Bar%');
or
select count(*) from users where (name ilike '%bar%');
work, which makes me believe the trigrams in the index are not lower()'d. Am I misunderstanding how this works under the hood? Is it not possible to call lower there?
I note that this
select show_trgm('Barry')
returns lowercase trigrams:
{" b"," ba",arr,bar,rry,"ry "}
So I'm perplexed.
The trigrams are definitely lower case.
The conundrum becomes cleared up when you consider how trigram indexes are used: they act as a filter that eliminates the majority of non-matches, but allow false positive results (among other reasons is their case insensitivity). That's why there always has to be a recheck to eliminate those false positives, and that us why you always get a bitmap index scan.
The ILIKE query may be slower because it has more results, or because case insensitive comparisons require more effort.

Postgres like filter indexing & perfomances with %wildcard%

I have a scenario that I have to do full text search on database, based on parameters that are provided.
So let's say if first_name provided I'm only doing query:
where(unaccent(first_name) ilike %{params[:first_name]}%
if last_name provided only this one:
where(unaccent(last_name) ilike %{params[:last_name]}%
And if both provided then I'm doing:
where(unaccent(first_name) ilike %{params[:first_name]}% AND
where(unaccent(last_name) ilike %{params[:last_name]}%
Note there is 3 more fields like this.
I know I can add GIN indexes for each of those fields and use it pretty fast, but it would definitely increase storage space and slow down other operations, so I'm not so big fan of it.
Is there any advice how to make it faster and quite optimized but to not slow down other parts.

Matching performance with pattern from table column

I have a query which looks like:
SELECT *
FROM my_table
WHERE 'some_string' LIKE mytable.some_column || '%%'
How can I index some_column to improve this query performance?
Or is the a better way to filter this?
This predicate effectively searches for all prefixes for a given string:
WHERE 'some_string' LIKE mytable.some_column || '%'
Maybe % is a special character in your client, which needs to be escaped with another % to pass a literal %, else '%%' would be just noise and can be replaced with '%'.
The most efficient solution should be a recursive CTE (or similar) that matches to every prefix exactly, starting with some_column = left('some_string', 1), up to some_column = left('some_string', length('some_string')) (= 'some_string').
You only need a plain btree index on the column for this. Depending on details of your implementation, partial expression indexes might improve performance ...
Related:
Reverse pattern matching: find the longest prefix
Algorithm for finding the longest prefix
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
I believe you intend to write the following query:
SELECT *
FROM my_table
WHERE mytable.some_column LIKE 'some_string%';
In other words, you want to find records where some column begins with some_string followed by anything, possibly nothing at all.
As far as I know, a regular B-tree index on some_column will be effective, to a point, in your query. The reason is that Postgres can traverse the tree looking for the prefix some_string. Once it has found that entry, beyond that the index might not help. But an index on some_column should give you some performance benefit here.
A condition where an index would not help would be the following:
WHERE mutable.some_column LIKE '%some_string';
In this case, the index is rendered mostly useless, because we have no idea with what prefix the column value should begin.

Are there plans to add 'OR' to attribute searches in Sphinx?

A little background is in order for this question since it is on surface too generic:
Recently I ran into an issue where I had to move the attribute values I was pushing into my sphinxql query as full-text because the attribute needed to be part of an 'OR' query.
In other words I was doing:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3)
When I tried to add an 'OR' to the attributes such as:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
it failed because Sphinx 2.* does not support OR in the attribute query.
I was also unable to simply put the name and customer IDs in to the query:
Select * from idx_test where MATCH('Terms ((#(name_id) 1|2|3)|(#customer_id) 4|5|6))')
Because (as far as I can tell) you can't push integer fields into the full_text search.
My solution was to index the id fields a second time appended by _text:
Select name_id, name_id as name_id_text
and then add that to the field list:
sql_attr_uint = name_id
sql_field_string = name_id_text
sql_attr_uint = customer_id
sql_field_string = customer_id_text
So now I can do my OR query as full_text:
Select * from idx_test where MATCH('Terms ((#(name_id_text) 1|2|3)|(#customer_id_text) 4|5|6))')
However recently I found an article that discussed the tradeoff between attribute and full-text searches. The upshot is that "it could reduce performance of queries that otherwise match few records". Which is precisely what my name_id/city_id query does. In an ideal world then I'd be able to go back to:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
If Sphinx would only allow for OR between attributes since as far as I can tell once I have a query that is filtering down to a relatively low # of results I'd have a much faster query using attributes vs full_text.
So my two-part question therefor is:
Am I in fact correct that this is the case (a query that would reduce the # of results significantly is better served doing attributes then full-text)?
If so are there plans to add OR to the attribute part of the SphinxQL query?
If so, when?
OR filter has been added in the Sphinx fork (from 2.3 branch) - Manticore, see https://github.com/manticoresoftware/manticore/commit/76b04de04feb8a4db60d7309bf1e57114052e298
For now it's only between attributes, OR between MATCH and attributes is not supported yet.
While yes, OR is not supported directly in WHERE, can still run the query. Your
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
example can be written as
Select *, IN(name_id,1,2,3) + IN(customer_id,4,5,6) as filter
from idx_test where MATCH('Terms') and filter > 0
It is a bit more cumbersome, but should work. You still get the full benefit of the full-text inverted index, so performance actully shoudnt be bad. The fitler is only executed against docs matching the terms.
(this may look crazy, if coming from say mysql background, but remeber sphinxQL isnt mysql :)
You dont get 'short circuiting (ie customer_id filter, will still be run, even if matches name_id), so perhaps
Select *, IF(IN(name_id,1,2,3) OR IN(customer_id,4,5,6),1,0) as filter
from idx_test where MATCH('Terms') and filter =1
is even better, the if function has an OR operator! (as sphinx could potentially short-circuit, but don't know if it does)
(but also yes, if the 'filter' is highly selective (matching few rows), than including in the full-text query can be good. As it discards the rows earlier in processing. The problem with non-selective filters, is they have lots of matching rows, so a long doclist to process during text-query processing)

Searching individual words in a string

I know about full-text search, but that only matches your query against individual words. I want to select strings that contain a word that starts with words in my query. For example, if I search:
appl
the following should match:
a really nice application
apples are cool
appliances
since all those strings contains words that start with appl. In addition, it would be nice if I could select the number of words that match, and sort based on that.
How can I implement this in PostgreSQL?
Prefix matching with Full Text Search
FTS supports prefix matching. Your query works like this:
SELECT * FROM tbl
WHERE to_tsvector('simple', string) ## to_tsquery('simple', 'appl:*');
Note the appended :* in the tsquery. This can use an index.
See:
Get partial match from GIN indexed TSVECTOR column
Alternative with regular expressions
SELECT * FROM tbl
WHERE string ~ '\mappl';
Quoting the manual here:
\m .. matches only at the beginning of a word
To order by the count of matches, you could use regexp_matches()
SELECT tbl_id, count(*) AS matches
FROM (
SELECT tbl_id, regexp_matches(string, '\mappl', 'g')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY tbl_id
ORDER BY matches DESC;
Or regexp_split_to_table():
SELECT tbl_id, string, count(*) - 1 AS matches
FROM (
SELECT tbl_id, string, regexp_split_to_table(string, '\mappl')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY 1, 2
ORDER BY 3 DESC, 2, 1;
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or later has index support for simple regular expressions with a trigram GIN or GiST index. The release notes for Postgres 9.3:
Add support for indexing of regular-expression searches in pg_trgm
(Alexander Korotkov)
See:
PostgreSQL LIKE query performance variations
Depesz wrote a blog about index support for regular expressions.
SELECT * FROM some_table WHERE some_field LIKE 'appl%' OR some_field LIKE '% appl%';
As for counting the number of words that match, I believe that would be too expensive to do dynamically in postgres (though maybe someone else knows better). One way you could do it is by writing a function that counts occurrences in a string, and then add ORDER BY myFunction('appl', some_field). Again though, this method is VERY expensive (i.e. slow) and not recommended.
For things like that, you should probably use a separate/complimentary full-text search engine like Sphinx Search (google it), which is specialized for that sort of thing.
An alternative to that, is to have another table that contains keywords and the number of occurrences of those keywords in each string. This means you need to store each phrase you have (e.g. really really nice application) and also store the keywords in another table (i.e. really, 2, nice, 1, application, 1) and link that keyword table to your full-phrase table. This means that you would have to break up strings into keywords as they are entered into your database and store them in two places. This is a typical space vs speed trade-off.