Postgresql - full-text ordering of sorted out set - postgresql

I have a Postgresql table with something like 200k tuples, so not that much.
What I try to do is filter out some rows and then order them using full-text matching:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND regexp_replace(d.description, '(...)', '') !~* '...'
AND regexp_replace(d.description, '...', '') !~* '...'
AND d.id != ?
ORDER BY ts_rank_cd(to_tsvector('english', name), plainto_tsquery('english', 'my search words')) DESC LIMIT 5 OFFSET 0';
There is a GIN index on description field.
Now this query works well only when there is less then 4000 or so records in the category. When its more like 5k or 6k then the query gets extremely slow.
I was trying different variations of this query. What I noticed is when I remove either WHERE clause or ORDER BY clause then I get big speed up. (Of course then I get irrelevant results)
What can I do to speedup this combination? Any way of optimizing or should I look for a solution outside Postgresql?
Additional question:
I'm experimenting further and for example this is the simplest query that I think runs too slow. Can I tell from explain analyze when it uses gist index and when doesn't?
SELECT d.*, d.description <-> 'banana' as dist FROM descriptions as d ORDER BY dist DESC LIMIT 5
"Limit (cost=16046.88..16046.89 rows=5 width=2425) (actual time=998.811..998.813 rows=5 loops=1)"
" -> Sort (cost=16046.88..16561.90 rows=206010 width=2425) (actual time=998.810..998.810 rows=5 loops=1)"
" Sort Key: (((description)::text <-> 'banana'::text))"
" Sort Method: top-N heapsort Memory: 27kB"
" -> Seq Scan on products d (cost=0.00..12625.12 rows=206010 width=2425) (actual time=0.033..901.260 rows=206010 loops=1)"
"Total runtime: 998.866 ms"`
Answered (kgrittn): DESC keyword is not correct for KNN-GiST and it's actually not wanted here. Removing it fixes the problem and gives right results.

An output of explain analyze of your query would be helpful. But I guess that this regexp_replace lines are your problem. Postgres planner just cannot know how many rows will match this two lines, so it is guessing and planning a query based on this flawed quess.
I'd recommend to create a function like this:
create function good_description(text) returns boolean as $$
select
regexp_replace($1, '(...)', '') !~* '...'
and
regexp_replace($1, '...', '') !~* '...'
$$ language sql immutable strict;
And creating a partial index on expression using this function:
create index descriptions_good_description_idx
on good_description(description)
where description != '';
And then querying in a way that allows Postgres to use this index:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND good_description(d.description)
AND d.id != ?
ORDER BY ts_rank_cd(
to_tsvector('english', name),
plainto_tsquery('english', 'my search words')
) DESC
LIMIT 5 OFFSET 0;

For this type of application, we have been moving from the tsearch feature to the trigram feature; when you want to pick a small number of best matches, it is much faster. People here often prefer the semantics of the trigram similarity matching over the text-search ranking, anyway.
http://www.postgresql.org/docs/current/interactive/pgtrgm.html
"Borrowing" the later query from the edited question, formatting it, and including the index creation statement, to make the answer self-contained without a raft of comments:
CREATE INDEX descriptions_description_trgm
ON descriptions
USING gist (description gist_trgm_ops);
SELECT d.*, d.description <-> 'banana' as dist
FROM descriptions as d
ORDER BY dist LIMIT 5;
This should return rows from the GiST index in "distance" sequence until it hits the LIMIT.

Related

Can Postgres use multiple indexes in a single query?

Assume that I have a query like below:
select
sum(impressions) as imp, sum(taps) as taps
from report
where org_id = 1 and report_date between '2019-01-01' and '2019-10-10'
group by country, text;
In MYSQL, there is no support for multiple indexing for a single query. Can I use multiple indexes for a single query in PostgeSQL?
Like:
For where condition: index(org_id, report_date);
For group by: index(country, text);
Explain:
"GroupAggregate (cost=8.18..8.21 rows=1 width=604)"
" Group Key: country, text"
" -> Sort (cost=8.18..8.18 rows=1 width=556)"
" Sort Key: country, text"
" -> Index Scan using idx_org_date on report (cost=0.14..8.17 rows=1 width=556)"
" Index Cond: ((org_id = 1) AND (date >= '2019-01-01'::date) AND (date <= '2019-02-02'::date))"
Yes and no. It can in general, but it can't use one index to get selectivity, and another to obtain the ordering needed for an efficient GROUP BY, on the same relation.
For example, if you had separate indexes on "org_id" and "report_date", it would be able to combine them using a BitmapAnd. But that would be less efficient than having your current two-column index, so this fact probably isn't of use to you.
You might be better off with a HashAgg. You could try increasing work_mem in order to get one. But if there truly is only one row, it won't really matter.

Postgres similarity function not appropriately using trigram index

I have a simple person table with a last_name column that I've added a GIST index with
CREATE INDEX last_name_idx ON person USING gist (last_name gist_trgm_ops);
According to the docs at https://www.postgresql.org/docs/10/pgtrgm.html, the <-> operator should utilize this index. However, when I actually try to use this difference operator using this query:
explain verbose select * from person where last_name <-> 'foobar' > 0.5;
I get this back:
Seq Scan on public.person (cost=0.00..290.82 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)
And it doesn't look like the index is being used. However, if I use the % operator with this command:
explain verbose select * from person where last_name % 'foobar';
It seems to use the index:
Bitmap Heap Scan on public.person (cost=4.25..41.51 rows=13 width=233)
Output: person_id, first_name, last_name
Recheck Cond: (person.last_name % 'foobar'::text)
-> Bitmap Index Scan on last_name_idx (cost=0.00..4.25 rows=13 width=0)
Index Cond: (person.last_name % 'foobar'::text)
I also noticed that if I move the operator to the select portion of the query, the index gets ignored again:
explain verbose select last_name % 'foobar' from person;
Seq Scan on public.person (cost=0.00..257.19 rows=13455 width=1)
Output: (last_name % 'foobar'::text)
Am I missing something obvious about how the similarity function uses the trigram index?
I am using Postgres 10.5 on OSX.
EDIT 1
As per Laurenz's suggestion, I tried setting enable_seqscan = off but unfortunately, the query with the <-> operator still seems to ignore the index.
show enable_seqscan;
enable_seqscan
----------------
off
explain verbose select * from person where last_name <-> 'foobar' < 0.5;
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.person (cost=10000000000.00..10000000290.83 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)
This behavior is normal for all kinds of indexes.
The first query is not in a form that can use the index. For that, a condition would have to be of the form
<indexed expression> <operator supported by the index> <quasi-constant>
where the last expressions remains constant for the duration of the index scan and the operator returns a boolean value. Your expression ´last_name <-> 'foobar' > 0.5` is not of that form.
The <-> operator has to be used in an ORDER BY clause to be able to use the index.
The third query doesn't use the index because the query affects all rows of the table. An index does not speed up the evaluation of an expression, it is only useful to quickly identify a subset of the table (or to get rows in a certain sort order).

Filtering rows in WHERE clause with a textual set of ranges

I have a table with 10s of millions of rows. Various complex filtering queries produce row sets to support an application. These row sets are of of arbitrary size from a single row up to and including the full table. For domain-specific reasons, however, they always maintain high levels of contiguity along a particular key.
I need to pass these row sets bidirectionally between the database and the application, and it would be nice to compress this somehow. Many of you are probably familiar with UNIX cut, which takes a field specification like so: cut -f 2-6,7,9-21 and returns the corresponding columns. I am currently using a slightly limited version of the cut field specification (e.g. no 17-) to represent the row sets. So, for instance 24-923817,2827711-8471362,99188271 indicates a unique set of 6567445 rows while occupying 34 bytes.
I have already written the following procedures to convert these to SQL WHERE filters using the BETWEEN syntax
CREATE OR REPLACE FUNCTION cut_string_to_sql_filter( TEXT, TEXT ) RETURNS TEXT AS $$
SELECT
CASE $1
WHEN '' THEN 'FALSE'
ELSE
(SELECT
'(' || STRING_AGG( REGEXP_REPLACE( REGEXP_REPLACE( str, '(\d+)-(\d+)', QUOTE_IDENT( $2 ) || ' BETWEEN \1 AND \2' ), '^(\d+)$', QUOTE_IDENT( $2 ) || '=\1' ), ' OR ' ) || ')' AS sql
FROM
REGEXP_SPLIT_TO_TABLE( $1, ',' ) AS t(str))
END;
$$ LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE;
The first parameter is the row set specification and the second parameter is the key field name for the table. For the example above, SELECT cut_string_to_sql_filter( '24-923817,2827711-8471362,99188271', 'some_key' ) returns:
(some_key BETWEEN 24 AND 923817 OR some_key BETWEEN 2827711 AND 8471362 OR some_key=99188271)
The problem with this is that currently any query that makes use of such row set specifications must use dynamic SQL, because I cannot think of a way to use custom operators or any other syntactic features to embed this effect in a plain SQL query.
I have also written a set-returning function for the row specifications:
CREATE OR REPLACE FUNCTION cut_string_to_set( TEXT ) RETURNS SETOF INTEGER AS $$
DECLARE
_i TEXT;
_j TEXT;
_pos INTEGER;
_start INTEGER;
_end INTEGER;
BEGIN
IF $1 <> '' THEN
FOR _i IN SELECT REGEXP_SPLIT_TO_TABLE( $1, ',' ) LOOP
_pos := POSITION( '-' IN _i );
IF _pos > 0 THEN
_start := SUBSTRING( _i FROM 1 FOR _pos - 1 )::INTEGER;
_end := SUBSTRING( _i FROM _pos + 1 )::INTEGER;
FOR _j IN _start.._end LOOP
RETURN NEXT _j;
END LOOP;
ELSE
RETURN NEXT _i;
END IF;
END LOOP;
END IF;
END
$$ LANGUAGE PLPGSQL IMMUTABLE STRICT PARALLEL SAFE;
This works in plain SQL with WHERE some_key IN (SELECT cut_string_to_set(...)). It is, of course, comparatively inefficient in unpacking what is best expressed to the planner as a set of ranges, produces nightmarish and verbose query plans, and may or may not prevent the planner from using an index when it otherwise could and should.
Can anybody offer any solutions to the above conundrum for packaging this, potentially as its own type, potentially with custom operators, to allow syntactically sane index-based filtering on a column without dynamic SQL in the broader involved query? Is this simply impossible?
Feel free to offer suggestions for improvements to the procedures if you see any opportunities. And thanks!
EDIT 1
Great answer below suggests using an array of range types. Unfortunately, the query planner does not seem willing to use indexes with such a query. Planner output below from run on a small test table.
Gather (cost=1000.00..34587.33 rows=38326 width=45) (actual time=0.395..112.334 rows=1018 loops=1)
Workers Planned: 6
Workers Launched: 6
-> Parallel Seq Scan on test (cost=0.00..29754.73 rows=6388 width=45) (actual time=91.525..107.354 rows=145 loops=7)
Filter: (test_ref <# ANY ('{"[24,28)","[29,51)","[999,1991)"}'::int4range[]))
Rows Removed by Filter: 366695
Planning time: 0.214 ms
Execution time: 116.779 ms
The CPU cost (notice 6 workers in parallel for over 100 ms on the small test table) is too high. I cannot see how any additional index could help here.
To contrast, here is the planner output using the BETWEEN filters.
Bitmap Heap Scan on test (cost=22.37..1860.39 rows=1031 width=45) (actual time=0.134..0.430 rows=1018 loops=1)
Recheck Cond: (((test_ref >= 24) AND (test_ref <= 27)) OR ((test_ref >= 29) AND (test_ref <= 50)) OR ((test_ref >= 999) AND (test_ref <= 1990)))
Heap Blocks: exact=10
-> BitmapOr (cost=22.37..22.37 rows=1031 width=0) (actual time=0.126..0.126 rows=0 loops=1)
-> Bitmap Index Scan on test_test_ref_index (cost=0.00..2.46 rows=3 width=0) (actual time=0.010..0.010 rows=4 loops=1)
Index Cond: ((test_ref >= 24) AND (test_ref <= 27))
-> Bitmap Index Scan on test_test_ref_index (cost=0.00..2.64 rows=21 width=0) (actual time=0.004..0.004 rows=22 loops=1)
Index Cond: ((test_ref >= 29) AND (test_ref <= 50))
-> Bitmap Index Scan on test_test_ref_index (cost=0.00..16.50 rows=1007 width=0) (actual time=0.111..0.111 rows=992 loops=1)
Index Cond: ((test_ref >= 999) AND (test_ref <= 1990))
Planning time: 0.389 ms
Execution time: 0.660 ms
END EDIT 1
EDIT 2
Answer below suggest using range index. The problem, as far as I understand this, is that I do not need to index a range type. All right, so maybe the key column is converted to a range for the operation, so I can apply a GIST index to it and the planner will use that.
CREATE INDEX test_test_ref_gist_index ON test USING GIST (test_ref);
ERROR: data type integer has no default operator class for access method "gist"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
No surprise here. So let's convert the key column to a range and index that.
CREATE INDEX test_test_ref_gist_index ON test USING GIST (INT4RANGE( test_ref, test_ref ));
Whew, a 110 MB index. That's hefty. But does it work.
Gather (cost=1000.00..34587.33 rows=38326 width=45) (actual time=0.419..111.009 rows=1018 loops=1)
Workers Planned: 6
Workers Launched: 6
-> Parallel Seq Scan on test_mv (cost=0.00..29754.73 rows=6388 width=45) (actual time=90.229..105.866 rows=145 loops=7)
Filter: (test_ref <# ANY ('{"[24,28)","[29,51)","[999,1991)"}'::int4range[]))
Rows Removed by Filter: 366695
Planning time: 0.237 ms
Execution time: 114.795 ms
Nope. I'm not too surprised. I would expect this index to work for "contains" rather than "contained by" operations. I have no experience here though.
END EDIT 2
Pass an array of ranges:
select *
from t
where
k <# any (array[
'[24,923817]','[2827711,8471362]','[99188271,99188271]'
]::int4range[])
Check indexing for range types: https://www.postgresql.org/docs/current/static/rangetypes.html#RANGETYPES-INDEXING
In case a suitable range index is not possible do a join to the materialized ranges:
select *
from
t
inner join
(
select generate_series(lower(a),upper(a) - 1) as k
from unnest(array[
'[24,27]','[29,50]','[999,1990]'
]::int4range[]) a(a)
) s using (k)
It is possible to avoid joining all the range values. Compare to the lower and upper bounds of the range:
select *
from
t
cross join
(
select lower(a) as l, upper(a) - 1 as u
from unnest(array[
'[24,27]','[29,50]','[999,1990]'
]::int4range[]) a(a)
) s
where k between l and u
Simply impossible. Operators don't do that. They call functions. If they called a function here, that would function would have to use dynamic SQL.
To not use dynamic SQL, you'd have to hack apart the PostgreSQL lexer. PostgreSQL is a SQL database. Your syntax is not SQL. You can do two things,
Use SQL.
Compile SQL.
I prefer the first option where possible. If I need to make a DSL though, I don't do it in PostgreSQL. I do it in the app.

Postgres: Sorting by an immutable function index doesn't use index

I have a simple table.
CREATE TABLE posts
(
id uuid NOT NULL,
vote_up_count integer,
vote_down_count integer,
CONSTRAINT post_pkey PRIMARY KEY(id)
);
I have an IMMUTABLE function that does simple (but could be complex) arithmetic.
CREATE OR REPLACE FUNCTION score(
ups integer,
downs integer)
RETURNS integer AS
$BODY$
select $1 - $2
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;
ALTER FUNCTION score(integer, integer)
OWNER TO postgres;
I create an index on the posts table that uses my function.
CREATE INDEX posts_score_index ON posts(score(vote_up_count, vote_down_count), date_created);
When I EXPLAIN the following query, it doesn't seem to be using the index.
SELECT * FROM posts ORDER BY score(vote_up_count, vote_down_count), date_created
Sort (cost=1.02..1.03 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_of_com (...)"
Sort Key: ((posts.vote_up_count - posts.vote_down_count)), posts.date_created
-> Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_ (...)
How do I get my ORDER BY to use an index from an IMMUTABLE function that could have some very complex arithmetic?
EDIT: Based off of #Егор-Рогов's suggestions, I change the query a bit to see if I can get it to use an index. Still no luck.
set enable_seqscan=off;
EXPLAIN VERBOSE select date_created from posts ORDER BY (hot(vote_up_count, vote_down_count, date_created),date_created);
Here is the output.
Sort (cost=10000000001.06..10000000001.06 rows=1 width=16)
Output: date_created, (ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision) / 4 (...)
Sort Key: (ROW(round((((log((GREATEST(abs((posts.vote_up_count - posts.vote_down_count)), 1))::double precision) * sign(((posts.vote_up_count - posts.vote_down_count))::double precision)) + ((date_part('epoch'::text, posts.date_created) - 1134028003::dou (...)
-> Seq Scan on public.posts (cost=10000000000.00..10000000001.05 rows=1 width=16)
Output: date_created, ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision (...)
EDIT2: It seems that I was not using the index because of a second order by with date_created.
I can see a couple of points that discourages the planner from using the index.
1.
Look at this line in the explain output:
Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
It says that the planner believes there is only one row in the table. In this case it makes no sense to use index scan, for sequential scan is faster.
Try to add more rows to the table, do analyze and try again. You can also test it by temporarily disabling sequential scans by set enable_seqscan=off;.
2.
You use your the function to sort the results. So the planner may decide to use the index in order to get tuple ids in the correct order. But then it needs to fetch each tuple from the table to get values of all columns (because of select *).
You can make the index more attractive to the planner by adding all necessary columns to it, which make possible to avoid table scan. This is called index-only scan.
CREATE INDEX posts_score_index ON posts(
score(vote_up_count, vote_down_count),
date_created,
id, -- do you actually need it in result set?
vote_up_count, -- do you actually need it in result set?
vote_down_count -- do you actually need it in result set?
);
And make sure you run vacuum after inserting/updating/deleting rows to update the visibility map.
The downside is the increased index size, of course.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.