Can Postgres use multiple indexes in a single query? - postgresql

Assume that I have a query like below:
select
sum(impressions) as imp, sum(taps) as taps
from report
where org_id = 1 and report_date between '2019-01-01' and '2019-10-10'
group by country, text;
In MYSQL, there is no support for multiple indexing for a single query. Can I use multiple indexes for a single query in PostgeSQL?
Like:
For where condition: index(org_id, report_date);
For group by: index(country, text);
Explain:
"GroupAggregate (cost=8.18..8.21 rows=1 width=604)"
" Group Key: country, text"
" -> Sort (cost=8.18..8.18 rows=1 width=556)"
" Sort Key: country, text"
" -> Index Scan using idx_org_date on report (cost=0.14..8.17 rows=1 width=556)"
" Index Cond: ((org_id = 1) AND (date >= '2019-01-01'::date) AND (date <= '2019-02-02'::date))"

Yes and no. It can in general, but it can't use one index to get selectivity, and another to obtain the ordering needed for an efficient GROUP BY, on the same relation.
For example, if you had separate indexes on "org_id" and "report_date", it would be able to combine them using a BitmapAnd. But that would be less efficient than having your current two-column index, so this fact probably isn't of use to you.
You might be better off with a HashAgg. You could try increasing work_mem in order to get one. But if there truly is only one row, it won't really matter.

Related

Index created for PostgreSQL jsonb column not utilized

I have created an index for a field in jsonb column as:
create index on Employee using gin ((properties -> 'hobbies'))
Query generated is:
CREATE INDEX employee_expr_idx ON public.employee USING gin (((properties -> 'hobbies'::text)))
My search query has structure as:
SELECT * FROM Employee e
WHERE e.properties #> '{"hobbies": ["trekking"]}'
AND e.department = 'Finance'
Running EXPLAIN command for this query gives:
Seq Scan on employee e (cost=0.00..4452.94 rows=6 width=1183)
Filter: ((properties #> '{"hobbies": ["trekking"]}'::jsonb) AND (department = 'Finance'::text))
Going by this, I am not sure if index is getting used for search.
Is this entire setup ok?
The expression you use in the WHERE clause must match the expression in the index exactly, your index uses the expression: ((properties -> 'hobbies'::text)) but your query only uses e.properties on the left hand side.
To make use of that index, your WHERE clause needs to use the same expression as was used in the index:
SELECT *
FROM Employee e
WHERE (properties -> 'hobbies') #> '["trekking"]'
AND e.department = 'Finance'
However: your execution plan shows that the table employee is really tiny (rows=6). With a table as small as that, a Seq Scan is always going to be the fastest way to retrieve data, no matter what kind of indexes you define.

Postgres SELECT query not using index

I've created an index on a fairly large table of data (100 million + rows)
create index on web_responses_rebuild_2019_09 (response_time)
When I run an explain on a simple query of the data, it does not use the index, it does a full table scan
explain select *
from web_responses_rebuild_2019_09
where response_time between 1567398218
and 1567398220;
Results:
QUERY PLAN
"Seq Scan on web_responses_rebuild_2019_09 (cost=10000000000.00..10044668761.32 rows=300 width=734)"
" Filter: ((response_time >= 1567398218) AND (response_time <= 1567398220))"
...and we have a similarly named table and index that correctly does use the index.

Postgres: Sorting by an immutable function index doesn't use index

I have a simple table.
CREATE TABLE posts
(
id uuid NOT NULL,
vote_up_count integer,
vote_down_count integer,
CONSTRAINT post_pkey PRIMARY KEY(id)
);
I have an IMMUTABLE function that does simple (but could be complex) arithmetic.
CREATE OR REPLACE FUNCTION score(
ups integer,
downs integer)
RETURNS integer AS
$BODY$
select $1 - $2
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;
ALTER FUNCTION score(integer, integer)
OWNER TO postgres;
I create an index on the posts table that uses my function.
CREATE INDEX posts_score_index ON posts(score(vote_up_count, vote_down_count), date_created);
When I EXPLAIN the following query, it doesn't seem to be using the index.
SELECT * FROM posts ORDER BY score(vote_up_count, vote_down_count), date_created
Sort (cost=1.02..1.03 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_of_com (...)"
Sort Key: ((posts.vote_up_count - posts.vote_down_count)), posts.date_created
-> Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_ (...)
How do I get my ORDER BY to use an index from an IMMUTABLE function that could have some very complex arithmetic?
EDIT: Based off of #Егор-Рогов's suggestions, I change the query a bit to see if I can get it to use an index. Still no luck.
set enable_seqscan=off;
EXPLAIN VERBOSE select date_created from posts ORDER BY (hot(vote_up_count, vote_down_count, date_created),date_created);
Here is the output.
Sort (cost=10000000001.06..10000000001.06 rows=1 width=16)
Output: date_created, (ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision) / 4 (...)
Sort Key: (ROW(round((((log((GREATEST(abs((posts.vote_up_count - posts.vote_down_count)), 1))::double precision) * sign(((posts.vote_up_count - posts.vote_down_count))::double precision)) + ((date_part('epoch'::text, posts.date_created) - 1134028003::dou (...)
-> Seq Scan on public.posts (cost=10000000000.00..10000000001.05 rows=1 width=16)
Output: date_created, ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision (...)
EDIT2: It seems that I was not using the index because of a second order by with date_created.
I can see a couple of points that discourages the planner from using the index.
1.
Look at this line in the explain output:
Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
It says that the planner believes there is only one row in the table. In this case it makes no sense to use index scan, for sequential scan is faster.
Try to add more rows to the table, do analyze and try again. You can also test it by temporarily disabling sequential scans by set enable_seqscan=off;.
2.
You use your the function to sort the results. So the planner may decide to use the index in order to get tuple ids in the correct order. But then it needs to fetch each tuple from the table to get values of all columns (because of select *).
You can make the index more attractive to the planner by adding all necessary columns to it, which make possible to avoid table scan. This is called index-only scan.
CREATE INDEX posts_score_index ON posts(
score(vote_up_count, vote_down_count),
date_created,
id, -- do you actually need it in result set?
vote_up_count, -- do you actually need it in result set?
vote_down_count -- do you actually need it in result set?
);
And make sure you run vacuum after inserting/updating/deleting rows to update the visibility map.
The downside is the increased index size, of course.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.

Postgresql - full-text ordering of sorted out set

I have a Postgresql table with something like 200k tuples, so not that much.
What I try to do is filter out some rows and then order them using full-text matching:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND regexp_replace(d.description, '(...)', '') !~* '...'
AND regexp_replace(d.description, '...', '') !~* '...'
AND d.id != ?
ORDER BY ts_rank_cd(to_tsvector('english', name), plainto_tsquery('english', 'my search words')) DESC LIMIT 5 OFFSET 0';
There is a GIN index on description field.
Now this query works well only when there is less then 4000 or so records in the category. When its more like 5k or 6k then the query gets extremely slow.
I was trying different variations of this query. What I noticed is when I remove either WHERE clause or ORDER BY clause then I get big speed up. (Of course then I get irrelevant results)
What can I do to speedup this combination? Any way of optimizing or should I look for a solution outside Postgresql?
Additional question:
I'm experimenting further and for example this is the simplest query that I think runs too slow. Can I tell from explain analyze when it uses gist index and when doesn't?
SELECT d.*, d.description <-> 'banana' as dist FROM descriptions as d ORDER BY dist DESC LIMIT 5
"Limit (cost=16046.88..16046.89 rows=5 width=2425) (actual time=998.811..998.813 rows=5 loops=1)"
" -> Sort (cost=16046.88..16561.90 rows=206010 width=2425) (actual time=998.810..998.810 rows=5 loops=1)"
" Sort Key: (((description)::text <-> 'banana'::text))"
" Sort Method: top-N heapsort Memory: 27kB"
" -> Seq Scan on products d (cost=0.00..12625.12 rows=206010 width=2425) (actual time=0.033..901.260 rows=206010 loops=1)"
"Total runtime: 998.866 ms"`
Answered (kgrittn): DESC keyword is not correct for KNN-GiST and it's actually not wanted here. Removing it fixes the problem and gives right results.
An output of explain analyze of your query would be helpful. But I guess that this regexp_replace lines are your problem. Postgres planner just cannot know how many rows will match this two lines, so it is guessing and planning a query based on this flawed quess.
I'd recommend to create a function like this:
create function good_description(text) returns boolean as $$
select
regexp_replace($1, '(...)', '') !~* '...'
and
regexp_replace($1, '...', '') !~* '...'
$$ language sql immutable strict;
And creating a partial index on expression using this function:
create index descriptions_good_description_idx
on good_description(description)
where description != '';
And then querying in a way that allows Postgres to use this index:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND good_description(d.description)
AND d.id != ?
ORDER BY ts_rank_cd(
to_tsvector('english', name),
plainto_tsquery('english', 'my search words')
) DESC
LIMIT 5 OFFSET 0;
For this type of application, we have been moving from the tsearch feature to the trigram feature; when you want to pick a small number of best matches, it is much faster. People here often prefer the semantics of the trigram similarity matching over the text-search ranking, anyway.
http://www.postgresql.org/docs/current/interactive/pgtrgm.html
"Borrowing" the later query from the edited question, formatting it, and including the index creation statement, to make the answer self-contained without a raft of comments:
CREATE INDEX descriptions_description_trgm
ON descriptions
USING gist (description gist_trgm_ops);
SELECT d.*, d.description <-> 'banana' as dist
FROM descriptions as d
ORDER BY dist LIMIT 5;
This should return rows from the GiST index in "distance" sequence until it hits the LIMIT.