Indexing with joins - postgresql

I'm new to the concept of indexes but I'm trying to figure out how its working.
I have the following query that I want to improve the performance of.
explain analyze select to_char(rental_date, 'month') as month, count(*) count
from rental
join instrument on rental.instrument_id = instrument.instrument_id
where extract(year from rental_date) = 2020
group by month, extract(month from rental_date)
order by extract(month from rental_date) asc
;
Execution plan
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=75.10..75.55 rows=15 width=48) (actual time=14.204..14.821 rows=12 loops=1)
Group Key: (date_part('month'::text, (rental.rental_date)::timestamp without time zone)), (to_char((rental.rental_date)::timestamp with time zone, 'month'::text))
-> Sort (cost=75.10..75.14 rows=15 width=40) (actual time=14.121..14.298 rows=1540 loops=1)
Sort Key: (date_part('month'::text, (rental.rental_date)::timestamp without time zone)), (to_char((rental.rental_date)::timestamp with time zone, 'month'::text))
Sort Method: quicksort Memory: 169kB
-> Hash Join (cost=1.20..74.81 rows=15 width=40) (actual time=7.912..13.166 rows=1540 loops=1)
Hash Cond: (rental.instrument_id = instrument.instrument_id)
-> Seq Scan on rental (cost=0.00..73.39 rows=15 width=8) (actual time=0.061..2.027 rows=1540 loops=1)
Filter: (date_part('year'::text, (rental_date)::timestamp without time zone) = '2020'::double precision)
Rows Removed by Filter: 1511
-> Hash (cost=1.09..1.09 rows=9 width=4) (actual time=0.046..0.047 rows=9 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on instrument (cost=0.00..1.09 rows=9 width=4) (actual time=0.012..0.016 rows=9 loops=1)
Planning Time: 3.908 ms
Execution Time: 15.072 ms
(15 rows)
My idea was to have indexes on instrument_id and on rental_date because intrument_id is a foreign key and rental_date is in the where clause.
create index isx_rental ON rental(instrument_id);
create index isx_date ON rental(rental_date);
But this doesn't affect the runtime at all.
Why doesn't this help me with the performance?

The condition extract(year from rental_date) = 2020 can't use an index on rental_date as the index doesn't store the result of the expression, only the original column value.
You need to at least change that condition to:
where rental_date >= date '2020-01-01'
and rental_date < date '2021-01-01'
in order for the index to be considered at all. Whether that improves the performance is hard to say without seeing the execution plan.

To begin with extract(year from rental_date) = 2020 cannot use an index, because the function need to be applied for each row before the comparison can be done. You should whenever possible not extract a portion of a date but use a range instead. In this case you could use rental_date >= '2020-01-01' AND rental_date < '2021-01-01'.
Then it's a little weird that you group by the alias month, which is the month (as name) and then again by extract(month from rental_date) which is also the month (yet numerical). Those can be bijectively mapped. I don't know if the optimizer gets that, so you better only group by one of them. It might also help, if you grouped by the same expression you order by. So GROUP BY extract(month from rental_date) seems more promising.
In short, you can try to rewrite the query to:
SELECT to_char(rental_date, 'month') month,
count(*) count
FROM rental
INNER JOIN instrument
ON rental.instrument_id = instrument.instrument_id
WHERE rental_date >= '2020-01-01'
AND rental_date < '2021-01-01'
GROUP BY extract(month from rental_date)
ORDER BY extract(month from rental_date) ASC;
For instrument, there's only one column involved, instrument_id, and that might be a primary key, i.e. it is already indexed. If not, index it, it might help the JOIN.
The next thing is, that only one index can be used on a table in a SELECT (unless it appears more often in the FROM clause, but that's not the case here). So your two indexes on rental are one too many. You need a composite index.
Now for rental you could try an index on (rental_date, instrument_id) to support the WHERE and the JOIN. To also support the GROUP BY and ORDER BY you can try to add extract(month from rental_date) to the index as well, making it an index on (rental_date, instrument_id, extract(month from rental_date)).

Related

Fast PostgreSQL query with GROUP BY, COUNT(DISTINCT) and SUM on differrent columns

I am trying to query a table with approximately 1.5 million records. I have indexes and it performs well.
However, one of the columns I want to get a COUNT of a distinct column (that have many duplicates). When I do DISTINCT vs not its 10x slower.
This is the query:
SELECT
created_at,
SUM(amount) as total,
COUNT(DISTINCT partner_id) as count_partners
FROM
consumption
WHERE
is_official = true
AND
(is_processed = true OR is_deferred = true)
GROUP BY created_at
This takes 2.5s
If I make it:
COUNT(partner_id) as count_partners
It takes 230ms. But this is not what I want.
I want a unique set of partners per grouping (date) as well as a sum of the amounts they have consumed in that period.
I don't understand why this is so much slower. PostgreSQL seems to be very quick creating an array of all the duplicates, why does simply adding DISTINCT to it trash its performance?
Query Plan:
GroupAggregate (cost=85780.70..91461.63 rows=12 width=24) (actual time=1019.428..2641.434 rows=13 loops=1)
Output: created_at, sum(amount), count(DISTINCT partner_id)"
Group Key: p.created_at
Buffers: shared hit=16487
-> Sort (cost=85780.70..87200.90 rows=568081 width=16) (actual time=865.599..945.674 rows=568318 loops=1)
Output: created_at, amount, partner_id
Sort Key: p.created_at
Sort Method: quicksort Memory: 62799kB
Buffers: shared hit=16487
-> Seq Scan on public.consumption p (cost=0.00..31484.26 rows=568081 width=16) (actual time=0.020..272.126 rows=568318 loops=1)
Output: created_at, amount, partner_id
Filter: (p.is_official AND (p.is_deferred OR p.is_processed))
Rows Removed by Filter: 931408
Buffers: shared hit=16487
Planning Time: 0.191 ms
Execution Time: 2647.629 ms
Indexes:
CREATE INDEX IF NOT EXISTS i_pid ON consumption (partner_id);
CREATE INDEX IF NOT EXISTS i_processed ON consumption (is_processed);
CREATE INDEX IF NOT EXISTS i_official ON consumption (is_official);
CREATE INDEX IF NOT EXISTS i_deferred ON consumption (is_deferred);
CREATE INDEX IF NOT EXISTS i_created ON consumption (created_at);
The following query should be able to benefit from the indexes.
SELECT
created_at,
SUM(amount) AS total,
COUNT(DISTINCT partner_id) AS count_partners
FROM
(SELECT
created_at,
sum(amount) as amount,
partner_id
FROM consumption
WHERE is_official = true
AND (is_processed = true OR is_deferred = true)
GROUP BY
created_at,
partner_id
) AS c
GROUP BY created_at;

Database design for time series

Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);

Indexing strategy for different combinations of WHERE clauses incl. text patterns

Continuation of other question here:
How to get date_part query to hit index?
When executing the following query, it hits a compound index I created on the datelocal, views, impressions, gender, agegroup fields:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
GROUP BY 1
ORDER BY 1;
However, I'd like to be able to also filter this query down based on additional clauses in the WHERE, for example:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
AND network LIKE '%'
GROUP BY 1
ORDER BY 1;
This second query is MUCH slower than the first, although it should be operating on far fewer records, in addition to the fact that it doesn't hit my index.
Table schema:
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
Analyze output:
Finalize GroupAggregate (cost=1005368.37..1005385.70 rows=3151 width=24) (actual time=70615.636..70615.649 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1005368.37..1005369.94 rows=3151 width=24) (actual time=70615.631..70615.634 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1005005.62..1005331.75 rows=3151 width=24) (actual time=70615.456..70641.208 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1004005.62..1004016.65 rows=3151 width=24) (actual time=70613.132..70613.152 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Seq Scan on reportimpression (cost=0.00..996952.63 rows=2821195 width=17) (actual time=0.803..69876.914 rows=2429159 loops=2)
Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-03-01 00:00:00'::timestamp without time zone) AND (network ~~ '%'::text))
Rows Removed by Filter: 6701736
Planning time: 0.195 ms
Execution time: 70641.349 ms
Do I need to create additional indexes, tweak my SELECT, or something else entirely?
Your added predicate uses the LIKE operator:
AND network LIKE '%'
The actual query plan depends on what you pass instead of '%'.
But, generally, plain btree indexes are useless for this. You'll need a trigram index or use the text search infrastructure or similar, depending on what patterns you might be looking for.
See:
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
You might even combine multiple indexing strategies. Example:
PostgreSQL: Find sentences closest to a given sentence
If that's supposed to be:
AND network = '<input_string>'
then, by all means, actually use the = operator, not LIKE. Reasons in ascending order of importance:
shorter
less confusing
makes the job for the Postgres planner simpler (very slightly cheaper)
correct
If you pass a string with special characters inadvertently, you might get incorrect results. See:
Escape function for regular expression or LIKE patterns

Forcing Postgresql to use Merge Append

Say I have the following tables and indices:
create table inbound_messages(id int, user_id int, received_at timestamp);
create table outbound_messages(id int, user_id int, sent_at timestamp);
create index on inbound_messages(user_id, received_at);
create index on outbound_messages(user_id, sent_at);
Now I want to pull out the last 20 messages for a user, either inbound or outbound in a specific time range. I can do the following and from the explain it looks like PG walks back both indices in 'parallel' so it minimises the amount of rows it needs to scan.
explain select * from (select id, user_id, received_at as time from inbound_messages union all select id, user_id, sent_at as time from outbound_messages) x where user_id = 5 and time between '2018-01-01' and '2020-01-01' order by user_id,time desc limit 20;
Limit (cost=0.32..16.37 rows=2 width=16)
-> Merge Append (cost=0.32..16.37 rows=2 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (received_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (received_at <= '2020-01-01 00:00:00'::timestamp without time zone))
-> Index Scan Backward using outbound_messages_user_id_sent_at_idx on outbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (sent_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (sent_at <= '2020-01-01 00:00:00'::timestamp without time zone))
For example it could do something crazy like find all the matching rows in memory, and then sort the rows. Lets say there were millions of matching rows then this could take a long time. But because it walks the indices in the same order we want the results in this is a fast operation. It looks like the 'Merge Append' operation is done lazily and it doesn't actually materialize all the matching rows.
Now we can see postgres supports this operation for two distinct tables, however is it possible to force Postgres to use this optimisation for a single table.
Lets say I wanted the last 20 inbound messages for user_id = 5 or user_id = 6.
explain select * from inbound_messages where user_id in (6,7) order by received_at desc limit 20;
Then we get a query plan that does a bitmap heap scan, and then does an in-memory sort. So if there are millions of messages that match then it will look at millions of rows even though theoretically it could use the same Merge trick to only look at a few rows.
Limit (cost=15.04..15.09 rows=18 width=16)
-> Sort (cost=15.04..15.09 rows=18 width=16)
Sort Key: received_at DESC
-> Bitmap Heap Scan on inbound_messages (cost=4.44..14.67 rows=18 width=16)
Recheck Cond: (user_id = ANY ('{6,7}'::integer[]))
-> Bitmap Index Scan on inbound_messages_user_id_received_at_idx (cost=0.00..4.44 rows=18 width=0)
Index Cond: (user_id = ANY ('{6,7}'::integer[]))
We could think of just adding (received_at) as an index on the table and then it will do the same backwards scan. However, if we have a large number of users then we are missing out on a potentially large speedup because we are scanning lots of index entries that would not match the query.
The following approach should work as a way of forcing Postgres to use the "merge append" plan when you are interested in most recent messages for two users from the same table.
[Note: I tested this on YugabyteDB (which is based on Postgres)- so I expect the same to apply to Postgres also.]
explain select * from (
(select * from inbound_messages where user_id = 6 order by received_at DESC)
union all
(select * from inbound_messages where user_id = 7 order by received_at DESC)
) AS result order by received_at DESC limit 20;
which produces:
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.01..3.88 rows=20 width=16)
-> Merge Append (cost=0.01..38.71 rows=200 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 6)
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages inbound_messages_1 (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 7)

How to auto-extract or index day from timestamp in Postgres?

We have a Postgres table with a timestamp field created_at. On a regular basis, we need to find all the records with the day field of created_at being a certain number.
We can run a query like
select * from table where extract(day from created_at) = 3;
I suspect this isn't efficient, ie it's doing a full-table scan. If so, can I create an index somehow to make the above efficient?
If it's not possible, we can create a separate column called created_at_day and create an index on it.
So we can simply run the query like
select * from table where created_at_day = 3;
Let's say created_at can be updated. Whenever this happens, created_at_day should be updated, too.
Does Postgres provide any support to automatically keep created_at_day in sync with created_at? If so, how?
Of course this can be done in the application logic. So whenever created_at is created or updated, we update the created_at_day column. But just wondering if there's an easier, automated way to do this.
Thanks
You can create an index on extract(day from created_at)
To see the difference:
Create a table
knayak=# create table t as select i ,now()::timestamp + interval '1 days' * i as created_at from generate_series(1,10000) as i;
SELECT 10000
Create normal index on created_at
knayak=# create index ind_created_at on t(created_at);
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..205.00 rows=50 width=12) (actual time=1.049..6.020 rows=328 loops=1)
Filter: (date_part('day'::text, created_at) = '3'::double precision)
Rows Removed by Filter: 9672
Planning time: 0.392 ms
Execution time: 6.070 ms
(5 rows)
Create index with extract
knayak=# drop index ind_created_at;
DROP INDEX
knayak=# create index ind_created_at on t( extract(day from created_at) );
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=4.67..61.66 rows=50 width=12) (actual time=0.110..0.260 rows=328 loops=1)
Recheck Cond: (date_part('day'::text, created_at) = '3'::double precision)
Heap Blocks: exact=54
-> Bitmap Index Scan on ind_created_at (cost=0.00..4.66 rows=50 width=0) (actual time=0.093..0.093 rows=328 loops=1)
Index Cond: (date_part('day'::text, created_at) = '3'::double precision)
Planning time: 0.316 ms
Execution time: 0.314 ms
(7 rows)