Postgres does not use composite index - postgresql

I have a table.
CREATE TABLE orders(
id int NOT NULL,
created_at timestamp DEFAULT CURRENT_TIMESTAMP,
order_type_id int,
);
And 2 indexes:
CREATE INDEX ix_orders_created_at ON orders (created_at);
CREATE INDEX ix_orders_order_type_id_created_at ON orders (orders_type_id, created_at desc);
Sometimes I want to get all orders sorted by created_at DESC and sometimes I need to get orders of specific type only, also sorted by created_at. For the second case the query is:
SELECT * FROM orders
WHERE order_type_id=(SELECT id FROM order_types WHERE name='order_type_name')
ORDER by created_at DESC
LIMIT 50;
I expect that postgres will use seconds composite index for such query. But no, postgres uses simple index on created_at, "scans" all table records by date and filters required types.
QUERY PLAN
----------------------------------------------------------------------------------------------
Index Scan using ix_orders_created_at on orders (cost=1.68..1663504.64 rows=5747582 width=1471)
Filter: (order_type_id = $0)
InitPlan 1 (returns $0)
-> Seq Scan on order_types (cost=0.00..1.11 rows=1 width=4)
Filter: (name = 'order_type_name'::text)
(5 rows)
For orders of frequent types it is acceptable, but for rare or not yet existing types of orders this leads to heavy and long queries scaning a lot of records. How do I force postgres to use composite index instead? "ANALAZE orders" doesn't help.

Related

Create a unique index using gist with a column of type daterange and a text column

Imagine we have this table:
CREATE TABLE reservations
(
reservation_id INT GENERATED ALWAYS AS IDENTITY,
room_id INT NOT NULL,
date_period DATERANGE,
EXCLUDE USING gist (room_id WITH =, date_period WITH &&),
PRIMARY KEY (reservation_id),
FOREIGN KEY (room_id) REFERENCES rooms(room_id) ON DELETE CASCADE
);
The exclude using gist helps me to no overlap the date_period with the same room.
What we want is to create a composite unique index on the room_id and the date_period, so I could hit this index on my queries.
SELECT reservation_id
FROM reservations
WHERE room_id = 1 AND date_period = '[2022-09-01, 2022-09-07)';
The thing is I am not sure if I've already create the index with my exclude and if yes can we create a unique composite index with our overlapping date constraint?
If you use EXPLAIN on your query, you will see that the index can be used:
EXPLAIN
SELECT reservation_id
FROM reservations
WHERE room_id = 1 AND date_period = '[2022-09-01, 2022-09-07)';
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════════════════
Index Scan using reservations_room_id_date_period_excl on reservations (cost=0.14..8.16 rows=1 width=4)
Index Cond: ((room_id = 1) AND (date_period = '[2022-09-01,2022-09-07)'::daterange))
(2 rows)
If the table is small, you may have to set enable_seqscan = off to keep PostgreSQL from using a sequential scan.

Multi column order by kills query performance even when the time range does not contain any records

I have a fairly small table of 26 million records.
CREATE TABLE t1
(
cam varchar(100) NOT NULL,
updatedat timestamp,
objid varchar(40) NOT NULL,
image varchar(100) NOT NULL,
reader varchar(60) NOT NULL,
imgcap timestamp NOT NULL
);
ALTER TABLE t1
ADD CONSTRAINT t1_pk
PRIMARY KEY (reader, cam, image, objid, imgcap);
I have a simple query to iterate the records between a time range.
SELECT * FROM t1
WHERE updatedat >= '2021-12-09 20:30:00' and updatedat <= '2021-12-09 20:32:01'
ORDER BY reader ASC , imgcap ASC, objid ASC, cam ASC, image ASC
LIMIT 10000
OFFSET 0;
I added an index to support the query with the comparison as the left most field and the remaining elements to support the sort.
CREATE INDEX t1_idtmp ON t1 USING btree (updatedat , reader , imgcap , objid, cam, image);
However, the query takes more than 10 seconds to get complete. It takes same time even if there are no elements in the range.
-> Incremental Sort (cost=8.28..3809579.24 rows=706729 width=223) (actual time=11034.114..11065.710 rows=10000 loops=1)
Sort Key: reader, imgcap, objid, cam, image
Presorted Key: reader, imgcap
Full-sort Groups: 62 Sort Method: quicksort Average Memory: 42kB Peak Memory: 42kB
Pre-sorted Groups: 62 Sort Methods: top-N heapsort, quicksort Average Memory: 58kB Peak Memory: 58kB
-> Index Scan using t1_idxevtim on t1 (cost=0.56..3784154.75 rows=706729 width=223) (actual time=11033.613..11036.823 rows=10129 loops=1)
Filter: ((updatedat >= '2021-12-09 20:30:00'::timestamp without time zone) AND (updatedat <= '2021-12-09 20:32:01'::timestamp without time zone))
Rows Removed by Filter: 25415461
Planning Time: 0.137 ms
Execution Time: 11066.791 ms
There are couple of more indexes on table to support other use cases.
CREATE INDEX t1_idxua ON t1 USING btree (updatedat);
CREATE INDEX t1_idxevtim ON t1 USING btree (reader, imgcap);
I think, Postgresql wants to avoid an expensive sort and thinks that pre sorted key will be faster but why does Postgresql not use the t1_idtmp index as both search & sort can be satisfied with it?
why does Postgresql not use the t1_idtmp index as both search & sort can be satisfied with it?
Because the sort can't be satisfied by it. An btree index on (updatedat , reader , imgcap , objid, cam, image) can only produce data ordered by reader , imgcap , objid, cam, image for within ties of updatedat. So if your condition was for a specific value of updatedat, that would work. But since it is for a range of updatedat, that won't work as they are not all tied with each other.

UPDATE query for 180k rows in 10M row table unexpectedly slow

I have a table that is getting too big and I want to reduce it's size
with an UPDATE query. Some of the data in this table is redundant, and
I should be able to reclaim a lot of space by setting the redundant
"cells" to NULL. However, my UPDATE queries are taking excessive
amounts of time to complete.
Table details
-- table1 10M rows (estimated)
-- 45 columns
-- Table size 2200 MB
-- Toast Table size 17 GB
-- Indexes Size 1500 MB
-- **columns in query**
-- id integer primary key
-- testid integer foreign key
-- band integer
-- date timestamptz indexed
-- data1 real[]
-- data2 real[]
-- data3 real[]
This was my first attempt at an update query. I broke it up into some
temporary tables just to get the id's to update. Further, to reduce the
query, I selected a date range for June 2020
CREATE TEMP TABLE A as
SELECT testid
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3;
CREATE TEMP TABLE B as -- this table has 180k rows
SELECT id
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND testid in (SELECT testid FROM A)
AND band > 1
UPDATE table1
SET data1 = Null, data2 = Null, data3 = Null
WHERE id in (SELECT id FROM B)
Queries for creating TEMP tables execute in under 1 sec. I ran the UPDATE query for an hour(!) before I finally killed it. Only 180k
rows needed to be updated. It doesn't seem like it should take that much
time to update that many rows. Temp table B identifies exactly which
rows to update.
Here is the EXPLAIN from the above UPDATE query. One of the odd features of this explain is that it shows 4.88M rows, but there are only 180k rows to update.
Update on table1 (cost=3212.43..4829.11 rows=4881014 width=309)
-> Nested Loop (cost=3212.43..4829.11 rows=4881014 width=309)
-> HashAggregate (cost=3212.00..3214.00 rows=200 width=10)
-> Seq Scan on b (cost=0.00..2730.20 rows=192720 width=10)
-> Index Scan using table1_pkey on table1 (cost=0.43..8.07 rows=1 width=303)
Index Cond: (id = b.id)
Another way to run this query is in one shot:
WITH t as (
SELECT id from table1
WHERE testid in (
SELECT testid
from table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3
)
)
UPDATE table1 a
SET data1 = Null, data2 = Null, data3 = Null
FROM t
WHERE a.id = t.id
I only ran this one for about 10 minutes before I killed it. It feels like I should be able to run this query in much less time if I just knew the tricks. This query has EXPLAIN below. This explain shows 195k rows which is more expected, but cost is much higher # 1.3M to 1.7M
Update on testlog a (cost=1337986.60..1740312.98 rows=195364 width=331)
CTE t
-> Hash Join (cost=8834.60..435297.00 rows=195364 width=4)
Hash Cond: (testlog.testid = testlog_1.testid)
-> Seq Scan on testlog (cost=0.00..389801.27 rows=9762027 width=8)
-> Hash (cost=8832.62..8832.62 rows=158 width=4)"
-> HashAggregate (cost=8831.04..8832.62 rows=158 width=4)
-> Index Scan using amptest_testlog_date_idx on testlog testlog_1 (cost=0.43..8820.18 rows=4346 width=4)
Index Cond: ((date >= '2020-06-01 00:00:00-07'::timestamp with time zone) AND (date <= '2020-07-01 00:00:00-07'::timestamp with time zone))
Filter: (band = 3)
-> Hash Join (cost=902689.61..1305015.99 rows=195364 width=331)
Hash Cond: (t.id = a.id)
-> CTE Scan on t (cost=0.00..3907.28 rows=195364 width=32)
-> Hash (cost=389801.27..389801.27 rows=9762027 width=303)
-> Seq Scan on testlog a (cost=0.00..389801.27 rows=9762027 width=303)
Edit: one of the suggestions in the accepted answer was to drop any indexes before the update and then add them back later. This is what I went with, with a twist: I needed another table to hold indexed data from the dropped indexes to make the A and B queries faster:
CREATE TABLE tempid AS
SELECT id, testid, band, date
FROM table1
I made indexes on this table for id, testid, and date. Then I replaced table1 in the A and B queries with tempid. It still went slower than I would have liked, but it did get the job done.
You might have another table that has a foreign key to this table to one or more columns you are setting to NULL. And this foreign table does not have an index on the column.
Each time you set the row value to NULL the database has to check the foreign table - maybe it has a row that references the value you are removing.
If this is the case you should be able to speed it up by adding an index on this remote table.
For example if you have a table like this:
create table table2 (
id serial primary key,
band integer references table1(data1)
)
Then you can create an index create index table2_band_nnull_idx on table2(band) where band is not null.
But you suggested that all columns you are setting to NULL have array type. This means that it is unlikely that they are referenced. Still it is worth checking.
Another possibility is that you have a trigger on the table that works slowly.
Another possibility is that you have a lot of indexes on the table. Each index has to be updated for each row you update and it can use only a single processor core.
Sometimes it is faster to drop all indexes, do the bulk update and then recreate them all back. Creating indexes can use multiple cores - one core per index.
Another possibility is that your query is waiting for some other query to finish and release its locks. You should check with:
select now()-query_start, * from pg_stat_activity where state<>'idle' order by 1;

Strange PostgreSQL index using while using LIMIT..OFFSET

PostgreSQL 9.6.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Table and indices:
create table if not exists orders
(
id bigserial not null constraint orders_pkey primary key,
partner_id integer,
order_id varchar,
date_created date,
state_code integer,
state_date timestamp,
recipient varchar,
phone varchar,
);
create index if not exists orders_partner_id_index on orders (partner_id);
create index if not exists orders_order_id_index on orders (order_id);
create index if not exists orders_partner_id_date_created_index on orders (partner_id, date_created);
The task is to create paging/sorting/filtering data.
The query for the first page:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 0;
The query plan:
QUERY PLAN
"Limit (cost=19495.48..38990.41 rows=10 width=91)"
" -> Index Scan using orders_order_id_index on orders (cost=0.56..**41186925.66** rows=21127 width=91)"
" Filter: ((date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date) AND (partner_id = 1))"
Index orders_partner_id_date_created_index is not used, so the cost is extremely high!
But starting from some offset values (the exact value differs from time to time, looks like it depends on total row count) the index starts to be used:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 40;
Plan:
QUERY PLAN
"Limit (cost=81449.76..81449.79 rows=10 width=91)"
" -> Sort (cost=81449.66..81502.48 rows=21127 width=91)"
" Sort Key: order_id"
" -> Bitmap Heap Scan on orders (cost=4241.93..80747.84 rows=21127 width=91)"
" Recheck Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
" -> Bitmap Index Scan on orders_partner_id_date_created_index (cost=0.00..4236.65 rows=21127 width=0)"
" Index Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
What's happening? Is this a way to force the server to use the index?
General answer:
Postgres stores some information about your tables
Before executing the query, planner prepares execution plan based on those informations
In your case, planner thinks that for certain offset value this sub-optimal plan will be better. Note that your desired plan requires sorting all selected rows by order_id, while this "worse" plan does not. I'd guess that Postgres bets there will be quite many such rows for various orders and just tests one order after another, starting from lowest.
I can think of two solutions:
A) provide more data to the planer, by running
ANALYZE orders;
(https://www.postgresql.org/docs/9.6/sql-analyze.html)
or bo changing gathered statistics
ALTER TABLE orders SET STATISTCS (...);
(https://www.postgresql.org/docs/9.6/planner-stats.html)
B) Rewrite query in a way that hints desired index usage, like this:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
)
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Or maybe even better:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
),
all_data AS (
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
)
SELECT *
FROM all_data
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Disclaimer - I can't explain why the first query should be interpreted in other way by Postgres planner, just think it could. On the other hand, second query separates offsets/limits from joins and I'd be very surprised if Postgres still did it the "bad" (according to you benchmarks) way.

How to speed up PostgreSQL aggregate select with sub queries and case statements

Background: I have a table containing financial transaction records. The table has several tens of millions of rows for tens of thousands of users. I need to fetch the sum of the transactions for showing balances and other aspects of the site.
My current query can get extremely slow and often times out. I have tried optimizing the query but can't seem to get it to run efficiently.
Environment: My application is running on Heroku using a Postgres Standard-2 plan (8GB ram, 400 max connections, 256GB allowed storage). My max connections at any given time is about 20 and my current DB size is 35GB. According to statistics, this query runs on average about 1,000ms and is used very frequently which has a big impact on site performance.
For the database, the index cache hit rate is 99% and the table cache hit rate is 97%. Autovacuum runs about every other day based on the current thresholds.
Here's my current transactions table setup:
CREATE TABLE transactions (
id bigint DEFAULT nextval('transactions_id_seq'::regclass) NOT NULL,
user_id integer NOT NULL,
date timestamp without time zone NOT NULL,
amount numeric(15,2) NOT NULL,
transaction_type integer DEFAULT 0 NOT NULL,
account_id integer DEFAULT 0,
reconciled integer DEFAULT 0,
parent integer DEFAULT 0,
ccparent integer DEFAULT 0,
created_at timestamp without time zone DEFAULT now() NOT NULL
);
CREATE INDEX transactions_user_id_key ON transactions USING btree (user_id);
CREATE INDEX transactions_user_date_idx ON transactions (user_id, date);
CREATE INDEX transactions_user_ccparent_idx ON transactions (user_id, ccparent) WHERE ccparent >0;
And here's my current query:
SELECT account_id,
sum(deposit) - sum(withdrawal) AS balance,
sum(r_deposit)-sum(r_withdrawal) AS r_balance,
sum(deposit) AS o_deposit,
sum(withdrawal) AS o_withdrawal,
sum(r_deposit) AS r_deposit,
sum(r_withdrawal) AS r_withdrawal
FROM
(SELECT t.account_id,
CASE
WHEN transaction_type > 0 THEN sum(amount)
ELSE 0
END AS deposit,
CASE
WHEN transaction_type = 0 THEN sum(amount)
ELSE 0
END AS withdrawal,
CASE
WHEN transaction_type > 0 AND reconciled=0 THEN sum(amount)
ELSE 0
END AS r_deposit,
CASE
WHEN transaction_type = 0 AND reconciled=0 THEN sum(amount)
ELSE 0
END AS r_withdrawal
FROM transactions AS t
WHERE user_id = $1 AND parent=0 AND ccparent=0
GROUP BY transaction_type, account_id, reconciled ) AS t0
GROUP BY account_id;
The query has several parts. I have to get the following for each account the user has:
1) the overall account balance
2) the balance for all reconciled transactions
3) separately, the sum of all deposits, withdrawals, reconciled deposits and reconciled withdrawals.
Here's one query plan when I run explain analyze on the query:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=13179.85..13180.14 rows=36 width=132) (actual time=1326.200..1326.204 rows=6 loops=1)
Group Key: t.account_id
-> HashAggregate (cost=13179.29..13179.58 rows=36 width=18) (actual time=1326.163..1326.171 rows=16 loops=1)
Group Key: t.transaction_type, t.account_id, t.reconciled
-> Bitmap Heap Scan on transactions t (cost=73.96..13132.07 rows=13491 width=18) (actual time=17.410..1317.863 rows=12310 loops=1)
Recheck Cond: (user_id = 1)
Filter: ((parent = 0) AND (ccparent = 0))
Rows Removed by Filter: 2
Heap Blocks: exact=6291
-> Bitmap Index Scan on transactions_user_id_key (cost=0.00..73.29 rows=13601 width=0) (actual time=15.901..15.901 rows=12343 loops=1)
Index Cond: (user_id = 1)
Planning time: 0.895 ms
Execution time: 1326.424 ms
Does anyone have any suggestions on how to speed up this query? Like I said, it's the most run query in my application and is also one of the most demanding on the DB. If I could optimize this, it would have tremendous benefits to the app in general.
Try if it picks up an index on transactions (user_id, parent, ccparent, transaction_type, account_id, reconciled).
CREATE INDEX transactions_u_p_ccp_tt_a_r_idx
ON transactions
(user_id,
parent,
ccparent,
transaction_type,
account_id,
reconciled);
Maybe you can even include amount in the index.
CREATE INDEX transactions_u_p_ccp_tt_a_r_a_idx
ON transactions
(user_id,
parent,
ccparent,
transaction_type,
account_id,
reconciled,
amount);