Postgres: Optimisation for query "WHERE id IN (...)" - postgresql

I have a table (2M+ records) which keeps track of a ledger.
Some entries add points, while others subtract points (there are only two kinds of entries). The entries which subtract points, always reference the (adding) entries they were subtracted from with referenceentryid. The adding entries would always have NULL in referenceentryid.
This table has a dead column which would be set to true by a worker when some additions was depleted or expired, or when the subtractions is pointing at a "dead" additions. Since the table has a partial index on dead=false, SELECT on live rows works pretty fast.
My problem is with the performance of the worker that sets dead to NULL.
The flow would be:
1. Get an entry for each addition which indicates the amount added, subtracted and whether or not it's expired.
2. Filter away entries which are both not expired and have more addition than subtraction.
3. Update dead=true on every row where either the id or the referenceentryid is in the filtered set of entries.
WITH entries AS
(
SELECT
additions.id AS id,
SUM(subtractions.amount) AS subtraction,
additions.amount AS addition,
additions.expirydate <= now() AS expired
FROM
loyalty_ledger AS subtractions
INNER JOIN
loyalty_ledger AS additions
ON
additions.id = subtractions.referenceentryid
WHERE
subtractions.dead = FALSE
AND subtractions.referenceentryid IS NOT NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_entries AS (
SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE
)
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
In the query above the inner part runs pretty fast (a few seconds) while the last part would run for ever.
I have the following indexes on the table:
CREATE TABLE IF NOT EXISTS loyalty_ledger (
id SERIAL PRIMARY KEY,
programid bigint NOT NULL,
FOREIGN KEY (programid) REFERENCES loyalty_programs(id) ON DELETE CASCADE,
referenceentryid bigint,
FOREIGN KEY (referenceentryid) REFERENCES loyalty_ledger(id) ON DELETE CASCADE,
customerprofileid bigint NOT NULL,
FOREIGN KEY (customerprofileid) REFERENCES customer_profiles(id) ON DELETE CASCADE,
amount int NOT NULL,
expirydate TIMESTAMPTZ,
dead boolean DEFAULT false,
expired boolean DEFAULT false
);
CREATE index loyalty_ledger_referenceentryid_idx ON loyalty_ledger (referenceprofileid) WHERE dead = false;
CREATE index loyalty_ledger_customer_program_idx ON loyalty_ledger (customerprofileid, programid) WHERE dead = false;
I'm trying to optimise the last part of the query.
EXPLAIN gives me the following:
"Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger ledger (cost=103412.24..4976040812.22 rows=986583 width=67)"
" Filter: ((SubPlan 3) OR (SubPlan 4))"
" CTE entries"
" -> GroupAggregate (cost=1.47..97737.83 rows=252177 width=25)"
" Group Key: subtractions.referenceentryid, additions.id"
" -> Merge Join (cost=1.47..91390.72 rows=341928 width=28)"
" Merge Cond: (subtractions.referenceentryid = additions.id)"
" -> Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger subtractions (cost=0.43..22392.56 rows=341928 width=12)"
" Index Cond: (referenceentryid IS NOT NULL)"
" -> Index Scan using loyalty_ledger_pkey on loyalty_ledger additions (cost=0.43..80251.72 rows=1683086 width=16)"
" CTE dead_entries"
" -> CTE Scan on entries (cost=0.00..5673.98 rows=168118 width=4)"
" Filter: ((subtraction >= addition) OR expired)"
" SubPlan 3"
" -> CTE Scan on dead_entries (cost=0.00..3362.36 rows=168118 width=4)"
" SubPlan 4"
" -> CTE Scan on dead_entries dead_entries_1 (cost=0.00..3362.36 rows=168118 width=4)"
Seems like the last part of my query is very inefficient. Any ideas on how to speed it up?

For large datasets, I have found semi-joins to have much better performance than query in-lists:
from
loyalty_ledger as ledger
WHERE
ledger.dead = FALSE AND (
exists (
select null
from dead_entries d
where d.id = ledger.id
) or
exists (
select null
from dead_entries d
where d.id = ledger.referenceentryid
)
)
I honestly don't know, but I think each of these would also be worth a try. It's less code and more intuitive, but there is no guarantee they will work better:
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id = ledger.id or d.id = ledger.referenceentryid
)
or
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id in (ledger.id, ledger.referenceentryid)
)

What helped me in the end was to do the id IN filtering part in the second WITH step, replacing IN with ANY syntax:
WITH entries AS
(
SELECT
additions.id AS id,
additions.amount - coalesce(SUM(subtractions.amount),0) AS balance,
additions.expirydate <= now() AS passed_expiration
FROM
loyalty_ledger AS additions
LEFT JOIN
loyalty_ledger AS subtractions
ON
subtractions.dead = FALSE AND
additions.id = subtractions.referenceentryid
WHERE
additions.dead = FALSE AND additions.referenceentryid IS NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_rows AS (
SELECT
l.id AS id,
-- only additions that still have usable points can expire
l.referenceentryid IS NULL AND e.balance > 0 AND e.passed_expiration AS expired
FROM
loyalty_ledger AS l
INNER JOIN
entries AS e
ON
(l.id = e.id OR l.referenceentryid = e.id)
WHERE
l.dead = FALSE AND
(e.balance <= 0 OR e.passed_expiration)
ORDER BY e.balance DESC
)
UPDATE
loyalty_ledger AS l
SET
(dead, expired) = (TRUE, d.expired)
FROM
dead_rows AS d
WHERE
l.id = d.id AND
l.dead = FALSE;

I also believe
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
Can be rewritten into a JOIN and UNION ALL which most likely also will generate a other execution plan and might be faster.
But hard to verify for sure without the other table structures.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE
And because CTE's in PostgreSQL are materialized and not indexed. Your are most likely better off removing the dead_entries alias from the CTE and repeat outside the CTE.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE

Related

Postgres Query Optimization without adding an extra index

I was trying to optimize this query differently, but before that, Can we make any slight change in this query to reduce the time without adding any index?
Postgres version: 13.5
Query:
SELECT
orders.id as order_id,
orders.*, u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(select
SUM(order_items.tax) as tax from order_items
where order_items.order_id = orders.id and order_items.deleted = 'f'
) as tax,
(select
SUM(orders_surcharges.surcharge_tax) as surcharge_tax from orders_surcharges
where orders_surcharges.order_id = orders.id
)
FROM "orders"
LEFT JOIN
users as u1 ON u1.id = orders.user_id
LEFT JOIN
users as u2 ON u2.id = orders.driver_id
LEFT JOIN
users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN
referrals ON referrals.id = orders.referral_id
INNER JOIN
locations ON locations.id = orders.location_id
LEFT JOIN
orders_payments ON orders_payments.order_id = orders.id
WHERE
(orders.company_id = '626')
AND
(orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00')
AND
orders.order_status_id NOT IN (10, 5, 50)
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at ASC LIMIT 300 OFFSET 0
Current Index:
"orders_pkey" PRIMARY KEY, btree (id)
"idx_orders_company_and_location" btree (company_id, location_id)
"idx_orders_created_at" btree (created_at)
"idx_orders_customer_id" btree (customer_id)
"idx_orders_location_id" btree (location_id)
"idx_orders_order_status_id" btree (order_status_id)
Execution Plan
Seems this takes more time on the parallel heap scan.
You're looking for 300 orders and try to get some additional information about these records. I would see if I could first get these 300 records, instead of getting all the data and then limit it to 300. Something like this:
WITH orders_300 AS (
SELECT * -- just get the columns that you really need, never use * in production
FROM orders
INNER JOIN locations ON locations.id = orders.location_id
WHERE orders.company_id = '626'
AND orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00'
AND orders.order_status_id NOT IN (10, 5, 50)
ORDER BY
created_at ASC LIMIT 300 -- LIMIT
OFFSET 0
)
SELECT
orders.id as order_id,
orders.*, -- just get the columns that you really need, never use * in production
u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(SELECT SUM(order_items.tax) as tax
FROM order_items
WHERE order_items.order_id = orders.id
AND order_items.deleted = 'f'
) as tax,
( SELECT SUM(orders_surcharges.surcharge_tax) as surcharge_tax
FROM orders_surcharges
WHERE orders_surcharges.order_id = orders.id
)
FROM "orders_300" AS orders
LEFT JOIN users as u1 ON u1.id = orders.user_id
LEFT JOIN users as u2 ON u2.id = orders.driver_id
LEFT JOIN users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN referrals ON referrals.id = orders.referral_id
LEFT JOIN orders_payments ON orders_payments.order_id = orders.id
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at;
This will at least have a huge impact on the slowest part of your query, all these index scans on orders_payments. Every single scan is fast, but the query is doing 165000 of them... Limit this to just 300 and will be much faster.
Another issue is that none of your indexes covers the entire WHERE condition on the table "orders". But if you can't create a new index, you're out of luck.

Indexes for optimising SQL Joins in Postgres

Given the below query
SELECT * FROM A
INNER JOIN B ON (A.b_id = B.id)
WHERE (A.user_id = 'XXX' AND B.provider_id = 'XXX' AND A.type = 'PENDING')
ORDER BY A.created_at DESC LIMIT 1;
The variable values in the query are A.user_id and B.provider_id, the type is always queried on 'PENDING'.
I am planning to add a compound + partial index on A
A(user_id, created_at) where type = 'PENDING'
Also the number of records in A >> B.
Given A.user_id, B.provider_id, A.b_id all are foreign keys. Is there any way I can optimize the query?
Given that you are doing an inner join, I would first express the query as follows, with the join in the opposite direction:
SELECT *
FROM B
INNER JOIN A ON A.b_id = B.id
WHERE A.user_id = 'XXX' AND A.type = 'PENDING' AND
B.provider_id = 'XXX'
ORDER BY
A.created_at DESC
LIMIT 1;
Then I would add the following index to the A table:
CREATE INDEX idx_a ON A (user_id, type, created_at, b_id);
This four column index should cover the join from B to A, as well as the WHERE clause and also the ORDER BY sort at the end of the query. Note that we could probably also have left the query with the join order as you originally wrote above, and this index could still be used.

Cancelled amount and a corresponding entry - Postgres

I have the payment table:
There could be erroneous entries when a payment was made by mistake - see row 5 and then, this payment gets cancelled out - see row 6. I cannot figure out the query where I don't only cancel the negative amounts but also the corresponding pair. Here is the desired outcome:
You could also see the cases when several wrong payments were made and then, I need to cancel out all payments which if summed up give the cancelled amount.
The desired outcome:
I found Remove Rows That Sum Zero For A Given Key, Selecting positive aggregate value and ignoring negative in Postgres SQL and https://www.sqlservercentral.com/forums/topic/select-all-negative-values-that-have-a-positive-value but it is not exactly what I need
I already don't mind cases like case 2. At least, find a reliable way to exclude those like 5;-5.
you can try this for deleting the rows from the table :
WITH RECURSIVE cancel_list (id, total_cancel, sum_cancel, index_to_cancel) AS
( SELECT p.id, abs(p.amount), 0, array[p.index]
FROM payment_table AS p
WHERE p.amount < 0
AND p.id = id_to_check_and_cancel -- this condition can be suppressed in order to go through the full table payment
UNION ALL
SELECT DISTINCT ON (l.id) l.id, l.total_cancel, l.sum_cancel + p.amount, l.index_to_cancel || p.index
FROM cancel_list AS l
INNER JOIN payment_table AS p
ON p.id = l.id
WHERE l.sum_cancel + p.amount <= l.total_cancel
AND NOT l.index_to_cancel #> array[p.index] -- this condition is to avoid loops
)
DELETE FROM payment_table AS p
USING (SELECT DISTINCT ON (c.id) c.id, unnest(c.index_to_cancel) AS index_to_cancel
FROM cancel_list AS c
ORDER BY c.id, array_length(c.index_to_cancel, 1) DESC
) AS c
WHERE p.index = c.index_to_cancel;
you can try this for just querying the table without the hidden rows :
WITH RECURSIVE cancel_list (id, total_cancel, sum_cancel, index_to_cancel) AS
( SELECT p.id, abs(p.amount), 0, array[p.index]
FROM payment_table AS p
WHERE p.amount < 0
AND p.id = id_to_check_and_cancel -- this condition can be suppressed in order to go through the full table payment
UNION ALL
SELECT DISTINCT ON (l.id) l.id, l.total_cancel, l.sum_cancel + p.amount, l.index_to_cancel || p.index
FROM cancel_list AS l
INNER JOIN payment_table AS p
ON p.id = l.id
WHERE l.sum_cancel + p.amount <= l.total_cancel
AND NOT l.index_to_cancel #> array[p.index] -- this condition is to avoid loops
)
SELECT *
FROM payment_table AS p
LEFT JOIN (SELECT DISTINCT ON (c.id) c.id, c.index_to_cancel
FROM cancel_list AS c
ORDER BY c.id, array_length(c.index_to_cancel, 1) DESC
) AS c
ON c.index_to_cancel #> array[p.index]
WHERE c.index_to_cancel IS NULL ;

PostgreSQL - Slow Count

I need to write one time query. It will be run one time, and the data will be moved to other system (AWS Personalize). It does not need to be optimized for sure, but at least sped up a bit, so the migration of data is even possible.
Coming from MySQL I thought it would not be a problem. But reading a lot, it seems the COUNT function is handled differently in PostgreSQL. Having mentioned all of that this is the query, reduced in size. There are several other joins (removed from this example), but they do not present an issue, at least looking at the QUERY PLAN.
explain
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE,
'-1' AS EVENT_VALUE,
extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
INNER JOIN schedules sch ON p.id = sch.plan_id
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND (select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE') = 1
The issue is here:
select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE'
The id field in the schedules table is uuid.
I have tried lots of things, but they all end up the same. Same or worse.
I have read somewhere it is possible to use row estimate in these cases, but I have honestly no idea how to do that in this case.
This is the query plan:
Unique (cost=0.99..25152516038.36 rows=100054 width=88)
-> Nested Loop (cost=0.99..25152515788.22 rows=100054 width=88)
-> Index Only Scan using idx_schedules_plan_id_done_date on schedules sch (cost=0.56..25152152785.84 rows=107641 width=16)
Filter: ((SubPlan 1) = 1)
SubPlan 1
-> Aggregate (cost=1168.28..1168.29 rows=1 width=8)
-> Bitmap Heap Scan on schedules s (cost=14.78..1168.13 rows=58 width=16)
Recheck Cond: (plan_id = sch.plan_id)
Filter: ((status)::text = 'DONE'::text)
-> Bitmap Index Scan on idx_schedules_plan_id_done_date (cost=0.00..14.77 rows=294 width=0)
Index Cond: (plan_id = sch.plan_id)
-> Index Scan using plans_pkey on plans p (cost=0.42..3.37 rows=1 width=24)
Index Cond: (id = sch.plan_id)
Filter: ((continuous IS NOT TRUE) AND ((status)::text = 'ENDED'::text))
you are not selecting any column from the schedules table, so it can be omitted from the main query, and put into an EXISTS() term
distinct is probaly not needed, assuming id is a PK
Maybe you dont need the COUNT() to be exactly one, but just > 0
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS (
SELECT *
FROM schedules sch
WHERE p.id = sch.plan_id
)
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE' -- <<-- Must there be EXACTLY ONE schedules record?
) ;
Now you can see that the first EXISTS() is actually not needed: if the second one yields True, the first EXISTS() must yield True, too
SELECT -- DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE'
) ;

Strange PostgreSQL index using while using LIMIT..OFFSET

PostgreSQL 9.6.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Table and indices:
create table if not exists orders
(
id bigserial not null constraint orders_pkey primary key,
partner_id integer,
order_id varchar,
date_created date,
state_code integer,
state_date timestamp,
recipient varchar,
phone varchar,
);
create index if not exists orders_partner_id_index on orders (partner_id);
create index if not exists orders_order_id_index on orders (order_id);
create index if not exists orders_partner_id_date_created_index on orders (partner_id, date_created);
The task is to create paging/sorting/filtering data.
The query for the first page:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 0;
The query plan:
QUERY PLAN
"Limit (cost=19495.48..38990.41 rows=10 width=91)"
" -> Index Scan using orders_order_id_index on orders (cost=0.56..**41186925.66** rows=21127 width=91)"
" Filter: ((date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date) AND (partner_id = 1))"
Index orders_partner_id_date_created_index is not used, so the cost is extremely high!
But starting from some offset values (the exact value differs from time to time, looks like it depends on total row count) the index starts to be used:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 40;
Plan:
QUERY PLAN
"Limit (cost=81449.76..81449.79 rows=10 width=91)"
" -> Sort (cost=81449.66..81502.48 rows=21127 width=91)"
" Sort Key: order_id"
" -> Bitmap Heap Scan on orders (cost=4241.93..80747.84 rows=21127 width=91)"
" Recheck Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
" -> Bitmap Index Scan on orders_partner_id_date_created_index (cost=0.00..4236.65 rows=21127 width=0)"
" Index Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
What's happening? Is this a way to force the server to use the index?
General answer:
Postgres stores some information about your tables
Before executing the query, planner prepares execution plan based on those informations
In your case, planner thinks that for certain offset value this sub-optimal plan will be better. Note that your desired plan requires sorting all selected rows by order_id, while this "worse" plan does not. I'd guess that Postgres bets there will be quite many such rows for various orders and just tests one order after another, starting from lowest.
I can think of two solutions:
A) provide more data to the planer, by running
ANALYZE orders;
(https://www.postgresql.org/docs/9.6/sql-analyze.html)
or bo changing gathered statistics
ALTER TABLE orders SET STATISTCS (...);
(https://www.postgresql.org/docs/9.6/planner-stats.html)
B) Rewrite query in a way that hints desired index usage, like this:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
)
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Or maybe even better:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
),
all_data AS (
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
)
SELECT *
FROM all_data
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Disclaimer - I can't explain why the first query should be interpreted in other way by Postgres planner, just think it could. On the other hand, second query separates offsets/limits from joins and I'd be very surprised if Postgres still did it the "bad" (according to you benchmarks) way.