PostgreSQL Nested Loop Join Performance - postgresql

I have two tables exchange_rate (100 Thousand Rows) and paid_date_t (9 million rows) with below structure.
"exchange_rate"
Column | Type | Collation | Nullable | Default
-----------------------------+--------------------------+-----------+----------+---------
valid_from | timestamp with time zone | | |
valid_until | timestamp with time zone | | |
currency | text | | |
Indexes:
"exchange_rate_unique_valid_from_currency_key" UNIQUE, btree (valid_from, currency)
"exchange_rate_valid_from_gist_idx" gist (valid_from)
"exchange_rate_valid_from_until_currency_gist_idx" gist (valid_from, valid_until, currency)
"exchange_rate_valid_from_until_gist_idx" gist (valid_from, valid_until)
"exchange_rate_valid_until_gist_idx" gist (valid_until)
"paid_date_t"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+---------
currency | character varying(3) | | |
paid_date | timestamp without time zone | | |
Indexes:
"paid_date_t_paid_date_idx" btree (paid_date)
I am running below select query and joining these tables based on multiple join keys:
SELECT
paid_date
FROM exchange_rate erd
JOIN paid_date_t sspd
ON sspd.paid_date >= erd.valid_from AND sspd.paid_date < erd.valid_until
AND erd.currency = sspd.currency
WHERE sspd.currency != 'USD'
However, the performance of the query is inefficient and takes hours to execute. The query plan below shows that it using a nested loop join.
Nested Loop (cost=0.28..44498192.71 rows=701389198 width=40)
-> Seq Scan on paid_date_t sspd (cost=0.00..183612.84 rows=2557615 width=24)
Filter: ((currency)::text <> 'USD'::text)
-> Index Scan using exchange_rate_valid_from_until_currency_gist_idx on exchange_rate erd (cost=0.28..16.53 rows=80 width=36)
Index Cond: (currency = (sspd.currency)::text)
Filter: ((sspd.paid_date >= valid_from) AND (sspd.paid_date < valid_until))
I have worked with different indexing methods but got the same result. I know that <= and >= operators are not supporting merge or hash joins.
Any ideas are appreciated.

You should create a smaller table with just a sample of the rows from paid_date_t in it. It is hard to optimize a query if it takes a very long time each time you try to test it.
Your btree index has the column tested for equality as the 2nd column, which is certainly less efficient. The better btree index for this query (as it is currently written) would be something like (currency, valid_from, valid_until).
For a gist index, you really want it to be on the time range, not on the separate end points of the range. You could either convert the table to hold a range type, or build a functional index to convert them on the fly (and then rewrite the query to use the same expression). This is complicated by the fact that your tables have different types due to the different handling of time zones. The index would look like:
create index on exchange_rate using gist (tstzrange(valid_from,valid_until), currency);
and then the ON condition would look like:
ON sspd.paid_date::timestamptz <# tstzrange(erd.valid_from, erd.valid_until)
AND erd.currency = sspd.currency
It might be faster to have the order of the columns in the gist index be reversed from what I show, you should try it both ways on your own data and see.

Related

PostgreSQL - how to improve this query/index

PostgresSQL 11.2
Settings:
shared_buffers = 1024MB
effective_cache_size = 2048MB
maintenance_work_mem = 320MB
checkpoint_completion_target = 0.5
wal_buffers = 3932kB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 64MB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
I've got a table with about 40M rows.
The query I'm doing on it(more fields exist in the query, it's the where clauses that count):
select id,name from my_table where
action_performed = true AND
should_still_perform_action = false AND
action_performed_at <= '2021-09-05 00:00:00.000'
LIMIT 100;
The date is just something I picked for this example.
The point is to get items that need to be processed. The client retrieving this data would then use the metadata to find a file and upload it to a cloud provider. This can take some time.
The timestamp condition is really only there to say "only process entries older than today" or the given timestamp, in general. The order in which they are returned is of no practical importance, since the goal is to perform processing on any entry that has not yet been processed. The LIMIT was introduced to stop the application doing so from hanging, because of the network activity.
Table definition(redacted):
Table "public.my_table"
action_performed_at | timestamp without time zone | | | now()
should_still_perform_action | boolean | | not null | true
action_performed | boolean | | not null | false
Indexes:
"index001" btree (action_performed_at, should_still_perform_action, action_performed) WHERE should_still_perform_action = false AND action_performed = true
"index002" btree (action_performed, should_still_perform_action, action_performed_at DESC) WHERE should_still_perform_action = false AND action_performed = true
These are all indexes added over time, all worked at the start, but are no longer being used now.
Re-indexing also does not seem to work, only dropping and re-creating them works for a while.
While the table hold 40M rows, the amount of rows matching these conditions is roughly around 100K.
The query plan looks like this:
QUERY PLAN
------------------------------------------------------------------------------------
Limit (cost=0.00..707.80 rows=100 width=3595) (actual time=18520.627..100644.933 rows=100 loops=1)
Buffers: shared hit=0 read=1392361 dirtied=26 written=26
-> Seq Scan on my_table (cost=0.00..4164264.45 rows=5883377 width=3595) (actual time=18520.624..100644.073 rows=100 loops=1)
Filter: (action_performed AND (NOT should_still_perform_action) AND (action_performed_at <= '2021-09-05 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 19846606
Buffers: shared hit=0 read=1392361 dirtied=26 written=26
Planning Time: 63.667ms
Execution Time: 100645.548 ms
(10 rows)
Using the query found here: https://github.com/ioguix/pgsql-bloat-estimation/blob/master/btree/btree_bloat-superuser.sql
This is the result:
current_database | schemaname | tblname | idxname | real_size | extra_size | extra_pct | fillfactor | bloat_size | bloat_pct | is_na
------------------+------------+------------------+---------------------------------------+-------------+-------------+------------------+------------+-------------+------------------+-------
mine | public | my_table | index001 | 343244800 | 341598208 | 99.5202863961814 | 90 | 341426176 | 99.4701670644391 | f
mine | public | my_table | index002 | 3290316800 | 2338521088 | 71.0728245985311 | 90 | 2231902208 | 67.832441180132 | f
And I'm looking for a way to do this better. Sure, I could drop and recreate the index after time I see it slow down, but that's not exactly a good way of doing things.
Changing LIMIT to FETCH changes nothing.
I'm wondering if I can improve this without changing SELECT to FETCH, which I've never used before and I'm not even sure the client can handle.
What should I do here?
EDIT:
After an analyze:
QUERY PLAN
------------------------------------------------------------------------------------
Limit (cost=0.00..690.14 rows=1000 width=3591) (actual time=0.044..5840.228 rows=1000 loops=1)
Buffers: shared hit=3 read=81426 dirtied=18 written=18
-> Seq Scan on my_table (cost=0.00..4163978.60 rows=6033500 width=3591) (actual time=0.034..5839.599 rows=100 loops=1)
Filter: (action_performed AND (NOT should_still_perform_action) AND (action_performed_at <= '2021-09-05 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 953640
Buffers: shared hit=3 read=81426 dirtied=18 written=18
Planning Time: 63.667ms
Execution Time: 100645.548 ms
(10 rows)
The estimate (cost=0.00..4164264.45 rows=5883377 width=3595) shows the planner expects over 5M records to match the criteria. It is significantly different from the expected 100K you mention.
In cases like this `ANALYZE public.my_table;' usually helps. It refreshes statistics for the table data.
Your main problem seems to be index bloat, which is caused by more DELETEs or UPDATEs than autovacuum can clean up. Solve that problem, and you should be fine.
Tune autovacuum to be as fast as possible:
ALTER TABLE my_table SET (autovacuum_vacuum_cost_delay = 0);
Also, give autovacuum enough memory by setting maintenance_work_mem to a high value up to 1GB.
Then rebuild the indexes so that they are no longer bloated. If pgstattuple tells you that the table is bloated too, VACUUM (FULL) that table instead.
Make sure that you don't have long running database transactions most of the time.
By the way, this is the perfect index for this query:
CREATE INDEX ON my_table (action_performed_at)
WHERE action_performed AND NOT should_still_perform_action;
Entire time is taking to scan the table as per the plan. Seq scan is happening, even index is available on the required columns. It seems rows return by this query is quite high, not only 100k rows.
If you check the below condition in the plan, around 20M rows are removed by all 3 filter used in where clause in the query.
Rows Removed by Filter: 19846606
Please check
Why index isn't picking by reviewing the cardinality of the columns,
how many exact rows return by the query and When this table last
analyze.
is Autovaccuum enable in this database? when the last autovaccuum run for this table ?
Because of the statistics that are collected behind the index, the distribution histogram present detailed values only for the first column of the index key.
If this first column has only 2 values the accuracy is inconsistent and the optimizer will create a bad execution plan.
To bypass this trouble, you must place the column action_performed_at as the first column of the index key.
Another point is that you do not need to have column stored in an index with a single value. When you create an index with a WHERE clause that's rely on MyColum = A_Single_Value, you can ignore this column into the index key.
Finally you can use the INCLUDE clause, that MS SQL Server invented 16 years ago and arrived in PostGreSQL, to add some more columns that do not participate in any seek, but is necessary for the SELECT. This will use only the index and do not use a two phase access SEEK for index and SCAN for the table.
So I will try an index like this one :
CREATE INDEX SQLpro__B6B13FC3_6F90_4EEC_BA61_CA6C96C7958A__20210914
ON my_table (action_performed_at)
INCLUDE (id, name)
WHERE action_performed = true AND
should_still_perform_action = true;

Strange sorting behavior with bigint column via GiST index in PostgreSQL

I'm working on implementing a fast text search in PostgreSQL 12.6 (Ubuntu 20.04.2 VBox) with custom sorting, and I'm using pg_trgm along with GiST (btree_gist) index for sorted output. The idea is to return top 5 matching artists that have the highest number of plays. The index is created like this:
create index artist_name_gist_idx on artist
using gist ("name" gist_trgm_ops, total_play_count) where active = true;
"name" here is of type varchar(255) and total_play_count is bigint, no nulls allowed.
When I query the table like this:
select id, name, total_play_count
from artist
where name ilike '%the do%' and active = true
order by total_play_count <-> 40312
limit 5;
I get the correct result:
id | name | total_play_count
--------+-------------------------+------------------
1757 | The Doors | 1863
733226 | Damsel in the Dollhouse | 1095
9758 | The Doobie Brothers | 1036
822805 | The Doubleclicks | 580
7236 | Slaughter and the Dogs | 258
I would get the same result if I replace total_play_count <-> 40312 with simple total_play_count desc, but then I get the additional sort operation that I want to avoid. Number 40312 here is the current maximum value of this column, and table itself contains 1612297 rows in total.
However, since total_play_count is of type bigint, I wanted to make this query more general (and faster) and use the maximum value for bigint, so I don't have to query for the max value every time. But when I update the ORDER BY clause with total_play_count <-> 9223372036854775807, I get the following result:
id | name | total_play_count
---------+-------------------------+------------------
1757 | The Doors | 1863
822805 | The Doubleclicks | 580
9758 | The Doobie Brothers | 1036
733226 | Damsel in the Dollhouse | 1095
1380763 | Bruce Bawss The Don | 10
The ordering here is broken, and it's even worse when I try the same approach on another table that has a lot more rows. There are no negative or overly large values, so overflow shouldn't be possible. Results of explain are almost identical:
Limit (cost=0.41..6.13 rows=5 width=34)
-> Index Scan using artist_name_gist_idx on artist (cost=0.41..184.44 rows=161 width=34)
Index Cond: ((name)::text ~~* '%the do%'::text)
Order By: (total_play_count <-> '9223372036854775807'::bigint)
What could be the issue here? Is this a bug with btree_gist, or am I missing something? I could settle for querying for the max value, but it worries me that there is a threshold that might be reached eventually and break the search, which would be a shame since I'm quite happy with the performance.
Update:
I've tried using regular integer type instead of bigint, and then query with it's upper bound total_play_count <-> 2147483647. It seems that there is no such problem with it. Perhaps using bigint in the first place was somewhat optimistic, but I'll keep the question open in case anyone has an answer or a workaround.

postgresql won't use index for character LIKE comparison during a nested loop

I've two tables:
alpha
string1 | other_attr|
------------|-----------|
y-foo-one | ...
y-foo-two |
y-foo-three |
y-baz-one |
y-bat-four |
y-baz-two |
beta
string2
---
foo
baz
bat
I would like to perform a fuzzy left join to get e.g.
string2 | string1 | other_attr |
--------|-------------|------------|
foo | y-foo-one | ... |
foo | y-foo-two |
foo | y-foo-three |
baz | y-baz-one |
baz | y-baz-two |
bat | y-bat-four |
I have a btree index on alpha(string1), so that the following query
SELECT * FROM alpha WHERE string1 LIKE 'y%';
uses a speedy index scan with the following condition:
Index Cond: (((string1)::text >= 'y'::text) AND ((string1)::text < 'z'::text))
However, when I try to write the nested loop query
SELECT * FROM beta LEFT JOIN alpha ON (alpha.string1 LIKE 'y-' || beta.string2 || '%')
PostgreSQL appears to refuse the index scan, instead forcing a seq scan as if it does not know that it can limit the search through all that text data to strings beginning with y-:
Join Filter: ((alpha.string1)::text ~~ (('y-'::text || (beta.string2)::text) || '%'::text))
This is rather a pain, as alpha has over a billion rows. Is there a better way to write the string comparison filter so it's clear to PostgreSQL that we know how alpha.string1 should start, and that the wildcard only comes in at the end?
You can get it to use the index for the 'y-' part, but not for the rest.
SELECT * FROM beta LEFT JOIN alpha ON (alpha.string1 LIKE 'y-%' and alpha.string1 LIKE 'y-' || beta.string2 || '%')
To do better than that, I think you would need to loop over beta, issuing a query against alpha for each row. I would probably do that on the client side, but you should also be able to wrap that into a table-returning function in a procedural language, with dynamic queries. Or as Laurenz says, use a trigram index.
The only option is to create a trigram index:
CREATE EXTENSION pg_trgm;
CREATE INDEX ON alpha USING gin (string1 gin_trgm_ops);
I admit that this is unfortunate, since the B-tree index would be smaller and more efficient.
The reason is that the string concatenation function is a black box to the optimizer, and it does not know that the resulting string will not start with a wildcard. Trigram indexes can also handle patterns that start with a wildcard.
(alpha.string1 LIKE 'y-' || beta.string2 || '%')
Here the LIKE parameter is not a constant, rather it is the result of an expression evaluation. To use an index the planner would need to know the '%' is always at the end of the string, but I don't it does that kind of analysys on anything other than constants.
Since this works:
Index Cond: (((string1)::text >= 'y'::text) AND ((string1)::text < 'z'::text))
Why not just use it? If, say, string2 contains 'foo' you just have to come up with an expression or function that calculates the next text string in alphabetic order, like 'fop'. Then, LIKE 'y-foo%' becomes >='y-foo' AND < 'y-fop.'
Or course if the string ends with a 'z' it will be a bit more involved. But if it's just ASCII text with only contain alphanumeric characters, then...
\d foo
Colonne | Type | Collationnement | NULL-able | Par défaut
---------+---------+-----------------+-----------+------------
id | integer | | not null |
value | text | | not null |
Index :
"foo_pkey" PRIMARY KEY, btree (id)
"foo_value" btree (value text_pattern_ops)
select value, left(value,char_length(value)-1)||chr(ascii(right(value,1))+1) from foo limit 10;
value | ?column?
-------+----------
y-1 | y-2
y-2 | y-3
y-3 | y-4
y-4 | y-5
y-5 | y-6
y-6 | y-7
y-7 | y-8
y-8 | y-9
y-9 | y-:
y-10 | y-11
...seems to work...
select * from foo join list on ((foo.value~>=~('y-'||list.l)) and (foo.value~<=~('y-'||left(list.l,char_length(list.l)-1)||chr(ascii(right(list.l,1))+1))));
Nested Loop (cost=0.32..36968.41 rows=1122222 width=17) (actual time=0.052..0.770 rows=193 loops=1)
-> Seq Scan on list (cost=0.00..2.01 rows=101 width=6) (actual time=0.021..0.039 rows=101 loops=1)
-> Index Scan using foo_value on foo (cost=0.32..254.89 rows=11111 width=11) (actual time=0.004..0.005 rows=2 loops=101)
Index Cond: ((value ~>=~ ('y-'::text || list.l)) AND (value ~<=~ (('y-'::text || "left"(list.l, (char_length(list.l) - 1))) || chr((ascii("right"(list.l, 1)) + 1)))))
Planning Time: 0.166 ms
Execution Time: 0.827 ms
...and we get the desired nested loop range index scan.
But really, the problem here is that there are several fields inside your column that represent different data and really want to be independent columns, so you can actually access them and search for them in the way you want.
So you could just create a bunch of GENERATED columns, for example using split_part() to split your a-b-c string into a, b, c ; then create an index on (a,b,c) and you can simply do:
WHERE a='y' AND b='foo'
This has the added bonus that b='foo' will not also select b='fooz' which would be selected by LIKE. Your question implies this might be what you actually want. If that's not the case, well, you'll still have to use LIKE and then that solution won't work, you have to build a range by hand like said above.

Postgres best way to delete duplicates in large table with no primary key

I have a table that logs scan events wherein I store the first and last event. Each night at midnight the all the scan events from the previous day are added to the table, duplicates are dropped, and a query is run to delete anything other than the scan event with the minimum and maximum timestamp.
One of the problems is that the data provider recycles scan ID's every 45 days, so this table does not have a primary key. Here is an example of what the table looks like in it's final state:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
But before the cleanup queries are run it can look like this:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 19:30:32|Received |12345 |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
Because there are sometimes data overlap from the vendor and there's nothing we can really do about that. I currently run the following queries to delete duplicates:
DELETE FROM scans T1
USING scans T2
WHERE EXTRACT(DAY FROM current_timestamp-T1.scandatetime) < 2
AND T1.ctid < T2.ctid
AND T1.scaneventID = T2.scaneventID
AND T1.scandatetime = T2.scandatetime
;
And to retain only the min/max timestamps:
delete from scans
where EXTRACT(DAY FROM current_timestamp-scandatetime) < 2 and
scandatetime <> (select min(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID) and
scandatetime <> (select max(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID)
;
However the table is quite large (100's of millions of scans over multiple years) so these run quite slowly. How can I speed this up?

Order of columns in compound indexes

I am using a compound index on a table with more than 13 million records.
The index order is (center_code, created_on, status). The center_code and status both are varchar(100) not NULL and created_on is timestamp without time zone.
I read somewhere that order of indexes matter in a compound index. We have to check for number of unique values and put the one with the highest number of unique values at the first place in compound index.
The center_code can have 4000 distinct values.
The status can have 5 distinct values.
The min value of created_on is 2017-12-12 02:00:49.465317+00.
The question is what can be the number of unique values for created_on?
Should I put it first in the compound index?
Indexing on date column works on date basis, hour basis or second basis.
The problem is:
A simple SELECT query is taking more than 500 ms which is using just this compound index and nothing else.
Indexes on table:
Indexes:
"pa_key" PRIMARY KEY, btree (id)
"pa_uniq" UNIQUE CONSTRAINT, btree (wbill)
"pa_center_code_created_on_status_idx_new" btree (center_code, created_on, status)
The query is:
EXPLAIN ANALYSE
SELECT "pa"."wbill"
FROM "pa"
WHERE ("pa"."center_code" = 'IND110030AAC'
AND "pa"."status" IN ('Scheduled')
AND "pa"."created_on" >= '2018-10-10T00:00:00+05:30'::timestamptz);
Query Plan:
Index Scan using pa_center_code_created_on_status_idx_new on pa (cost=0.69..3769.18 rows=38 width=13) (actual time=5.592..15.526 rows=78 loops=1)
Index Cond: (((center_code)::text = 'IND110030AAC'::text) AND (created_on >= '2018-10-09 18:30:00+00'::timestamp with time zone) AND ((status)::text = 'Scheduled'::text))
Planning time: 1.156 ms
Execution time: 519.367 ms
Any help would be highly appreciated.
The index scan condition reads
(((center_code)::text = 'IND110030AAC'::text) AND
(created_on >= '2018-10-09 18:30:00+00'::timestamp with time zone) AND
((status)::text = 'Scheduled'::text))
but the index scan itself is only over (center_code, created_on), while the condition on status is applied as a filter.
Unfortunately this is not visible from the execution plan, but it follows from the following rule:
An index scan will only use conditions if the rows satisfying the conditions are next to each other in the index.
Let's consider this example (in index order):
center_code | created_on | status
--------------+---------------------+-----------
IND110030AAC | 2018-10-09 00:00:00 | Scheduled
IND110030AAC | 2018-10-09 00:00:00 | Xtra
IND110030AAC | 2018-10-10 00:00:00 | New
IND110030AAC | 2018-10-10 00:00:00 | Scheduled
IND110030AAC | 2018-10-11 00:00:00 | New
IND110030AAC | 2018-10-11 00:00:00 | Scheduled
You will see that the query needs the 4th and 6th row.
PostgreSQL cannot scan the index with all three conditions, because the required rows are not next to each other. It will have to scan only with the first two conditions, because all rows satisfying those are right next to each other.
Your rule for multi-column indexes is wrong. The columns at the left of the index have to be the ones where = is used as comparison operator in the conditions.
The perfect index would be one on (center_code, status, created_on).
One of the tips that I have learned from working is that when you created compound idx, the column with condition (=) should be priority and other conditions like (>, <, >=, <=, IN) will follow after.