PostgreSQL+table partitioning: inefficient max() and min() - postgresql

I have a huge partitioned table stored at a PostgreSQL table. Each child table has an index and a check constraint on its id, e.g. (irrelevant deatils removed for clarity):
Master table: points
Column | Type | Modifiers
---------------+-----------------------------+------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Sub-table points_01
Column | Type | Modifiers
---------------+-----------------------------+-------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Indexes:
"points_01_pkey" PRIMARY KEY, btree (id)
"points_01_creation_time_idx" btree (creation_time)
"points_01_the_geom_idx" gist (the_geom) CLUSTER
Check constraints:
"enforce_srid_the_geom" CHECK (srid(the_geom) = 4326)
"id_gps_points_2010_08_22__14_47_04_check"
CHECK (id >= 1000000::bigint AND id <= 2000000::bigint)
Now,
SELECT max(id) FROM points_01
is instant, but:
SELECT max(id) FROM points
which is a master table for points_01 .. points_60 and should take very little time using check constraints, takes more than an hour because the query planner does not utilize the check constraints.
According to the PostgreSQL wiki (last section of this page), this is a known issue that would be fixed in the next versions.
Is there a good hack that will make the query planner utilize the check constraints and indices of sub-tables for max() and min() queries?
Thanks,
Adam

I don't know if it will work, but you could try this:
For that session, you could disable all access strategies but indexed ones:
db=> set enable_seqscan = off;
db=> set enable_tidscan = off;
db=> -- your query goes here
This way, only bitmapscan and indexscan would be enabled. PostgreSQL will have no choice but to use indexes to access data on the table.
After running your query, remember to reenable seqscan and tidscan by doing:
db=> set enable_seqscan = on;
db=> set enable_tidscan = on;
Otherwise, those access strategies will be disabled for the session from that point on.

Short answer: No. At this point in time, there's no way to make the Postgres planner understand that some aggregate functions can check the constraints on child partitions first. Its fairly easy to prove for specific case of min and max, but for aggregates in general, its a tough case.
You can always write it as a UNION of several partitions when it just has to be done...

I don't know much about postgres but you could could try this query (My query syntax may be not right due to lack of experience with postgres query's):
SELECT id FROM points a WHERE id > ALL (SELECT id FROM x WHERE x.id != a.id)
I'm curious if this works.

Related

Sometime Postgresql Query takes to much time to load Data

I am using PostgreSQL 13.4 and has intermediate level experience with PostgreSQL.
I have a table which stores reading from a device for each customer. For each customer we receive around ~3,000 data in a day and we store them in to our usage_data table as below
We have the proper indexing in place.
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
customer_id | bigint | idx_customer_id | btree
date | timestamp | idx_date | btree
usage | bigint | idx_usage | btree
amount | bigint | idx_amount | btree
Also I have common index on 2 columns which is below.
CREATE INDEX idx_cust_date
ON public.usage_data USING btree
(customer_id ASC NULLS LAST, date ASC NULLS LAST)
;
Problem
Few weird incidents I have observed.
When I tried to get data for 06/02/2022 for a customer it took almost 20 seconds. It's simple query as below
SELECT * FROM usage_data WHERE customer_id =1 AND date = '2022-02-06'
The execution plan
When I execute same query for 15 days then I receive result in 32 milliseconds.
SELECT * FROM usage_data WHERE customer_id =1 AND date > '2022-05-15' AND date <= '2022-05-30'
The execution plan
Tried Solution
I thought it might be the issue of indexing as for this particular date I am facing this issue.
Hence I dropped all the indexing from the table and recreate it.
But the problem didn't resolve.
Solution Worked (Not Recommended)
To solve this I tried another way. Created a new database and restored the old database
I executed the same query in new database and this time it takes 10 milliseconds.
The execution plan
I don't think this is proper solution for production server
Any idea why for any specific data we are facing this issue?
Please let me know if any additional information is required.
Please guide. Thanks

PostgreSQL 13 - Performance Improvement to delete large table data

I am using PostgreSQL 13 and has intermediate level experience with PostgreSQL.
I have a table named tbl_employee. it stores employee details for number of customers.
Below is my table structure, followed by datatype and index access method
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
name | character varying | |
customer_id | bigint | idx_customer_id | btree
is_active | boolean | idx_is_active | btree
is_delete | boolean | idx_is_delete | btree
I want to delete employees for specific customer by customer_id.
In table I have total 18,00,000+ records.
When I execute below query for customer_id 1001 it returns 85,000.
SELECT COUNT(*) FROM tbl_employee WHERE customer_id=1001;
When I perform delete operation using below query for this customer then it takes 2 hours, 45 minutes to delete the records.
DELETE FROM tbl_employee WHERE customer_id=1001
Problem
My concern is that this query should take less than 1 min to delete the records. Is this normal to take such long time or is there any way we can optimise and reduce the execution time?
Below is Explain output of delete query
The values of seq_page_cost = 1 and random_page_cost = 4.
Below are no.of pages occupied by the table "tbl_employee" from pg_class.
Please guide. Thanks
During :
DELETE FROM tbl_employee WHERE customer_id=1001
Is there any other operation accessing this table? If only this SQL accessing this table, I don't think it will take so much time.
In RDBMS systems each SQL statement is also a transaction, unless it's wrapped in BEGIN; and COMMIT; to make multi-statement transactions.
It's possible your multirow DELETE statement is generating a very large transaction that's forcing PostgreSQL to thrash -- to spill its transaction logs from RAM to disk.
You can try repeating this statement until you've deleted all the rows you need to delete:
DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
Doing it this way will keep your transactions smaller, and may avoid the thrashing.
SQL: DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
will not work then.
To make the batch delete smaller, you can try this:
DELETE FROM tbl_employee WHERE ctid IN (SELECT ctid FROM tbl_employee where customer_id=1001 limit 1000)
Until there is nothing to delete.
Here the "ctid" is an internal column of Postgresql Tables. It can locate the rows.

Can Postgres table partitioning by hash boost query performance?

I have an OLTP table on a Postgres 14.2 database that looks something like this:
Column | Type | Nullable |
----------------+-----------------------------+-----------
id | character varying(32) | not null |
user_id | character varying(255) | not null |
slug | character varying(255) | not null |
created_at | timestamp without time zone | not null |
updated_at | timestamp without time zone | not null |
Indexes:
"playground_pkey" PRIMARY KEY, btree (id)
"playground_user_id_idx" btree (user_id)
The database host has 8GB of RAM and 2 CPUs.
I have roughly 500M records in the table which adds up to about 80GB in size.
The table gets about 10K INSERT/h, 30K SELECT/h, and 5K DELETE/h.
The main query run against the table is:
SELECT * FROM playground WHERE user_id = '12345678' and slug = 'some-slug' limit 1;
Users have anywhere between 1 record to a few hundred records.
Thanks to the index on the user_id I generally get decent performance (double-digit milliseconds), but 5%-10% of the queries will take a few hundred milliseconds to maybe a second or two at worst.
My question is this: would partitioning the table by hash(user_id) help me boost lookup performance by taking advantage of partition pruning?
No, that wouldn't improve the speed of the query at all, since there is an index on that attribute. If anything, the increased planning time will slow down the query.
If you want to speed up that query as much as possible, create an index that supports both conditions:
CREATE INDEX ON playground (user_id, slug);
If slug is large, it may be preferable to index a hash:
CREATE INDEX ON playground (user_id, hashtext(slug));
and query like this:
SELECT *
FROM playground
WHERE user_id = '12345678'
AND slug = 'some-slug'
AND hashtext(slug) = hashtext('some-slug');
LIMIT 1;
Of course, partitioning could be a good idea for other reasons, for example to speed up autovacuum or CREATE INDEX.

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.
Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

Oracle: TABLE ACCESS FULL with Primary key?

There is a table:
CREATE TABLE temp
(
IDR decimal(9) NOT NULL,
IDS decimal(9) NOT NULL,
DT date NOT NULL,
VAL decimal(10) NOT NULL,
AFFID decimal(9),
CONSTRAINT PKtemp PRIMARY KEY (IDR,IDS,DT)
)
;
Let's see the plan for select star query:
SQL>explain plan for select * from temp;
Explained.
SQL> select plan_table_output from table(dbms_xplan.display('plan_table',null,'serial'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
---------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 61 | 2 (0)|
| 1 | TABLE ACCESS FULL| TEMP | 1 | 61 | 2 (0)|
---------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
11 rows selected.
SQL server 2008 shows in the same situation Clustered index scan. What is the reason?
select * with no where clause -- means read every row in the table, fetch every column.
What do you gain by using an index? You have to go to the index, get a rowid, translate the rowid into a table offset, read the file.
What happens when you do a full table scan? You go the th first rowid in the table, then read on through the table to the end.
Which one of these is faster given the table you have above? Full table scan. Why? because it skips having to to go the index, retreive values, then going back to the other to where the table lives and fetching.
To answer this more simply without mumbo-jumbo, the reason is:
Clustered Index = Table
That's by definition in SQL Server. If this is not clear, look up the definition.
To be absolutely clear once again, since most people seem to miss this, the Clustered Index IS the table itself. It therefore follows that "Clustered Index Scan" is another way of saying "Table Scan". Or what Oracle calls "TABLE ACCESS FULL"