PostgreSQL 13 - Performance Improvement to delete large table data - postgresql

I am using PostgreSQL 13 and has intermediate level experience with PostgreSQL.
I have a table named tbl_employee. it stores employee details for number of customers.
Below is my table structure, followed by datatype and index access method
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
name | character varying | |
customer_id | bigint | idx_customer_id | btree
is_active | boolean | idx_is_active | btree
is_delete | boolean | idx_is_delete | btree
I want to delete employees for specific customer by customer_id.
In table I have total 18,00,000+ records.
When I execute below query for customer_id 1001 it returns 85,000.
SELECT COUNT(*) FROM tbl_employee WHERE customer_id=1001;
When I perform delete operation using below query for this customer then it takes 2 hours, 45 minutes to delete the records.
DELETE FROM tbl_employee WHERE customer_id=1001
Problem
My concern is that this query should take less than 1 min to delete the records. Is this normal to take such long time or is there any way we can optimise and reduce the execution time?
Below is Explain output of delete query
The values of seq_page_cost = 1 and random_page_cost = 4.
Below are no.of pages occupied by the table "tbl_employee" from pg_class.
Please guide. Thanks

During :
DELETE FROM tbl_employee WHERE customer_id=1001
Is there any other operation accessing this table? If only this SQL accessing this table, I don't think it will take so much time.

In RDBMS systems each SQL statement is also a transaction, unless it's wrapped in BEGIN; and COMMIT; to make multi-statement transactions.
It's possible your multirow DELETE statement is generating a very large transaction that's forcing PostgreSQL to thrash -- to spill its transaction logs from RAM to disk.
You can try repeating this statement until you've deleted all the rows you need to delete:
DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
Doing it this way will keep your transactions smaller, and may avoid the thrashing.

SQL: DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
will not work then.
To make the batch delete smaller, you can try this:
DELETE FROM tbl_employee WHERE ctid IN (SELECT ctid FROM tbl_employee where customer_id=1001 limit 1000)
Until there is nothing to delete.
Here the "ctid" is an internal column of Postgresql Tables. It can locate the rows.

Related

Sometime Postgresql Query takes to much time to load Data

I am using PostgreSQL 13.4 and has intermediate level experience with PostgreSQL.
I have a table which stores reading from a device for each customer. For each customer we receive around ~3,000 data in a day and we store them in to our usage_data table as below
We have the proper indexing in place.
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
customer_id | bigint | idx_customer_id | btree
date | timestamp | idx_date | btree
usage | bigint | idx_usage | btree
amount | bigint | idx_amount | btree
Also I have common index on 2 columns which is below.
CREATE INDEX idx_cust_date
ON public.usage_data USING btree
(customer_id ASC NULLS LAST, date ASC NULLS LAST)
;
Problem
Few weird incidents I have observed.
When I tried to get data for 06/02/2022 for a customer it took almost 20 seconds. It's simple query as below
SELECT * FROM usage_data WHERE customer_id =1 AND date = '2022-02-06'
The execution plan
When I execute same query for 15 days then I receive result in 32 milliseconds.
SELECT * FROM usage_data WHERE customer_id =1 AND date > '2022-05-15' AND date <= '2022-05-30'
The execution plan
Tried Solution
I thought it might be the issue of indexing as for this particular date I am facing this issue.
Hence I dropped all the indexing from the table and recreate it.
But the problem didn't resolve.
Solution Worked (Not Recommended)
To solve this I tried another way. Created a new database and restored the old database
I executed the same query in new database and this time it takes 10 milliseconds.
The execution plan
I don't think this is proper solution for production server
Any idea why for any specific data we are facing this issue?
Please let me know if any additional information is required.
Please guide. Thanks

Postgres: How to efficiently bucketize random event ids below (hour,config_id,sensor_id)

I have a large table "measurement" with 4 columns:
measurement-service=> \d measurement
Table "public.measurement"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
hour | timestamp without time zone | | not null |
config_id | bigint | | not null |
sensor_id | bigint | | not null |
event_id | uuid | | not null |
Partition key: RANGE (hour)
Indexes:
"hour_config_id_sensor_id_event_id_key" UNIQUE CONSTRAINT, btree (hour, config_id, sensor_id, event_id)
Number of partitions: 137 (Use \d+ to list them.)
An example of a partition name: "measurement_y2019m12d04"
And then i insert a lot of events as CSV via COPY to a temporary table, and from there i copy the table directly into the partition using ON CONFLICT DO NOTHING.
Example:
CREATE TEMPORARY TABLE 'tmp_measurement_y2019m12d04T02_12345' (
hour timestamp without timezone,
config_id bigint,
sensor_id bigint,
event_id uuid
) ON COMMIT DROP;
[...]
COPY tmp_measurement_y2019m12d04T02_12345 FROM STDIN DELIMITER ',' CSV HEADER;
INSERT INTO measurement_y2019m12d04 (SELECT * FROM tmp_measurement_y2019m12d04T02_12345) ON CONFLICT DO NOTHING;
I think i help postgres by sending CSV with data of the same hour only. Also within that hour, i remove all duplicates in the CSV. Therefore the CSV only contains unique rows.
But i send many batches for different hours. There is no order. It can be the hour of today, yesterday, the last week. Etc.
My approach worked alright so far, but i think i have reached a limit now. The insertion speed has become very slow. While the CPU is idle, i have 25% i/o wait. Subsystem is a RAID with several TB, using disks, that are not SSD.
maintenance_work_mem = 32GB
max_wal_size = 1GB
fsync = off
max_worker_processes = 256
wal_buffers = -1
shared_buffers = 64GB
temp_buffers = 4GB
effective_io_concurrency = 1000
effective_cache_size = 128GB
Each partition per day is around 20gb big and contains no more than 500m rows. And by maintaining the unique index per partition, i just duplicated the data once more.
The lookup speed, on the other hand, is quick.
I think the limit is in the maintenance of the btree with the rather random UUIDs in (hour,config_id,sensor_id). I constantly change it, its written out and has to be re-read.
I am wondering, if there is another approach. Basically i want uniqueness for (hour,config_id,sensor_id,event_id) and then a quick lookup per (hour,config_id,sensor_id).
I am considering removal of the unique index and only having an index over (hour,config_id,sensor_id). And then providing the uniqueness on the reader side. But it may slow down the reading, as the event_id can no longer be delivered via the index, when i lookup via (hour,config_id,sensor_id). It has to access the actual row to get the event_id.
Or i provide uniqueness via a hash index.
Any other ideas are welcome!
Thank you.
When you do the insert, you should specify an ORDER BY which matches the index of the table being inserted into:
INSERT INTO measurement_y2019m12d04
SELECT * FROM tmp_measurement_y2019m12d04T02_12345
order by hour, config_id, sensor_id, event_id
Only if this fails to give enough improvement would I consider any of the other options you list.
Hash indexes don't provide uniqueness. You can simulate it with an exclusion constraint, but I think they are less efficient. Exclusion constraints do support DO NOTHING, but not support DO UPDATE. So as long as your use case does not evolve to want DO UPDATE, you would be good on that front, but I still doubt it would actually solve the problem. If your bottleneck is IO from updating the index, hash would only make it worse as it is designed to scatter your data all over the place, rather than focus it in a small cacheable area.
You also mention parallel processing. For inserting into the temp table, that might be fine. But I wouldn't do the INSERT...SELECT in parallel. If IO is your bottleneck, that would probably just make it worse. Of course if IO is no longer the bottleneck after my ORDER BY suggestion, then ignore this part.

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.
Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

PostgreSQL+table partitioning: inefficient max() and min()

I have a huge partitioned table stored at a PostgreSQL table. Each child table has an index and a check constraint on its id, e.g. (irrelevant deatils removed for clarity):
Master table: points
Column | Type | Modifiers
---------------+-----------------------------+------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Sub-table points_01
Column | Type | Modifiers
---------------+-----------------------------+-------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Indexes:
"points_01_pkey" PRIMARY KEY, btree (id)
"points_01_creation_time_idx" btree (creation_time)
"points_01_the_geom_idx" gist (the_geom) CLUSTER
Check constraints:
"enforce_srid_the_geom" CHECK (srid(the_geom) = 4326)
"id_gps_points_2010_08_22__14_47_04_check"
CHECK (id >= 1000000::bigint AND id <= 2000000::bigint)
Now,
SELECT max(id) FROM points_01
is instant, but:
SELECT max(id) FROM points
which is a master table for points_01 .. points_60 and should take very little time using check constraints, takes more than an hour because the query planner does not utilize the check constraints.
According to the PostgreSQL wiki (last section of this page), this is a known issue that would be fixed in the next versions.
Is there a good hack that will make the query planner utilize the check constraints and indices of sub-tables for max() and min() queries?
Thanks,
Adam
I don't know if it will work, but you could try this:
For that session, you could disable all access strategies but indexed ones:
db=> set enable_seqscan = off;
db=> set enable_tidscan = off;
db=> -- your query goes here
This way, only bitmapscan and indexscan would be enabled. PostgreSQL will have no choice but to use indexes to access data on the table.
After running your query, remember to reenable seqscan and tidscan by doing:
db=> set enable_seqscan = on;
db=> set enable_tidscan = on;
Otherwise, those access strategies will be disabled for the session from that point on.
Short answer: No. At this point in time, there's no way to make the Postgres planner understand that some aggregate functions can check the constraints on child partitions first. Its fairly easy to prove for specific case of min and max, but for aggregates in general, its a tough case.
You can always write it as a UNION of several partitions when it just has to be done...
I don't know much about postgres but you could could try this query (My query syntax may be not right due to lack of experience with postgres query's):
SELECT id FROM points a WHERE id > ALL (SELECT id FROM x WHERE x.id != a.id)
I'm curious if this works.

Oracle: TABLE ACCESS FULL with Primary key?

There is a table:
CREATE TABLE temp
(
IDR decimal(9) NOT NULL,
IDS decimal(9) NOT NULL,
DT date NOT NULL,
VAL decimal(10) NOT NULL,
AFFID decimal(9),
CONSTRAINT PKtemp PRIMARY KEY (IDR,IDS,DT)
)
;
Let's see the plan for select star query:
SQL>explain plan for select * from temp;
Explained.
SQL> select plan_table_output from table(dbms_xplan.display('plan_table',null,'serial'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
---------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 61 | 2 (0)|
| 1 | TABLE ACCESS FULL| TEMP | 1 | 61 | 2 (0)|
---------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
11 rows selected.
SQL server 2008 shows in the same situation Clustered index scan. What is the reason?
select * with no where clause -- means read every row in the table, fetch every column.
What do you gain by using an index? You have to go to the index, get a rowid, translate the rowid into a table offset, read the file.
What happens when you do a full table scan? You go the th first rowid in the table, then read on through the table to the end.
Which one of these is faster given the table you have above? Full table scan. Why? because it skips having to to go the index, retreive values, then going back to the other to where the table lives and fetching.
To answer this more simply without mumbo-jumbo, the reason is:
Clustered Index = Table
That's by definition in SQL Server. If this is not clear, look up the definition.
To be absolutely clear once again, since most people seem to miss this, the Clustered Index IS the table itself. It therefore follows that "Clustered Index Scan" is another way of saying "Table Scan". Or what Oracle calls "TABLE ACCESS FULL"