Deleting rows in Postgres table using ctid - postgresql

We have a table with nearly 2 billion events recorded. As per our data model, each event is uniquely identified with 4 columns combined primary key. Excluding the primary key, there are 5 B-tree indexes each on single different columns. So totally 6 B-tree indexes.
The events recorded span for years and now we need to remove the data older than 1 year.
We have a time column with long values recorded for each event. And we use the following query,
delete from events where ctid = any ( array (select ctid from events where time < 1517423400000 limit 10000) )
Does the indices gets updated?
During testing, it didn't.
After insertion,
total_table_size - 27893760
table_size - 7659520
index_size - 20209664
After deletion,
total_table_size - 20226048
table_size - 0
index_size - 20209664

Reindex can be done
Command: REINDEX
Description: rebuild indexes
Syntax:
REINDEX { INDEX | TABLE | DATABASE | SYSTEM } name [ FORCE ]

Considering #a_horse_with_no_name method is the good solution.
What we had:
Postgres version 9.4.
1 table with 2 billion rows with 21 columns (all bigint) and 5 columns combined primary key and 5 individual column indices with date spanning 2 years.
It looks similar to time-series data with a time column containing UNIX timestamp except that its analytics project, so time is not at an ordered increase. The table was insert and select only (most select queries use aggregate functions).
What we need: Our data span is 6 months and need to remove the old data.
What we did (with less knowledge on Postgres internals):
Delete rows at 10000 batch rate.
At inital, the delete was so fast taking ms, as the bloat increased each batch delete increased to nearly 10s. Then autovacuum got triggered and it ran for almost 3 months. The insert rate was high and each batch delete has increased the WAL size too. Poor stats in the table made the current queries so slow that they ran for minutes and hours.
So we decided to go for Partitioning. Using Table Inheritance in 9.4, we implemented it.
Note: Postgres has Declarative Partitioning from version 10, which handles most manual work needed in partitioning using Table Inheritance.
Please go through the official docs as they have clear explanation.
Simplified and how we implemented it:
Create parent table
Create child table inheriting it with check constraints. (We had monthly partitions and created using schedular)
Indexes are need to be created separately for each child table
To drop old data, just drop the table, so vacuum is not needed and will be instant.
Make sure to have the postgres property constraint_exclusion to partition.
VACUUM ANALYZE the old partition after started inserting in the new partition. (In our case, it helped the query planner to use Index-Only scan instead of Seq. scan)
Using Triggers as mentioned in the docs may make the inserts slower, so we deviated from it, as we partitioned based on time column, we calculated the table name at application level based on time value before every insert and it didn't affect the insert rate for us.
Also read other caveats mentioned there.

Related

Temp table updates is slower then normal table in postgresql

I have a situation where updates on my temp table is slow. Below is the scenario
Created temp table in session for every session,first time temp table created and then going forward doing insert,update and delete operations this operations until session ends only.
First i'm inserting the rows and based on rows i'm updateing other columns. but this updates is slow compared to norma table. i checked the performance by replacing temp table whereas normal table taking around 50 to 60s but temp table is taking nearly 5 mins.
I tried analyze on temp table, then i got the improved performance. when im using analyze the updates are completed in with 50 seconds.
I tried Types also, but no luck.
Record count in temp table is 480
Can anyone help to imprrove the performance on temp table with out analyze OR any alternative for bulk collect and bulk insert in user defined types
All the above ooperations i'm doing in postgresql.
The lack of information in your question forces me to guess, but if all other things are equal, the difference is probably that you don't have accurate statistics on the temporary table. For normal tables (which are visible to the public), autovacuum takes care of that automatically, but for temporary tables, you have to call ANALYZE explicitly to gather table statistics.

What is the best approach for upserting large number of rows into a single table?

Im working on a product that involves large number of upsert operations into a single table.
We are dealing with a time-based data and using timescaledb hypertables with 7 days chunk interval size. we have concurrent tasks that upserts data into a single table, and in extreme cases its possible that we will have 40 concurrent tasks, each one upserting around 250k rows, all to the same table.
Initially we decided to go with the approach of deleting all the old rows and then inserting the updated ones with a COPY FROM statement, but when we got to test the system on large scale these COPYs took long time to finish, eventually resulting in the db's CPU usage to reach 100%, and become unresponsive.
We also noticed that the index size of the table increased radically and filled up the disk usage to 100%, and SELECT statements took extremely long time to execute (over 10 minutes). We concluded that the reason for that was large amount of delete statements that caused index fragmentation, and decided to go with another approach.
following the answers on this post, we decided to copy all the data to a temporary table, and then upsert all the data to the actual table using an "extended insert" statement -
INSERT INTO table SELECT * FROM temp_table ON CONFLICT DO UPDATE...;
our tests show that it helped with the index fragmentation issues, but still large upsert operations of ~250K take over 4 minutes to execute, and during this upsert process SELECT statements take too long to finish which is unacceptable for us.
I'm wondering whats the best approach to create this upsert operation with as low impact to the performance of SELECTs as possible. The only thing that comes in mind right now is to split the insert into smaller chunks -
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 0 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 50000 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 100000 ON CONFLICT DO UPDATE ...;
...
but if we batch the inserts, is there any advantage of first copying all the data into a temporary table? will it perform better then a simple multi-row insert statement?
and how do i decide whats the best chunk size to use when splitting up the upsert? is
using a temporary table and upserting the rows directly from it allows for a bigger chunk sizes?
Is there any better approach to achieve this? any suggestion would be appreciated
There are a handful of things that you can do to speed up data loading:
Have no index or foreign key on the table while you load data (check constraints are fine). I am not suggesting that you drop all your constraints, but you could for example use partitioning, load the data into a new table, then create indexes and constraints and attach the table as a new partition.
Load the data with COPY. If you cannot use COPY, use a prepared statement for the INSERT to save on parsing time.
Load many rows in a single transaction.
Set max_wal_size high so that you get no more checkpoints than necessary.
Get fast local disks.

Bulk update Postgres table

I have a table with around 200 million records and I have added 2 new columns to it. Now the 2 columns need values from a different table. Nearly 80% of the rows will be updated.
I tried update but it takes more than 2 hours to complete.
The main table has a composite primary key of 4 columns. I have dropped it and dropped an index that is present on a column before updating. Now the update takes little over than 1 hour.
Is there any other way to speed up this update process (like batch processing).
Edit: I used the other table(from where values will be matched for update) in from clause of the update statement.
Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints.
After the update, the table will be bloated to about twice its original size.
That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.
Do you need whole update in single transaction? I had quite similar problem, with table that was under heavy load, and column required not null constraint. Do deal with it - I did some steps:
Add columns without constraints like not null, but with defaults. That way it went really fast.
Update columns in steps like 1000 entries per transaction. In my case load of the DB rise, so I had to put small delay.
Update columns to have not null constraints.
That way you don't block table for long time, but that is not an answer to your question.
First to validate where you are - I would check iostats to see if that is not the limit... To speed up, I would consider:
higher free space map - to be sure DB is aware of entries that can be removed, but note that if pages are packed to the limit it would not bring much...
maybe foreign keys referring to the table can be also removed? To stop locking the table,
removing all indices since they are slowing down, and create them afterwords - that looks like slicing problem but other way, but is an option, so counts...
There is a 2 type of solution to your problem.
1) This approach work if your main table doesn't update or inserted during this process
First create the same table schema without composite primary key and index with a different name.
Then insert the data in the new table with join table data.
Apply all constraints and indexes on the new table after insert.
Drop the old table and rename the new table with the old table name.
2) Or you can use a trigger to update that two-column on insert or update event. (This will make insert update operation slightly slow)

Performance of truncate and insert vs update

I have a table with more than 1 million records and table is growing everyday.I need to update two columns of that table everyday. What is the best way either to truncate the table and insert or update row wise?
example :-
today
userid activitycount
1 18
tomorrow
userid activitycount
1 19
Make sure that the fillfactor of the table is less than 50 and that the updated columns are not indexed.
Then the updates will become HOT updates that don't need to modify any index, and autovacuum will make sure that tomorrow's update will find enough free space.
The disadvantage is the bloat you have with this method, but you don't need to create new tables and rename them, which may be problematic with concurrent transactions.
Is faster to truncate table and copy it again. On Postgres docs you can learn how to do to populate tables with big datasets:
This section contains some suggestions on how to make this process as efficient as possible.
Use Copy: Use COPY to load all the rows in one command, instead of using a series of INSERT commands.
Remove Indexes: if you need indexes, just create indexes when data is already inserted.
Remove Foreign Key Constraints: Create constraints when data is already inserted.
Tuning Postgres installation: maintenance_work_mem, max_wal_size, Disable WAL Archival and Streaming Replication, ...

Partitioning Oracle tables used for logging

I have an application that records activity in a table (Oracle 10g). The logging records should be kept for at least 30 days. I expect about 20 million rows to be added to this table every month.
The DBA suggested that the table be split in partitions containing one week of data. The weekly maintenance script would then delete the oldest partition (leaving only 4 weeks of data in the table).
What would be the best way of partitioning this logging table?
Partitioning a table isn't hard - it appears that you will be removing the data on a weekly basis, so the partition clauses will look like
PARTITION "P2009_45" VALUES LESS THAN
(TO_DATE(' 2009-11-02 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
PARTITION "P2009_46" VALUES LESS THAN
(TO_DATE(' 2009-11-09 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
... etc
where your partitioning column is your date column of interest in the table.
Additional comments:
If you can upgrade to 11g you can
take advantage of interval
partitioning, which is similar to
this range partitioning, but Oracle
will manage creating new partitions
for you.
If you're going to routinely drop off
partitions, I would advise making all
indexes on the table
locally-partitioned to avoid the
rebuilds that would be necessary with
global partitions after partition
operations.
If you have a good idea of the number
of log entries per month, and it
stays relatively constant, you might
consider using a sequence (as a primary key) that is
capped at this number and then
recycles back to 0. Then your
logging statements must become "MERGE
INTO... " statements that either
create a new row or overwrite the row
if it exists. This only guarantees
that you'll retain the number of rows
allowed by the sequence max value and
NOT a certain time interval, but this
might be an alternative to
partitioning (which as DvE points
out, is an extra-expense option)
The most likely partitioning scheme would be to range-partition your data on the creation date. Each week you would create a new partition and drop the oldest one. The impact will depend on how this table is used / indexed.
Since it is a logging table perhaps it is not indexed, in that case dropping a partition will have little impact: referencing objects won't be invalidated, the drop will be just require a partition lock (and the oldest partition shouldn't be inserted at that time).
If the table is indexed, you will have to decide if your indexes will be global or partitionned. Global indexes will have to be rebuilt when you drop a partition (which takes time, although 20M rows is still manageable). You can use the UPDATE GLOBAL INDEXES clause to keep the indexes valid after the partition drop.
Local indexes will be partitionned like the table and may be less efficient than global indexes (index range scans will have to scan each local index instead of a common index if you do not query by date). These indexes won't have to be updated after a partition drop.
20 million rows every month and you only have to keep 30 days of data? (That's about a months worth).
Even with 12 months worth of data it wouldn't be hard to query this table (as one big table) with the correct index.
Inserting is no problemen either with 1 row in the logging table or 20 million.
Partitioning in Oracle is also a feature that needs to be paid for if I'm correct, so it's costly too (if you don't have a license already).