Implications of using ADD COLUMN on large dataset

Implications of using ADD COLUMN on large dataset - amazon-redshift

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?

I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.

I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.

Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

Related

Bulk update Postgres table

I have a table with around 200 million records and I have added 2 new columns to it. Now the 2 columns need values from a different table. Nearly 80% of the rows will be updated.
I tried update but it takes more than 2 hours to complete.
The main table has a composite primary key of 4 columns. I have dropped it and dropped an index that is present on a column before updating. Now the update takes little over than 1 hour.
Is there any other way to speed up this update process (like batch processing).
Edit: I used the other table(from where values will be matched for update) in from clause of the update statement.

Not really. Make sure that max_wal_size is high enough that you don't get too many checkpoints.
After the update, the table will be bloated to about twice its original size.
That bloat can be avoided if you update in batches and VACUUM in between, but that will not make processing faster.

Do you need whole update in single transaction? I had quite similar problem, with table that was under heavy load, and column required not null constraint. Do deal with it - I did some steps:
Add columns without constraints like not null, but with defaults. That way it went really fast.
Update columns in steps like 1000 entries per transaction. In my case load of the DB rise, so I had to put small delay.
Update columns to have not null constraints.
That way you don't block table for long time, but that is not an answer to your question.
First to validate where you are - I would check iostats to see if that is not the limit... To speed up, I would consider:
higher free space map - to be sure DB is aware of entries that can be removed, but note that if pages are packed to the limit it would not bring much...
maybe foreign keys referring to the table can be also removed? To stop locking the table,
removing all indices since they are slowing down, and create them afterwords - that looks like slicing problem but other way, but is an option, so counts...

There is a 2 type of solution to your problem.
1) This approach work if your main table doesn't update or inserted during this process
First create the same table schema without composite primary key and index with a different name.
Then insert the data in the new table with join table data.
Apply all constraints and indexes on the new table after insert.
Drop the old table and rename the new table with the old table name.
2) Or you can use a trigger to update that two-column on insert or update event. (This will make insert update operation slightly slow)

Postgres Upsert vs Truncate and Insert

I have a stream of data that I can replay any time to reload data into a Postgres table. Lets say I have millions of rows in my table and I add a new column. Now I can replay that stream of data to map a key in the data to the column name that I have just added.
The two options I have are:
1) Truncate and then Insert
2) Upsert
Which would be a better option in terms of performance?

The way PostgreSQL does multiversioning, every update creates a new row version. The old row version will have to be reclaimed later.
This means extra work and tables with a lot of empty space in them.
On the other hand, TRUNCATE just throws away the old table, which is very fast.
You can gain extra performance by using COPY instead of INSERT to load bigger amounts of data.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!

I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.

I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

ADD COLUMN with DEFAULT value to a huge table

I have a postgresql DB and a table with almost billion of rows.
when I try to add a new column with default value:
ALTER TABLE big_table
ADD COLUMN some_flag integer NOT NULL DEFAULT 0;
The transaction goes on for 30+ min .. and the DB logs starts to shoots warnings.
Any way to optimize the query ?

Besides doing it in batches (which will still take a while):
You could dump the table as COPY statements and write a script to edit the contents of the COPY statements to insert another column (COPY can be CSV IIRC).
Then you just reload your altered COPY dump and it should in theory be faster than the ALTER because COPY will not log transactions.
The other option is to turn off fsync while you run the command... just remember to turn it back on.
You can also do both of the above in batches.

Starting from PostgreSQL 11 this behaviour will change.
Waiting for PostgreSQL 11 – Fast ALTER TABLE ADD COLUMN with a non-NULL default:
So, for the longest time, when you did:
alter table x add column z text;
it was virtually instantaneous. Get a lock on table, add information about new column to system catalogs, and it's done.
But when you tried:
alter table x add column z text default 'some value';
then it took long time. How long it did depend on size of table.
This was because postgresql was actually rewriting the whole table, adding the column to each row, and filling it with default value.
"What happens if you want to set the column to NOT NULL also? Are we back to the slow version in that case or does this handle that as well?"
not null doesn’t change anything. it is a constraint for new rows. so adding a column with “not null default ‘xxx'” will be fast.

I'd consider creating the column without the default and manually updating the rows in batches with intermittent commits to apply the default.

efficiently trimming postgresql tables

I have about 10 tables with over 2 million records and one with 30 million. I would like to efficiently remove older data from each of these tables.
My general algorithm is:
create a temp table for each large table and populate it with newer data
truncate the original tables
copy tmp data back to original tables using: "insert into originaltable (select * from tmp_table)"
However, the last step of copying the data back is taking longer than I'd like. I thought about deleting the original tables and making the temp tables "permanent", but I lose constraint/foreign key info.
If I delete from the tables directly, it takes much longer. Given that I need to preserve all foreign keys and constraints, are there any faster ways of removing the older data?
Thanks.

The fastest process is likely to be exactly as you've outlined:
Copy new data into a temporary table
Drop indexes and foreign keys
Drop the old table
Copy the temporary table back to the old table name
Rebuild indexes and foreign keys.
The Postgres manual has some suggestions on perfomance, too, that may or may not apply. Frankly, however, it is significantly quicker to drop a table than to drop millions of rows (since each delete is performed tuple by tuple) and it is significantly quicker to insert millions of rows into a table with no constraints or indexes (as each constraint must be checked and each index must be updated for each record insert; by removing all constraints, you limit this to a single build of the index and a single verification for the constraint).

The "standard" solution for these problems typically involves partitioning your tables on the appropriate key, such that when you need to delete old data, you can simply drop a whole partition -- certainly the fastest deletion that you will ever get.
However, partitioning in PostgreSQL isn't as easy as some other databases -- you need to relocate data manually using triggers, and there are caveats (e.g. no global primary keys)
See the PostgreSQL manual on Partitioning