How to add a sort key to an existing table in AWS Redshift - amazon-redshift

In AWS Redshift, I want to add a sort key to a table that is already created. Is there any command which can add a column and use it as sort key?

UPDATE:
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
source: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-supports-changing-table-sort-keys-dynamically/
At the moment I think its not possible (hopefully that will change in the future). In the past when I ran into this kind of situation I created a new table and copied the data from the old one into it.
from http://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html:
ADD [ COLUMN ] column_name
Adds a column with the specified name to the table. You can add only one column in each ALTER TABLE statement.
You cannot add a column that is the distribution key (DISTKEY) or a sort key (SORTKEY) of the table.
You cannot use an ALTER TABLE ADD COLUMN command to modify the following table and column attributes:
UNIQUE
PRIMARY KEY
REFERENCES (foreign key)
IDENTITY
The maximum column name length is 127 characters; longer names are truncated to 127 characters. The maximum number of columns you can define in a single table is 1,600.

As Yaniv Kessler mentioned, it's not possible to add or change distkey and sort key after creating a table, and you have to recreate a table and copy all data to the new table.
You can use the following SQL format to recreate a table with a new design.
ALTER TABLE test_table RENAME TO old_test_table;
CREATE TABLE new_test_table([new table columns]);
INSERT INTO new_test_table (SELECT * FROM old_test_table);
ALTER TABLE new_test_table RENAME TO test_table;
DROP TABLE old_test_table;
In my experience, this SQL is used for not only changing distkey and sortkey, but also setting the encoding(compression) type.

To add to Yaniv's answer, the ideal way to do this is probably using the CREATE TABLE AS command. You can specify the distkey and sortkey explicitly. I.e.
CREATE TABLE test_table_with_dist
distkey(field)
sortkey(sortfield)
AS
select * from test_table
Additional examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_examples.html
EDIT
I've noticed that this method doesn't preserve encoding. Redshift only automatically encodes during a copy statement. If this is a persistent table you should redefine the table and specify the encoding.
create table test_table_with_dist(
field1 varchar encode row distkey
field2 timestam pencode delta sortkey);
insert into test_table select * from test_table;
You can figure out which encoding to use by running analyze compression test_table;

AWS now allows you to add both sortkeys and distkeys without having to recreate tables:
TO add a sortkey (or alter a sortkey):
ALTER TABLE data.engagements_bot_free_raw
ALTER SORTKEY (id)
To alter a distkey or add a distkey:
ALTER TABLE data.engagements_bot_free_raw
ALTER DISTKEY id
Interestingly, the paranthesis are mandatory on SORTKEY, but not on DISTKEY.
You still cannot inplace change the encoding of a table - that still requires the solutions where you must recreate tables.

I followed this approach for adding the sort columns to my table table_transactons its more or less same approach only less number of commands.
alter table table_transactions rename to table_transactions_backup;
create table table_transactions compound sortkey(key1, key2, key3, key4) as select * from table_transactions_backup;
drop table table_transactions_backup;

Catching this query a bit late.
I find that using 1=1 the best way to create and replicate data into another table in redshift
eg:
CREATE TABLE NEWTABLE AS SELECT * FROM OLDTABLE WHERE 1=1;
then you can drop the OLDTABLE after verifying that the data has been copied
(if you replace 1=1 with 1=2, it copies only the structure - which is good for creating staging tables)

it is now possible to alter a sort kay:
Amazon Redshift now supports changing table sort keys dynamically
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
Customers when creating Redshift tables can optionally specify one or more table columns as sort keys. The sort keys are used to maintain the sort order of the Redshift tables and allows the query engine to achieve high performance by reducing the amount of data to read from disk and to save on storage with better compression. Currently Redshift customers who desire to change the sort keys after the initial table creation will need to re-create the table with new sort key definitions.
With the new ALTER SORT KEY command, users can dynamically change the Redshift table sort keys as needed. Redshift will take care of adjusting data layout behind the scenes and table remains available for users to query. Users can modify sort keys for a given table as many times as needed and they can alter sort keys for multiple tables simultaneously.
For more information ALTER SORT KEY, please refer to the documentation.
documentation
as for the documentation itself:
ALTER DISTKEY column_name or ALTER DISTSTYLE KEY DISTKEY column_name A
clause that changes the column used as the distribution key of a
table. Consider the following:
VACUUM and ALTER DISTKEY cannot run concurrently on the same table.
If VACUUM is already running, then ALTER DISTKEY returns an error.
If ALTER DISTKEY is running, then background vacuum doesn't start on a table.
If ALTER DISTKEY is running, then foreground vacuum returns an error.
You can only run one ALTER DISTKEY command on a table at a time.
The ALTER DISTKEY command is not supported for tables with interleaved sort keys.
When specifying DISTSTYLE KEY, the data is distributed by the values in the DISTKEY column. For more information about DISTSTYLE, see CREATE TABLE.
ALTER [COMPOUND] SORTKEY ( column_name [,...] ) A clause that changes
or adds the sort key used for a table. Consider the following:
You can define a maximum of 400 columns for a sort key per table.
You can only alter a compound sort key. You can't alter an interleaved sort key.
When data is loaded into a table, the data is loaded in the order of the sort key. When you alter the sort key, Amazon Redshift reorders the data. For more information about SORTKEY, see CREATE TABLE.

According to the updated documentation it is now possible to change a sort key type with:
ALTER [COMPOUND] SORTKEY ( column_name [,...] )
For reference (https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html):
"You can alter an interleaved sort key to a compound sort key or no sort key. However, you can't alter a compound sort key to an interleaved sort key."

ALTER TABLE table_name ALTER SORTKEY (sortKey1, sortKey2 ...etc)

Related

How do you manage UPSERTs on PostgreSQL partitioned tables for unique constraints on columns outside the partitioning strategy?

This question is for a database using PostgreSQL 12.3; we are using declarative partitioning and ON CONFLICT against the partitioned table is possible.
We had a single table representing application event data from client activity. Therefore, each row has fields client_id int4 and a dttm timestamp field. There is also an event_id text field and a project_id int4 field which together formed the basis of a composite primary key. (While rare, it was possible for two event records to have the same event_id but different project_id values for the same client_id.)
The table became non-performant, and we saw that queries most often targeted a single client in a specific timeframe. So we shifted the data into a partitioned table: first by LIST (client_id) and then each partition is further partitioned by RANGE(dttm).
We are running into problems shifting our upsert strategy to work with this new table. We used to perform a query of INSERT INTO table SELECT * FROM staging_table ON CONFLICT (event_id, project_id) DO UPDATE ...
But since the columns that determine uniqueness (event_id and project_id) are not part of the partitioning strategy (dttm and client_id), I can't do the same thing with the partitioned table. I thought I could get around this by building UNIQUE indexes on each partition on (project_id, event_id) but the ON CONFLICT is still not firing because there is no such unique index on the parent table (there can't be, since it doesn't contain all partitioning columns). So now a single upsert query appears impossible.
I've found two solutions so far but both require additional changes to the upsert script that seem like they'd be less performant:
I can still do an INSERT INTO table_partition_subpartition ... ON CONFLICT (event_id, project_id) DO UPDATE ... but that requires explicitly determining the name of the partition for each row instead of just INSERT INTO table ... once for the entire dataset.
I could implement the "old way" UPSERT procedure: https://www.postgresql.org/docs/9.4/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE but this again requires looping through all rows.
Is there anything else I could do to retain the cleanliness of a single, one-and-done INSERT INTO table SELECT * FROM staging_table ON CONFLICT () DO UPDATE ... while still keeping the partitioning strategy as-is?
Edit: if it matters, concurrency is not an issue here; there's just one machine executing the UPSERT into the main table from the staging table on a schedule.

Postgres ALTER TABLE... unexpected performance when applying multiple changes

I am attempting to speed up a bulk load. The bulk load is performed into a table with all primary keys, indexes, and foreign keys dropped. After data finishes loading we go and apply the necessary constraints back to the database. As a simple test I have the following setup:
CREATE TABLE users
(
id int primary key
);
CREATE TABLE events
(
id serial,
user_id1 int,
user_id2 int,
unique_int1 int,
unique_int2 int,
unique_int3 int
);
INSERT INTO
users (id)
SELECT generate_Series(1,100000000);
INSERT INTO
events (user_id1,user_id2,unique_int1,unique_int2,unique_int3)
SELECT
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000);
I then wish to apply the following constraints via 5 individual alter table statements:
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_pkey PRIMARY KEY (id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_user_id1_fkey FOREIGN KEY (user_id1) REFERENCES public.users(id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_user_id2_fkey FOREIGN KEY (user_id2) REFERENCES public.users(id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int1_me_key UNIQUE (unique_int1);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int2_me_key UNIQUE (unique_int2);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int3_me_key UNIQUE (unique_int3);
Each of the above statements takes approximately 90 seconds to run for a total of 450 seconds. I expected that when I combined the 5 above statements into a single ALTER TABLE statement that the time would be reduced but in fact it has increased.
alter table only public.events
ADD CONSTRAINT events_pkey PRIMARY KEY (id),
ADD CONSTRAINT events_user_id1_fkey FOREIGN KEY (user_id1) REFERENCES public.users(id),
ADD CONSTRAINT events_user_id2_fkey FOREIGN KEY (user_id2) REFERENCES public.users(id),
ADD CONSTRAINT events_unique_int1_me_key UNIQUE (unique_int1),
ADD CONSTRAINT events_unique_int2_me_key UNIQUE (unique_int2),
ADD CONSTRAINT events_unique_int3_me_key UNIQUE (unique_int3);
This takes 520 seconds whereas I expected it to at least take less than 450 seconds. My reason for thinking this is from the Postgres documentation for the ALTER TABLE statement where in the Notes section it reads
The main reason for providing the option to specify multiple changes in a single ALTER TABLE is that multiple table scans or rewrites can thereby be combined into a single pass over the table.
Can anyone explain the above measurements or have any suggestions?
This case is not going to be helped much as each of the commands needs to do a pass to verify the conditions of the constraints FK or UNIQUE. Also the docs:
When multiple subcommands are given, the lock acquired will be the strictest one required by any subcommand.
So by combining the commands you are working on the strictest locking for the entire command.
The section you quoted is amplified further on:
For example, it is possible to add several columns and/or alter the type of several columns in a single command. This is particularly useful with large tables, since only one pass over the table need be made.
The conclusion is combining commands is not necessarily a time saver.
Unifying the table read is going to be particularly beneficial if there is no other IO to be done, like column type change or check constraint. But here you need to actually build the indexes or look up the foreign keys, and that is what will dominate your run time. So it is no surprise that you don't see a big gain.
For a small loss in performance, it could be that that is due to worse cache usage, less efficient parallel query execution, or interrupted stride length for readahead or for bulk writing to your IO devices.

Delete duplicates from a huge table in Postgresql

I have an unusual problem: I need to delete duplicate records from a table in Postgresql. As i have duplicate records so i dont have primary key and unique index in this table. The table conatins like 20million records and it has duplicate records in it. While i am trying the below query it is taking too long time.
'DELETE FROM temp a using temp b where a.recordid=b.recordid and a.ctid < b.ctid;'
So what should be a better approach to handle such huge table with no index in it?
Appreciate for help.
if you have enough empty space, your can copy table without duplicates, then remove old table and rename new table
like this
INSERT INTO new_table
VALUES
SELECT
DISTINCT ON (column)
*
FROM old_table
ORDER BY column ASC
Use COPY TO to dump the table.
Then Unix sort -u to de-duplicate it.
Drop or truncate the table in Postgres, use COPY FROM to read it back in.
Add a primary key column.

Redshift - extracting constraints

How to get exported keys (database metadata).Even though redshift does not support foreign keys and primary keys I am able to see them in system tables.
The problem here is in the system table the multiple columns of a foreign key exist as an array in one column(though redshift doesn't support arrays). Is it possible to extract them in one query.
Use table_constraints table:
SELECT * FROM information_schema.table_constraints;

T-SQL create table with primary keys

Hello I wan to create a new table based on another one and create primary keys as well.
Currently this is how I'm doing it. Table B has no primary keys defined. But I would like to create them in table A. Is there a way using this select top 0 statement to do that? Or do I need to do an ALTER TABLE after I created tableA?
Thanks
select TOP 0 *
INTO [tableA]
FROM [tableB]
SELECT INTO does not support copying any of the indexes, constraints, triggers or even computed columns and other table properties, aside from the IDENTITY property (as long as you don't apply an expression to the IDENTITY column.
So, you will have to add the constraints after the table has been created and populated.
The short answer is NO. SELECT INTO will always create a HEAP table and, according to Books Online:
Indexes, constraints, and triggers defined in the source table are not
transferred to the new table, nor can they be specified in the
SELECT...INTO statement. If these objects are required, you must
create them after executing the SELECT...INTO statement.
So, after executing SELECT INTO you need to execute an ALTER TABLE or CREATE UNIQUE INDEX in order to add a primary key.
Also, if dbo.TableB does not already have an IDENTITY column (or if it does and you want to leave it out for some reason), and you need to create an artificial primary key column (rather than use an existing column in dbo.TableB to serve as the new primary key), you could use the IDENTITY function to create a candidate key column. But you still have to add the constraint to TableA after the fact to make it a primary key, since just the IDENTITY function/property alone does not make it so.
-- This statement will create a HEAP table
SELECT Col1, Col2, IDENTITY(INT,1,1) Col3
INTO dbo.MyTable
FROM dbo.AnotherTable;
-- This statement will create a clustered PK
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable_Col3 PRIMARY KEY (Col3);