Redshift - extracting constraints - amazon-redshift

How to get exported keys (database metadata).Even though redshift does not support foreign keys and primary keys I am able to see them in system tables.
The problem here is in the system table the multiple columns of a foreign key exist as an array in one column(though redshift doesn't support arrays). Is it possible to extract them in one query.

Use table_constraints table:
SELECT * FROM information_schema.table_constraints;

Related

Postgres - Find only by UUID

I've got PostgreSQL DB with multiple schemas and tables in that schemas. Every row in table have PRIMARY UUID like "Ref_Key" => "41bf3b1e-91f0-491c-a6bd-c48a17e7c252"
Is it possible to find row only by it UUID, without specifying schema and table?
No, that is not possible. You can only query tables that explicitly appear in the FROM clause of a SELECT statement.

How to write another query in IN function when partitioning

I have 2 local docker postgresql-10.7 servers set up. On my hot instance, I have a huge table that I wanted to partition by date (I achieved that). The data from the partitioned table (Let's call it PART_TABLE) is stored on the other server, only PART_TABLE_2019 is stored on HOT instance. And here comes the problem. I don't know how to partition 2 other tables that have foreign keys from PART_TABLE, based on FK. PART_TABLE and TABLE2_PART are both stored on HOT instance.
I was thinking something like this:
create table TABLE2_PART_2019 partition of TABLE2_PART for values in (select uuid from PART_TABLE_2019);
But the query doesn't work and I don't know if this is a good idea (performance wise and logically).
Let me just mention that I can solve this with either function or script etc. but I would like to do this without scripting.
From doc at https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
"While primary keys are supported on partitioned tables, foreign keys
referencing partitioned tables are not supported. (Foreign key
references from a partitioned table to some other table are
supported.)"
With PostgreSQL v10, you can only define foreign keys on the individual partitions. But you could create foreign keys on each partition.
You could upgrade to PostgreSQL v11 which allows foreign keys to be defined on partitioned tables.
Can you explain what a HOT instance is and why it would makes this difficult?

Is it possible to merge two Postgres databases

We have two copies of a simple application that is based on SQLite. The application has 10 tables with a variety of relations between the tables. We would like to merge the databases to a single Postgres database with the same schema. We can use Talend to facilitate this, however the issue is that there would be duplicate keys (as both the source databases are independent). Is there a systematic method by which we can insert data into Postgres with the original key plus an offset resulting from loading the first database?
Step 1. Restore the first database.
Step 2. Change foreign keys of all tables by adding the option on update cascade.
For example, if the column table_b.a_id refers to the column table_a.id:
alter table table_b
drop constraint table_b_a_id_fkey,
add constraint table_b_a_id_fkey
foreign key (a_id) references table_a(id)
on update cascade;
Step 3. Update primary keys of the tables by adding the desired offset, e.g.:
update table_a
set id = 10000+ id;
Step 4. Restore the second database.
If you have the possibility to edit the script with database schema (or do the transfer manually with your own script), you can merge steps 1 and 2 and edit the script before the restore (adding the option on update cascade for foreign keys in tables declarations).

Redshift: Is using a foreign key necessary to take advantage of distribution keys?

In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?
The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.
Consider the following example where I have two tables that are frequently joined:
CREATE TABLE motorcycles
(
id INT,
hexcolor CHAR(6)
);
CREATE TABLE helmets
(
id INT,
hexcolor CHAR(6)
);
Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY and use DISTKEY (hexcolor), right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).
So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?
As long as the columns have the same data type, you should expect Redshift to distribute the motorcycles and helmets tables in the same fashion.
There is no justification for a foreign-key in your case. The query planner will be able to take advantage of the fact that the tables are distributed by the same key.
But it's always good to read the execution plan and make sure that it says DS_DIST_NONE - which means that no data redistribution was needed.

How to add a sort key to an existing table in AWS Redshift

In AWS Redshift, I want to add a sort key to a table that is already created. Is there any command which can add a column and use it as sort key?
UPDATE:
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
source: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-supports-changing-table-sort-keys-dynamically/
At the moment I think its not possible (hopefully that will change in the future). In the past when I ran into this kind of situation I created a new table and copied the data from the old one into it.
from http://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html:
ADD [ COLUMN ] column_name
Adds a column with the specified name to the table. You can add only one column in each ALTER TABLE statement.
You cannot add a column that is the distribution key (DISTKEY) or a sort key (SORTKEY) of the table.
You cannot use an ALTER TABLE ADD COLUMN command to modify the following table and column attributes:
UNIQUE
PRIMARY KEY
REFERENCES (foreign key)
IDENTITY
The maximum column name length is 127 characters; longer names are truncated to 127 characters. The maximum number of columns you can define in a single table is 1,600.
As Yaniv Kessler mentioned, it's not possible to add or change distkey and sort key after creating a table, and you have to recreate a table and copy all data to the new table.
You can use the following SQL format to recreate a table with a new design.
ALTER TABLE test_table RENAME TO old_test_table;
CREATE TABLE new_test_table([new table columns]);
INSERT INTO new_test_table (SELECT * FROM old_test_table);
ALTER TABLE new_test_table RENAME TO test_table;
DROP TABLE old_test_table;
In my experience, this SQL is used for not only changing distkey and sortkey, but also setting the encoding(compression) type.
To add to Yaniv's answer, the ideal way to do this is probably using the CREATE TABLE AS command. You can specify the distkey and sortkey explicitly. I.e.
CREATE TABLE test_table_with_dist
distkey(field)
sortkey(sortfield)
AS
select * from test_table
Additional examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_examples.html
EDIT
I've noticed that this method doesn't preserve encoding. Redshift only automatically encodes during a copy statement. If this is a persistent table you should redefine the table and specify the encoding.
create table test_table_with_dist(
field1 varchar encode row distkey
field2 timestam pencode delta sortkey);
insert into test_table select * from test_table;
You can figure out which encoding to use by running analyze compression test_table;
AWS now allows you to add both sortkeys and distkeys without having to recreate tables:
TO add a sortkey (or alter a sortkey):
ALTER TABLE data.engagements_bot_free_raw
ALTER SORTKEY (id)
To alter a distkey or add a distkey:
ALTER TABLE data.engagements_bot_free_raw
ALTER DISTKEY id
Interestingly, the paranthesis are mandatory on SORTKEY, but not on DISTKEY.
You still cannot inplace change the encoding of a table - that still requires the solutions where you must recreate tables.
I followed this approach for adding the sort columns to my table table_transactons its more or less same approach only less number of commands.
alter table table_transactions rename to table_transactions_backup;
create table table_transactions compound sortkey(key1, key2, key3, key4) as select * from table_transactions_backup;
drop table table_transactions_backup;
Catching this query a bit late.
I find that using 1=1 the best way to create and replicate data into another table in redshift
eg:
CREATE TABLE NEWTABLE AS SELECT * FROM OLDTABLE WHERE 1=1;
then you can drop the OLDTABLE after verifying that the data has been copied
(if you replace 1=1 with 1=2, it copies only the structure - which is good for creating staging tables)
it is now possible to alter a sort kay:
Amazon Redshift now supports changing table sort keys dynamically
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
Customers when creating Redshift tables can optionally specify one or more table columns as sort keys. The sort keys are used to maintain the sort order of the Redshift tables and allows the query engine to achieve high performance by reducing the amount of data to read from disk and to save on storage with better compression. Currently Redshift customers who desire to change the sort keys after the initial table creation will need to re-create the table with new sort key definitions.
With the new ALTER SORT KEY command, users can dynamically change the Redshift table sort keys as needed. Redshift will take care of adjusting data layout behind the scenes and table remains available for users to query. Users can modify sort keys for a given table as many times as needed and they can alter sort keys for multiple tables simultaneously.
For more information ALTER SORT KEY, please refer to the documentation.
documentation
as for the documentation itself:
ALTER DISTKEY column_name or ALTER DISTSTYLE KEY DISTKEY column_name A
clause that changes the column used as the distribution key of a
table. Consider the following:
VACUUM and ALTER DISTKEY cannot run concurrently on the same table.
If VACUUM is already running, then ALTER DISTKEY returns an error.
If ALTER DISTKEY is running, then background vacuum doesn't start on a table.
If ALTER DISTKEY is running, then foreground vacuum returns an error.
You can only run one ALTER DISTKEY command on a table at a time.
The ALTER DISTKEY command is not supported for tables with interleaved sort keys.
When specifying DISTSTYLE KEY, the data is distributed by the values in the DISTKEY column. For more information about DISTSTYLE, see CREATE TABLE.
ALTER [COMPOUND] SORTKEY ( column_name [,...] ) A clause that changes
or adds the sort key used for a table. Consider the following:
You can define a maximum of 400 columns for a sort key per table.
You can only alter a compound sort key. You can't alter an interleaved sort key.
When data is loaded into a table, the data is loaded in the order of the sort key. When you alter the sort key, Amazon Redshift reorders the data. For more information about SORTKEY, see CREATE TABLE.
According to the updated documentation it is now possible to change a sort key type with:
ALTER [COMPOUND] SORTKEY ( column_name [,...] )
For reference (https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html):
"You can alter an interleaved sort key to a compound sort key or no sort key. However, you can't alter a compound sort key to an interleaved sort key."
ALTER TABLE table_name ALTER SORTKEY (sortKey1, sortKey2 ...etc)