AWS Redshift accepting duplicates , although Primary Key is declared - amazon-redshift

I'm new to Redshift. I need help in understanding this behavior of Redshift that I came across. So, I used the following Query to create a new table :
CREATE TABLE customer
(
cust_id INTEGER NOT NULL UNIQUE,
email VARCHAR(30),
name CHAR(30),
PRIMARY KEY (cust_id)
);
Now, the table is getting created successfully but when it comes to data insertion using the following query :
INSERT INTO customer VALUES (1, 'john.doe#email.com', 'John Doe')
The table is accepting duplicates even though the Primary key is defined. Can someone help understand this behavior ?
I'm also going through redshift documentation to understand the cause behind this.

Redshift and most clustered warehouse databases don't enforce uniqueness constraints. This is due to the prohibitive costs of checking uniqueness across a cluster. For example Snowflake DB works exactly the same way.
Your ETL processes need to enforce uniqueness if you need to the values to be unique in Redshift.
From - https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
Uniqueness, primary key, and foreign key constraints are informational
only; they are not enforced by Amazon Redshift. Nonetheless, primary
keys and foreign keys are used as planning hints and they should be
declared if your ETL process or some other process in your application
enforces their integrity.

Related

Out of shared memory when deleting rows with lots of incoming foreign keys

I develop a multi-tenancy application where we have a single master schema to keep track of tenants, along with 99 app databases to distribute load. Each of 33 tables within each app database also has a tenant column pointing to the master schema. This means there are 3,267 foreign keys pointing to the master schema's tenant id, and roughly 6000 triggers associated with the tenant table.
Recently, I added a table and started getting this error in the teardown portion of our test suite where we delete the test tenant:
psycopg2.errors.OutOfMemory: out of shared memory
HINT: You might need to increase max_locks_per_transaction.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "test2"."item" x WHERE $1 OPERATOR(pg_catalog.=) "tenant" FOR KEY SHARE OF x"
For query
SET CONSTRAINTS ALL IMMEDIATE
Raising max_locks_per_transaction as suggested solves the problem, as does deleting some of the app schemas. The obvious solution here would be to reduce the number of redundant schemas or delete the foreign key constraints so we don't have to hold so many locks, but I'm curious if there's something else going on here.
I had imagined that only the rows to be deleted (associated with the test schema) would be locked, and so only the test schema would be locked. And anyway, by this point there is no data left pointing to the tenant table, so the locking is pretty much redundant in practice.
Update:
For more context, I'm not doing anything really fancy here. Below is a simplified example of what my schema and query look like:
CREATE SCHEMA master;
CREATE table master.tenant (id uuid NOT NULL PRIMARY KEY);
CREATE SCHEMA app_00;
CREATE table app_00.account (id uuid NOT NULL PRIMARY KEY, tenant uuid NOT NULL);
ALTER TABLE app_00.account ADD CONSTRAINT fk_tenant FOREIGN KEY (store) REFERENCES master.store(id) DEFERRABLE;
CREATE table app_00.item (id uuid NOT NULL PRIMARY KEY, tenant uuid NOT NULL);
ALTER TABLE app_00.item ADD CONSTRAINT fk_tenant FOREIGN KEY (store) REFERENCES master.store(id) DEFERRABLE;
In reality I'm creating 33 tables for each schema of app_00..99. Now assume my database is populated with data, the query that is failing with the above error is:
DELETE FROM master.tenant WHERE id = 'some uuid';
You don't tell us much about the setup, but probably partitioning or inheritance are involved. These features often require that a statement recurse to table partitions or inheritance children, either during query planning or execution. At any rate, your SQL statements have to touch many tables.
Now whenever PostgreSQL touches a table, it places a lock on it to avoid conflicting concurrent executions. If lots of tables are involved, it can be that the lock table, that originally has max_connections * max_locks_per_transaction entries, is exhausted.
The solution simply is to increase max_locks_per_transaction. Don't worry, there is no negative consequence in raising that parameter, only a little bit more shared memory is allocated during server startup.

TimescaleDB/PostgreSQL: how to use unique constraint when creating hypertables?

I am trying to create a table in PostgreSQL to contain lots of data and for that reason I want to use timescales hypertable as in the example below.
CREATE TABLE "datapoints" (
"tstz" timestamptz NOT NULL,
"id" bigserial UNIQUE NOT NULL,
"entity_id" bigint NOT NULL,
"value" real NOT NULL,
PRIMARY KEY ("id", "tstz", "entity_id")
);
SELECT create_hypertable('datapoints','tstz');
However, this throws an error - shown below. As far as I have figured out the error arise since the unique constraint isn't allowed in hypertables, but I really need the uniqueness. So does anyone have an idea on how to solve it or work around it?
ERROR: cannot create a unique index without the column "tstz" (used in partitioning)
SQL state: TS103
There is no way to avoid that.
TimescaleDB uses PostgreSQL partitioning, and it is not possible to have a primary key or unique constraint on a partitioned table that does not contain the partitioning key.
The reason behind that is that an index on a partitioned table consists of individual indexes on the partitions (these are the partitions of the partitioned index). Now the only way to guarantee uniqueness for such a partitioned index is to have the uniqueness implicit in the definition, which is only the case if the partitioning key is part of the index.
So you either have to sacrifice the uniqueness constraint on id (which is pretty much given if you use a sequence) or you have to do without partitioning.

How to write another query in IN function when partitioning

I have 2 local docker postgresql-10.7 servers set up. On my hot instance, I have a huge table that I wanted to partition by date (I achieved that). The data from the partitioned table (Let's call it PART_TABLE) is stored on the other server, only PART_TABLE_2019 is stored on HOT instance. And here comes the problem. I don't know how to partition 2 other tables that have foreign keys from PART_TABLE, based on FK. PART_TABLE and TABLE2_PART are both stored on HOT instance.
I was thinking something like this:
create table TABLE2_PART_2019 partition of TABLE2_PART for values in (select uuid from PART_TABLE_2019);
But the query doesn't work and I don't know if this is a good idea (performance wise and logically).
Let me just mention that I can solve this with either function or script etc. but I would like to do this without scripting.
From doc at https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
"While primary keys are supported on partitioned tables, foreign keys
referencing partitioned tables are not supported. (Foreign key
references from a partitioned table to some other table are
supported.)"
With PostgreSQL v10, you can only define foreign keys on the individual partitions. But you could create foreign keys on each partition.
You could upgrade to PostgreSQL v11 which allows foreign keys to be defined on partitioned tables.
Can you explain what a HOT instance is and why it would makes this difficult?

Redshift Constraints (Primary Key and Foreign Key Constraints)

I am new to Redshift when pushing the data in Redshift, where created the primary key as Vin(Vehicle Identification Number). Even when pushing the same key twice not getting any constraint exception instead same data being saved as record.
And when doing with Foreign key constraint again getting the same issue. Am I missing any configurations for enabling the contrints in db ?
From the AWS documentation:
Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
Do not define primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.
If I read this information correctly, the workaround you should follow is to check in your application layer that each VIN number to be inserted is unique.

Which index is used to answer aggregates when we have several indexes?

I have a table which is partitioned on daily basis, each partition has certainly a primary key, and several other indexes on columns which are not null. If I get the query plane for the following:
SELECT COUNT(*) FROM parent_table;
I can see different indexes are used, sometimes the primary key index is used and some times others. How postgres is able to decide which index to use. Note that, my table is not clustered and never clustered before. Also, the primary key is serial.
What are the catalog / statistics tables which are used to make this decision.