redshift copy using amazon pipeline fails for missing primary key - amazon-redshift

I have a set of files on S3 that I am trying to load into redshift.
I am using the amazon data pipeline to do it. the wizard took the cluster, db and file format info but I get errors that a primary key is needed to keep existing fields in th table (KEEP_EXISTING) on the table
My table schema is:
create table public.Bens_Analytics_IP_To_FileName(
Day date not null encode delta32k,
IP varchar(30) not null encode text255,
FileName varchar(300) not null encode text32k,
Count integer not null)
distkey(Day)
sortkey(Day,IP);
so then I added a composite primary key on the table to see if it will work, but I get the same error.
create table public.Bens_Analytics_IP_To_FileName(
Day date not null encode delta32k,
IP varchar(30) not null encode text255,
FileName varchar(300) not null encode text32k,
Count integer not null,
primary key(Day,IP,FileName))
distkey(Day)
sortkey(Day,IP);
So I decided to add an identity column as the last column and made it the primary key but then the COPY operation wants a value in the input files for that identity column which did not make much sense
ideally I want it to work without a primary key or a composite primary key
any ideas?
Thanks

Documentation is not in a great condition. They have added a 'mergeKey' concept that can be any arbitrary key (announcement, docs). You should not have to define a primary key on table with this.
But you would still need to supply a key to perform join between your new data coming in and the existing data in redshift table.

In Edit Pipeline, under Parameters, there is a field named: myPrimaryKeys (optional). Enter you Pk there, instead of adding it to your table definition.

Related

Why does Id increases by two instead of one using insert

I have been trying to understand after lots of hours and still cannot understand why it is happening.
I have created two tables with ALTER:
CREATE TABLE stores (
id SERIAL PRIMARY KEY,
store_name TEXT
-- add more fields if needed
);
CREATE TABLE products (
id SERIAL,
store_id INTEGER NOT NULL,
title TEXT,
image TEXT,
url TEXT UNIQUE,
added_date timestamp without time zone NOT NULL DEFAULT NOW(),
PRIMARY KEY(id, store_id)
);
ALTER TABLE products
ADD CONSTRAINT "FK_products_stores" FOREIGN KEY ("store_id")
REFERENCES stores (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE RESTRICT;
and everytime I am inserting a value to products by doing
INSERT
INTO
public.products(store_id, title, image, url)
VALUES((SELECT id FROM stores WHERE store_name = 'footish'),
'Teva Flatform Universal Pride',
'https://www.footish.se/sneakers/teva-flatform-universal-pride-t1116376',
'https://www.footish.se/pub_images/large/teva-flatform-universal-pride-t1116376-p77148.jpg?timestamp=1623417840')
I can see that the column of id increases by two everytime I insert instead of one and I would like to know what is the reason behind that?
I have not been able to figure out why and it would be nice to know! :)
There could be 3 reasons:
You've tried to create data but it failed. Even on failed creation and transaction rollback, a sequence does count. A used number will never be put back.
You're using a global sequence and created other data on other data meanwhile. Using a global sequence will always increase on any table data added, even on other tables be modified.
DB configuration for your sequence is set to stepsize/allocationsize=2. It can be configured however you want.
Overall it is not important. The most important thing is that it increases automatically and that even on a error/delete a already tried ID will never be put back.
If you want to have concrete information you need to procive the information about the sequence. You can check that using a SQL CLI or show it via DBeaver/....

Unique Identifier in multiple schemas

As the title suggests I want to have a unique ID as a primary key but over multiple schemas. I know about UUID but it's just too costly.
Is there any way to work this around a serial?
You can create a global sequence and use that in your table instead of the automatic sequence that a serial column creates.
create schema global;
create schema s1;
create schema s2;
create sequence global.unique_id;
create table s1.t1
(
id integer default nextval('global.unique_id') primary key
);
create table s2.t1
(
id integer default nextval('global.unique_id') primary key
);
The difference to a serial column is, that the sequence unique_id doesn't "know" it's used by the id columns. A "serial sequence" is automatically dropped if the corresponding column (or table) is dropped which is not what you want with a global sequence.
There is one drawback however: you can't make sure that duplicate values across those two table are inserted manually. If you want to make sure the sequence is always used to insert values, you can create a trigger that always fetches a value from the sequence.

Building a primary key with Json columns

I am trying to set a unique constraint across rows, in which some of them are JSON data types. Since there's no way to make a JSON column a primary key, I thought maybe I can hash the desired columns and build a primary key on the hash. For example:
CREATE TABLE petshop(
name text,
fav_food jsonb,
md5sum uuid);
I can do the following:
SELECT md5(name||fav_food::text) FROM petshop;
But I want that to be performed by default and/or with a trigger which insert the md5 sum into the column md5sum. And then build a pkey on that column.
But really, I just want to know if the JSON object is unique, and not restrict the keys in the JSON. So if anyone has a better idea, helps!

Generate column value automatically from other columns values and be used as PRIMARY KEY

I have a table with a column named "source" and "id". This table is populated from open data DB.
"id" can't be UNIQUE, since my data came from other db with their own id system. There is a real risk to have same id but really different data.
I want to create another column which combine source and id into a single value.
"openDataA" + 123456789 -> "openDataA123456789"
"openDataB" + 123456789 -> "openDataB123456789"
I have seen example that use || and function to concatenate value. This is good, but I want to make this third column my PRIMARY KEY, to avoid duplicate, and create a really unique id that I can query without much computation and that I can use as a foreign key constraint for other table.
I think Composite Types is what I'm looking for, but instead of setting the value manually each time, I want to grab them automatically by setting only "source" and "id"
I'm fairly new to postgresql, so any help is welcome.
Thank you.
You could just have a composite key in your table:
CREATE TABLE mytable (
source VARCHAR(10),
id VARCHAR(10),
PRIMARY KEY (source, id)
);
If you really want a joined column, you could create a view to display it:
CREATE VIEW myview AS
SELECT *, source || id AS primary_key
FROM mytable;

how to update the data type of a column without deleting the values in Postgresql?

I made a mistake by the creation of my table. The primary key was incorrect. I delete the constraint and now I don't have a primary key in my table, only the field with the data. Now I want to set again this field as auto_increment primary key without losing my data. How I can do this?
I tryed this:
ALTER TABLE name_table ADD COLUMN name_column serial primary key;
But with this I am losing my data and creating a new column, that I don't want
try this
ALTER TABLE table_name ADD CONSTRAINT some_name primary key (name_column);
For my suggestion,
backup your database first in sql or csv or xml or excel something
restore-able.
Then alter your table structure, column data type, from UI or command
Then if data recorded on your table are lost or gone, restore your
backup data only, (not the structure of table)
After that you have changed column data type and also get your required data. I hope it will work.
Hi guys I was trying several ways and I found this one and maybe also somebody later can use:
Create a sequenz: Sequenz is the way that Postgresql implement to generate auto_increment fields. Ones we have a auto_increment is also a primary key. Should not be like this, is not a rule, but in most of the cases a auto_increment field is a primary key.
To create a sequenz is like this:
CREATE SEQUENCE exemplo_id_seq
INCREMENT 1 --the increment upgrate will be made 1 + 1
MINVALUE 1
MAXVALUE
START 1 --the start counting is in 1
CACHE 1;
After this is only to give this sequenz to the affected field using NEXTVAL, like this:
ALTER TABLE table_name ALTER COLUMN id SET DEFAULT NEXTVAL("exemplo_id_seq"::regclass);
Is working good without losing the data from old errors