How can I ensure synchronous DDL operations on a table that is being replaced? - amazon-redshift

I have multiple processes which are continually refreshing data in Redshift. They start a transaction, create a new table, COPY all the data from S3 into the new table, then drop the old table and rename the new table to the old table.
pseudocode:
start transaction;
create table foo_temp;
copy into foo_temp from S3;
drop table foo;
rename table foo_temp to foo;
commit;
I have several dozen tables that I update in this way. This works well but I would like to have multiple processes performing these table updates for redundancy purposes and to ensure that data is fairly fresh (different processes can update the data for different tables concurrently).
It works fine unless one process attempts to refresh a table that another process is working on. In that case the second process gets blocked by the first until it commits, and when it commits the second process gets the error:
ERROR: table 12345 dropped by concurrent transaction
Is there a simple way for me to guarantee that only one of my processes is refreshing a table so that the second process doesn't get into this situation?
I considered creating a special lock table for each of my real tables. The process would LOCK the special lock table before working on the companion real table. I think that will work but I would like to avoid creating a special lock table for each of my tables.

you need to protect readers from seeing the drop, do this by:
begin transaction
rename main table to old_main_table
rename tmp table to main table
commit
drop table old_main_table
Conn #1 Conn #2
-------------- ------------------------------------------
> create table bar (id int,id2 int,id3 int);
CREATE TABLE
> begin;
BEGIN
> begin;
BEGIN
> alter table bar rename to bar2;
ALTER TABLE
> select * from bar;
> create table bar (id int,id2 int,id3 int,id4 int);
CREATE TABLE
> commit; drop table bar2;
COMMIT
id | id2 | id3
----+-----+-----
(0 rows)
> commit;
COMMIT
DROP TABLE

Related

How to remove columns for real in postgresql?

I have a large system, and table schema updates quite offtenly, I noticed that after times of removing and recreating new cloumn, limitation "tables can have at most 1600 columns" is shown, but still there are few columns in information_schema.columns.
I've tried vacuum full analyze, still not working, any way to avoid this limitation?
DO $$
declare tbname varchar(1024);
BEGIN
FOR i IN 1..1599 LOOP
tbname := 'alter table vacuum_test add column test' || CAST(i AS varchar(8)) ||' int';
EXECUTE tbname;
END LOOP;
END $$;
alter table vacuum_test drop column test1;
VACUUM FULL ANALYZE vacuum_test;
alter table vacuum_test add column test1 int;
result:
alter table vacuum_test add column test1 int
> ERROR: tables can have at most 1600 columns
> 时间: 0.054s
Unfortunately vacuum full does not remove dropped columns from the table (i.e. entries that have attisdropped = true in `pg_attribute). I would have expected that, but apparently this does not happen.
The only way to get rid of the hidden columns is to create a brand new table and copy the data to the new table.
Something along the lines:
create table new_table (like old_table including all);
insert into new_table
select *
from old_table;
Then drop the old table and rename the new one to the old name. Constraint and index names will be named differently, so you might want to rename them as well.
You will have the re-create all foreign keys (incoming and outgoing) manually as they are not included when using CREATE TABLE (LIKE ...).
Another option is to use pg_repack which does this transparently in the background without locking the table.

Postgres create table asynchronously

i'm struggling with postgres and async task queue
i'm trying to create a new table - whatever worker reach this point first - create the table
using the statment
create table if not exists
'table already exists' exception raised
this is really wierd, because when doing so whith single worker - i.e trying to create the table twice synchronously, the second time it writes a notice (not exception)
Not the answer, just for info how it would be reproduced, so be patient.
Open two terminals, say tty1 and tty2, open psql in each.
tty1:
nd#postgres=# begin;
BEGIN
*nd#postgres=# create table if not exists foo();
CREATE TABLE
*nd#postgres=#
tty2:
nd#postgres=# begin;
BEGIN
*nd#postgres=# create table if not exists foo();
(waiting for lock)
tty1:
*nd#postgres=# commit;
COMMIT
tty2:
ERROR: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL: Key (typname, typnamespace)=(foo, 16386) already exists.
!nd#postgres=#
Not sure that PostgreSQL should be more smart in such cases. IMO something wrong with the application logic...

Best practices for performing a table swap in Redshift

We're in the process of running a handful of hourly scripts on our Redshift cluster which build summary tables for data consumers. After assembling a staging table, the script then runs a transaction which deletes the existing table and replaces it with the staging table, as such:
BEGIN;
DROP TABLE IF EXISTS public.data_facts;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
COMMIT;
The problem with this operation is that long-running analysis queries will place an AccessShareLock on public.data_facts, preventing it from being dropped and thrashing our ETL cycle. I'm thinking a better solution would be one which renames the existing table, as such:
ALTER TABLE public.data_facts RENAME TO data_facts_old;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
DROP TABLE public.data_facts_old;
However, this approach presupposes that 1) public.data_facts exists, and 2) public.data_facts_old does not exist.
Do you know if there's a way to conduct this operation safely in SQL, without relying on application logic? (eg. something like ALTER TABLE IF EXISTS).
I haven't tried it but looking at the documentation of CREATE VIEW it seems that this can be done with late-binding views.
The main idea would be a view public.data_facts that users interact with. Behind the scenes, you can load new data and then swap the view to “point” to the new table.
Bootstrap
-- load data into public.data_facts_v0
CREATE VIEW public.data_facts AS
SELECT * from public.data_facts_v0 WITH NO SCHEMA BINDING;
Update
-- load data into public.data_facts_v1
CREATE OR REPLACE VIEW public.data_facts AS
SELECT * from public.data_facts_v1 WITH NO SCHEMA BINDING;
DROP TABLE public.data_facts_v0;
The WITH NO SCHEMA BINDING means the view will be late-binding. “A late-binding view doesn't check the underlying database objects, such as tables and other views, until the view is queried.” This means the update can even introduce a table with renamed columns or a completely new structure.
Notes:
It might be a good idea to wrap the swap operations into a transaction to make sure we don't drop the previous table if the VIEW swap failed.
You can add a new load time timestamp encode runlength default getdate() column to your target table, and make your ETL do this:
INSERT INTO public.data_facts
SELECT * FROM public.data_facts_staging;
DELETE FROM public.data_facts
WHERE load_time<(select max(load_time) from public.data_facts);
DROP TABLE public.data_facts_staging;
note: public.data_facts_staging should have exactly the same structure as public.data_facts except that the last column of public.data_facts is load_time, so that on insert it will be populated with the current timestamp.
The only implication is that it would require extra disk space for a moment between you insert new rows and delete the old rows, and load_time has to be always the last column. Also you have to vaccum table every time you do this.
Another good thing about this is that if your ETL fails and staging table is empty or there is no staging table you won't lose your data. In the pure SQL scenario of swapping tables with DDL you're not protected from dropping the target table when staging table is missing. In the suggested scenario if no new rows are inserted the delete statement deletes nothing (there are no rows less than max load time), so worst case is just having the old version of data.
p.s. there is a command that instead of insert ... select ... just changes the pointer from staging to target table (alter table ... append from ...) but it requires the same type of lock as alter table I guess, so I don't suggest this

PostgreSQL TEMP table alternating between exist and not exist

I'm using PostgreSQL 9.6.2, with Toad client on Mac. Auto-commit is set to ON.
I first created a simple temp table like this:
CREATE TEMP TABLE demo_pairs
AS
WITH t (name, value) AS (VALUES ('a', 'b'), ('c', 'd'))
SELECT * FROM t;
Then something weird happens when I ran:
SELECT * FROM demo_pairs;
Every time I run the select (without re-running the create), it alternates between successfully selecting the values and error with table does not exist!
Can anyone help me understand what's going on?
https://www.postgresql.org/docs/current/static/sql-createtable.html
TEMPORARY or TEMP
If specified, the table is created as a temporary table. Temporary
tables are automatically dropped at the end of a session, or
optionally at the end of the current transaction (see ON COMMIT
below). Existing permanent tables with the same name are not visible
to the current session while the temporary table exists, unless they
are referenced with schema-qualified names. Any indexes created on a
temporary table are automatically temporary as well.
If you use session pooler that can close session for your or just close it yourself (eg network problem), the temp table will be dropped.
Also you can create it the way it is dropped on transaction end as well:
ON COMMIT
The behavior of temporary tables at the end of a transaction block can
be controlled using ON COMMIT. The three options are:
PRESERVE ROWS
No special action is taken at the ends of transactions. This is the
default behavior.
DELETE ROWS
All rows in the temporary table will be deleted at the end of each
transaction block. Essentially, an automatic TRUNCATE is done at each
commit.
DROP
The temporary table will be dropped at the end of the current transaction block.

Flip flopping data tables in Postgres

I have a table of several million records which I am running a query against and inserting the results into another table which clients will query. This process takes about 20 seconds.
How can I run this query, building this new table without impacting any of the clients that might be running queries against the target table?
For instance. I'm running
BEGIN;
DROP TABLE target_table;
SELECT blah, blahX, blahY
INTO target_table
FROM source_table
GROUP BY blahX, blahY
COMMIT;
Which is then blocking queries to:
SELECT SUM(blah)
FROM target_table
WHERE blahX > x
In the days of working with some SQL Server DBA's I recall them creating temporary tables, and then flipping these in over the current table. Is this doable/practical in Postgres?
What you want here is to minimize the lock time, which of course if you include a query (that takes a while) in your transaction is not going to work.
In this case, I assume you're in fact refreshing that 'target_table' which contains the positions of the "blah" objects when you run your script is that correct ?
BEGIN;
CREATE TEMP TABLE temptable AS
SELECT blah, blahX, blahY
FROM source_table
GROUP BY blahX, blahY
COMMIT;
BEGIN;
TRUNCATE TABLE target_table
INSERT INTO target_table(blah,blahX,blahY)
SELECT blah,blahX,blahY FROM temptable;
DROP TABLE temptable;
COMMIT;
As mentioned in the comments, it will be faster to drop the index's before truncating and create them anew just after loading the data to avoid the unneeded index changes.
For the full details of what is and is not possible with postgreSQL in that regard :
http://postgresql.1045698.n5.nabble.com/ALTER-TABLE-REPLACE-WITH-td3305036i40.html
There's ALTER TABLE ... RENAME TO ...:
ALTER TABLE name
RENAME TO new_name
Perhaps you could select into an intermediate table and then drop target_table and rename the intermediate table to target_table.
I have no idea how this would interact with any queries that may be running against target_table when you try to do the rename.
You can create a table, drop a table, and rename a table in every version of SQL I've ever used.
BEGIN;
SELECT blah, blahX, blahY
INTO new_table
FROM source_table
GROUP BY blahX, blahY;
DROP TABLE target_table;
ALTER TABLE new_table RENAME TO target_table;
COMMIT;
I'm not sure off the top of my head whether you could use a temporary table for this in PostgreSQL. PostgreSQL creates temp tables in a special schema; you don't get to pick the schema. But you might be able to create it as a temporary table, drop the existing table, and move it with SET SCHEMA.
At some point, any of these will require a table lock. (Duh.) You might be able to speed things up a lot by putting the swappable table on a SSD.