Why is table-swapping in Postgres so verbose? - postgresql

I'd like to backfill a column of a large (20M rows), frequently-read but rarely-written table. From various articles and questions on SO, it seems like the best way to do this is create a table with identical structure, load in the backfilled data, and live-swap (since renaming is pretty quick). Sounds good!
But when I actually write the script to do this, it is mind-blowingly long. Here's a taste:
BEGIN;
CREATE TABLE foo_new (LIKE foo);
-- I don't use INCLUDING ALL, because that produces Indexes/Constraints with different names
-- This is the only part of the script that is specific to my case.
-- Everything else is standard for any table swap
INSERT INTO foo_new (id, first_name, last_name, email, full_name)
(SELECT id, first_name, last_name, email, first_name || last_name) FROM foo);
CREATE SEQUENCE foo_new_id_seq
START 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
SELECT setval('foo_new_id_seq', COALESCE((SELECT MAX(id)+1 FROM foo_new), 1), false);
ALTER SEQUENCE foo_new_id_seq OWNED BY foo_new.id;
ALTER TABLE ONLY foo_new ALTER COLUMN id SET DEFAULT nextval('foo_new_id_seq'::regclass);
ALTER TABLE foo_new
ADD CONSTRAINT foo_new_pkey
PRIMARY KEY (id);
COMMIT;
-- Indexes are made concurrently, otherwise they would block reads for
-- a long time. Concurrent index creation cannot occur within a transaction.
CREATE INDEX CONCURRENTLY foo_new_on_first_name ON foo_new USING btree (first_name);
CREATE INDEX CONCURRENTLY foo_new_on_last_name ON foo_new USING btree (last_name);
CREATE INDEX CONCURRENTLY foo_new_on_email ON foo_new USING btree (email);
-- One more line for each index
BEGIN;
ALTER TABLE foo RENAME TO foo_old;
ALTER TABLE foo_new RENAME TO foo;
ALTER SEQUENCE foo_id_seq RENAME TO foo_old_id_seq;
ALTER SEQUENCE foo_new_id_seq RENAME TO foo_id_seq;
ALTER TABLE foo_old RENAME CONSTRAINT foo_pkey TO foo_old_pkey;
ALTER TABLE foo RENAME CONSTRAINT foo_new_pkey TO foo_pkey;
ALTER INDEX foo_on_first_name RENAME TO foo_old_on_first_name;
ALTER INDEX foo_on_last_name RENAME TO foo_old_on_last_name;
ALTER INDEX foo_on_email RENAME TO foo_old_on_email;
-- One more line for each index
ALTER INDEX foo_new_on_first_name RENAME TO foo_on_first_name;
ALTER INDEX foo_new_on_last_name RENAME TO foo_on_last_name;
ALTER INDEX foo_new_on_email RENAME TO foo_on_email;
-- One more line for each index
COMMIT;
-- TODO: drop old table (CASCADE)
And this doesn't even include foreign keys, or other constraints! Since the only part of this that is specific to my case in the INSERT INTO bit, I'm surprised that there's no built-in Postgres function to do this sort of swapping. Is this operation less common than I make it out to be? Am I underestimating the variety of ways this can be accomplished? Is my desire to keep naming consistent an atypical one?

It's probably not all that common. Most tables aren't big enough to warrant it, and most applications can tolerate some amount of downtime here and there.
More importantly, different applications can afford to cut corners in different ways depending on their workload. The database server can't; it needs to handle (or to very deliberately not handle) every possible obscure edge-case, which is likely a lot harder than you might expect. Ultimately, writing tailored solutions for different use cases probably makes more sense.
Anyway, if you're just trying to implement a calculated field as first_name || last_name, there are better ways of doing it:
ALTER TABLE foo RENAME TO foo_base;
CREATE VIEW foo AS
SELECT
id,
first_name,
last_name,
email,
(first_name || last_name) AS full_name
FROM foo_base;
Assuming that your real case is more complicated, all of this effort may still be unnecessary. I believe that the copy-and-rename approach is largely based on the assumption that you need to lock the table against concurrent modifications for the duration of this process, and so the goal is to get it done as quickly as possible. If all concurrent operations are read-only - which appears to be the case, since you're not locking the table - then you're probably better off with a simple UPDATE (which won't block SELECTs), even if it does take a bit longer (though it does have the advantage of avoiding foreign key re-checks and TOAST table rewrites).
If this approach really is justified, I think there a few opportunities for improvement:
You don't need to recreate/reset the sequence; you can just link the existing sequence to the new table.
CREATE INDEX CONCURRENTLY seems unnecessary, as nobody else should be trying to access foo_new yet. In fact, if the whole script were in one transaction, it wouldn't even be externally visible at this point.
Table names only need to be unique within a schema. If you temporarily create a schema for the new table, you should be able to replace all of those RENAMEs with a single ALTER TABLE foo SET SCHEMA public.
Even if you don't expect concurrent writes, it wouldn't hurt to LOCK foo IN SHARE MODE anyway...
EDIT:
The sequence reassignment is a little more involved than I expected, as it seems that they need to stay in the same schema as their parent table. But here is (what appears to be) a working example:
BEGIN;
LOCK public.foo IN SHARE MODE;
CREATE SCHEMA tmp;
CREATE TABLE tmp.foo (LIKE public.foo);
INSERT INTO tmp.foo (id, first_name, last_name, email, full_name)
SELECT id, first_name, last_name, email, (first_name || last_name) FROM public.foo;
ALTER TABLE tmp.foo ADD CONSTRAINT foo_pkey PRIMARY KEY (id);
CREATE INDEX foo_on_first_name ON tmp.foo (first_name);
CREATE INDEX foo_on_last_name ON tmp.foo (last_name);
CREATE INDEX foo_on_email ON tmp.foo (email);
ALTER TABLE tmp.foo ALTER COLUMN id SET DEFAULT nextval('public.foo_id_seq');
ALTER SEQUENCE public.foo_id_seq OWNED BY NONE;
DROP TABLE public.foo;
ALTER TABLE tmp.foo SET SCHEMA public;
ALTER SEQUENCE public.foo_id_seq OWNED BY public.foo.id;
DROP SCHEMA tmp;
COMMIT;

Related

Replacing two columns (first name, last name) with an auto-increment id

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?
That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user

How to safely reindex primary key on postgres?

We have a huge table that contains bloat on the primary key index. We constantly archive old records on that table.
We reindex other columns by recreating the index concurrently and dropping the old one. This is to avoid interfering with production traffic.
But this is not possible for a primary key since there are foreign keys depending on it. At least based on what we have tried.
What's the right way to reindex the primary key safely without blocking DML statements on the table?
REINDEX CONCURRENTLY seems to work as well. I tried it on my database and didn't get any error.
REINDEX INDEX CONCURRENTLY <indexname>;
I think it possibly does something similar to what #jlandercy has described in his answer. While the reindex was running I saw an index with suffix _ccnew and the existing one was intact as well. Eventually I guess that index was renamed as the original index after dropping the older one and I eventually see a unique primary index on my table.
I am using postgres v12.7.
You can use pg_repack for this.
pg_repack is a PostgreSQL extension which lets you remove bloat from tables and indexes, and optionally restore the physical order of clustered indexes.
It doesn't hold exclusive locks during the whole process. It still does execute some locks, but this should be for a short period of time only. You can check the details here: https://reorg.github.io/pg_repack/
To perform repack on indexes, you can try:
pg_repack -t table_name --only-indexes
TL;DR
Just reindex it as other index using its index name:
REINDEX INDEX <indexname>;
MCVE
Let's create a table with a Primary Key constraint which is also an Index:
CREATE TABLE test(
Id BIGSERIAL PRIMARY KEY
);
Looking at the catalogue we see the constraint name:
SELECT conname FROM pg_constraint WHERE conname LIKE 'test%';
-- "test_pkey"
Having the name of the index, we can reindex it:
REINDEX INDEX test_pkey;
You can also fix the Constraint Name at the creation:
CREATE TABLE test(
Id BIGSERIAL NOT NULL
);
ALTER TABLE test ADD CONSTRAINT myconstraint PRIMARY KEY(Id);
If you must address concurrence, then use the method a_horse_with_no_name suggested, create a unique index concurrently:
-- Ensure Uniqueness while recreating the Primary Key:
CREATE UNIQUE INDEX CONCURRENTLY tempindex ON test USING btree(Id);
-- Drop PK:
ALTER TABLE test DROP CONSTRAINT myconstraint;
-- Recreate PK:
ALTER TABLE test ADD CONSTRAINT myconstraint PRIMARY KEY(Id);
-- Drop redundant Index:
DROP INDEX tempindex;
To check Index existence:
SELECT * FROM pg_index WHERE indexrelid::regclass = 'tempindex'::regclass

How to add a sort key to an existing table in AWS Redshift

In AWS Redshift, I want to add a sort key to a table that is already created. Is there any command which can add a column and use it as sort key?
UPDATE:
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
source: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-supports-changing-table-sort-keys-dynamically/
At the moment I think its not possible (hopefully that will change in the future). In the past when I ran into this kind of situation I created a new table and copied the data from the old one into it.
from http://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html:
ADD [ COLUMN ] column_name
Adds a column with the specified name to the table. You can add only one column in each ALTER TABLE statement.
You cannot add a column that is the distribution key (DISTKEY) or a sort key (SORTKEY) of the table.
You cannot use an ALTER TABLE ADD COLUMN command to modify the following table and column attributes:
UNIQUE
PRIMARY KEY
REFERENCES (foreign key)
IDENTITY
The maximum column name length is 127 characters; longer names are truncated to 127 characters. The maximum number of columns you can define in a single table is 1,600.
As Yaniv Kessler mentioned, it's not possible to add or change distkey and sort key after creating a table, and you have to recreate a table and copy all data to the new table.
You can use the following SQL format to recreate a table with a new design.
ALTER TABLE test_table RENAME TO old_test_table;
CREATE TABLE new_test_table([new table columns]);
INSERT INTO new_test_table (SELECT * FROM old_test_table);
ALTER TABLE new_test_table RENAME TO test_table;
DROP TABLE old_test_table;
In my experience, this SQL is used for not only changing distkey and sortkey, but also setting the encoding(compression) type.
To add to Yaniv's answer, the ideal way to do this is probably using the CREATE TABLE AS command. You can specify the distkey and sortkey explicitly. I.e.
CREATE TABLE test_table_with_dist
distkey(field)
sortkey(sortfield)
AS
select * from test_table
Additional examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_examples.html
EDIT
I've noticed that this method doesn't preserve encoding. Redshift only automatically encodes during a copy statement. If this is a persistent table you should redefine the table and specify the encoding.
create table test_table_with_dist(
field1 varchar encode row distkey
field2 timestam pencode delta sortkey);
insert into test_table select * from test_table;
You can figure out which encoding to use by running analyze compression test_table;
AWS now allows you to add both sortkeys and distkeys without having to recreate tables:
TO add a sortkey (or alter a sortkey):
ALTER TABLE data.engagements_bot_free_raw
ALTER SORTKEY (id)
To alter a distkey or add a distkey:
ALTER TABLE data.engagements_bot_free_raw
ALTER DISTKEY id
Interestingly, the paranthesis are mandatory on SORTKEY, but not on DISTKEY.
You still cannot inplace change the encoding of a table - that still requires the solutions where you must recreate tables.
I followed this approach for adding the sort columns to my table table_transactons its more or less same approach only less number of commands.
alter table table_transactions rename to table_transactions_backup;
create table table_transactions compound sortkey(key1, key2, key3, key4) as select * from table_transactions_backup;
drop table table_transactions_backup;
Catching this query a bit late.
I find that using 1=1 the best way to create and replicate data into another table in redshift
eg:
CREATE TABLE NEWTABLE AS SELECT * FROM OLDTABLE WHERE 1=1;
then you can drop the OLDTABLE after verifying that the data has been copied
(if you replace 1=1 with 1=2, it copies only the structure - which is good for creating staging tables)
it is now possible to alter a sort kay:
Amazon Redshift now supports changing table sort keys dynamically
Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.
Customers when creating Redshift tables can optionally specify one or more table columns as sort keys. The sort keys are used to maintain the sort order of the Redshift tables and allows the query engine to achieve high performance by reducing the amount of data to read from disk and to save on storage with better compression. Currently Redshift customers who desire to change the sort keys after the initial table creation will need to re-create the table with new sort key definitions.
With the new ALTER SORT KEY command, users can dynamically change the Redshift table sort keys as needed. Redshift will take care of adjusting data layout behind the scenes and table remains available for users to query. Users can modify sort keys for a given table as many times as needed and they can alter sort keys for multiple tables simultaneously.
For more information ALTER SORT KEY, please refer to the documentation.
documentation
as for the documentation itself:
ALTER DISTKEY column_name or ALTER DISTSTYLE KEY DISTKEY column_name A
clause that changes the column used as the distribution key of a
table. Consider the following:
VACUUM and ALTER DISTKEY cannot run concurrently on the same table.
If VACUUM is already running, then ALTER DISTKEY returns an error.
If ALTER DISTKEY is running, then background vacuum doesn't start on a table.
If ALTER DISTKEY is running, then foreground vacuum returns an error.
You can only run one ALTER DISTKEY command on a table at a time.
The ALTER DISTKEY command is not supported for tables with interleaved sort keys.
When specifying DISTSTYLE KEY, the data is distributed by the values in the DISTKEY column. For more information about DISTSTYLE, see CREATE TABLE.
ALTER [COMPOUND] SORTKEY ( column_name [,...] ) A clause that changes
or adds the sort key used for a table. Consider the following:
You can define a maximum of 400 columns for a sort key per table.
You can only alter a compound sort key. You can't alter an interleaved sort key.
When data is loaded into a table, the data is loaded in the order of the sort key. When you alter the sort key, Amazon Redshift reorders the data. For more information about SORTKEY, see CREATE TABLE.
According to the updated documentation it is now possible to change a sort key type with:
ALTER [COMPOUND] SORTKEY ( column_name [,...] )
For reference (https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html):
"You can alter an interleaved sort key to a compound sort key or no sort key. However, you can't alter a compound sort key to an interleaved sort key."
ALTER TABLE table_name ALTER SORTKEY (sortKey1, sortKey2 ...etc)

PostgreSQL delete all content

Hello I want to delete all data in my postgresql tables, but not the table itself.
How could I do this?
Use the TRUNCATE TABLE command.
The content of the table/tables in PostgreSQL database can be deleted in several ways.
Deleting table content using sql:
Deleting content of one table:
TRUNCATE table_name;
DELETE FROM table_name;
Deleting content of all named tables:
TRUNCATE table_a, table_b, …, table_z;
Deleting content of named tables and tables that reference to them (I will explain it in more details later in this answer):
TRUNCATE table_a, table_b CASCADE;
Deleting table content using pgAdmin:
Deleting content of one table:
Right click on the table -> Truncate
Deleting content of table and tables that reference to it:
Right click on the table -> Truncate Cascaded
Difference between delete and truncate:
From the documentation:
DELETE deletes rows that satisfy the WHERE clause from the specified
table. If the WHERE clause is absent, the effect is to delete all rows
in the table.
http://www.postgresql.org/docs/9.3/static/sql-delete.html
TRUNCATE is a PostgreSQL extension that provides a faster mechanism to
remove all rows from a table. TRUNCATE quickly removes all rows from a
set of tables. It has the same effect as an unqualified DELETE on each
table, but since it does not actually scan the tables it is faster.
Furthermore, it reclaims disk space immediately, rather than requiring
a subsequent VACUUM operation. This is most useful on large tables.
http://www.postgresql.org/docs/9.1/static/sql-truncate.html
Working with table that is referenced from other table:
When you have database that has more than one table the tables have probably relationship.
As an example there are three tables:
create table customers (
customer_id int not null,
name varchar(20),
surname varchar(30),
constraint pk_customer primary key (customer_id)
);
create table orders (
order_id int not null,
number int not null,
customer_id int not null,
constraint pk_order primary key (order_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
create table loyalty_cards (
card_id int not null,
card_number varchar(10) not null,
customer_id int not null,
constraint pk_card primary key (card_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
And some prepared data for these tables:
insert into customers values (1, 'John', 'Smith');
insert into orders values
(10, 1000, 1),
(11, 1009, 1),
(12, 1010, 1);
insert into loyalty_cards values (100, 'A123456789', 1);
Table orders references table customers and table loyalty_cards references table customers. When you try to TRUNCATE / DELETE FROM the table that is referenced by other table/s (the other table/s has foreign key constraint to the named table) you get an error. To delete content from all three tables you have to name all these tables (the order is not important)
TRUNCATE customers, loyalty_cards, orders;
or just the table that is referenced with CASCADE key word (you can name more tables than just one)
TRUNCATE customers CASCADE;
The same applies for pgAdmin. Right click on customers table and choose Truncate Cascaded.
For small tables DELETE is often faster and needs less aggressive locking (important for concurrent load):
DELETE FROM tbl;
With no WHERE condition.
For medium or bigger tables, go with TRUNCATE, like #Greg posted:
TRUNCATE tbl;
Hard to pin down the line between "small" and "big", as that depends on many variables. You'll have to test in your installation.
I found a very easy and fast way for everyone who might use a tool like DBeaver:
You just need to select all the tables that you want to truncate (SHIFT + click or CTRL + click) then right click
And if you have foreign keys, select also CASCADE option on Settings panel. Start and that's all it takes!

What is the difference between these two T-SQL statements?

In a SSIS package at work there are some SQL tasks that create staging tables for holding import data. All the statements take this form:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.tbNewTable') AND type in (N'U'))
BEGIN
TRUNCATE TABLE dbo.tbNewTable
END
ELSE
BEGIN
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
END
In Itzik Ben-Gan's T-SQL Fundamentals I see a different form of statement for creating a table:
IF OBJECT_ID('dbo.tbNewTable', 'U') IS NOT NULL
BEGIN
DROP TABLE dbo.tbNewTable
END
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
Each of these appears to do the same thing. After execution, there will be a empty table called tbNewTable in the dbo schema.
Are there any practical or theoretical differences between the two? What implications might they have?
The first one assumes that if the table exists, it has the same columns as those it would create. The second one does not make that assumption. So if a table with that name happened to exist and had a different set of columns, the two would have very different results.
The first will not actually DROP the table -- it merely TRUNCATES all the data in said table. Hence why the CREATE is guarded.
Thus the form with the DROP will allow the subsequent CREATE to change the schema (when the new table is created) even if tbNewTable previously existed.
Because the DROP/CREATE alters the database schema it may not also be allowed in all cases. For instance, a view created with a SCHEMABINDING will prevent the table from being dropped. (This also hold true for more general FK relationships, should any exist.)
...when SCHEMABINDING is specified, the base table or tables cannot be modified in a way that would affect the view definition.
The TRUNCATE should be marginally faster in one of those constant "don't care" ways: there should be no performance consideration given to one over the other.
There are also permission differences. TRUNCATE only requires the ALTER permission.
The minimum permission required is ALTER on table_name. TRUNCATE TABLE permissions default to the table owner...
Happy coding.
These are very different..
The first does an equality check on the sys.objects system table and looks to see if there is a matching table name. If so, it truncates the table. Basically removing all rows but maintaining the table structure itself - i.e. the actual table is never dropped.
In the second, the check to make sure that the table exists is implicitly done using the OBJECT_ID() method. If so, the table is dropped completely - rows and structure.
If you have a primary and foreign key constraint on the table, you'll certainly have issues dropping it completely... and if you have other tables that are linked to the table you are trying to 'truncate' you'll have issues there too, unless you have cascade deletion turned on.
I tend to dislike either construction in an SSIS package. I create the tables in a deployment script and I want the package to fail if one of the tables I use is missing later on because then something drastically wrong has happened and I want to investigate what before I try putting data anywhere.