Replacing two columns (first name, last name) with an auto-increment id - postgresql

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?

That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user

Related

Delete all records that violate new unqiue constraint

I have a table that has the following fields
----------------------------------
| id | user_id | doc_id |
----------------------------------
I want to create a new unique constraint to make sure that there are no repeat user_id and doc_id records. Aka a user can only be linked to a doc one time. That is simple enough.
ALTER TABLE mytable
ADD CONSTRAINT uniquectm_const UNIQUE (user_id, doc_id);
The issue is I have records that currently violate that constraint. I was wondering if there is an easy way to query for those records or to tell postgres just delete anything that violates the constraint.
Identifying records that violate your new key:
SELECT *
FROM
(
SELECT id, user_id, doc_id
, COUNT(*) OVER (PARTITION BY user_id, doc_id) as unique_check
FROM mytable
)
WHERE unique_check > 1;
Then you can figure out from those duplicates, which should be deleted and perform the delete.
To my knowledge there is no other way to perform this since any automated "Delete any duplicates" command would leave the database engine to decide which of the two-or-more duplicate records to get rid of.
If the entire record is a duplicate (all columns match) then you could just create a new table with your new unique constraint and do a INSERT INTO newtable SELECT DISTINCT * FROM oldtable but I'm betting that isn't the case.

Why is table-swapping in Postgres so verbose?

I'd like to backfill a column of a large (20M rows), frequently-read but rarely-written table. From various articles and questions on SO, it seems like the best way to do this is create a table with identical structure, load in the backfilled data, and live-swap (since renaming is pretty quick). Sounds good!
But when I actually write the script to do this, it is mind-blowingly long. Here's a taste:
BEGIN;
CREATE TABLE foo_new (LIKE foo);
-- I don't use INCLUDING ALL, because that produces Indexes/Constraints with different names
-- This is the only part of the script that is specific to my case.
-- Everything else is standard for any table swap
INSERT INTO foo_new (id, first_name, last_name, email, full_name)
(SELECT id, first_name, last_name, email, first_name || last_name) FROM foo);
CREATE SEQUENCE foo_new_id_seq
START 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
SELECT setval('foo_new_id_seq', COALESCE((SELECT MAX(id)+1 FROM foo_new), 1), false);
ALTER SEQUENCE foo_new_id_seq OWNED BY foo_new.id;
ALTER TABLE ONLY foo_new ALTER COLUMN id SET DEFAULT nextval('foo_new_id_seq'::regclass);
ALTER TABLE foo_new
ADD CONSTRAINT foo_new_pkey
PRIMARY KEY (id);
COMMIT;
-- Indexes are made concurrently, otherwise they would block reads for
-- a long time. Concurrent index creation cannot occur within a transaction.
CREATE INDEX CONCURRENTLY foo_new_on_first_name ON foo_new USING btree (first_name);
CREATE INDEX CONCURRENTLY foo_new_on_last_name ON foo_new USING btree (last_name);
CREATE INDEX CONCURRENTLY foo_new_on_email ON foo_new USING btree (email);
-- One more line for each index
BEGIN;
ALTER TABLE foo RENAME TO foo_old;
ALTER TABLE foo_new RENAME TO foo;
ALTER SEQUENCE foo_id_seq RENAME TO foo_old_id_seq;
ALTER SEQUENCE foo_new_id_seq RENAME TO foo_id_seq;
ALTER TABLE foo_old RENAME CONSTRAINT foo_pkey TO foo_old_pkey;
ALTER TABLE foo RENAME CONSTRAINT foo_new_pkey TO foo_pkey;
ALTER INDEX foo_on_first_name RENAME TO foo_old_on_first_name;
ALTER INDEX foo_on_last_name RENAME TO foo_old_on_last_name;
ALTER INDEX foo_on_email RENAME TO foo_old_on_email;
-- One more line for each index
ALTER INDEX foo_new_on_first_name RENAME TO foo_on_first_name;
ALTER INDEX foo_new_on_last_name RENAME TO foo_on_last_name;
ALTER INDEX foo_new_on_email RENAME TO foo_on_email;
-- One more line for each index
COMMIT;
-- TODO: drop old table (CASCADE)
And this doesn't even include foreign keys, or other constraints! Since the only part of this that is specific to my case in the INSERT INTO bit, I'm surprised that there's no built-in Postgres function to do this sort of swapping. Is this operation less common than I make it out to be? Am I underestimating the variety of ways this can be accomplished? Is my desire to keep naming consistent an atypical one?
It's probably not all that common. Most tables aren't big enough to warrant it, and most applications can tolerate some amount of downtime here and there.
More importantly, different applications can afford to cut corners in different ways depending on their workload. The database server can't; it needs to handle (or to very deliberately not handle) every possible obscure edge-case, which is likely a lot harder than you might expect. Ultimately, writing tailored solutions for different use cases probably makes more sense.
Anyway, if you're just trying to implement a calculated field as first_name || last_name, there are better ways of doing it:
ALTER TABLE foo RENAME TO foo_base;
CREATE VIEW foo AS
SELECT
id,
first_name,
last_name,
email,
(first_name || last_name) AS full_name
FROM foo_base;
Assuming that your real case is more complicated, all of this effort may still be unnecessary. I believe that the copy-and-rename approach is largely based on the assumption that you need to lock the table against concurrent modifications for the duration of this process, and so the goal is to get it done as quickly as possible. If all concurrent operations are read-only - which appears to be the case, since you're not locking the table - then you're probably better off with a simple UPDATE (which won't block SELECTs), even if it does take a bit longer (though it does have the advantage of avoiding foreign key re-checks and TOAST table rewrites).
If this approach really is justified, I think there a few opportunities for improvement:
You don't need to recreate/reset the sequence; you can just link the existing sequence to the new table.
CREATE INDEX CONCURRENTLY seems unnecessary, as nobody else should be trying to access foo_new yet. In fact, if the whole script were in one transaction, it wouldn't even be externally visible at this point.
Table names only need to be unique within a schema. If you temporarily create a schema for the new table, you should be able to replace all of those RENAMEs with a single ALTER TABLE foo SET SCHEMA public.
Even if you don't expect concurrent writes, it wouldn't hurt to LOCK foo IN SHARE MODE anyway...
EDIT:
The sequence reassignment is a little more involved than I expected, as it seems that they need to stay in the same schema as their parent table. But here is (what appears to be) a working example:
BEGIN;
LOCK public.foo IN SHARE MODE;
CREATE SCHEMA tmp;
CREATE TABLE tmp.foo (LIKE public.foo);
INSERT INTO tmp.foo (id, first_name, last_name, email, full_name)
SELECT id, first_name, last_name, email, (first_name || last_name) FROM public.foo;
ALTER TABLE tmp.foo ADD CONSTRAINT foo_pkey PRIMARY KEY (id);
CREATE INDEX foo_on_first_name ON tmp.foo (first_name);
CREATE INDEX foo_on_last_name ON tmp.foo (last_name);
CREATE INDEX foo_on_email ON tmp.foo (email);
ALTER TABLE tmp.foo ALTER COLUMN id SET DEFAULT nextval('public.foo_id_seq');
ALTER SEQUENCE public.foo_id_seq OWNED BY NONE;
DROP TABLE public.foo;
ALTER TABLE tmp.foo SET SCHEMA public;
ALTER SEQUENCE public.foo_id_seq OWNED BY public.foo.id;
DROP SCHEMA tmp;
COMMIT;

PostgreSQL self referential table - how to store parent ID in script?

I've the following table:
DROP SEQUENCE IF EXISTS CATEGORY_SEQ CASCADE;
CREATE SEQUENCE CATEGORY_SEQ START 1;
DROP TABLE IF EXISTS CATEGORY CASCADE;
CREATE TABLE CATEGORY (
ID BIGINT NOT NULL DEFAULT nextval('CATEGORY_SEQ'),
NAME CHARACTER VARYING(255) NOT NULL,
PARENT_ID BIGINT
);
ALTER TABLE CATEGORY
ADD CONSTRAINT CATEGORY_PK PRIMARY KEY (ID);
ALTER TABLE CATEGORY
ADD CONSTRAINT CATEGORY_SELF_FK FOREIGN KEY (PARENT_ID) REFERENCES CATEGORY (ID);
Now I need to insert the data. So I start with parent:
INSERT INTO CATEGORY (NAME) VALUES ('PARENT_1');
And now I need the ID of the just inserted parent to add children to it:
INSERT INTO CATEGORY (NAME, PARENT_ID) VALUES ('CHILDREN_1_1', <what_goes_here>);
INSERT INTO CATEGORY (NAME, PARENT_ID) VALUES ('CHILDREN_1_2', <what_goes_here>);
How can I get and store the ID of the parent to later use it in the subsequent inserts?
You can use a data modifying CTE with the returning clause:
with parent_cat (parent_id) as (
INSERT INTO CATEGORY (NAME) VALUES ('PARENT_1')
returning id
)
INSERT INTO CATEGORY (NAME, PARENT_ID)
VALUES
('CHILDREN_1_1', (select parent_id from parent_cat) ),
('CHILDREN_1_2', (select parent_id from parent_cat) );
The answer is to use RETURNING along with WITH
WITH inserted AS (
INSERT INTO CATEGORY (NAME) VALUES ('PARENT_1')
RETURNING id
) INSERT INTO CATEGORY (NAME, PARENT_ID) VALUES
('CHILD_1_1', (SELECT inserted.id FROM inserted)),
('CHILD_2_1', (SELECT inserted.id FROM inserted));
( tl;dr : goto option 3: INSERT with RETURNING )
Recall that in postgresql there is no "id" concept for tables, just sequences (which are typically but not necessarily used as default values for surrogate primary keys, with the SERIAL pseudo-type).
If you are interested in getting the id of a newly inserted row, there are several ways:
Option 1: CURRVAL(<sequence name>);.
For example:
INSERT INTO persons (lastname,firstname) VALUES ('Smith', 'John');
SELECT currval('persons_id_seq');
The name of the sequence must be known, it's really arbitrary; in this example we assume that the table persons has an id column created with the SERIAL pseudo-type. To avoid relying on this and to feel more clean, you can use instead pg_get_serial_sequence:
INSERT INTO persons (lastname,firstname) VALUES ('Smith', 'John');
SELECT currval(pg_get_serial_sequence('persons','id'));
Caveat: currval() only works after an INSERT (which has executed nextval() ), in the same session.
Option 2: LASTVAL();
This is similar to the previous, only that you don't need to specify the sequence number: it looks for the most recent modified sequence (always inside your session, same caveat as above).
Both CURRVAL and LASTVAL are totally concurrent safe. The behaviour of sequence in PG is designed so that different session will not interfere, so there is no risk of race conditions (if another session inserts another row between my INSERT and my SELECT, I still get my correct value).
However they do have a subtle potential problem. If the database has some TRIGGER (or RULE) that, on insertion into persons table, makes some extra insertions in other tables... then LASTVAL will probably give us the wrong value. The problem can even happen with CURRVAL, if the extra insertions are done intto the same persons table (this is much less usual, but the risk still exists).
Option 3: INSERT with RETURNING
INSERT INTO persons (lastname,firstname) VALUES ('Smith', 'John') RETURNING id;
This is the most clean, efficient and safe way to get the id. It doesn't have any of the risks of the previous.
Drawbacks? Almost none: you might need to modify the way you call your INSERT statement (in the worst case, perhaps your API or DB layer does not expect an INSERT to return a value); it's not standard SQL (who cares); it's available since Postgresql 8.2 (Dec 2006...)
Conclusion: If you can, go for option 3. Elsewhere, prefer 1.
Note: all these methods are useless if you intend to get the last globally inserted id (not necessarily in your session). For this, you must resort to select max(id) from table (of course, this will not read uncommitted inserts from other transactions).

Add a serial column based on a sorted column

I have a table that has one column with unordered value. I want to order this column descending and add a column to record its order. My SQL code is:
select *
into newtable
from oldtable
order by column_name desc;
alter table newtable add column id serial;
Would this implement my goal? I know that rows in PostgreSQL have no fixed order. So I am not sure about this.
Rather than (ab)using a SERIAL via ALTER TABLE, generate it at insert-time.
CREATE TABLE newtable (id serial unique not null, LIKE oldtable INCLUDING ALL);
INSERT INTO newtable
SELECT nextval('newtable_id_seq'), *
FROM oldtable
ORDER BY column_name desc;
This avoids a table rewrite, and unlike your prior approach, is guaranteed to produce the correct ordering.
(If you want it to be the PK, and the prior table had no PK, change unique not null to primary key. If the prior table had a PK you'll need to use a LIKE variant that excludes constraints).
You can first create a new table, sorted based on the column you want to use:
CREATE TABLE newtable AS
SELECT * FROM oldtable
ORDER BY column_name desc;
Afterwards, since you want to order from the largest to the smallest, you can add a new column to your table:
ALTER TABLE newtable ADD COLUMN id serial unique;

PostgreSQL delete all content

Hello I want to delete all data in my postgresql tables, but not the table itself.
How could I do this?
Use the TRUNCATE TABLE command.
The content of the table/tables in PostgreSQL database can be deleted in several ways.
Deleting table content using sql:
Deleting content of one table:
TRUNCATE table_name;
DELETE FROM table_name;
Deleting content of all named tables:
TRUNCATE table_a, table_b, …, table_z;
Deleting content of named tables and tables that reference to them (I will explain it in more details later in this answer):
TRUNCATE table_a, table_b CASCADE;
Deleting table content using pgAdmin:
Deleting content of one table:
Right click on the table -> Truncate
Deleting content of table and tables that reference to it:
Right click on the table -> Truncate Cascaded
Difference between delete and truncate:
From the documentation:
DELETE deletes rows that satisfy the WHERE clause from the specified
table. If the WHERE clause is absent, the effect is to delete all rows
in the table.
http://www.postgresql.org/docs/9.3/static/sql-delete.html
TRUNCATE is a PostgreSQL extension that provides a faster mechanism to
remove all rows from a table. TRUNCATE quickly removes all rows from a
set of tables. It has the same effect as an unqualified DELETE on each
table, but since it does not actually scan the tables it is faster.
Furthermore, it reclaims disk space immediately, rather than requiring
a subsequent VACUUM operation. This is most useful on large tables.
http://www.postgresql.org/docs/9.1/static/sql-truncate.html
Working with table that is referenced from other table:
When you have database that has more than one table the tables have probably relationship.
As an example there are three tables:
create table customers (
customer_id int not null,
name varchar(20),
surname varchar(30),
constraint pk_customer primary key (customer_id)
);
create table orders (
order_id int not null,
number int not null,
customer_id int not null,
constraint pk_order primary key (order_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
create table loyalty_cards (
card_id int not null,
card_number varchar(10) not null,
customer_id int not null,
constraint pk_card primary key (card_id),
constraint fk_customer foreign key (customer_id) references customers(customer_id)
);
And some prepared data for these tables:
insert into customers values (1, 'John', 'Smith');
insert into orders values
(10, 1000, 1),
(11, 1009, 1),
(12, 1010, 1);
insert into loyalty_cards values (100, 'A123456789', 1);
Table orders references table customers and table loyalty_cards references table customers. When you try to TRUNCATE / DELETE FROM the table that is referenced by other table/s (the other table/s has foreign key constraint to the named table) you get an error. To delete content from all three tables you have to name all these tables (the order is not important)
TRUNCATE customers, loyalty_cards, orders;
or just the table that is referenced with CASCADE key word (you can name more tables than just one)
TRUNCATE customers CASCADE;
The same applies for pgAdmin. Right click on customers table and choose Truncate Cascaded.
For small tables DELETE is often faster and needs less aggressive locking (important for concurrent load):
DELETE FROM tbl;
With no WHERE condition.
For medium or bigger tables, go with TRUNCATE, like #Greg posted:
TRUNCATE tbl;
Hard to pin down the line between "small" and "big", as that depends on many variables. You'll have to test in your installation.
I found a very easy and fast way for everyone who might use a tool like DBeaver:
You just need to select all the tables that you want to truncate (SHIFT + click or CTRL + click) then right click
And if you have foreign keys, select also CASCADE option on Settings panel. Start and that's all it takes!