delete duplicate rows in large postgresql database table - postgresql

I have a postgresql database with 100 GB size. One of the tables has about half a billion entries. For quick data entry, some of the data was repeated and left to be pruned later. One of the columns can be used to identify the rows as unique.
I found this stackoverflow question which suggested a solution for mysql:
ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)
Is there anything similar for postgresql?
I tried deleting with group by and row number, my computer runs out of memory after a few hours in both cases.
This is what I get when I try to estimate the number of rows in the table:
SELECT reltuples FROM pg_class WHERE relname = 'orders';
reltuples
-------------
4.38543e+08
(1 row)

Two solutions immediately come to mind:
1). Create a new table as select * from source table with a WHERE clause to determine the unique rows. Add the indexes to match the source table, then rename them both in a transaction. Whether or not this will work for you depends on several factors, including amount of free disk space, if the table is in constant use and interruptions to access are permissible, etc. Creating a new table has the benefit of tightly packing your data and indexes, and the table will be smaller than the original because the non-unique rows are omitted.
2). Create a partial unique index over your columns and add a WHERE clause to filter out the non-uniques.
For example:
test=# create table t ( col1 int, col2 int, is_unique boolean);
CREATE TABLE
test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
INSERT 0 3
test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
CREATE INDEX
test=# \d t
Table "public.t"
Column | Type | Modifiers
-----------+---------+-----------
col1 | integer |
col2 | integer |
is_unique | boolean |
Indexes:
"t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE

Related

Replacing two columns (first name, last name) with an auto-increment id

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?
That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user

Creating a 2-way relationship in PostgreSQL table

I have 3 tables representing UUIDs, Name, Location, and Info of a house, room and drawers (this is an example as my work is sensitive).
So, for example 1 house will have many rooms (one to many) and the many rooms will contain many drawers (many to many).
The idea is that an associations table will be created where each UUID of the rows in the table will be associated with the corresponding UUID of the other table.
For example, if I query the house which is represent by ID1 it will return the following:
SELECT * FROM house where 'ID_1='1';
| ID_1|ID_2 |
| ----| -----|
| 1 | 201 |
| 1 | 254 |
| 1 | 268 |
So far, I have created a temporary version of the associations table of how I need it to be represented in the real table. However, now I need a function to automatically fill in the IDs properly for all rows from the temporary associations table to the real associations table. For example:
INSERT INTO associations (id_1, id_2) VALUES
('1','201'),
('201','1')
I need it to be directionless so that when I query id_1 I'm also getting it's linked id_2 in the result
Let's say your query to get a one-way relationship looks like this:
SELECT room_uuid AS left_uuid, house_the_room_is_in_uuid AS right_uuid
FROM rooms
WHERE house_the_room_is_in_uuid IS NOT NULL
AND is_active
All you need to get the reverse relationship is to put the list in the other order; the rest of the query doesn't need to change, however complex it is:
SELECT house_the_room_is_in_uuid AS left_uuid, room_uuid AS right_uuid
FROM rooms
WHERE house_the_room_is_in_uuid IS NOT NULL
AND is_active
Both of those will be valid as queries to insert into a table with two UUID columns:
CREATE TABLE my_lookup_table (left_uuid UUID, right_uuid UUID);
INSERT INTO my_lookup_table (left_uuid, right_uuid)
SELECT ... -- either of the above
To combine them, either insert each into the same table in turn, or use a UNION to create one result set with both sets of rows:
SELECT room_uuid AS left_uuid, house_the_room_is_in_uuid AS right_uuid
FROM rooms
WHERE is_in_house_uuid IS NOT NULL
AND is_active
UNION
SELECT house_the_room_is_in_uuid AS left_uuid, room_uuid AS right_uuid
FROM rooms
WHERE is_in_house_uuid IS NOT NULL
AND is_active
All that's required for a union is that the queries have the same number and type of columns. The names (if relevant at all) come from the first query, but I find it more readable if you include the aliases on both.
Since the result of that UNION is itself just a two-column result set, it can be used with the same INSERT statement as before. That would allow you to insert into the table even if it had a self-referencing foreign key constraint as discussed here:
ALTER TABLE my_lookup_table ADD CONSTRAINT
my_lookup_table_combinations_must_be_unique
UNIQUE (left_uuid, right_uuid);
ALTER TABLE my_lookup_table ADD CONSTRAINT
my_lookup_table_must_have_rows_both_ways_around
FOREIGN KEY (right_uuid, left_uuid)
REFERENCES my_lookup_table (left_uuid, right_uuid);
If you tried to insert just one set of rows, this would fail, but with the UNION, by the end of the statement/transaction, each row is in the table both ways around, so the constraint is met.

Add a serial column based on a sorted column

I have a table that has one column with unordered value. I want to order this column descending and add a column to record its order. My SQL code is:
select *
into newtable
from oldtable
order by column_name desc;
alter table newtable add column id serial;
Would this implement my goal? I know that rows in PostgreSQL have no fixed order. So I am not sure about this.
Rather than (ab)using a SERIAL via ALTER TABLE, generate it at insert-time.
CREATE TABLE newtable (id serial unique not null, LIKE oldtable INCLUDING ALL);
INSERT INTO newtable
SELECT nextval('newtable_id_seq'), *
FROM oldtable
ORDER BY column_name desc;
This avoids a table rewrite, and unlike your prior approach, is guaranteed to produce the correct ordering.
(If you want it to be the PK, and the prior table had no PK, change unique not null to primary key. If the prior table had a PK you'll need to use a LIKE variant that excludes constraints).
You can first create a new table, sorted based on the column you want to use:
CREATE TABLE newtable AS
SELECT * FROM oldtable
ORDER BY column_name desc;
Afterwards, since you want to order from the largest to the smallest, you can add a new column to your table:
ALTER TABLE newtable ADD COLUMN id serial unique;

How does LIMIT interact with DELETE by primary key in Postgres? (Fix corrupt unique index)

I've been handed a database that's stuck in a weird state. At some indeterminate time in the past, I ended up in a situation where I had duplicate rows in the same table with the same primary key:
=> \d my_table
Table "public.my_table"
Column | Type | Modifiers
--------------------+-------------------------+-----------
id | bigint | not null
some_data | bigint |
a_string | character varying(1024) | not null
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
=> SELECT id, count(*) FROM my_table GROUP BY id HAVING count(*) > 1 ORDER BY id;
#50-some results, non-consecutive rows.
I have no idea how the database got into this state, but I want to be able to safely get out of it. If, for each duplicated primary key, if I execute a query of the form:
DELETE FROM my_table WHERE id = "a_duplicated_row" LIMIT 1;
Is it only going to delete one row from the table, or is it going to delete both rows with the given primary key?
Alas, PostgreSQL does not yet implement LIMIT for DELETE or UPDATE. If the rows are indistinguishable in every other way, you will need to carefully use the hidden ctid column to break ties, like discussed here. Or just create the table by selecting distinct tuples from the existing table, and renaming.

How can I enforce a constraint only if a column is not null in Postgresql?

I would like a solution to enforce a constraint only if a column is not null. I can't seem to find a way to do this in the documentation.
create table mytable(
table_identifier_a INTEGER,
table_identifier_b INTEGER,
table_value1,...)
Do to the nature of the data, I will have identifier b and a value when the table is created. After we receive additional data, I will be able to populate identifier a. At this point I would like to ensure a unique key of (identifier_a, value1) but only if identifier_a exists.
Hopefully that makes sense, Any one have any ideas?
Ummm. Unique constraints don't prevent multiple NULL values.
CREATE TABLE mytable (
table_identifier_a INTEGER NULL,
table_identifier_b INTEGER NOT NULL,
table_value1 INTEGER NOT NULL,
UNIQUE(table_identifier_a, table_identifier_b)
);
Note that we can insert muliple NULLs into it, even when identifier_b
matches:
test=# INSERT INTO mytable values(NULL, 1, 2);
INSERT 0 1
test=# INSERT INTO mytable values(NULL, 1, 2);
INSERT 0 1
test=# select * from mytable;
table_identifier_a | table_identifier_b | table_value1
--------------------+--------------------+--------------
| 1 | 2
| 1 | 2
(2 rows)
But we can't create duplicate (a,b) pairs:
test=# update mytable set table_identifier_a = 3;
ERROR: duplicate key value violates unique constraint "mytable_table_identifier_a_key"
Of course, you do have an issue: Your table has no primary key. You
probably have a data model problem. But you didn't provide enough
details to fix that.
If it is feasible to complete the entire operation within one transaction, it is possible to change the time which postgres evaluates the constraint, i.e.:
START;
SET CONSTRAINTS <...> DEFERRED;
<SOME INSERT/UPDATE/DELETE>
COMMIT;
In this case, the constraint is evaluated at commit. See:
Postgres 7.4 Doc - Set constraints or Postgres 8.3 Doc
Actually, I would probably break this out into Two tables. You're modeling two different kinds of things. The first one is the initial version, which is only partial, and the second is the whole thing. Once the information needed to bring the first kind of thing to the second, move the row from one table to the other.
You could handle this using a trigger instead of a constraint.
If I were you, I'd split the table into two tables, and possibly create view which combines them as needed.