Delete all records that violate new unqiue constraint - postgresql

I have a table that has the following fields
----------------------------------
| id | user_id | doc_id |
----------------------------------
I want to create a new unique constraint to make sure that there are no repeat user_id and doc_id records. Aka a user can only be linked to a doc one time. That is simple enough.
ALTER TABLE mytable
ADD CONSTRAINT uniquectm_const UNIQUE (user_id, doc_id);
The issue is I have records that currently violate that constraint. I was wondering if there is an easy way to query for those records or to tell postgres just delete anything that violates the constraint.

Identifying records that violate your new key:
SELECT *
FROM
(
SELECT id, user_id, doc_id
, COUNT(*) OVER (PARTITION BY user_id, doc_id) as unique_check
FROM mytable
)
WHERE unique_check > 1;
Then you can figure out from those duplicates, which should be deleted and perform the delete.
To my knowledge there is no other way to perform this since any automated "Delete any duplicates" command would leave the database engine to decide which of the two-or-more duplicate records to get rid of.
If the entire record is a duplicate (all columns match) then you could just create a new table with your new unique constraint and do a INSERT INTO newtable SELECT DISTINCT * FROM oldtable but I'm betting that isn't the case.

Related

Replacing two columns (first name, last name) with an auto-increment id

I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?
That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user

Import a csv with foreignkeys

Let's say I have 2 tables: Students and Groups.
The Group table has 2 columns: id, GroupName
The Student table has 3 columns: id, StudentName and GroupID
The GroupID is a foreign key to a Group field.
I need to import the Students table from a CSV, but in my CSV instead of the Group id appears the name of the group. How can I import it with pgAdmin without modifying the csv?
Based on Laurenz answer, use follwoing scripts:
Create a temp table to insert from CSV file:
CREATE TEMP TABLE std_temp (id int, student_name char(25), group_name char(25));
Then, import the CSV file:
COPY std_temp FROM '/home/username/Documents/std.csv' CSV HEADER;
Now, create std and grp tables for students and groups:
CREATE TABLE grp (id int, name char(25));
CREATE TABLE std (id int, name char(20), grp_id int);
It's grp table's turn to be populated based on distinct value of group name. Consider how row_number() is use to provide value for id`:
INSERT INTO grp (id, name) select row_number() OVER (), * from (select distinct group_name from std_temp) as foo;
And the final step, select data based on the join then insert it into the std table:
insert into std (id, name, grp_id) select std_temp.id, std_temp.student_name,grp.id from std_temp inner join grp on std_temp.group_name = grp.name;
At the end, retreive data from final std table:
select * from std;
Your easiest option is to import the file into a temporary table that is defined like the CSV file. Then you can join that table with the "groups" table and use INSERT INTO ... SELECT ... to populate the "students" table.
There is of course also the option to define a view on a join of the two tables and define an INSTEAD OF INSERT trigger on the view that inserts values into the underlying tables as appropriate. Then you could load the data directly to the view.
The suggestion by #LaurenzAlbe is the obvious approach (IMHO never load a spreadsheet directly to
your tables, they are untrustworthy beasts). But I believe your implementation after loading the staging
table is flawed.
First, using row_number() virtually ensures you get duplicated ids for the same group name.
The ids will always increment from 1 by 1 to then number of group names no matter the number of groups previously loaded and you cannot ensure the identical sequence on a subsequent spreadsheets. What happens when you have a group that does not previously exist.
Further there is no validation that the group name does not already exist. Result: Duplicate group names and/or multiple ids for the same name.
Second, you attempt to use the id from the spreadsheet as the id the student (std) table is full of error possibilities. How do you ensure that number is unique across spreadsheets?
Even if unique in a single spreadsheet, how do you ensure another spreadsheet does not use the same numbers as a previous one. Or assuming multiple users create the spreadsheets that one users numbers do not overlap another users even if all users
user are very conscious of the numbers they use. Result: Duplicate id numbers.
A much better approach would be to put a unique key on the group table name column then insert any group names from the stage table into the group trapping any duplicate name errors (using on conflict). Then load the student table directly from the stage table
while selecting group id from the group table by the (now unique) group name.
create table csv_load_temp( junk_num integer, student_name text, group_name text);
create table groups( grp_id integer generated always as identity
, name text
, grp_key text generated always as ( lower(name) ) stored
, constraint grp_pk
primary key (grp_id)
, constraint grp_bk
unique (grp_key)
);
create table students (std_id integer generated always as identity
, name text
, grp_id integer
, constraint std_pk
primary key (std_id)
, constraint std2grp_fk
foreign key (grp_id)
references groups(grp_id)
);
-- Function to load Groups and Students
create or replace function establish_students()
returns void
language sql
as $$
insert into groups (name)
select distinct group_name
from csv_load_temp
on conflict (grp_key) do nothing;
insert into students (name, grp_id)
select student_name, grp_id
from csv_load_temp t
join groups grp
on (grp.name = t.group_name);
$$;
The groups table requires Postgres v12. For prior versions remove the column grp_key couumn
and and put the unique constraint directly on the name column. What to do about capitalization is up to your business logic.
See fiddle for full example. Obviously the 2 inserts in the Establish_Students function can be run standalone and independently. In that case the function itself is not necessary.

PostgreSQL - Select from one table based on another

I have a table of tweets (#OneToMany) and another table of analyzedtweets (#ManyToOne) with 'n' number of analyzedtweets (one per analyst) for each entry in the tweet table. Essentially, I can have any number of analysts (represented in a table), each one can analyze a tweet just once. To make it a bit more complex, the entries in the tweet table are grouped by process which is represented by yet another table.
My question is, how would I query the analyzedtweet table for the tweet_id in the last entry given a specific process_id and analyst_id and then use that to find the next tweet in the tweet table also given the same process_id and analyst_id? Basically, I want to give the analyst the next tweet that he/she has not yet analyzed within that specific process (run).
Here are my tables:
CREATE TABLE tweet (
id SERIAL PRIMARY KEY,
process_id INTEGER NOT NULL REFERENCES process(id) ON DELETE CASCADE ON UPDATE CASCADE,
...
);
CREATE TABLE analyzedtweet (
id SERIAL PRIMARY KEY,
tweet_id INTEGER NOT NULL REFERENCES tweet(id) ON DELETE CASCADE ON UPDATE CASCADE,
analyst_id INTEGER NOT NULL REFERENCES analyst(id) ON DELETE CASCADE ON UPDATE CASCADE,
process_id INTEGER NOT NULL REFERENCES process(id) ON DELETE CASCADE ON UPDATE CASCADE,
...
);
CREATE TABLE process (
id SERIAL PRIMARY KEY,
...
);
CREATE TABLE analyst (
id SERIAL PRIMARY KEY,
...
);
The only way I know how to do this is in 2 steps:
Given a specific process_id (processId) and analyst_id (analystId) run the following query to give me the last tweet_id analyzed by that analyst in that process.
SELECT tweet_id from analyzedtweet WHERE analyzedtweet.analyst_id = analystId AND analyzedtweet.process_id = processId ORDER BY analyzedtweet.tweet_id DESC LIMIT 1
Take the result of the above query (referred to ask latestTweetId) and run the following query:
SELECT * from tweet WHERE tweet.id > latestTweetId AND tweet.process_id = processId ORDER BY tweet.id DESC LIMIT 1
I'm sure there is a much better way to do this with JOIN, I just can't figure out how.
Finally, I am using Hibernate and would like to get the POJO back.
If you are fetching the latest tweet for a giving process_id and analyzedtweet_id use this query:
List<Tweet> t = session.createQuery("select t from tweet t
join t.process p,t.analyzedtweet a
where p.id=? and a.id=? order by t.id desc")
.setParameter(1, process_id)
.setParameter(2, analyzedtweet_id)
.setMaxResults(1).getResultList();

Add a serial column based on a sorted column

I have a table that has one column with unordered value. I want to order this column descending and add a column to record its order. My SQL code is:
select *
into newtable
from oldtable
order by column_name desc;
alter table newtable add column id serial;
Would this implement my goal? I know that rows in PostgreSQL have no fixed order. So I am not sure about this.
Rather than (ab)using a SERIAL via ALTER TABLE, generate it at insert-time.
CREATE TABLE newtable (id serial unique not null, LIKE oldtable INCLUDING ALL);
INSERT INTO newtable
SELECT nextval('newtable_id_seq'), *
FROM oldtable
ORDER BY column_name desc;
This avoids a table rewrite, and unlike your prior approach, is guaranteed to produce the correct ordering.
(If you want it to be the PK, and the prior table had no PK, change unique not null to primary key. If the prior table had a PK you'll need to use a LIKE variant that excludes constraints).
You can first create a new table, sorted based on the column you want to use:
CREATE TABLE newtable AS
SELECT * FROM oldtable
ORDER BY column_name desc;
Afterwards, since you want to order from the largest to the smallest, you can add a new column to your table:
ALTER TABLE newtable ADD COLUMN id serial unique;

How does LIMIT interact with DELETE by primary key in Postgres? (Fix corrupt unique index)

I've been handed a database that's stuck in a weird state. At some indeterminate time in the past, I ended up in a situation where I had duplicate rows in the same table with the same primary key:
=> \d my_table
Table "public.my_table"
Column | Type | Modifiers
--------------------+-------------------------+-----------
id | bigint | not null
some_data | bigint |
a_string | character varying(1024) | not null
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
=> SELECT id, count(*) FROM my_table GROUP BY id HAVING count(*) > 1 ORDER BY id;
#50-some results, non-consecutive rows.
I have no idea how the database got into this state, but I want to be able to safely get out of it. If, for each duplicated primary key, if I execute a query of the form:
DELETE FROM my_table WHERE id = "a_duplicated_row" LIMIT 1;
Is it only going to delete one row from the table, or is it going to delete both rows with the given primary key?
Alas, PostgreSQL does not yet implement LIMIT for DELETE or UPDATE. If the rows are indistinguishable in every other way, you will need to carefully use the hidden ctid column to break ties, like discussed here. Or just create the table by selecting distinct tuples from the existing table, and renaming.