Creating a 2-way relationship in PostgreSQL table - postgresql

I have 3 tables representing UUIDs, Name, Location, and Info of a house, room and drawers (this is an example as my work is sensitive).
So, for example 1 house will have many rooms (one to many) and the many rooms will contain many drawers (many to many).
The idea is that an associations table will be created where each UUID of the rows in the table will be associated with the corresponding UUID of the other table.
For example, if I query the house which is represent by ID1 it will return the following:
SELECT * FROM house where 'ID_1='1';
| ID_1|ID_2 |
| ----| -----|
| 1 | 201 |
| 1 | 254 |
| 1 | 268 |
So far, I have created a temporary version of the associations table of how I need it to be represented in the real table. However, now I need a function to automatically fill in the IDs properly for all rows from the temporary associations table to the real associations table. For example:
INSERT INTO associations (id_1, id_2) VALUES
('1','201'),
('201','1')
I need it to be directionless so that when I query id_1 I'm also getting it's linked id_2 in the result

Let's say your query to get a one-way relationship looks like this:
SELECT room_uuid AS left_uuid, house_the_room_is_in_uuid AS right_uuid
FROM rooms
WHERE house_the_room_is_in_uuid IS NOT NULL
AND is_active
All you need to get the reverse relationship is to put the list in the other order; the rest of the query doesn't need to change, however complex it is:
SELECT house_the_room_is_in_uuid AS left_uuid, room_uuid AS right_uuid
FROM rooms
WHERE house_the_room_is_in_uuid IS NOT NULL
AND is_active
Both of those will be valid as queries to insert into a table with two UUID columns:
CREATE TABLE my_lookup_table (left_uuid UUID, right_uuid UUID);
INSERT INTO my_lookup_table (left_uuid, right_uuid)
SELECT ... -- either of the above
To combine them, either insert each into the same table in turn, or use a UNION to create one result set with both sets of rows:
SELECT room_uuid AS left_uuid, house_the_room_is_in_uuid AS right_uuid
FROM rooms
WHERE is_in_house_uuid IS NOT NULL
AND is_active
UNION
SELECT house_the_room_is_in_uuid AS left_uuid, room_uuid AS right_uuid
FROM rooms
WHERE is_in_house_uuid IS NOT NULL
AND is_active
All that's required for a union is that the queries have the same number and type of columns. The names (if relevant at all) come from the first query, but I find it more readable if you include the aliases on both.
Since the result of that UNION is itself just a two-column result set, it can be used with the same INSERT statement as before. That would allow you to insert into the table even if it had a self-referencing foreign key constraint as discussed here:
ALTER TABLE my_lookup_table ADD CONSTRAINT
my_lookup_table_combinations_must_be_unique
UNIQUE (left_uuid, right_uuid);
ALTER TABLE my_lookup_table ADD CONSTRAINT
my_lookup_table_must_have_rows_both_ways_around
FOREIGN KEY (right_uuid, left_uuid)
REFERENCES my_lookup_table (left_uuid, right_uuid);
If you tried to insert just one set of rows, this would fail, but with the UNION, by the end of the statement/transaction, each row is in the table both ways around, so the constraint is met.

Related

Delete all records that violate new unqiue constraint

I have a table that has the following fields
----------------------------------
| id | user_id | doc_id |
----------------------------------
I want to create a new unique constraint to make sure that there are no repeat user_id and doc_id records. Aka a user can only be linked to a doc one time. That is simple enough.
ALTER TABLE mytable
ADD CONSTRAINT uniquectm_const UNIQUE (user_id, doc_id);
The issue is I have records that currently violate that constraint. I was wondering if there is an easy way to query for those records or to tell postgres just delete anything that violates the constraint.
Identifying records that violate your new key:
SELECT *
FROM
(
SELECT id, user_id, doc_id
, COUNT(*) OVER (PARTITION BY user_id, doc_id) as unique_check
FROM mytable
)
WHERE unique_check > 1;
Then you can figure out from those duplicates, which should be deleted and perform the delete.
To my knowledge there is no other way to perform this since any automated "Delete any duplicates" command would leave the database engine to decide which of the two-or-more duplicate records to get rid of.
If the entire record is a duplicate (all columns match) then you could just create a new table with your new unique constraint and do a INSERT INTO newtable SELECT DISTINCT * FROM oldtable but I'm betting that isn't the case.

Import a csv with foreignkeys

Let's say I have 2 tables: Students and Groups.
The Group table has 2 columns: id, GroupName
The Student table has 3 columns: id, StudentName and GroupID
The GroupID is a foreign key to a Group field.
I need to import the Students table from a CSV, but in my CSV instead of the Group id appears the name of the group. How can I import it with pgAdmin without modifying the csv?
Based on Laurenz answer, use follwoing scripts:
Create a temp table to insert from CSV file:
CREATE TEMP TABLE std_temp (id int, student_name char(25), group_name char(25));
Then, import the CSV file:
COPY std_temp FROM '/home/username/Documents/std.csv' CSV HEADER;
Now, create std and grp tables for students and groups:
CREATE TABLE grp (id int, name char(25));
CREATE TABLE std (id int, name char(20), grp_id int);
It's grp table's turn to be populated based on distinct value of group name. Consider how row_number() is use to provide value for id`:
INSERT INTO grp (id, name) select row_number() OVER (), * from (select distinct group_name from std_temp) as foo;
And the final step, select data based on the join then insert it into the std table:
insert into std (id, name, grp_id) select std_temp.id, std_temp.student_name,grp.id from std_temp inner join grp on std_temp.group_name = grp.name;
At the end, retreive data from final std table:
select * from std;
Your easiest option is to import the file into a temporary table that is defined like the CSV file. Then you can join that table with the "groups" table and use INSERT INTO ... SELECT ... to populate the "students" table.
There is of course also the option to define a view on a join of the two tables and define an INSTEAD OF INSERT trigger on the view that inserts values into the underlying tables as appropriate. Then you could load the data directly to the view.
The suggestion by #LaurenzAlbe is the obvious approach (IMHO never load a spreadsheet directly to
your tables, they are untrustworthy beasts). But I believe your implementation after loading the staging
table is flawed.
First, using row_number() virtually ensures you get duplicated ids for the same group name.
The ids will always increment from 1 by 1 to then number of group names no matter the number of groups previously loaded and you cannot ensure the identical sequence on a subsequent spreadsheets. What happens when you have a group that does not previously exist.
Further there is no validation that the group name does not already exist. Result: Duplicate group names and/or multiple ids for the same name.
Second, you attempt to use the id from the spreadsheet as the id the student (std) table is full of error possibilities. How do you ensure that number is unique across spreadsheets?
Even if unique in a single spreadsheet, how do you ensure another spreadsheet does not use the same numbers as a previous one. Or assuming multiple users create the spreadsheets that one users numbers do not overlap another users even if all users
user are very conscious of the numbers they use. Result: Duplicate id numbers.
A much better approach would be to put a unique key on the group table name column then insert any group names from the stage table into the group trapping any duplicate name errors (using on conflict). Then load the student table directly from the stage table
while selecting group id from the group table by the (now unique) group name.
create table csv_load_temp( junk_num integer, student_name text, group_name text);
create table groups( grp_id integer generated always as identity
, name text
, grp_key text generated always as ( lower(name) ) stored
, constraint grp_pk
primary key (grp_id)
, constraint grp_bk
unique (grp_key)
);
create table students (std_id integer generated always as identity
, name text
, grp_id integer
, constraint std_pk
primary key (std_id)
, constraint std2grp_fk
foreign key (grp_id)
references groups(grp_id)
);
-- Function to load Groups and Students
create or replace function establish_students()
returns void
language sql
as $$
insert into groups (name)
select distinct group_name
from csv_load_temp
on conflict (grp_key) do nothing;
insert into students (name, grp_id)
select student_name, grp_id
from csv_load_temp t
join groups grp
on (grp.name = t.group_name);
$$;
The groups table requires Postgres v12. For prior versions remove the column grp_key couumn
and and put the unique constraint directly on the name column. What to do about capitalization is up to your business logic.
See fiddle for full example. Obviously the 2 inserts in the Establish_Students function can be run standalone and independently. In that case the function itself is not necessary.

Fine on SQLite, broken in Postgresql: column must appear in the GROUP BY clause or be used in an aggregate function

I have a query which works fine on SQLite, but when I run it on the same data in Postgresql I get:
column "role.id" must appear in the GROUP BY clause or be used in an aggregate function
I have three tables, for people, exhibitions, and a table that links the two: "One person in one exhibition performing a particular role" (such as "Artist" or "Curator"):
CREATE TABLE "person" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "exhibition" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "role" (`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT,
`name` varchar(30) NOT NULL,
`exhibition_id` integer NOT NULL,
`person_id` integer NOT NULL,
FOREIGN KEY(`exhibition_id`) REFERENCES `exhibition`(`id`),
FOREIGN KEY(`person_id`) REFERENCES `person`(`id`));
I want to display the people involved in an exhibition ordered by how many things they've done. So, I get the IDs of the people in an exhibition (1,2,3,4) and then do this:
SELECT
*,
COUNT(person.id) AS role_count
FROM person
INNER JOIN role
ON person.id = role.person_id
WHERE person.id IN ( 1, 2, 3, 4 )
GROUP BY person.id
ORDER BY role_count DESC
That orders the people by role_count, which is the number of roles they've had across all exhibitions
It works fine on SQLite, but not in Postgresql. I've tried putting role.id into the GROUP BY (instead of, and as well as, person.id) but that changes the results.
You know when you struggle for ages, post an SO question, and then immediately stumble on the answer?
From this answer I realised that I couldn't select role.id (which the SELECT * is implicitly doing) as it wasn't in the GROUP BY.
I couldn't add it to the GROUP BY (because that changes the results) so the solution was to not select it.
So I changed the SELECT part to:
SELECT
person.*,
COUNT(person.id) AS role_count
FROM person
...
Now role.id is not being selected. And that works.
If I needed any other fields from the role table, like name, I could add those explicitly too:
SELECT
person.*,
role.name,
COUNT(person.id) AS role_count
FROM person
...
Just like the error says, Standard SQL doesn't let you SELECT anything other than one of the GROUP BY columns or a call to an aggregate function. (For a logical reason: How would the RDBMS know which role.id to select when there are multiple rows to select from within a group?) PostgreSQL actually enforces this rule; SQLite ignores it and just returns data from an arbitrary row in the group.
As you discovered, omitting role.id from the SELECT fixes your error. But if you do want SQLite's behavior of selecting the ID from an arbitrary row, you can simply wrap it in an aggregate function, e.g., SELECT MAX(role.id) instead of just SELECT role.id).

How does LIMIT interact with DELETE by primary key in Postgres? (Fix corrupt unique index)

I've been handed a database that's stuck in a weird state. At some indeterminate time in the past, I ended up in a situation where I had duplicate rows in the same table with the same primary key:
=> \d my_table
Table "public.my_table"
Column | Type | Modifiers
--------------------+-------------------------+-----------
id | bigint | not null
some_data | bigint |
a_string | character varying(1024) | not null
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
=> SELECT id, count(*) FROM my_table GROUP BY id HAVING count(*) > 1 ORDER BY id;
#50-some results, non-consecutive rows.
I have no idea how the database got into this state, but I want to be able to safely get out of it. If, for each duplicated primary key, if I execute a query of the form:
DELETE FROM my_table WHERE id = "a_duplicated_row" LIMIT 1;
Is it only going to delete one row from the table, or is it going to delete both rows with the given primary key?
Alas, PostgreSQL does not yet implement LIMIT for DELETE or UPDATE. If the rows are indistinguishable in every other way, you will need to carefully use the hidden ctid column to break ties, like discussed here. Or just create the table by selecting distinct tuples from the existing table, and renaming.

delete duplicate rows in large postgresql database table

I have a postgresql database with 100 GB size. One of the tables has about half a billion entries. For quick data entry, some of the data was repeated and left to be pruned later. One of the columns can be used to identify the rows as unique.
I found this stackoverflow question which suggested a solution for mysql:
ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)
Is there anything similar for postgresql?
I tried deleting with group by and row number, my computer runs out of memory after a few hours in both cases.
This is what I get when I try to estimate the number of rows in the table:
SELECT reltuples FROM pg_class WHERE relname = 'orders';
reltuples
-------------
4.38543e+08
(1 row)
Two solutions immediately come to mind:
1). Create a new table as select * from source table with a WHERE clause to determine the unique rows. Add the indexes to match the source table, then rename them both in a transaction. Whether or not this will work for you depends on several factors, including amount of free disk space, if the table is in constant use and interruptions to access are permissible, etc. Creating a new table has the benefit of tightly packing your data and indexes, and the table will be smaller than the original because the non-unique rows are omitted.
2). Create a partial unique index over your columns and add a WHERE clause to filter out the non-uniques.
For example:
test=# create table t ( col1 int, col2 int, is_unique boolean);
CREATE TABLE
test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
INSERT 0 3
test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
CREATE INDEX
test=# \d t
Table "public.t"
Column | Type | Modifiers
-----------+---------+-----------
col1 | integer |
col2 | integer |
is_unique | boolean |
Indexes:
"t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE