I have a table that needs id to be unique, the id column in my table has values like sourcename_number
id
noaa_1
noaa_2
noaa_3
noaa_3
noaa_3
meteo_1
meteo_2
meteo_2
I wanted to change it as
noaa_1
noaa_2
noaa_3
noaa_4
noaa_5
meteo_1
meteo_2
meteo_3
so that there are no duplication in the column
You need a regex to match the -
you need a row_number() to enumerate within the string classes
You need an identity column to order by (this could be the ctid system column)
you need a self-join subquery to refer to the row_number()
\i tmp.sql
CREATE TABLE ugly_duck
( id_name text
);
INSERT INTO ugly_duck(id_name) VALUES
( 'noaa_1' ) , ( 'noaa_2' ) , ( 'noaa_3' ) , ( 'noaa_3' ) , ( 'noaa_3' )
, ( 'meteo_1' ) , ( 'meteo_2' ) , ( 'meteo_2' )
;
SELECT * FROM ugly_duck;
-- create a surrogate unique key
ALTER TABLE ugly_duck ADD COLUMN seq serial UNIQUE;
UPDATE ugly_duck dst
SET id_name = src.prefix || src.rn::integer
FROM ( SELECT seq
, substring( id_name FROM '.*_') AS prefix
, row_number() OVER (PARTITION BY substring( id_name FROM '.*_') ORDER BY seq) rn
FROM ugly_duck
) src
WHERE src.seq=dst.seq
;
SELECT * FROM ugly_duck;
-- remove the surrogate key
ALTER TABLE ugly_duck DROP COLUMN seq;
-- create the natural key
ALTER TABLE ugly_duck ADD PRIMARY KEY (id_name);
SELECT * FROM ugly_duck;
\d+ ugly_duck
Output:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 8
id_name
---------
noaa_1
noaa_2
noaa_3
noaa_3
noaa_3
meteo_1
meteo_2
meteo_2
(8 rows)
ALTER TABLE
UPDATE 8
id_name | seq
---------+-----
meteo_1 | 6
meteo_2 | 7
meteo_3 | 8
noaa_1 | 1
noaa_2 | 2
noaa_3 | 3
noaa_4 | 4
noaa_5 | 5
(8 rows)
ALTER TABLE
ALTER TABLE
id_name
---------
meteo_1
meteo_2
meteo_3
noaa_1
noaa_2
noaa_3
noaa_4
noaa_5
(8 rows)
Table "tmp.ugly_duck"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
id_name | text | | not null | | extended | |
Indexes:
"ugly_duck_pkey" PRIMARY KEY, btree (id_name)
Related
I want to insert into table1 multiple rows from table2. The problem is that I have a field of same name in table2 and table1 and I don't want to insert data if there's already a record with same value in this field. Now I have something like this:
insert into table1 (id, sameField, constantField, superFied)
select gen_random_uuid(), "sameField", 'constant', "anotherField"
from table2;
And I assume I need to do something like this:
insert into table1 (id, sameField, constantField, superFied)
select gen_random_uuid(), "sameField", 'constant', "anotherField"
from table2
where not exists ... ?
What I need to write instead of ? if I want this logic: check if there's already same value in sameField in table1 when selecting sameField from table2? DBMS is Postgres.
You can use a sub-query to see whether the record exists. You will need to define the column(s) which should be unique.
create table table2(
id varchar(100),
sameField varchar(25),
constant varchar(25),
superField varchar(25)
);
insert into table2 values
(gen_random_uuid(),'same1','constant1','super1'),
(gen_random_uuid(),'same2','constant2','super2')
✓
2 rows affected
create table table1(
id varchar(100),
sameField varchar(25),
constant varchar(25),
superField varchar(25)
);
insert into table1 values
(gen_random_uuid(),'same1','constant1','super1');
✓
1 rows affected
insert into table1 (id, sameField, constant, superField)
select uuid_in(md5(random()::text || clock_timestamp()::text)::cstring),
t2.sameField, 'constant', t2.superField
from table2 t2
where sameField not in (select sameField from table1)
1 rows affected
select * from table1;
select * from table2;
id | samefield | constant | superfield
:----------------------------------- | :-------- | :-------- | :---------
4cf10b1c-7a3f-4323-9a16-cce681fcd6d8 | same1 | constant1 | super1
d8cf27a0-3f55-da50-c274-c4a76c697b84 | same2 | constant | super2
id | samefield | constant | superfield
:----------------------------------- | :-------- | :-------- | :---------
c8a83804-9f0b-4d97-8049-51c2c8c54665 | same1 | constant1 | super1
3a9cf8b5-8488-4278-a06a-fd75fa74e206 | same2 | constant2 | super2
db<>fiddle here
Consider following scenario in PostgreSQL (any version from 10+):
CREATE TABLE users(
id serial primary key,
name text not null unique,
last_seen timestamp
);
INSERT INTO users(name, last_seen)
VALUES ('Alice', '2019-05-01'),
('Bob', '2019-04-29'),
('Dorian', '2019-05-11');
CREATE TABLE inactive_users(
user_id int primary key references users(id),
last_seen timestamp not null);
INSERT INTO inactive_users(user_id, last_seen)
SELECT id as user_id, last_seen FROM users
WHERE users.last_seen < '2019-05-04'
ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
Now let's say that I want to insert the same values (execute last statement) multiple times, every now and then. In practice from the database point of view, on conflicting values 90% of the time last_seen column will be updated to the same value it already had. The values of the rows stay the same, so there's no reason to do I/O writes, right? But is this really the case, or will postgres perform corresponding updates even though the actual value didn't change?
In my case the destination table has dozens of millions of rows, but only few hundreds/thousands of them will be really changing on each of the insert calls.
Any UPDATE to a row will actually create a new row (marking the old row deleted/dirty), regardless of the before/after values:
[root#497ba0eaf137 /]# psql
psql (12.1)
Type "help" for help.
postgres=# create table foo (id int, name text);
CREATE TABLE
postgres=# insert into foo values (1,'a');
INSERT 0 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,1) | 1 | a
(1 row)
postgres=# update foo set name = 'a' where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,2) | 1 | a
(1 row)
postgres=# update foo set id = 1 where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,3) | 1 | a
(1 row)
postgres=# select * from pg_stat_user_tables where relname = 'foo';
-[ RECORD 1 ]-------+-------
relid | 16384
schemaname | public
relname | foo
seq_scan | 5
seq_tup_read | 5
idx_scan |
idx_tup_fetch |
n_tup_ins | 1
n_tup_upd | 2
n_tup_del | 0
n_tup_hot_upd | 2
n_live_tup | 1
n_dead_tup | 2
<...>
And according to your example:
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,1) | 1 | 2019-05-01 00:00:00
(0,2) | 2 | 2019-04-29 00:00:00
(2 rows)
postgres=# INSERT INTO inactive_users(user_id, last_seen)
postgres-# SELECT id as user_id, last_seen FROM users
postgres-# WHERE users.last_seen < '2019-05-04'
postgres-# ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
INSERT 0 2
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,3) | 1 | 2019-05-01 00:00:00
(0,4) | 2 | 2019-04-29 00:00:00
(2 rows)
Postgres does not do any data validation against the column values -- if you are looking to prevent unnecessary write activity, you will need to surgically craft your WHERE clauses.
Disclosure: I work for EnterpriseDB (EDB)
I have a large table (6+ million rows) that I'd like to add an auto-incrementing integer column sid, where sid is set on existing rows based on an ORDER BY inserted_at ASC. In other words, the oldest record based on inserted_at would be set to 1 and the latest record would be the total record count. Any tips on how I might approach this?
Add a sid column and UPDATE SET ... FROM ... WHERE:
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
Note that this relies on there being a primary key, id.
(If your table did not already have a primary key, you would have to make one first.)
For example,
-- create test table
DROP TABLE IF EXISTS test;
CREATE TABLE test (
id int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
, foo text
, inserted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO test (foo, inserted_at) VALUES
('XYZ', '2019-02-14 00:00:00-00')
, ('DEF', '2010-02-14 00:00:00-00')
, ('ABC', '2000-02-14 00:00:00-00');
-- +----+-----+------------------------+
-- | id | foo | inserted_at |
-- +----+-----+------------------------+
-- | 1 | XYZ | 2019-02-13 19:00:00-05 |
-- | 2 | DEF | 2010-02-13 19:00:00-05 |
-- | 3 | ABC | 2000-02-13 19:00:00-05 |
-- +----+-----+------------------------+
ALTER TABLE test ADD COLUMN sid INT;
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
yields
+----+-----+------------------------+-----+
| id | foo | inserted_at | sid |
+----+-----+------------------------+-----+
| 3 | ABC | 2000-02-13 19:00:00-05 | 1 |
| 2 | DEF | 2010-02-13 19:00:00-05 | 2 |
| 1 | XYZ | 2019-02-13 19:00:00-05 | 3 |
+----+-----+------------------------+-----+
Finally, make sid SERIAL (or, better, an IDENTITY column):
ALTER TABLE test ALTER COLUMN sid SET NOT NULL;
-- IDENTITY fixes certain issue which may arise with SERIAL
ALTER TABLE test ALTER COLUMN sid ADD GENERATED BY DEFAULT AS IDENTITY;
-- ALTER TABLE test ALTER COLUMN sid SERIAL;
I'm having some trouble working out the PostgreSQL documentation for recursive queries, and wonder if anyone might be able to offer a suggestion for the following.
Here's the data:
Table "public.subjects"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+--------------------------------------
id | bigint | | not null | nextval('subjects_id_seq'::regclass)
name | character varying | | |
Table "public.subject_associations"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------------------
id | bigint | | not null | nextval('subject_associations_id_seq'::regclass)
parent_id | integer | | |
child_id | integer | | |
Here, a "subject" may have many parents and many children. Of course, at the top level a subject has no parents and at the bottom no children. For example:
parent_id | child_id
------------+------------
2 | 3
1 | 4
1 | 3
4 | 8
4 | 5
5 | 6
6 | 7
What I'm looking for is starting with a child_id to get all the ancestors, and with a parent_id, all the descendants. Therefore:
parent_id 1 -> children 3, 4, 5, 6, 7, 8
parent_id 2 -> children 3
child_id 3 -> parents 1, 2
child_id 4 -> parents 1
child_id 7 -> parents 6, 5, 4, 1
Though there seem to be a lot of examples of similar things about I'm having trouble making sense of them, so any suggestions I can try out would be welcome.
To get all children for subject 1, you can use
WITH RECURSIVE c AS (
SELECT 1 AS id
UNION ALL
SELECT sa.child_id
FROM subject_associations AS sa
JOIN c ON c.id = sa. parent_id
)
SELECT id FROM c;
CREATE OR REPLACE FUNCTION func_finddescendants(start_id integer)
RETURNS SETOF subject_associations
AS $$
DECLARE
BEGIN
RETURN QUERY
WITH RECURSIVE t
AS
(
SELECT *
FROM subject_associations sa
WHERE sa.id = start_id
UNION ALL
SELECT next.*
FROM t prev
JOIN subject_associations next ON (next.parentid = prev.id)
)
SELECT * FROM t;
END;
$$ LANGUAGE PLPGSQL;
Try this
--- Table
-- DROP SEQUENCE public.data_id_seq;
CREATE SEQUENCE "data_id_seq"
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 1
CACHE 1;
ALTER TABLE public.data_id_seq
OWNER TO postgres;
CREATE TABLE public.data
(
id integer NOT NULL DEFAULT nextval('data_id_seq'::regclass),
name character varying(50) NOT NULL,
label character varying(50) NOT NULL,
parent_id integer NOT NULL,
CONSTRAINT data_pkey PRIMARY KEY (id),
CONSTRAINT data_name_parent_id_unique UNIQUE (name, parent_id)
)
WITH (
OIDS=FALSE
);
INSERT INTO public.data(id, name, label, parent_id) VALUES (1,'animal','Animal',0);
INSERT INTO public.data(id, name, label, parent_id) VALUES (5,'birds','Birds',1);
INSERT INTO public.data(id, name, label, parent_id) VALUES (6,'fish','Fish',1);
INSERT INTO public.data(id, name, label, parent_id) VALUES (7,'parrot','Parrot',5);
INSERT INTO public.data(id, name, label, parent_id) VALUES (8,'barb','Barb',6);
--- Function
CREATE OR REPLACE FUNCTION public.get_all_children_of_parent(use_parent integer) RETURNS integer[] AS
$BODY$
DECLARE
process_parents INT4[] := ARRAY[ use_parent ];
children INT4[] := '{}';
new_children INT4[];
BEGIN
WHILE ( array_upper( process_parents, 1 ) IS NOT NULL ) LOOP
new_children := ARRAY( SELECT id FROM data WHERE parent_id = ANY( process_parents ) AND id <> ALL( children ) );
children := children || new_children;
process_parents := new_children;
END LOOP;
RETURN children;
END;
$BODY$
LANGUAGE plpgsql VOLATILE COST 100;
ALTER FUNCTION public.get_all_children_of_parent(integer) OWNER TO postgres
--- Test
SELECT * FROM data WHERE id = any(get_all_children_of_parent(1))
SELECT * FROM data WHERE id = any(get_all_children_of_parent(5))
SELECT * FROM data WHERE id = any(get_all_children_of_parent(6))
When doing an UPDATE query, we got the following error message:
ERROR: duplicate key value violates unique constraint "tableA_pkey"
DETAIL: Key (id)=(47470) already exists.
However, our UPDATE query does not affect the primary key. Here is a simplified version:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE
a.end_at BETWEEN now() AND now() - interval '1 day';
We ensured the primary key sequence was already synced:
\d tableA_id_seq
Which produces:
Column | Type | Value
---------------+---------+--------------------------
sequence_name | name | tableA_id_seq
last_value | bigint | 50364
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | t
Looking for maximum table index:
select max(id) from tableA;
We got a lower value:
max
-------
50363
(1 row)
Have you any idea on why such a behavior? If we exclude the problematic id, it works.
Another strange point is that replacing the previous UPDATE by:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE a.id = 47470;
It works well. Are we missing something?
EDIT: triggers
I have no user-defined triggers on this table:
SELECT t.tgname, c.relname
FROM pg_trigger t
JOIN pg_class c ON t.tgrelid = c.oid
WHERE
c.relname = 'tableA'
AND
t.tgisinternal = false
;
Which returns no row.
Note: I am using psql (PostgreSQL) 9.3.4 version.
Not really sure what was the cause. However, deleting the two (non vital) records corresponding to already existing ids (?) solved the issue.