PostgreSQL add SERIAL column to existing table with values based on ORDER BY - postgresql

I have a large table (6+ million rows) that I'd like to add an auto-incrementing integer column sid, where sid is set on existing rows based on an ORDER BY inserted_at ASC. In other words, the oldest record based on inserted_at would be set to 1 and the latest record would be the total record count. Any tips on how I might approach this?

Add a sid column and UPDATE SET ... FROM ... WHERE:
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
Note that this relies on there being a primary key, id.
(If your table did not already have a primary key, you would have to make one first.)
For example,
-- create test table
DROP TABLE IF EXISTS test;
CREATE TABLE test (
id int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
, foo text
, inserted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO test (foo, inserted_at) VALUES
('XYZ', '2019-02-14 00:00:00-00')
, ('DEF', '2010-02-14 00:00:00-00')
, ('ABC', '2000-02-14 00:00:00-00');
-- +----+-----+------------------------+
-- | id | foo | inserted_at |
-- +----+-----+------------------------+
-- | 1 | XYZ | 2019-02-13 19:00:00-05 |
-- | 2 | DEF | 2010-02-13 19:00:00-05 |
-- | 3 | ABC | 2000-02-13 19:00:00-05 |
-- +----+-----+------------------------+
ALTER TABLE test ADD COLUMN sid INT;
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
yields
+----+-----+------------------------+-----+
| id | foo | inserted_at | sid |
+----+-----+------------------------+-----+
| 3 | ABC | 2000-02-13 19:00:00-05 | 1 |
| 2 | DEF | 2010-02-13 19:00:00-05 | 2 |
| 1 | XYZ | 2019-02-13 19:00:00-05 | 3 |
+----+-----+------------------------+-----+
Finally, make sid SERIAL (or, better, an IDENTITY column):
ALTER TABLE test ALTER COLUMN sid SET NOT NULL;
-- IDENTITY fixes certain issue which may arise with SERIAL
ALTER TABLE test ALTER COLUMN sid ADD GENERATED BY DEFAULT AS IDENTITY;
-- ALTER TABLE test ALTER COLUMN sid SERIAL;

Related

Optimizing Postgres Count For Aggregated Select

I have a query that is intended to retrieve the counts of each grouped product like so
SELECT
product_name,
product_color,
(array_agg("product_distributor"))[1] AS "product_distributor",
(array_agg("product_release"))[1] AS "product_release",
COUNT(*) AS "count"
FROM
product
WHERE
product.id IN (
SELECT
id
FROM
product
WHERE
(product_name ilike "%red%"
OR product_color ilike "%red%")
AND product_type = 1)
GROUP BY
product_name, product_color
LIMIT
1000
OFFSET
0
This query is run on the following table
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
product_type | integer | | not null |
id | integer | | not null |
product_name | citext | | not null |
product_color | character varying(255) | | |
product_distributor | integer | | |
product_release | timestamp with time zone | | |
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_distributer_index" btree (product_distributor)
"product_product_type_name_color" UNIQUE, btree (product_type, name, color)
"product_product_type_index" btree (product_type)
"product_name_color_index" btree (name, color)
Foreign-key constraints:
"product_product_type_fkey" FOREIGN KEY (product_type) REFERENCES product_type(id) ON UPDATE CASCADE ON DELETE CASCADE
"product_product_distributor_id" FOREIGN KEY (product_distributor) REFERENCES product_distributor(id)
How can I improve the performance of this query, specifically the COUNT(*) portion, which when removed improves the query but is requisite?
You may try using an INNER JOIN in place of a WHERE ... IN clause.
WITH selected_products AS (
SELECT id
FROM product
WHERE (product_name ilike "%red%" OR product_color ilike "%red%")
AND product_type = 1
)
SELECT product_name,
product_color,
(ARRAY_AGG("product_distributor"))[1] AS "product_distributor",
(ARRAY_AGG("product_release"))[1] AS "product_release",
COUNT(*) AS "count"
FROM product p
INNER JOIN selected_products sp
ON p.id = sp.id
GROUP BY product_name,
product_color
LIMIT 1000
OFFSET 0
Then create an index on the "product.id" field as follows:
CREATE INDEX product_ids_idx ON product USING HASH (id);

Liquibase insert select where not exists

I want to insert into table1 multiple rows from table2. The problem is that I have a field of same name in table2 and table1 and I don't want to insert data if there's already a record with same value in this field. Now I have something like this:
insert into table1 (id, sameField, constantField, superFied)
select gen_random_uuid(), "sameField", 'constant', "anotherField"
from table2;
And I assume I need to do something like this:
insert into table1 (id, sameField, constantField, superFied)
select gen_random_uuid(), "sameField", 'constant', "anotherField"
from table2
where not exists ... ?
What I need to write instead of ? if I want this logic: check if there's already same value in sameField in table1 when selecting sameField from table2? DBMS is Postgres.
You can use a sub-query to see whether the record exists. You will need to define the column(s) which should be unique.
create table table2(
id varchar(100),
sameField varchar(25),
constant varchar(25),
superField varchar(25)
);
insert into table2 values
(gen_random_uuid(),'same1','constant1','super1'),
(gen_random_uuid(),'same2','constant2','super2')
✓
2 rows affected
create table table1(
id varchar(100),
sameField varchar(25),
constant varchar(25),
superField varchar(25)
);
insert into table1 values
(gen_random_uuid(),'same1','constant1','super1');
✓
1 rows affected
insert into table1 (id, sameField, constant, superField)
select uuid_in(md5(random()::text || clock_timestamp()::text)::cstring),
t2.sameField, 'constant', t2.superField
from table2 t2
where sameField not in (select sameField from table1)
1 rows affected
select * from table1;
select * from table2;
id | samefield | constant | superfield
:----------------------------------- | :-------- | :-------- | :---------
4cf10b1c-7a3f-4323-9a16-cce681fcd6d8 | same1 | constant1 | super1
d8cf27a0-3f55-da50-c274-c4a76c697b84 | same2 | constant | super2
id | samefield | constant | superfield
:----------------------------------- | :-------- | :-------- | :---------
c8a83804-9f0b-4d97-8049-51c2c8c54665 | same1 | constant1 | super1
3a9cf8b5-8488-4278-a06a-fd75fa74e206 | same2 | constant2 | super2
db<>fiddle here

Postgres: does updating column value to the same value marks page as dirty?

Consider following scenario in PostgreSQL (any version from 10+):
CREATE TABLE users(
id serial primary key,
name text not null unique,
last_seen timestamp
);
INSERT INTO users(name, last_seen)
VALUES ('Alice', '2019-05-01'),
('Bob', '2019-04-29'),
('Dorian', '2019-05-11');
CREATE TABLE inactive_users(
user_id int primary key references users(id),
last_seen timestamp not null);
INSERT INTO inactive_users(user_id, last_seen)
SELECT id as user_id, last_seen FROM users
WHERE users.last_seen < '2019-05-04'
ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
Now let's say that I want to insert the same values (execute last statement) multiple times, every now and then. In practice from the database point of view, on conflicting values 90% of the time last_seen column will be updated to the same value it already had. The values of the rows stay the same, so there's no reason to do I/O writes, right? But is this really the case, or will postgres perform corresponding updates even though the actual value didn't change?
In my case the destination table has dozens of millions of rows, but only few hundreds/thousands of them will be really changing on each of the insert calls.
Any UPDATE to a row will actually create a new row (marking the old row deleted/dirty), regardless of the before/after values:
[root#497ba0eaf137 /]# psql
psql (12.1)
Type "help" for help.
postgres=# create table foo (id int, name text);
CREATE TABLE
postgres=# insert into foo values (1,'a');
INSERT 0 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,1) | 1 | a
(1 row)
postgres=# update foo set name = 'a' where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,2) | 1 | a
(1 row)
postgres=# update foo set id = 1 where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,3) | 1 | a
(1 row)
postgres=# select * from pg_stat_user_tables where relname = 'foo';
-[ RECORD 1 ]-------+-------
relid | 16384
schemaname | public
relname | foo
seq_scan | 5
seq_tup_read | 5
idx_scan |
idx_tup_fetch |
n_tup_ins | 1
n_tup_upd | 2
n_tup_del | 0
n_tup_hot_upd | 2
n_live_tup | 1
n_dead_tup | 2
<...>
And according to your example:
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,1) | 1 | 2019-05-01 00:00:00
(0,2) | 2 | 2019-04-29 00:00:00
(2 rows)
postgres=# INSERT INTO inactive_users(user_id, last_seen)
postgres-# SELECT id as user_id, last_seen FROM users
postgres-# WHERE users.last_seen < '2019-05-04'
postgres-# ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
INSERT 0 2
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,3) | 1 | 2019-05-01 00:00:00
(0,4) | 2 | 2019-04-29 00:00:00
(2 rows)
Postgres does not do any data validation against the column values -- if you are looking to prevent unnecessary write activity, you will need to surgically craft your WHERE clauses.
Disclosure: I work for EnterpriseDB (EDB)

PostgreSQL: duplicate key value violates unique constraint on UPDATE command

When doing an UPDATE query, we got the following error message:
ERROR: duplicate key value violates unique constraint "tableA_pkey"
DETAIL: Key (id)=(47470) already exists.
However, our UPDATE query does not affect the primary key. Here is a simplified version:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE
a.end_at BETWEEN now() AND now() - interval '1 day';
We ensured the primary key sequence was already synced:
\d tableA_id_seq
Which produces:
Column | Type | Value
---------------+---------+--------------------------
sequence_name | name | tableA_id_seq
last_value | bigint | 50364
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | t
Looking for maximum table index:
select max(id) from tableA;
We got a lower value:
max
-------
50363
(1 row)
Have you any idea on why such a behavior? If we exclude the problematic id, it works.
Another strange point is that replacing the previous UPDATE by:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE a.id = 47470;
It works well. Are we missing something?
EDIT: triggers
I have no user-defined triggers on this table:
SELECT t.tgname, c.relname
FROM pg_trigger t
JOIN pg_class c ON t.tgrelid = c.oid
WHERE
c.relname = 'tableA'
AND
t.tgisinternal = false
;
Which returns no row.
Note: I am using psql (PostgreSQL) 9.3.4 version.
Not really sure what was the cause. However, deleting the two (non vital) records corresponding to already existing ids (?) solved the issue.

reference to a sequence column (postgresql)

I encountered a problem when creating a foreign key referencing to a sequence, see the code example below.
But on creating the tables i recieve the following error.
"Detail: Key columns "product" and "id" are of incompatible types: integer and ownseq"
I've already tried different datatypes for the product column (like smallint, bigint) but none of them is accepted.
CREATE SEQUENCE ownseq INCREMENET BY 1 MINVALUE 100 MAXVALUE 99999;
CREATE TABLE products (
id ownseq PRIMARY KEY,
...);
CREATE TABLE basket (
basket_id SERIAL PRIMARY KEY,
product INTEGER FOREIGN KEY REFERENCES products(id));
CREATE SEQUENCE ownseq INCREMENT BY 1 MINVALUE 100 MAXVALUE 99999;
CREATE TABLE products (
id integer PRIMARY KEY default nextval('ownseq'),
...
);
alter sequence ownseq owned by products.id;
The key change is that id is defined as an integer, rather than as ownseq. This is what would happen if you used the SERIAL pseudo-type to create the sequence.
Try
CREATE TABLE products (
id INTEGER DEFAULT nextval(('ownseq'::text)::regclass) NOT NULL PRIMARY KEY,
...);
or don't create the sequence ownseq and let postgres do it for you:
CREATE TABLE products (
id SERIAL NOT NULL PRIMARY KEY
...);
In the above case the name of the sequence postgres has create should be products_id_seq.
Hope this helps.
PostgreSQL is powerful and you have just been bitten by an advanced feature.
Your DDL is quite valid but not at all what you think it is.
A sequence can be thought of as an extra-transactional simple table used for generating next values for some columns.
What you meant to do
You meant to have the id field defined thus, as per the other answer:
id integer PRIMARY KEY default nextval('ownseq'),
What you did
What you did was actually define a nested data structure for your table. Suppose I create a test sequence:
CREATE SEQUENCE testseq;
Then suppose I \d testseq on Pg 9.1, I get:
Sequence "public.testseq"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | testseq
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | f
This is the definition of the type the sequence used.
Now suppose I:
create table seqtest (test testseq, id serial);
I can insert into it:
INSERT INTO seqtest (id, test) values (default, '("testseq",3,4,1,133445,1,1,0,f,f)');
I can then select from it:
select * from seqtest;
test | id
----------------------------------+----
(testseq,3,4,1,133445,1,1,0,f,f) | 2
Moreover I can expand test:
SELECT (test).* from seqtest;
select (test).* from seqtest;
sequence_name | last_value | start_value | increment_by | max_value | min_value
| cache_value | log_cnt | is_cycled | is_called
---------------+------------+-------------+--------------+-----------+----------
-+-------------+---------+-----------+-----------
| | | | |
| | | |
testseq | 3 | 4 | 1 | 133445 | 1
| 1 | 0 | f | f
(2 rows)
This sort of thing is actually very powerful in PostgreSQL but full of unexpected corners (for example not null and check constraints don't work as expected with nested data types). I don't generally recommend nested data types, but it is worth knowing that PostgreSQL can do this and will be happy to accept SQL commands to do it without warning.