Postgres: does updating column value to the same value marks page as dirty? - postgresql

Consider following scenario in PostgreSQL (any version from 10+):
CREATE TABLE users(
id serial primary key,
name text not null unique,
last_seen timestamp
);
INSERT INTO users(name, last_seen)
VALUES ('Alice', '2019-05-01'),
('Bob', '2019-04-29'),
('Dorian', '2019-05-11');
CREATE TABLE inactive_users(
user_id int primary key references users(id),
last_seen timestamp not null);
INSERT INTO inactive_users(user_id, last_seen)
SELECT id as user_id, last_seen FROM users
WHERE users.last_seen < '2019-05-04'
ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
Now let's say that I want to insert the same values (execute last statement) multiple times, every now and then. In practice from the database point of view, on conflicting values 90% of the time last_seen column will be updated to the same value it already had. The values of the rows stay the same, so there's no reason to do I/O writes, right? But is this really the case, or will postgres perform corresponding updates even though the actual value didn't change?
In my case the destination table has dozens of millions of rows, but only few hundreds/thousands of them will be really changing on each of the insert calls.

Any UPDATE to a row will actually create a new row (marking the old row deleted/dirty), regardless of the before/after values:
[root#497ba0eaf137 /]# psql
psql (12.1)
Type "help" for help.
postgres=# create table foo (id int, name text);
CREATE TABLE
postgres=# insert into foo values (1,'a');
INSERT 0 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,1) | 1 | a
(1 row)
postgres=# update foo set name = 'a' where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,2) | 1 | a
(1 row)
postgres=# update foo set id = 1 where id = 1;
UPDATE 1
postgres=# select ctid,* from foo;
ctid | id | name
-------+----+------
(0,3) | 1 | a
(1 row)
postgres=# select * from pg_stat_user_tables where relname = 'foo';
-[ RECORD 1 ]-------+-------
relid | 16384
schemaname | public
relname | foo
seq_scan | 5
seq_tup_read | 5
idx_scan |
idx_tup_fetch |
n_tup_ins | 1
n_tup_upd | 2
n_tup_del | 0
n_tup_hot_upd | 2
n_live_tup | 1
n_dead_tup | 2
<...>
And according to your example:
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,1) | 1 | 2019-05-01 00:00:00
(0,2) | 2 | 2019-04-29 00:00:00
(2 rows)
postgres=# INSERT INTO inactive_users(user_id, last_seen)
postgres-# SELECT id as user_id, last_seen FROM users
postgres-# WHERE users.last_seen < '2019-05-04'
postgres-# ON CONFLICT (user_id) DO UPDATE SET last_seen = excluded.last_seen;
INSERT 0 2
postgres=# select ctid,* FROM inactive_users ;
ctid | user_id | last_seen
-------+---------+---------------------
(0,3) | 1 | 2019-05-01 00:00:00
(0,4) | 2 | 2019-04-29 00:00:00
(2 rows)
Postgres does not do any data validation against the column values -- if you are looking to prevent unnecessary write activity, you will need to surgically craft your WHERE clauses.
Disclosure: I work for EnterpriseDB (EDB)

Related

Replacement for materialized view on PostgreSQL

I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:
SELECT number, id, count(*) AS dups_count
FROM my_table
GROUP BY number, id HAVING count(*) > 1;
The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.
At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.
Is there a better solution to it?
A trigger the increments duplicate count might be a solution:
create table duplicates
(
number int,
id int,
dups_count int,
primary key (number, id)
);
The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.
Then create a trigger that updates that table each time a row is inserted into the base table:
create function increment_dupes()
returns trigger
as
$$
begin
insert into duplicates (number, id, dups_count)
values (new.number, new.id, 1)
on conflict (number,id)
do update
set dups_count = duplicates.dups_count + 1;
return new;
end
$$
language plpgsql;
create trigger update_dups_count
after insert on my_table
for each row
execute function increment_dupes();
Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.
A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.
Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':
Tab
Number ID CreationTime
Index on creationtime column.
Key to applying the incremental method is to have a monotonically increasing value.
Here we have 'creationtime' for that.
(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp'
Say:
Tab_duplicate
Number ID Duplicate_count last_compute_timestamp
(b) Create an index on column 'last_compute_timestamp'.
(c) Run the insert to find the duplicate records and insert it into Tab_duplicate along with the last_compute_timestamp.
(d) For repeat Execution:
install extension pg_cron (if its not there) and automate this execution of insert.
https://github.com/citusdata/pg_cron
https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/
or
2. Use a shell script/python script to execute it on the DB through OS crontab.
The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.
DEMONSTRATION:
Step 1: Production table
create table tab
(
id int,
number int,
creationtime timestamp
);
create index tab_id on tab(creationtime);
Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)
create table tab_duplicate
(
id int,
number int,
duplicate_count int,
last_compute_timestamp timestamp);
create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
insert into tab_duplicate values(0,0,0,current_timestamp);
Step 3: Some duplicate entry into the production table
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(1,10,current_timestamp);
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(3,30,current_timestamp);
insert into tab values(3,30,current_timestamp);
select pg_sleep(1);
insert into tab values(4,40,current_timestamp);
Verify records:
postgres=# select * from tab;
id | number | creationtime
----+--------+----------------------------
1 | 10 | 2022-01-23 19:00:37.238865
1 | 10 | 2022-01-23 19:00:38.248574
1 | 10 | 2022-01-23 19:00:38.252622
2 | 20 | 2022-01-23 19:00:39.259584
2 | 20 | 2022-01-23 19:00:40.26655
3 | 30 | 2022-01-23 19:00:41.274673
3 | 30 | 2022-01-23 19:00:41.279298
4 | 40 | 2022-01-23 19:00:52.697257
(8 rows)
Step 4: Duplicates captured and verified.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 4
postgres=#
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
(5 rows)
Step 5: Some more duplicates into the production table
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 2
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
5 | 50 | 3 | 2022-01-23 19:02:37.884417
6 | 60 | 2 | 2022-01-23 19:02:37.884417
(7 rows)
This duplicate capture will be always fast because of two things:
It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).
From execution plan:
-> Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2 (cost=0.15..77.76 rows=1692 width=8)
CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.
Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.
Two approaches come to mind:
Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.

How do I list all identity columns in a table

What query should I use to list all GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY columns in given table in PostgreSQL database?
I would like also like to see whether the column is GENERATED ALWAYS or GENERATED BY DEFAULT.
You can get the list of all generated columns by looking in the pg_attribute table under the attgenerated column:
postgres=# create table abc (
id int GENERATED ALWAYS AS IDENTITY,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm / 2.54) STORED);
postgres=# select attname, attidentity, attgenerated
from pg_attribute
where attnum > 0
and attrelid = (select oid from pg_class where relname = 'abc');
attname | attidentity | attgenerated
-----------+-------------+--------------
id | a |
height_cm | |
height_in | | s
(3 rows)
Identity columns are identified in attidentity. More information in the PostgreSQL documentation

How can I automatically fix my Postgres sequence?

I want to update a sequence in Postgres, which I can do semi-manually, like so:
SELECT MAX(id) as highest_id FROM users;
ALTER SEQUENCE users_id_seq RESTART WITH 11071;
In this case I have to take the result of the first query, which turns out to be 11070, and insert it into the next query, incremented by 1. I'd rather have a single query that does all of this in one fell swoop.
The "two fell swoops" approach would be like so, if it worked, but this fails:
ALTER SEQUENCE users_id_seq RESTART WITH (SELECT MAX(id) as highest_id FROM users);
ALTER SEQUENCE users_id_seq INCREMENT BY 1;
Even better would be if I could use + 1 in the first ALTER SEQUENCE statement and skip the second one.
Is there any way to fix this so it works? (Either as two steps or one, but without manual intervention by me.)
You can easily do this with:
SELECT setval('users_id_seq',(SELECT max(id) FROM users));
This sets the sequence to the current value, so that when you call nextval(), you'll get the next one:
edb=# create table foo (id serial primary key, name text);
CREATE TABLE
edb=# insert into foo values (generate_series(1,10000),'johndoe');
INSERT 0 10000
edb=# select * from foo_id_seq ;
last_value | log_cnt | is_called
------------+---------+-----------
1 | 0 | f
(1 row)
edb=# select setval('foo_id_seq',(SELECT max(id) FROM foo));
setval
--------
10000
(1 row)
edb=# select * from foo_id_seq ;
last_value | log_cnt | is_called
------------+---------+-----------
10000 | 0 | t
(1 row)
edb=# insert into foo values (default,'bob');
INSERT 0 1
edb=# select * from foo order by id desc limit 1;
id | name
-------+------
10001 | bob
(1 row)

PostgreSQL add SERIAL column to existing table with values based on ORDER BY

I have a large table (6+ million rows) that I'd like to add an auto-incrementing integer column sid, where sid is set on existing rows based on an ORDER BY inserted_at ASC. In other words, the oldest record based on inserted_at would be set to 1 and the latest record would be the total record count. Any tips on how I might approach this?
Add a sid column and UPDATE SET ... FROM ... WHERE:
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
Note that this relies on there being a primary key, id.
(If your table did not already have a primary key, you would have to make one first.)
For example,
-- create test table
DROP TABLE IF EXISTS test;
CREATE TABLE test (
id int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
, foo text
, inserted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO test (foo, inserted_at) VALUES
('XYZ', '2019-02-14 00:00:00-00')
, ('DEF', '2010-02-14 00:00:00-00')
, ('ABC', '2000-02-14 00:00:00-00');
-- +----+-----+------------------------+
-- | id | foo | inserted_at |
-- +----+-----+------------------------+
-- | 1 | XYZ | 2019-02-13 19:00:00-05 |
-- | 2 | DEF | 2010-02-13 19:00:00-05 |
-- | 3 | ABC | 2000-02-13 19:00:00-05 |
-- +----+-----+------------------------+
ALTER TABLE test ADD COLUMN sid INT;
UPDATE test
SET sid = t.rownum
FROM (SELECT id, row_number() OVER (ORDER BY inserted_at ASC) as rownum
FROM test) t
WHERE test.id = t.id
yields
+----+-----+------------------------+-----+
| id | foo | inserted_at | sid |
+----+-----+------------------------+-----+
| 3 | ABC | 2000-02-13 19:00:00-05 | 1 |
| 2 | DEF | 2010-02-13 19:00:00-05 | 2 |
| 1 | XYZ | 2019-02-13 19:00:00-05 | 3 |
+----+-----+------------------------+-----+
Finally, make sid SERIAL (or, better, an IDENTITY column):
ALTER TABLE test ALTER COLUMN sid SET NOT NULL;
-- IDENTITY fixes certain issue which may arise with SERIAL
ALTER TABLE test ALTER COLUMN sid ADD GENERATED BY DEFAULT AS IDENTITY;
-- ALTER TABLE test ALTER COLUMN sid SERIAL;

How do we get all columns which are the part of sortkey in Redshift

I need to get all columns, which are the part of sortkey in Redshift.
I tried get information using "select * from svv_table_info" but it have only the information of one column only. Can you let me know, how do I get all columns which are the part of Sortkey for a table.
Thanks,
Sanjeev
Thanks all for your help. I already tried "pg_table_def" table to get sortkey and distkey information but I have seen only pg_catalog and Public schema, I just go through the Amazon developer guide and found we need to add schema to search path using below commands:-
show search_path;
set search_path to '$user', 'public', 'NewSchema';
After adding the "NewSchema" in search path I can see sortkey and distkey information for this schema in pg_table_def
Thanks,
Sanjeev
Sanjeev,
A table called pg_table_def has information about the columns.
In the example below, I created a simple table with four columns and used 2 of these columns as my sort key.
As you can see in my query results the "sort key" field shows a number other than 0 if the column is part of a sort key.
dev=# drop table tb1;
DROP TABLE
dev=# create table tb1 (col1 integer, col2 integer, col3 integer, col4 integer) distkey(col1) sortkey(col2, col4);
CREATE TABLE
dev=# select * from pg_table_def where tablename = 'tb1';
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+-----------+--------+---------+----------+---------+---------+---------
public | tb1 | col1 | integer | none | t | 0 | f
public | tb1 | col2 | integer | none | f | 1 | f
public | tb1 | col3 | integer | none | f | 0 | f
public | tb1 | col4 | integer | none | f | 2 | f
(4 rows)
What about:
select "column", type, encoding, distkey, sortkey, "notnull"
from pg_table_def
where tablename = 'YOURTABLE'
and sortkey <> 0;