I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:
SELECT number, id, count(*) AS dups_count
FROM my_table
GROUP BY number, id HAVING count(*) > 1;
The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.
At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.
Is there a better solution to it?
A trigger the increments duplicate count might be a solution:
create table duplicates
(
number int,
id int,
dups_count int,
primary key (number, id)
);
The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.
Then create a trigger that updates that table each time a row is inserted into the base table:
create function increment_dupes()
returns trigger
as
$$
begin
insert into duplicates (number, id, dups_count)
values (new.number, new.id, 1)
on conflict (number,id)
do update
set dups_count = duplicates.dups_count + 1;
return new;
end
$$
language plpgsql;
create trigger update_dups_count
after insert on my_table
for each row
execute function increment_dupes();
Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.
A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.
Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':
Tab
Number ID CreationTime
Index on creationtime column.
Key to applying the incremental method is to have a monotonically increasing value.
Here we have 'creationtime' for that.
(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp'
Say:
Tab_duplicate
Number ID Duplicate_count last_compute_timestamp
(b) Create an index on column 'last_compute_timestamp'.
(c) Run the insert to find the duplicate records and insert it into Tab_duplicate along with the last_compute_timestamp.
(d) For repeat Execution:
install extension pg_cron (if its not there) and automate this execution of insert.
https://github.com/citusdata/pg_cron
https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/
or
2. Use a shell script/python script to execute it on the DB through OS crontab.
The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.
DEMONSTRATION:
Step 1: Production table
create table tab
(
id int,
number int,
creationtime timestamp
);
create index tab_id on tab(creationtime);
Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)
create table tab_duplicate
(
id int,
number int,
duplicate_count int,
last_compute_timestamp timestamp);
create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
insert into tab_duplicate values(0,0,0,current_timestamp);
Step 3: Some duplicate entry into the production table
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(1,10,current_timestamp);
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(3,30,current_timestamp);
insert into tab values(3,30,current_timestamp);
select pg_sleep(1);
insert into tab values(4,40,current_timestamp);
Verify records:
postgres=# select * from tab;
id | number | creationtime
----+--------+----------------------------
1 | 10 | 2022-01-23 19:00:37.238865
1 | 10 | 2022-01-23 19:00:38.248574
1 | 10 | 2022-01-23 19:00:38.252622
2 | 20 | 2022-01-23 19:00:39.259584
2 | 20 | 2022-01-23 19:00:40.26655
3 | 30 | 2022-01-23 19:00:41.274673
3 | 30 | 2022-01-23 19:00:41.279298
4 | 40 | 2022-01-23 19:00:52.697257
(8 rows)
Step 4: Duplicates captured and verified.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 4
postgres=#
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
(5 rows)
Step 5: Some more duplicates into the production table
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 2
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
5 | 50 | 3 | 2022-01-23 19:02:37.884417
6 | 60 | 2 | 2022-01-23 19:02:37.884417
(7 rows)
This duplicate capture will be always fast because of two things:
It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).
From execution plan:
-> Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2 (cost=0.15..77.76 rows=1692 width=8)
CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.
Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.
Two approaches come to mind:
Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.
I got this query,
SELECT s.pos
FROM (SELECT t.guild_id, t.user_id
ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM users t) s
WHERE (s.guild_id, s.user_id) = ($2, $3)
that gets a user's "rank" in a guild, but I want to filter the results by entries that are in an array of t.user_id values (like {'1', '64', '83'}) and have this affect the resulting pos value. I found FILTER and WITHIN GROUP, but I'm not sure how to fit one of those into this query. How would I do that?
Here's the full table if that helps at all:
Table "public.users"
Column | Type | Collation | Nullable | Default
------------+-----------------------+-----------+----------+---------
guild_id | character varying(20) | | not null |
user_id | character varying(20) | | not null |
reputation | real | | not null | 0
Indexes:
"users_pkey" PRIMARY KEY, btree (guild_id, user_id)
Why not select on those first?
WITH UsersWeCareAbout AS (
SELECT * FROM users u WHERE u.user_id = ANY(subgroup_array)
), RepUsers AS (
SELECT t.guild_id, t.user_id, ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM UsersWeCareAbout t
) SELECT s.pos FROM RepUsers s WHERE (s.guild_id, s.user_id) = ($2, $3)
(untested if only because I didn't really have enough context to test with)
I have a BEFORE trigger which should fill record's root ID which, of course, would point to rootmost entry. I.e:
id | parent_id | root_id
-------------------------
a | null | a
a.1 | a | a
a.1.1 | a.1 | a
b | null | b
If entry's parent_id is null, it would point to record itself.
Question is - inside BEFORE INSERT trigger, if parent_id is null, can I or should I fetch next sequence value, fill id and root_id in order to avoid filling root_id in AFTER trigger?
According to your own definition:
if entry's parent_id is null, it would point to record itself
then you have to do:
if new.parent_id is null then
new.root_id = new.id ;
else
WITH RECURSIVE p (parent_id, level) AS
(
-- Base case
SELECT
parent_id, 0 as level
FROM
t
WHERE
t.id = new.id
UNION ALL
SELECT
t.parent_id, level + 1
FROM
t JOIN p ON t.id = p.parent_id
WHERE
t.parent_id IS NOT NULL
)
SELECT
parent_id
INTO
new.root_id
FROM
p
ORDER BY
level DESC
LIMIT
1 ;
end if ;
RETURN new ;
I want to change the 20 rows with one column to 1 row with 20 columns to insert it later in a second database
Name
----------
- Frank
- Dora
- ...
- Michael
to
Name1 | Name2 | ... | Name20
Frank | Dora | ... | Michael
I tried
SELECT *
FROM (SELECT TOP 20 firstname AS NAME
FROM database) AS d
PIVOT (Min(NAME)
FOR NAME IN (name1,
name2,
name3,
name4,
name5,
name6,
name7,
name8,
name9,
name10,
name11,
name12,
name13,
name14,
name15,
name16,
name18,
name19,
name20) ) AS f
But all names are NULL. DEMO
You were close... But your inner select must carry the new column name. Try it like this:
DECLARE #tbl TABLE(Name VARCHAR(100));
INSERT INTO #tbl VALUES('Frank'),('Dora'),('Michael');
SELECT p.*
FROM
(
SELECT 'Name' + CAST(ROW_NUMBER() OVER(ORDER BY Name) AS VARCHAR(150)) AS ColumnName
,Name
From #tbl
) AS tbl
PIVOT
(
MIN(Name) FOR ColumnName IN(Name1,Name2,Name3)
) AS p
When doing an UPDATE query, we got the following error message:
ERROR: duplicate key value violates unique constraint "tableA_pkey"
DETAIL: Key (id)=(47470) already exists.
However, our UPDATE query does not affect the primary key. Here is a simplified version:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE
a.end_at BETWEEN now() AND now() - interval '1 day';
We ensured the primary key sequence was already synced:
\d tableA_id_seq
Which produces:
Column | Type | Value
---------------+---------+--------------------------
sequence_name | name | tableA_id_seq
last_value | bigint | 50364
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 0
is_cycled | boolean | f
is_called | boolean | t
Looking for maximum table index:
select max(id) from tableA;
We got a lower value:
max
-------
50363
(1 row)
Have you any idea on why such a behavior? If we exclude the problematic id, it works.
Another strange point is that replacing the previous UPDATE by:
UPDATE tableA AS a
SET
items = (
SELECT array_to_string(
array(
SELECT b.value
FROM tableB b
WHERE b.a_id = b.id
GROUP BY b.name
),
','
)
)
WHERE a.id = 47470;
It works well. Are we missing something?
EDIT: triggers
I have no user-defined triggers on this table:
SELECT t.tgname, c.relname
FROM pg_trigger t
JOIN pg_class c ON t.tgrelid = c.oid
WHERE
c.relname = 'tableA'
AND
t.tgisinternal = false
;
Which returns no row.
Note: I am using psql (PostgreSQL) 9.3.4 version.
Not really sure what was the cause. However, deleting the two (non vital) records corresponding to already existing ids (?) solved the issue.