pgbench select uuid for custom scripts - postgresql

I am trying to benchmark a custom dataset with pgbench.
all the records I want to select for have uuids as primary keys.
Unfortunately all the samples snipets select random records by using random() function - presumably for sequential PKs.
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
I wonder if there is a way to select random uuids from the tables into a variable, that's not accounted for in the latency.

You can use gen_random_uuid():
SELECT gen_random_uuid();
Since version 13 that is integrated into core PostgreSQL, for older version you have to install the extension pgcrypto first:
CREATE EXTENSION pgcrypto;

You can just order by random(). Suppose you wanted to pick a sample of 5 rows from a table; lets call it ruuids. Then just (see demo);
select *
from ruuids
order by random()
limit 5;

Related

Replacement for materialized view on PostgreSQL

I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:
SELECT number, id, count(*) AS dups_count
FROM my_table
GROUP BY number, id HAVING count(*) > 1;
The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.
At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.
Is there a better solution to it?
A trigger the increments duplicate count might be a solution:
create table duplicates
(
number int,
id int,
dups_count int,
primary key (number, id)
);
The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.
Then create a trigger that updates that table each time a row is inserted into the base table:
create function increment_dupes()
returns trigger
as
$$
begin
insert into duplicates (number, id, dups_count)
values (new.number, new.id, 1)
on conflict (number,id)
do update
set dups_count = duplicates.dups_count + 1;
return new;
end
$$
language plpgsql;
create trigger update_dups_count
after insert on my_table
for each row
execute function increment_dupes();
Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.
A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.
Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':
Tab
Number ID CreationTime
Index on creationtime column.
Key to applying the incremental method is to have a monotonically increasing value.
Here we have 'creationtime' for that.
(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp'
Say:
Tab_duplicate
Number ID Duplicate_count last_compute_timestamp
(b) Create an index on column 'last_compute_timestamp'.
(c) Run the insert to find the duplicate records and insert it into Tab_duplicate along with the last_compute_timestamp.
(d) For repeat Execution:
install extension pg_cron (if its not there) and automate this execution of insert.
https://github.com/citusdata/pg_cron
https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/
or
2. Use a shell script/python script to execute it on the DB through OS crontab.
The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.
DEMONSTRATION:
Step 1: Production table
create table tab
(
id int,
number int,
creationtime timestamp
);
create index tab_id on tab(creationtime);
Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)
create table tab_duplicate
(
id int,
number int,
duplicate_count int,
last_compute_timestamp timestamp);
create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
insert into tab_duplicate values(0,0,0,current_timestamp);
Step 3: Some duplicate entry into the production table
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(1,10,current_timestamp);
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(3,30,current_timestamp);
insert into tab values(3,30,current_timestamp);
select pg_sleep(1);
insert into tab values(4,40,current_timestamp);
Verify records:
postgres=# select * from tab;
id | number | creationtime
----+--------+----------------------------
1 | 10 | 2022-01-23 19:00:37.238865
1 | 10 | 2022-01-23 19:00:38.248574
1 | 10 | 2022-01-23 19:00:38.252622
2 | 20 | 2022-01-23 19:00:39.259584
2 | 20 | 2022-01-23 19:00:40.26655
3 | 30 | 2022-01-23 19:00:41.274673
3 | 30 | 2022-01-23 19:00:41.279298
4 | 40 | 2022-01-23 19:00:52.697257
(8 rows)
Step 4: Duplicates captured and verified.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 4
postgres=#
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
(5 rows)
Step 5: Some more duplicates into the production table
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 2
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
5 | 50 | 3 | 2022-01-23 19:02:37.884417
6 | 60 | 2 | 2022-01-23 19:02:37.884417
(7 rows)
This duplicate capture will be always fast because of two things:
It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).
From execution plan:
-> Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2 (cost=0.15..77.76 rows=1692 width=8)
CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.
Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.
Two approaches come to mind:
Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.

question about PostgreSql update with subquery in concurrent scenarios

seems postgresql have some issue in transaction when update with subquery
there are one record in table with column lock_id is null;
TESTCASE1
1、excute update table set lock_id = 1 where id in (select id from table where lock_id is null order by creation_date limit 100) but no commit;
2、excute update table set lock_id = 2 where id in (select id from table where lock_id is null order by creation_date limit 100) but no commit;
3、commit step 1;
4、commit step 2;
5 query the column lock_id ,result is 2
TESTCASE2
1、excute update table set lock_id = 1 where lock_id is null but no commit;
2、excute update table set lock_id = 2 where lock_id is null but no commit;
3、commit step 1;
4、commit step 2;
5 query the column lock_id ,result is 1
seems when there are subquery in update's condition,the second update will cover up the first one
For me,I have some concurrent scenarios, two machine will try to excute update like testcase1 to lock 100 records and then query the records for subsequent processings by lock_id,but sometimes two machine both handle a same record;
How to avoid this behaviour?

Sum and average total columns in PostgreSQL

I'm using this query to find duplicate dates but not sure how to sum each duplicate dates, average it and remove duplicate dates.
DB Schema
date_time
datapoint_1
datapoint_2
SQL Query
SELECT date_time, COUNT(date_time)
FROM MYTABLE
GROUP BY date_time
HAVING COUNT(date_time) > 1
ORDER BY COUNT(date_time)
I would create a new table to replace the old one. That is easier and might even perform better:
CREATE TABLE mytable2 (LIKE mytable);
INSERT INTO mytable2 (date_time, datapoint_1, datapoint_2)
SELECT m.date_time, avg(m.datapoint_1), avg(m.datapoint_2)
FROM mytable AS m
GROUP BY m.date_time;
Then you can drop mytable and rename mytable2 to replace it.
To prevent new rows from creating duplicates, you could change the way you insert data:
-- to keep track of counts
ALTER TABLE mytable ADD numval integer DEFAULT 1;
-- to prevent duplicates
ALTER TABLE mytable ADD UNIQUE (date_time);
-- to insert new rows
INSERT INTO mytable (date_time, datapoint_1, datapoint_2)
VALUES ('2021-06-30', 42.0, -34.9)
ON CONFLICT (date_time)
DO UPDATE SET numval = mytable.numval + 1,
datapoint_1 = mytable.datapoint_1 + excluded.datapoint_1,
datapoint_2 = mytable.datapoint_2 + excluded.datapoint_2;
-- to select the averages
SELECT date_time,
datapoint_1 / numval AS datapoint_1,
datapoint_2 / numval AS datapoint_2
FROM mytable;
When you use GROUP BY you can also use aggregate functions to reduce multiple lines to a single one (COUNT, that you used is one of such functions). In your case the query would be:
SELECT date_time, avg(datapoint_1), avg(datapoint_2)
FROM MYTABLE
GROUP BY date_time
For every distinct date_time you will get a single row with the average of datapoint_1 and datapoint_2.

Rownum in postgresql

Is there any way to simulate rownum in postgresql ?
Postgresql > 8.4
SELECT
row_number() OVER (ORDER BY col1) AS i,
e.col1,
e.col2,
...
FROM ...
Postgresql have limit.
Oracle's code:
select *
from
tbl
where rownum <= 1000;
same in Postgresql's code:
select *
from
tbl
limit 1000
I have just tested in Postgres 9.1 a solution which is close to Oracle ROWNUM:
select row_number() over() as id, t.*
from information_schema.tables t;
If you just want a number to come back try this.
create temp sequence temp_seq;
SELECT inline_v1.ROWNUM,inline_v1.c1
FROM
(
select nextval('temp_seq') as ROWNUM, c1
from sometable
)inline_v1;
You can add a order by to the inline_v1 SQL so your ROWNUM has some sequential meaning to your data.
select nextval('temp_seq') as ROWNUM, c1
from sometable
ORDER BY c1 desc;
Might not be the fastest, but it's an option if you really do need them.
If you have a unique key, you may use COUNT(*) OVER ( ORDER BY unique_key ) as ROWNUM
SELECT t.*, count(*) OVER (ORDER BY k ) ROWNUM
FROM yourtable t;
| k | n | rownum |
|---|-------|--------|
| a | TEST1 | 1 |
| b | TEST2 | 2 |
| c | TEST2 | 3 |
| d | TEST4 | 4 |
DEMO
I think it's possible to mimic Oracle rownum using temporary sequences.
create or replace function rownum_seq() returns text as $$
select concat('seq_rownum_',replace(uuid_generate_v4()::text,'-','_'));
$$ language sql immutable;
create or replace function rownum(r record, v_seq_name text default rownum_seq()) returns bigint as $$
declare
begin
return nextval(v_seq_name);
exception when undefined_table then
execute concat('create temporary sequence ',v_seq_name,' minvalue 1 increment by 1');
return nextval(v_seq_name);
end;
$$ language plpgsql volatile;
Demo:
select ccy_code,rownum(a.*) from (select ccy_code from currency order by ccy_code desc) a where rownum(a.*)<10;
Gives:
ZWD 1
ZMK 2
ZBH 3
ZAR 4
YUN 5
YER 6
XXX 7
XPT 8
XPF 9
Explanations:
Function rownum_seq() is immutable, called only once by PG in a query, so we get the same unique sequence name (even if the function is called thousand times in the same query)
Function rownum() is volatile and called each time by PG (even in a where clause)
Without r record parameter (which is unused), the function rownum() could be evaluated too early. That's the tricky point. Imagine, the following rownum() function:
create or replace function rownum(v_seq_name text default rownum_seq()) returns bigint as $$
declare
begin
return nextval(v_seq_name);
exception when undefined_table then
execute concat('create temporary sequence ',v_seq_name,' minvalue 1 increment by 1');
return nextval(v_seq_name);
end;
$$ language plpgsql volatile;
explain select ccy_code,rownum() from (select ccy_code from currency order by ccy_code desc) a where rownum()<10
Sort (cost=56.41..56.57 rows=65 width=4)
Sort Key: currency.ccy_code DESC
-> Seq Scan on currency (cost=0.00..54.45 rows=65 width=4)
Filter: (rownum('649aec1a-d512-4af0-87d8-23e8d8a9d982'::text) < 10)
PG apply the filter before the order. Damned!
With the first unused parameter, we force PG to order before filter:
explain select * from (select ccy_code from currency order by ccy_code desc) a where rownum(a.*)<10;
Subquery Scan on a (cost=12.42..64.36 rows=65 width=4)
Filter: (rownum(a.*, 'seq_rownum_43b5c67f_dd64_4191_b29c_372061c848d6'::text) < 10)
-> Sort (cost=12.42..12.91 rows=196 width=4)
Sort Key: currency.ccy_code DESC
-> Seq Scan on currency (cost=0.00..4.96 rows=196 width=4)
Pros:
works as an expression or in a where clause
easy to use: just pass the first record.* you have in the from
Cons:
a temporary sequence is created for each rownum() encountered, but it is removed when session ends.
performance (to discuss, row_number() over () versus nextval)
Postgresql does not have an equivalent of Oracle's ROWNUM.
In many cases you can achieve the same result by using LIMIT and OFFSET in your query.
use the limit clausule, with the offset to choose the row number -1 so if u wanna get the number 8 row so use:
limit 1 offset 7

How can I overcome rounding errors during division with T-SQL?

I have a Total value that I need to distribute among several rows in a SQL table:
DECLARE #total numeric(38,5);
DECLARE #count int;
SET #total=123.10000
SET #count = SELECT COUNT(*) FROM mytable WHERE condition=#val;
-- let's say #count is now 3
UPDATE mytable SET my_part=#total/#count WHERE condition=#val;
--each record now has 41.03333
SELECT SUM(my_part) FROM mytable where condition = #val;
-- the sum is 123.09999, not my original 123.10000
Obviously, the original total wasn't evenly divisible by 3 so the SUM won't match the original value. And no matter what I use for scale, there will be possible divisions like this one that can't line back up.
What I would like is that one of the UPDATEd rows would contain 41.03334, and the other two would have 41.03333. I don't care which ones round up and which round down. But I care that the values can be re-summed to get the original total. Is this possible? Are there known algorithms for doing this kind of thing?
Put the remainder into a secret account that slowly accumulates fractional pennies... then wait a few years...
Actually, if you have SQL Server 2005+, you can use the TOP 1 clause in the UPDATE to limit the updated rows. So maybe:
DECLARE #EPSILON numeric(38,5);
DECLARE #T1 numeric(38,5);
DECLARE #T2 numeric(38,5);
SET #T1 = 1;
SET #T2 = 3;
SET #T1 = #T1/#T2;
SET #T2 = 3 * #T1;
SET #EPSILON = 1 - #T2;
DECLARE #total numeric(38,5);
DECLARE #count int;
DECLARE #REMAINDER numeric(38,5);
DECLARE #PARTIAL numeric(38,5);
DECLARE #RESUM numeric(38,5);
DECLARE #LIMITN Integer;
SET #total=123.10000;
SELECT #count = COUNT(*) FROM mytable WHERE condition=#val;
SET #PARTIAL = #TOTAL / #COUNT;
SET #RESUM = #PARTIAL * #COUNT;
SET #REMAINDER = #TOTAL - #RESUM;
IF #REMAINDER < 0 SET #EPSILON = -#EPSILON;
SET #LIMITN = #REMAINDER / #EPSILON;
UPDATE mytable SET my_part=#PARTIAL WHERE condition=#val;
UPDATE TOP #LIMITN mytable SET my_part = my_part + #EPSILON WHERE condition=#val;
SELECT SUM(my_part) FROM mytable where condition = #val;
You could use fractions to avoid rounding problems. At least multiplication and division of several rows would be easy. SUM() would not be quite so easy, if you need the exact value.
Below is a hack, but it works for this situation:
DECLARE #total numeric(38,5);
DECLARE #count int;
declare #mytable table
(
my_part numeric(38,6) --note the scale is +1
)
insert into #mytable values (0)
insert into #mytable values (0)
insert into #mytable values (0)
SET #total=123.10000
SELECT #count = COUNT(*) FROM #mytable
-- let's say #count is now 3
UPDATE #mytable SET my_part=#total/#count;
--each record now has 41.03333
-- note the cast
SELECT cast(SUM(my_part) as numeric(38,5)) FROM #mytable;
try this
DECLARE #total numeric(38,5);
DECLARE #count int;
SET #total=123.10000
SELECT #count = COUNT(*)
FROM mytable
WHERE condition=#val;
-- let's say #count is now 3
UPDATE mytable SET
my_part = #total/#count
WHERE condition=#val;
Update mytable SET
my_part = my_part +
#total - (Select Sum(My_Part)
From mytable
Where condition=#val)
Where PK = (Select Max(PK) From mytable
Where condition=#val)
-- each record except one with highest PK now has 41.03333
-- the one with highest PK has 41.03334 (or whatever)
SELECT SUM(my_part)
FROM mytable
where condition = #val;
-- the sum should be the original 123.10000