Weighted Random Selection - postgresql

Please. I have two tables with the most common first and last names. Each table has basically two fields:
Tables
CREATE TABLE "common_first_name" (
"first_name" text PRIMARY KEY, --The text representing the name
"ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.
"inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
"updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);
CREATE TABLE "common_last_name" (
"last_name" text PRIMARY KEY, --The text representing the name
"ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.
"inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
"updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);
P.S: The TOP 1 name occurs only ~ 1.8% of the time. The tables have 1000 rows each.
Function (Pseudo, not READY)
CREATE OR REPLACE FUNCTION create_sample_data(p_number_of_records INT)
RETURNS VOID
AS $$
DECLARE
SUM_OF_WEIGHTS CONSTANT INT := 100;
BEGIN
FOR i IN 1..coalesce(p_number_of_records, 0) LOOP
--Get the random first and last name but taking in consideration their probability (RATIO)round(random()*SUM_OF_WEIGHTS);
--create_person (random_first_name || ' ' || random_last_name);
END LOOP;
END
$$
LANGUAGE plpgsql VOLATILE;
P.S.: The sum of all ratios for each name (per table) sums up to 100%.
I want to run a function N times and get a name and a surname to create sample data... both tables have 1000 rows each.
The sample size can be anywhere from 1000 full names to 1000000 names, so if there is a "fast" way of doing this random weighted function, even better.
Any suggestion of how to do it in PL/PGSQL?
I am using PG 13.3 on SUPABASE.IO.
Thanks

Given the small input dataset, it's straightforward to do this in pure SQL. Use CTEs to build lower & upper bound columns for each row in each of the common_FOO_name tables, then use generate_series() to generate sets of random numbers. Join everything together, and use the random value between the bounds as the WHERE clause.
with first_names_weighted as (
select first_name,
sum(ratio) over (order by first_name) - ratio as lower_bound,
sum(ratio) over (order by first_name) as upper_bound
from common_first_name
),
last_names_weighted as (
select last_name,
sum(ratio) over (order by last_name) - ratio as lower_bound,
sum(ratio) over (order by last_name) as upper_bound
from common_last_name
),
randoms as (
select random() * (select sum(ratio) from common_first_name) as f_random,
random() * (select sum(ratio) from common_last_name) as l_random
from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;
Change the value passed to generate_series() to control how many names to generate. If it's important that it be a function, you can just use a LANGAUGE SQL function definition to parameterize that number:
https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3

Related

Trigger taking time to insert data in postgres (column count 300)

I have created a trigger, it is taking more time while inserting multiple records.
Insetting 1 or 2 records is working. But if the records are more than 1000 then not fast, still running query from 2 hours.
I have created only 15 columns in below table. My actual table has 300 columns.
Is any other way to insert multiple records on the trigger table.?
Table
create table patients (
id serial,
name character varying (50),
daily varchar (8),
month varchar (6),
quarter varchar (6),
registration_date timestamp,
age integer,
address text,
country text,
city text,
phone_number integer,
Education text,
Occupation text,
Marital_Status text,"E-mail" text
);
trigger function
CREATE OR REPLACE FUNCTION update_data_after_insert_data_into_patients()
RETURNS trigger AS
$$BEGIN
update patients t1
set quarter=t2.quarter
from (SELECT (extract(year from registration_date)::text || 'Q' || extract(quarter from registration_date)::text) as quarter,registration_date
from patients) t2 where t1.registration_date =t2.registration_date;
update patients t1
set month=t2.month
from (select (extract(year from registration_date)::text || '' || to_char(registration_date,'MM')) as month,registration_date
from patients) t2 where t1.registration_date =t2.registration_date;
update patients t1
set daily=t2.daily
from (select extract(year from registration_date) || '' ||to_char(registration_date,'MM') || '' || to_char(registration_date,'DD') as daily,registration_date
from patients) t2 where t1.registration_date =t2.registration_date;
RETURN new;
END;
$$ LANGUAGE plpgsql;
Trigger definition
create TRIGGER trigger_update_data_after_insert_patients
AFTER insert ON patients
FOR EACH ROW
EXECUTE PROCEDURE update_data_after_insert_data_into_patients();
insert multiple records into patients table
INSERT INTO public.patients
("name", daily, "month", quarter, registration_date, age, address, country, city, phone_number, education, occupation, marital_status, "E-mail")
VALUES('Adam', '20221215', '202212', '2022Q4', '2022-08-17 19:01:10-08', 24, '', '', '', 1245578, '', '', '', '');
select statement
select * from patients;
You are updating all rows in the table with the same registration date as the one provided in the insert three times - just to calculate those generated columns.
You can do this more efficiently by assigning the generated values to the NEW record in a BEFORE trigger.
CREATE OR REPLACE FUNCTION update_data_after_insert_data_into_patients()
RETURNS trigger AS
$$
BEGIN
new.quarter := to_char(new.registration_date, 'yyyy"Q"q');
new.month := to_char(new.registration_date, 'yyyy mm');
new.daily := to_char(new.registration_date, 'yyyymmdd');
RETURN new;
END;
$$
LANGUAGE plpgsql;
create TRIGGER trigger_update_data_after_insert_patients
BEFORE insert ON patients
FOR EACH ROW
EXECUTE PROCEDURE update_data_after_insert_data_into_patients();
However I don't see the need to store these calculated values when you can easily format the registration_date when retrieving the data. I would get rid of those columns and the trigger and create a VIEW that does the formatting.

Postgres partitioned table query scans all partitions instead of one

I have a table
CREATE TABLE IF NOT EXISTS prices
(shop_id integer not null,
good_id varchar(24) not null,
eff_date timestamp with time zone not null,
price_wholesale numeric(20,2) not null default 0 constraint chk_price_ws check (price_wholesale >= 0),
price_retail numeric(20,2) not null default 0 constraint chk_price_rtl check (price_retail >= 0),
constraint pk_prices primary key (shop_id, good_id, eff_date)
)partition by list (shop_id);
CREATE TABLE IF NOT EXISTS prices_1 partition of prices for values in (1);
CREATE TABLE IF NOT EXISTS prices_3 partition of prices for values in (2);
CREATE TABLE IF NOT EXISTS prices_4 partition of prices for values in (3);
CREATE TABLE IF NOT EXISTS prices_4 partition of prices for values in (4);
...
CREATE TABLE IF NOT EXISTS prices_6 partition of prices for values in (100);
I'd like to delete outdated prices. The table is huge , so I try to delete small portions of records.
If I use loop and the variable v_shop_id then after 6 times Postgres starts scanning all partitions. I simplified the code, the real code has inner loop by shop_id.
If I use loop without the variable (I explicitly specify the value) Postgres doesn't scan all partitions
here code with the variable
do $$
declare
v_shop_id integer;
v_date_time timestamp with time zone := now();
begin
v_shop_id := 8;
for step in 1..10 loop
delete from prices p
using (select pd.good_id, max(pd.eff_date) as mxef_dt
from prices pd
where pd.eff_date < v_date_time - interval '30 days'
and pd.shop_id = v_shop_id
group by ppd.good_id
having count(1)>1
limit 40000) pfd
where p.eff_date <= pfd.mxef_dt
and p.shop_id = v_shop_id
and p.good_id = pfd.good_id;
end loop;
end;$$LANGUAGE plpgsql
How can I force Postrges to scan one desired partition only?

PL/pgSQL function to randomly select an id

Goal:
pre-populate a table with a list of sequential id, from e.g. 1 to 1,000,000. The table has an additional column that is nillable. NULL values are marked as unassigned and non-NULL values are marked as assigned
have function i can call that asks for x number of randomly chosen ids from the table which have not been assigned.
This is for something quite specific and while I understand there are different ways of doing this, I'd like to know if there's a solution to the flaw in this particular implementation.
I have something that partially works, but wondering where the flaw in the function is.
Here's the table:
CREATE SEQUENCE accounts_seq MINVALUE 700000000001 NO MAXVALUE;
CREATE TABLE accounts (
id BIGINT PRIMARY KEY default nextval('accounts_seq'),
client VARCHAR(25), UNIQUE(id, client)
);
This function gen_account_ids is just a one-time setup to pre-populate the table with a fixed number of rows, all marked as unassigned.
/*
This function will insert new rows into the accounts table with ids being
generated by a sequence, and client being NULL. A NULL client indicates
the account has not yet been assigned.
*/
CREATE OR REPLACE FUNCTION gen_account_ids(bigint)
RETURNS INT AS $gen_account_ids$
DECLARE
-- count is the number of new accounts you want generated
count alias for $1;
-- rowcount is returned as the number of rows inserted
rowcount int;
BEGIN
INSERT INTO accounts(client) SELECT NULL FROM generate_series(1, count);
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$gen_account_ids$ LANGUAGE plpgsql;
So, I use this to pre-populate the table with, say 1000 records:
SELECT gen_account_ids(1000);
The next function assign is meant to randomly select an unassigned id (unassigned means client column is null), and update it with a client value so it becomes assigned. It returns the number of rows affected.
It works sometimes, but I do believe there are collisions occurring -- which is why I tried for DISTINCT, but it often returns fewer than the desired number of rows. For example, if I select assign(100, 'foo'); it might return 95 rows instead of the desired 100.
How can I modify this to make it always return the exact desired rows?
/*
This will assign ids to a client randomly
#param int is the number of account numbers to generate
#param varchar(10) is a string descriptor for the client
#returns the number of rows affected -- should be the same as the input int
Call it like this: `SELECT * FROM assign(100, 'FOO')`
*/
CREATE OR REPLACE FUNCTION assign(INT, VARCHAR(10))
RETURNS INT AS $$
DECLARE
total ALIAS FOR $1;
clientname ALIAS FOR $2;
rowcount int;
BEGIN
UPDATE accounts SET client = clientname WHERE id IN (
SELECT DISTINCT trunc(random() * (
(SELECT max(id) FROM accounts WHERE client IS NULL) -
(SELECT min(id) FROM accounts WHERE client IS NULL)) +
(SELECT min(id) FROM accounts WHERE client IS NULL)) FROM generate_series(1, total));
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$$ LANGUAGE plpgsql;
This is loosely based on this where you can do something like SELECT trunc(random() * (100 - 1) + 1) FROM generate_series(1,5); which will select 5 random numbers between 1 and 100.
My goal is to do something similar where I select a random id between the min and max unassigned rows, and mark it for update.
This isn't the best answer b/c it does involve full table scans, but in my situation, I'm not concerned about the performance, and it works. This is based off #CraigRinger's reference to the blog post getting random tuples
I'd be generally interested in hearing about other (perhaps better) solutions -- and am specifically curious about why the original solution falls short, and what #klin also devised.
So, here's my brute force random order solution:
-- generate a million unassigned rows with null client column
insert into accounts(client) select null from generate_series(1, 1000000);
-- assign 1000 random rows to client 'foo'
update accounts set client = 'foo' where id in
(select id from accounts where client is null order by random() limit 1000);
Because ids of random subset of rows are not consecutive, select a random row_number() instead of random id.
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
),
row_numbers as ( -- add row numbers to nulls
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value; -- random row number
A function is not necessary here, but you can easily place the query in a function body if needed.
This query updates 5 random rows with null client.
update accounts
set client = 'new value' -- <-- clientname
where id in (
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select i, trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
cross join generate_series(1, 5) i -- <-- total
group by 1
),
row_numbers as ( -- add row numbers to nulls in order by id
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value -- random row number
)
However, there is no certainty that the query will update exactly 5 rows, because
select trunc(random()* (max_value - 1) + 1)::int
from generate_series(1, n)
is not a correct way to generate n different random values. The probability of repetitions increases with the quotient n / max_value.

Get columns that differ between 2 rows

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;

Does postgres postgis ST_makeline have a max number of points it can create a line from?

My database has a table with tons of geometry(PointZ,4326) I am doing a lot of my processing on the database side and I've noticed that when I use the ST_MakeLine I seem to be hitting a cap on the number of points it will make a line from. My table and function/query is below.
It works as long as the number of track_points returned from the sub query is less than 97. I know this because the insert puts data in the table for all columns when there are 96 points or fewer. For all records where there are 97 or more points all it inserts is the track_id, start_time and end_time.
I'm wondering if this is a bug in the ST_makeLine function of postgis or is it a setting in postgres that I need to modify.
CREATE TABLE track_line_strings(
track_id bigint NOT NULL,
linestring geometry(LINESTRINGZ,4326),
start_time bigint NOT NULL,
end_time bigint NOT NULL,
CONSTRAINT track_line_strings_pk PRIMARY KEY (track_id)
);
CREATE OR REPLACE FUNCTION create_track_line_string() RETURNS trigger
LANGUAGE plpgsql
AS $$
DECLARE
TRACKITEMID bigint := new.track_item_id;
TRACKID bigint := track_id from track_item ti where ti.id = TRACKITEMID;
STARTTIME bigint := MIN(ti.item_time) from track_item ti where ti.track_id = TRACKID;
ENDTIME bigint := MAX(ti.item_time) from track_item ti where ti.track_id = TRACKID;
BEGIN
IF EXISTS (SELECT track_id from track_line_strings where track_id = TRACKID)
THEN
UPDATE track_line_strings
SET start_time = STARTTIME, end_time = ENDTIME, linestring = (
SELECT ST_Makeline(e.trackPosition) FROM
(
Select track_id, tp.track_position AS trackPosition
FROM track_point tp JOIN track_item ti ON tp.track_item_id = ti.id
where ti.track_id = TRACKID ORDER BY ti.item_time ASC
) E )
WHERE track_id = TRACKID;
ELSE
INSERT INTO track_line_strings(track_id, linestring, start_time, end_time)
SELECT TRACKID, ST_Makeline(e.trackPosition), STARTTIME, ENDTIME FROM
(
Select track_id, tp.track_position AS trackPosition
FROM track_point tp JOIN track_item ti ON tp.track_item_id = ti.id
where ti.track_id = TRACKID ORDER BY ti.item_time ASC
)E;
END IF;
RETURN new;
END;
$$;
The database limits are pretty high, 1 GB data worth of geometry data in a field. It depends on what kind of point geometry, but it will be on the order of tens of millions of point geometries that can be used to construct a LineString.
You will see a proper error message with something about "exceeded size" if it is a limitation.
Apparent empty or missing data with pgAdminIII is a common question, but not related to database limitations:
http://postgis.net/2013/10/05/tip_pgAdmin_shows_no_data
http://postgis.net/docs/manual-dev/PostGIS_FAQ.html#pgadmin_shows_no_data_in_geom
There doesnt appear to be a limit. I was viewing results in pgAdminIII and there must be a limit on the number of characters the data output can handle for each column. I only realized this by copy pasting the results into a text file to see that it did infact return a value for the lines that have more than 96 points.