PL/pgSQL function to randomly select an id - postgresql

Goal:
pre-populate a table with a list of sequential id, from e.g. 1 to 1,000,000. The table has an additional column that is nillable. NULL values are marked as unassigned and non-NULL values are marked as assigned
have function i can call that asks for x number of randomly chosen ids from the table which have not been assigned.
This is for something quite specific and while I understand there are different ways of doing this, I'd like to know if there's a solution to the flaw in this particular implementation.
I have something that partially works, but wondering where the flaw in the function is.
Here's the table:
CREATE SEQUENCE accounts_seq MINVALUE 700000000001 NO MAXVALUE;
CREATE TABLE accounts (
id BIGINT PRIMARY KEY default nextval('accounts_seq'),
client VARCHAR(25), UNIQUE(id, client)
);
This function gen_account_ids is just a one-time setup to pre-populate the table with a fixed number of rows, all marked as unassigned.
/*
This function will insert new rows into the accounts table with ids being
generated by a sequence, and client being NULL. A NULL client indicates
the account has not yet been assigned.
*/
CREATE OR REPLACE FUNCTION gen_account_ids(bigint)
RETURNS INT AS $gen_account_ids$
DECLARE
-- count is the number of new accounts you want generated
count alias for $1;
-- rowcount is returned as the number of rows inserted
rowcount int;
BEGIN
INSERT INTO accounts(client) SELECT NULL FROM generate_series(1, count);
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$gen_account_ids$ LANGUAGE plpgsql;
So, I use this to pre-populate the table with, say 1000 records:
SELECT gen_account_ids(1000);
The next function assign is meant to randomly select an unassigned id (unassigned means client column is null), and update it with a client value so it becomes assigned. It returns the number of rows affected.
It works sometimes, but I do believe there are collisions occurring -- which is why I tried for DISTINCT, but it often returns fewer than the desired number of rows. For example, if I select assign(100, 'foo'); it might return 95 rows instead of the desired 100.
How can I modify this to make it always return the exact desired rows?
/*
This will assign ids to a client randomly
#param int is the number of account numbers to generate
#param varchar(10) is a string descriptor for the client
#returns the number of rows affected -- should be the same as the input int
Call it like this: `SELECT * FROM assign(100, 'FOO')`
*/
CREATE OR REPLACE FUNCTION assign(INT, VARCHAR(10))
RETURNS INT AS $$
DECLARE
total ALIAS FOR $1;
clientname ALIAS FOR $2;
rowcount int;
BEGIN
UPDATE accounts SET client = clientname WHERE id IN (
SELECT DISTINCT trunc(random() * (
(SELECT max(id) FROM accounts WHERE client IS NULL) -
(SELECT min(id) FROM accounts WHERE client IS NULL)) +
(SELECT min(id) FROM accounts WHERE client IS NULL)) FROM generate_series(1, total));
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$$ LANGUAGE plpgsql;
This is loosely based on this where you can do something like SELECT trunc(random() * (100 - 1) + 1) FROM generate_series(1,5); which will select 5 random numbers between 1 and 100.
My goal is to do something similar where I select a random id between the min and max unassigned rows, and mark it for update.

This isn't the best answer b/c it does involve full table scans, but in my situation, I'm not concerned about the performance, and it works. This is based off #CraigRinger's reference to the blog post getting random tuples
I'd be generally interested in hearing about other (perhaps better) solutions -- and am specifically curious about why the original solution falls short, and what #klin also devised.
So, here's my brute force random order solution:
-- generate a million unassigned rows with null client column
insert into accounts(client) select null from generate_series(1, 1000000);
-- assign 1000 random rows to client 'foo'
update accounts set client = 'foo' where id in
(select id from accounts where client is null order by random() limit 1000);

Because ids of random subset of rows are not consecutive, select a random row_number() instead of random id.
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
),
row_numbers as ( -- add row numbers to nulls
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value; -- random row number
A function is not necessary here, but you can easily place the query in a function body if needed.
This query updates 5 random rows with null client.
update accounts
set client = 'new value' -- <-- clientname
where id in (
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select i, trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
cross join generate_series(1, 5) i -- <-- total
group by 1
),
row_numbers as ( -- add row numbers to nulls in order by id
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value -- random row number
)
However, there is no certainty that the query will update exactly 5 rows, because
select trunc(random()* (max_value - 1) + 1)::int
from generate_series(1, n)
is not a correct way to generate n different random values. The probability of repetitions increases with the quotient n / max_value.

Related

How to insert into after the last row in a table?

i have this table below named roombooking:
I wrote this code that inserts a new row into roombooking(dont mind the details, just the hotelbookingID):
CREATE OR REPLACE FUNCTION my_function(startdate date , enddate date,idForHotel integer)
RETURNS void AS
$$
BEGIN
INSERT INTO roombooking("hotelbookingID","roomID","bookedforpersonID"
,checkin,checkout,rate)
SELECT rb."hotelbookingID", r."idRoom", p."idPerson"
,startdate-integer'20', startdate-integer'10', rr.rate
FROM(SELECT "hotelbookingID" FROM roombooking
WHERE "hotelbookingID"=
(select "hotelbookingID"
from roombooking
order by "hotelbookingID" desc
limit 1)+1) rb,
(SELECT "idRoom" FROM room
WHERE "idHotel"=idForHotel) r ,
(SELECT "idPerson" FROM person
ORDER BY random()
LIMIT 1) p,
(SELECT rate FROM roomrate
WHERE "idHotel"=idForHotel) rr;
END;
$$
LANGUAGE 'plpgsql';
The problem here is that i want to insert after the last row based on the last hotelbookingID(it is in asc order)
My function works but as i guess it cant find the last row ,in order to perform the insertion after . (I think that the problem can be spotted here :
SELECT "hotelbookingID" FROM roombooking
WHERE "hotelbookingID"=
(select "hotelbookingID"
from roombooking
order by "hotelbookingID" desc
limit 1)+1)
Any help would be valuable. Thank you.
Any approach that uses a subquery to find the maximum existing id is doomed to suffer from race conditions: if two such INSERTs are running concurrently, they will end up with the same number.
Use an identity column:
ALTER TABLE roombooking
ALTER id ADD GENERATED ALWAYS AS IDENTITY (START 100000);
where 100000 is a value greater than the maximum id in the table.
Then all you have to do is not insert anything into id, and the column will be populated automatically.
That WHERE condition makes no sense. There is no row in the roombooking table whose id is 1 + the largest id in the roombooking table.
You simply want to add 1 to the inserted value:
INSERT INTO roombooking("hotelbookingID", …)
SELECT rb."hotelbookingID" + 1, …
-- ^^^^
FROM (
SELECT "hotelbookingID"
FROM roombooking
ORDER BY "hotelbookingID" DESC
LIMIT 1
) rb,
…
That said, I would recommend to simply use a sequence instead (if you don't care about occasional gaps). If you really need a continuous numbering, I wouldn't use order by+limit though. Just use an aggregate, and consider the case where the table is still empty:
INSERT INTO roombooking("hotelbookingID", …)
VALUES ( COALESCE((SELECT max("hotelbookingID") FROM roombooking), 0) + 1, …);

CTE based insert of multiple rows into "one-per-group" table violates unique index

I have a table where only one row per group can be true.
This is enforced by a partial unique index (which can't be deferred).
CREATE TABLE test
(
id SERIAL PRIMARY KEY,
my_group INTEGER,
last BOOLEAN DEFAULT TRUE
);
CREATE UNIQUE INDEX "test.last" ON test (my_group) WHERE last;
INSERT INTO test (my_group)
VALUES (1), (2);
I'm trying to insert a new row into this table that shall replace the "last" element of the corresponding group. I also want to accomplish this in a single statement.
With some CTE trickery I'm able to do this: link to Fiddle
-- the statement is structured this way to closely resemble my actual usecase
WITH
new_data AS (
VALUES (1)
),
uncheck_old_last AS (
UPDATE test
SET last = FALSE
WHERE last AND my_group in (SELECT * FROM new_data)
RETURNING TRUE
)
INSERT INTO test (my_group)
SELECT *
FROM new_data
WHERE COALESCE((SELECT * FROM uncheck_old_last LIMIT 1), true);
So far so good, the insert happens... no conflicts.
I don't quite understand why this is working as from my understanding all CTEs should read the same initial DB state and can't see the changes made by other CTEs
The problem is now that I get a unique violation when I try to do the same with multiple rows at once: Link to Fiddle
-- the statement is structured this way to closely resemble my actual usecase
WITH
new_data AS (
VALUES (1), (2) -- <- difference to above query
),
uncheck_old_last AS (
UPDATE test
SET last = FALSE
WHERE last AND my_group in (SELECT * FROM new_data)
RETURNING TRUE
)
INSERT INTO test (my_group)
SELECT *
FROM new_data
WHERE COALESCE((SELECT * FROM uncheck_old_last LIMIT 1), true);
-- Schema Error: error: duplicate key value violates unique constraint "test.last"
Is there any way to insert multiple rows with one statement /Can someone explain to me why the first query is working and the second isn't?
This was caused by PostgreSQL simplifying my always true clause:
WHERE COALESCE((SELECT * FROM uncheck_old_last LIMIT 1), true)
was supposed to create a dependency between the main query and the CTE to enforce execution order from the main query's point of view.
It broke with more than one entry because the limit 1 allowed PostgreSQL to ignore the second row, as only one was required for evaluation.
I fixed it by comparing COUNT(*) > -1 instead:
COALESCE((SELECT COUNT(*) FROM uncheck_old_last) > -1, true)

Get columns that differ between 2 rows

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;

SQL Server : Update order, Why does it work and can I trust it?

I am using SQL Server 2012
There is a "magic query" I don't understand why it's working using a temporary column I am updating a table and let it use the previous values it already calculated.
It sets the rolMul to be a rolling multiplication of the item till now.
Can I trust this method?
Why does it work in the first place?
If I can't trust it what alternatives can I use?
-- Create data to work on
select * into #Temp from (
select 1 as id, null as rolMul ) A
insert into #temp select 2 as id, null as rolMul
insert into #temp select 3 as id, null as rolMul
insert into #temp select 4 as id, null as rolMul
insert into #temp select 5 as id, null as rolMul
------Here is the magic I don't understand why it's working -----
declare #rolMul int = 1
update #temp set #rolMul = "rolMul" = #rolMul * id from #temp
select * from #temp
-- you can see it did what I wanted multiply all the previous values
drop table #temp
What bothers me is:
Why does it work? can I trust it to work?
What about the order? If
the table was not ordered
select * into #Temp from (
select 3 as id, null as rolMul ) A
insert into #temp select 1 as id, null as rolMul
insert into #temp select 5 as id, null as rolMul
insert into #temp select 2 as id, null as rolMul
insert into #temp select 4 as id, null as rolMul
declare #rolMul int = 1
update #temp set #rolMul = "rolMul" = #rolMul * id from #temp
select * from #temp order by id
drop table #Temp
go
If I can't trust it what alternatives can I use?
As of SQL Server 2012, you can use an efficient rolling sum of logarithms.
WITH tempcte AS (
SELECT
id,
rolmul,
EXP(SUM(LOG(id)) OVER (ORDER BY id)) AS setval
FROM #Temp
)
UPDATE tempcte
SET rolmul = setval;
SQL Server 2012 introduces the OVER clause to the SUM function. Ole Michelsen shows with a brief example how this efficiently solves the running total problem.
The product law of logarithms says that the log of the product of two numbers is equal to the sum of the log of each number.
This identity allows us to use the fast sum to calculate multiplications at similar speed. Take the log before the sum and take the exponent of the result, and you have your answer!
SQL Server gives you LOG and EXP to calculate the natural logarithm (base e) and its exponential. It doesn't matter what base you use as long as you are consistent.
The updatable common table expression is necessary because window expressions can't appear in the SET clause of an update statement.
The query is reliably correct for small numbers of rows, but will overflow very quickly. Try 64 rows of 2 and you'll bust the bigint!
In theory this should product the correct result as long as the ids are unique. In practice, I think your set of ids will always be small :-)

In SQL Server 2000, how to delete the specified rows in a table that does not have a primary key?

Let's say we have a table with some data in it.
IF OBJECT_ID('dbo.table1') IS NOT NULL
BEGIN
DROP TABLE dbo.table1;
END
CREATE TABLE table1 ( DATA INT );
---------------------------------------------------------------------
-- Generating testing data
---------------------------------------------------------------------
INSERT INTO dbo.table1(data)
SELECT 100
UNION ALL
SELECT 200
UNION ALL
SELECT NULL
UNION ALL
SELECT 400
UNION ALL
SELECT 400
UNION ALL
SELECT 500
UNION ALL
SELECT NULL;
How to delete the 2nd, 5th, 6th records in the table? The order is defined by the following query.
SELECT data
FROM dbo.table1
ORDER BY data DESC;
Note, this is in SQL Server 2000 environment.
Thanks.
In short, you need something in the table to indicate sequence. The "2nd row" is a non-sequitur when there is nothing that enforces sequence. However, a possible solution might be (toy example => toy solution):
If object_id('tempdb..#NumberedData') Is Not Null
Drop Table #NumberedData
Create Table #NumberedData
(
Id int not null identity(1,1) primary key clustered
, data int null
)
Insert #NumberedData( data )
SELECT 100
UNION ALL SELECT 200
UNION ALL SELECT NULL
UNION ALL SELECT 400
UNION ALL SELECT 400
UNION ALL SELECT 500
UNION ALL SELECT NULL
Begin Tran
Delete table1
Insert table1( data )
Select data
From #NumberedData
Where Id Not In(2,5,6)
If ##Error <> 0
Commit Tran
Else
Rollback Tran
Obviously, this type of solution is not guaranteed to work exactly as you want but the concept is the best you will get. In essence, you stuff your rows into a table with an identity column and use that to identify the rows to remove. Removing the rows entails emptying the original table and re-populating with only the rows you want. Without a unique key of some kind, there just is no clean way of handling this problem.
As you are probably aware you can do this in later versions using row_number very straightforwardly.
delete t from
(select ROW_NUMBER() over (order by data) r from table1) t
where r in (2,5,6)
Even without that it is possible to use the undocumented %%LOCKRES%% function to differentiate between 2 identical rows
SELECT data,%%LOCKRES%%
FROM dbo.table1`
I don't think that's available in SQL Server 2000 though.
In SQL Sets don't have order but cursors do so you could use something like the below. NB: I was expecting to be able to use DELETE ... WHERE CURRENT OF but that relies on a PK so the code to delete a row is not as simple as I was hoping for.
In the event that the data to be deleted is a duplicate then there is no guarantee that it will delete the same row as CURRENT OF would have. However in this eventuality the ordering of the tied rows is arbitrary anyway so whichever row is deleted could equally well have been given that row number in the cursor ordering.
DECLARE #RowsToDelete TABLE
(
rowidx INT PRIMARY KEY
)
INSERT INTO #RowsToDelete SELECT 2 UNION SELECT 5 UNION SELECT 6
DECLARE #PrevRowIdx int
DECLARE #CurrentRowIdx int
DECLARE #Offset int
SET #CurrentRowIdx = 1
DECLARE #data int
DECLARE ordered_cursor SCROLL CURSOR FOR
SELECT data
FROM dbo.table1
ORDER BY data
OPEN ordered_cursor
FETCH NEXT FROM ordered_cursor INTO #data
WHILE EXISTS(SELECT * FROM #RowsToDelete)
BEGIN
SET #PrevRowIdx = #CurrentRowIdx
SET #CurrentRowIdx = (SELECT TOP 1 rowidx FROM #RowsToDelete ORDER BY rowidx)
SET #Offset = #CurrentRowIdx - #PrevRowIdx
DELETE FROM #RowsToDelete WHERE rowidx = #CurrentRowIdx
FETCH RELATIVE #Offset FROM ordered_cursor INTO #data
/*Can't use DELETE ... WHERE CURRENT OF as here that requires a PK*/
SET ROWCOUNT 1
DELETE FROM dbo.table1 WHERE (data=#data OR data IS NULL OR #data IS NULL)
SET ROWCOUNT 0
END
CLOSE ordered_cursor
DEALLOCATE ordered_cursor
To perform any action on a set of rows (such as deleting them), you need to know what identifies those rows.
So, you have to come up with criteria that identifies the rows you want to delete.
Providing a toy example, like the one above, is not particularly useful.
You plan ahead and if you anticipate this is possible you add a surrogate key column or some such.
In general you make sure you don't create tables without PK's.
It's like asking "Say I don't look both directions before crossing the road and I step in front of a bus."