postgresql: How to grab an existing id from a not subsequent ids of a table - postgresql

Postgresql version 9.4
I have a table with an integer column, which has a number of integers with some gaps, like the sample below; I'm trying to get an existing id from the column at random with the following query, but it returns NULL occasionally:
CREATE TABLE
IF NOT EXISTS test_tbl(
id INTEGER);
INSERT INTO test_tbl
VALUES (10),
(13),
(14),
(16),
(18),
(20);
-------------------------------
SELECT * FROM test_tbl;
-------------------------------
SELECT COALESCE(tmp.id, 20) AS classification_id
FROM (
SELECT tt.id,
row_number() over(
ORDER BY tt.id) AS row_num
FROM test_tbl tt
) tmp
WHERE tmp.row_num =floor(random() * 10);
Please let me know where I'm doing wrong.

but it returns NULL occasionally
and I must add to this that it sometimes returns more than 1 rows, right?
in your sample data there are 6 rows, so the column row_num will have a value from 1 to 6.
This:
floor(random() * 10)
creates a random number from 0 up to 0.9999...
You should use:
floor(random() * 6 + 1)::int
to get a random integer from 1 to 6.
But this would not solve the problem, because the WHERE clause is executed once for each row, so there is a case that row_num will never match the created random number, so it will return nothing, or it will match more than once so it will return more than 1 rows.
See the demo.
The proper (although sometimes not the most efficient) way to get a random row is:
SELECT id FROM test_tbl ORDER BY random() LIMIT 1
Also check other links from SO, like:
quick random row selection in Postgres

You could select one row and order by random(), this way you are ensured to hit an existing row
select id
from test_tbl
order by random()
LIMIT 1;

Related

PostgreSQL function results in inconsistent number of rows when used in WHERE clause

I have a function "experiments.get_random_experiment_group(experiment)" that when called in isolation behaves normally, but when called in a WHERE clause results in zero or more rows being returned.
For example, if I call this function like so:
SELECT
experiments.get_random_experiment_group('1a8cd547-8de6-4602-9103-3cea12f57365');
it will always return 1 of the 2 groups ids associated with that experiment.
But if I do something like this...:
SELECT
id,
experiment_id,
proportion
FROM
experiments."group"
WHERE
id = experiments.get_random_experiment_group('1a8cd547-8de6-4602-9103-3cea12f57365');
I will sometimes get 1 row, 2 rows, or no rows at all.
Why is it behaving as expected in the first case, but not in the second? Any suggestions as to what's happening?
Here is the function:
CREATE OR REPLACE FUNCTION experiments.get_random_experiment_group (experiment uuid)
RETURNS uuid
LANGUAGE plpgsql
AS
$function$
DECLARE
result uuid;
BEGIN
WITH
CTE
AS
(
SELECT
random() * (SELECT SUM(g.proportion) FROM experiments."group" g WHERE g.experiment_id = experiment) AS r
)
SELECT
id
INTO
result
FROM
(
SELECT
id,
SUM(proportion) OVER (ORDER BY id) AS s,
r
FROM
(
SELECT
*
FROM
experiments."group" g
WHERE
g.experiment_id = experiment
) a
CROSS JOIN
CTE
) q
WHERE
s >= r
ORDER BY
id
LIMIT
1;
return result;
end;
$function$;
The purpose of this function is to randomly select a group id for a given experiment id. The probability of selecting a particular group is given by the “proportion” assigned to that group.
And here are all of the groups associated with the experiment in my example:
id
experiment_id
proportion
95cc9e77-3dff-48d9-91cb-3e17c0c24d5a
1a8cd547-8de6-4602-9103-3cea12f57365
0.5
47469555-88fd-4cec-925f-9cf15b91bca0
1a8cd547-8de6-4602-9103-3cea12f57365
0.5
I am using postgres version 13.4
The function was getting called for each row in the "group" table, and thus returning rows when and only when the group id of that row happened to correspond to the randomly generated id returned by the function call.
Thanks to the commenters for pointing this out.

find rows not following by the same values in 3 columns

I have a table named raw_data with the following data
as You can see id 1 and 2 share the same values in field desa, kecamatan and kabupaten, also id 3,4,5.
So basically I want to select all rows that is not followed by the same previous values. expected result would be:
I know it's easy to do this in any programming languages such as PHP, but I need this in postgresql. is this doable? Thanks in Advance.
Assuming higher id denotes latest row, if a row with same all three columns is present not together and you don't want to filter it out as it doesn't have same values as previous row (order by id or created_date), then you can make use of analytic lag() function:
select *
from (
select
t.*,
case
when desa = lag(desa) over (order by id)
and kecamatan = lag(kecamatan) over (order by id)
and kabupaten = lag(kabupaten) over (order by id)
then 0 else 1
end flag
from your_table t
) t where flag = 1;

PostgreSQL Removing duplicates

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.
The table looks something like this
Ist col 2nd col
4 62
6 34
5 26
5 12
I want to write a select query which remove either row 3 or 4.
There is no need for an intermediate table:
delete from df1
where ctid not in (select min(ctid)
from df1
group by first_column);
If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.
If you just want to get unique values for one column, you can use:
select distinct on (first_column) *
from the_table
order by first_column;
Or simply
select first_column, min(second_column)
from the_table
group by first_column;
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1)
if you want to keep one of the rows (sorry, I initially missed it if you wanted that):
select first, min(second)
from df1
group by first
Where the table's name is df1 and the columns are named first and second.
You can actually leave off the count(first) as cnt if you want.
At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.
If you want to replace the table or make a new table you can just use create table as for the deletion:
create table tmp as
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1);
drop table df1;
create table df1 as select * from tmp;
or using DELETE FROM:
DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);
You could also use select into, etc, etc.
if you want to SELECT unique rows:
SELECT * FROM ztable u
WHERE NOT EXISTS ( -- There is no other record
SELECT * FROM ztable x
WHERE x.id = u.id -- with the same id
AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid
); -- so u.* must be unique
if you want to SELECT the other rows, which were suppressed in the previous query:
SELECT * FROM ztable nu
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = nu.id -- with the same id
AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid
);
if you want to DELETE records, making the table unique (but keeping one record per id):
DELETE FROM ztable d
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = d.id -- with the same id
AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid
);
So basically I did this
create temp t1 as
select first, min (second) as second
from df1
group by first
select * from df1
inner join t1 on t1.first = df1.first and t1.second = df1.second
Its a satisfactory answer. Thanks for your help #Hack-R

Postgresql - get closest datetime row relative to given datetime value

I have a postgres table with a unique datetime field.
I would like to use/create a function that takes as argument a datetime value and returns the row id having the closest datetime relative (but not equal) to the passed datetime value. A second argument could specify before or after the passed value.
Ideally, some combination of native datetime functions could handle this requirement. Otherwise it'll have to be a custom function.
Question: What are methods for querying relative datetime over a collection of rows?
select id, passed_ts - ts_column difference
from t
where
passed_ts > ts_column and positive_interval
or
passed_ts < ts_column and not positive_interval
order by abs(extract(epoch from passed_ts - ts_column))
limit 1
passed_ts is the timestamp parameter and positive_interval is a boolean parameter. If true only rows where the timestamp column is lower then the passed timestamp. If false the inverse.
use simply -.
Assuming you have a table with attributes Key, Attr and T (timestamp with or without timezone):
you can search with
select min(T - TimeValue) from Table where (T - TimeValue) > 0;
this will give you the main difference. You can combine this value with a join to the same table to get the tuple you are interested in:
select * from (select *, T - TimeValue as diff from Table) as T1 NATURAL JOIN
( select min(T - TimeValue) as diff from Table where (T - TimeValue) > 0) as T2;
that should do it
--dmg
You want the first row of a select statement producing all the rows below (or above) the given datetime in descending (or ascending) order.
Pseudo code for the function body:
SELECT id
FROM table
WHERE IF(#above, datecol < #param, datecol > #param)
ORDER BY IF (#above. datecol ASC, datecol DESC)
LIMIT 1
However, this does not work: one cannot condition the ordering direction.
The second idea is to do both queries, and select afterwards:
SELECT *
FROM (
(
SELECT 'below' AS dir, id
FROM table
WHERE datecol < #param
ORDER BY datecol DESC
LIMIT 1
) UNION (
SELECT 'above' AS dir, id
FROM table
WHERE datecol > #param
ORDER BY datecol ASC
LIMIT 1)
) AS t
WHERE dir = #dir
That should be pretty fast with an index on the datetime column.
-- test rig
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE lutser
( dt timestamp NOT NULL PRIMARY KEY
);
-- populate it
INSERT INTO lutser(dt)
SELECT gs
FROM generate_series('2013-04-30', '2013-05-01', '1 min'::interval) gs
;
DELETE FROM lutser WHERE random() < 0.9;
--
-- The query:
WITH xyz AS (
SELECT dt AS hh
, LAG (dt) OVER (ORDER by dt ) AS ll
FROM lutser
)
SELECT *
FROM xyz bb
WHERE '2013-04-30 12:00' BETWEEN bb.ll AND bb.hh
;
Result:
NOTICE: drop cascades to table tmp.lutser
DROP SCHEMA
CREATE SCHEMA
SET
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "lutser_pkey" for table "lutser"
CREATE TABLE
INSERT 0 1441
DELETE 1288
hh | ll
---------------------+---------------------
2013-04-30 12:02:00 | 2013-04-30 11:50:00
(1 row)
Wrapping it into a function is left as an excercise for the reader
UPDATE: here is a second one with the sandwiched-not-exists-trick (TM):
SELECT lo.dt AS ll
FROM lutser lo
JOIN lutser hi ON hi.dt > lo.dt
AND NOT EXISTS (
SELECT * FROM lutser nx
WHERE nx.dt < hi.dt
AND nx.dt > lo.dt
)
WHERE '2013-04-30 12:00' BETWEEN lo.dt AND hi.dt
;
You have to join the table to itself with the where condition looking for the smallest nonzero (negative or positive) interval between the base table row's datetime and the joined table row's datetime. It would be good to have an index on that datetime column.
P.S. You could also look for the max() of the previous or the min() of the subsequent.
Try something like:
SELECT *
FROM your_table
WHERE (dt_time > argument_time and search_above = 'true')
OR (dt_time < argument_time and search_above = 'false')
ORDER BY CASE WHEN search_above = 'true'
THEN dt_time - argument_time
ELSE argument_time - dt_time
END
LIMIT 1;

How do I replace a SSN with a 9 digit random number in SQL Server 2008R2?

To satisfy security requirements, I need to find a way to replace SSN's with unique, random 9 digit numbers, before providing said database to a developer. The SSN is in a column in a table of a database. There may be 10's of thousands of rows in said table. The number does not need hyphens. I am a beginner with SQL and programming in general.
I have been unable to find a solution for my specific needs. Nothing seems quite right. But if you know of a thread that I have missed, please let me know.
Thanks for any help!
Here is one way.
I'm assuming that you already have a backup of the real data as this update is not reversible.
Below I've assumed your table name is Person with your ssn column named SSN.
UPDATE Person SET
SSN = CAST(LEFT(CAST(ABS(CAST(CAST(NEWID() as BINARY(10)) as int)) as varchar(max)) + '00000000',9) as int)
If they do not have to be random, you could just replace them with ascending numeric values. Failing that, you’d have to generate a random number. As you may have discovered, the RAND function will only generate a single value per query statement (select, update, etc.); the work-around to that is the newid() function, which would generate a GUID for each row produced by a query (run SELECT newid() from MyTable to see how this works). Wrap this in a checksum() to generate an integer; modulus that by 1,000,00,000 to get a value within the SSN range (0 to 999,999,999); and, assuming you’re storing it as a char(9) prefix it with leading zeros.
Next trick is ensuring it’s unique for all values in your table. This gets tricky, and I’d do it by setting up a temp table with the values, populating it, then copying them over. Lessee now…
DECLARE #DummySSN as table
(
PrimaryKey int not null
,NewSSN char(9) not null
)
-- Load initial values
INSERT #DummySSN
select
UserId
,right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
from Users
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
-- Loop until values are unique
IF exists (SELECT 1 from #DummySSN group by NewSSN having count(*) > 1)
UPDATE #DummySSN
set NewSSN = right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
where NewSSN in (select NewSSN from #DummySSN group by NewSSN having count(*) > 1)
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
This works for a small table I have, and it should work for a large one. I don’t see this turning into an infinite loop, but even so you might want to add a check to exit the loop after say 10 iterations,
I've run a couple million tests in this and it seems to generate random (URN) 9 digit numbers (no leading zeros).
I cannot think of a more efficient way to do this.
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
The test used;
;WITH Fn(N) AS
(
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
UNION ALL
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
FROM Fn
)
,Tester AS
(
SELECT TOP 5000000 *
FROM Fn
)
SELECT LEN(MIN(N))
,LEN(MAX(N))
,MIN(N)
,MAX(N)
FROM Tester
OPTION (MAXRECURSION 0)
Not so fast, but easiest... I added some dot's...
DECLARE #tr NVARCHAR(40)
SET #tr = CAST(ROUND((888*RAND()+111),0) AS CHAR(3)) + '.' +
CAST(ROUND((8888*RAND()+1111),0) AS CHAR(4)) + '.' + CAST(ROUND((8888*RAND()+1111),0) AS
CHAR(4)) + '.' + CAST(ROUND((88*RAND()+11),0) AS CHAR(2))
PRINT #tr
If the requirement is to obfuscate a database then this will return the same unique value for each distinct SSN in any table preserving referential integrity in the output without having to do a lookup and translate.
SELECT CAST(RAND(SSN)*999999999 AS INT)