PostgreSQL - Create randomized percentages? - postgresql

For testing purposes I need to create a JSONB object with n random percentages, so that they sum up to 1.0.
Obviously, the very simple
SELECT jsonb_object(array_agg(ARRAY['prefix_' || n::TEXT, round(random()::NUMERIC, 2)::TEXT])) AS e
FROM generate_series(1,8) AS n
;
will produce numbers that won't sum up to 1.0.
I expect the solution to be much more complicated, and I can't seem to put my head around this.
How to do this in PostgreSQL 10?

You just have to adjust the values.
You could base your generator on this query:
WITH r AS (
SELECT random()
FROM generate_series(1, 8)
)
SELECT random/sum(random) OVER ()
FROM r;

Related

Update PostgreSQL columns with random numbers according to a uniform or normal law

Context
I have a PostgreSQL (v.10.12) table which contains several hundreds of thousands of rows and many columns.
Short: I'd like to initialize random values in some columns according to either a uniform distribution or a normal distribution.
Uniform distribution
I have 3 empty columns which I'd like to initialize with uniform random numbers (i.e. according to a uniform distribution along each column).
For that, I'm using the PostgreSQL random() function, but it is not clearly explained in the documentation whether the generated numbers are picked from a uniform or a normal distribution:
Source: https://www.postgresql.org/docs/12/functions-math.html
So I made the (I hope correct) hypothesis it is a uniform distribution from now.
Normal distribution
And I have 3 other empty columns, which I'd like to initialize with normal random numbers (i.e. according to a normal distribution along each column):
Results
For a uniform distribution
I did this (I actually figured it out while writing this post);
UPDATE schema.tables
SET col1 = (1 * random() + 1)::float4,
col2 = (1 * random() + 1)::float4,
col3 = (1 * random() + 1)::float4
Which is a bit slow but seems to work, because here is an example of generated data:
And the data histogram for one column is almost uniform, so I guess it's OK:
But what would be great is to set the entire columns values in one shot instead of row by row (or maybe this row-by-row way of updating the table performs really better in SQL. I don't know, but it's more a linear algebra reasoning that I have in the background).
For a normal distribution
I'm stuck with the following query, inspired by the previous one but using the normal_rand function from the tablefunc extension:
UPDATE schema.table
SET col1 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
col2 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
col3 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
but here, I am facing the following error: set-returning functions are not allowed in UPDATE.
So, I guessed I have to use a SELECT sub-query then, hence I also tried this, but without much success:
UPDATE schema.table
SET col1 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub,
col2 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 10.0) + 2)::float4 as col) AS sub,
col3 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 5.0) + 2)::float4 as col) AS sub
Where I get a syntax error at or near col2
But If I play with one single column:
UPDATE schema.table
SET col1 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub;
It almost works, the query is successful, but I have the exact same number in each row of the column, which obviously doesn't make it a normal distribution!
So my dream would be to be able to update all whole column in one shot, using something such as:
UPDATE schema.table
SET col1 = sub.col
FROM (
SELECT (
1 * normal_rand(
SELECT count(*) FROM schema.table,
50.0,
20.0
) + 2
):: float4 AS col
) AS sub;
But here again, I got a syntax error at the 2nd SELECT position; syntax error at or near "select"
Question
How can I set one or more entire column(s) with random numbers according to a normal distribution?
In a pinch, you can try the following solution:
((random() + random() + random() + random() + random() + random() + random() + random() + random() + random() + random() + random()) - 6)
The idiom above uses twelve random() calls to generate a random number that approximates a standard normal distribution (mean of 0 and standard deviation 1), and takes advantage of the central limit theorem. It's an approximation in part because the random number generated this way won't be less than -6 or greater than 6, whereas the normal distribution can theoretically take on any real number; however numbers less than -6 or greater than 6 occur so rarely (about 1 in 500 million) that it may be negligible in your case.

Can PostgreSQL LAG() function refer to itself?

I've just discovered LAG() function in PostgreSQL and I've been experimenting to see what it can achieve. I've though that I might calculate factorial with it and I wrote
SELECT i, i * lag(factorial, 1, 1) OVER (ORDER BY i, 1) as factorial FROM generate_series(1, 10) as i;
But online IDE complains that 42703 column "factorial" does not exist.
Is there any way I can access the result of previous LAG call?
You can't refer to the column recursively in its definition.
However, you can express the factorial calculation as:
SELECT i, EXP(SUM(LN(i)) OVER w)::int factorial
FROM generate_series(1, 10) i
WINDOW w AS (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- outputs:
i | factorial
----+-----------
1 | 1
2 | 2
3 | 6
4 | 24
5 | 120
6 | 720
7 | 5040
8 | 40320
9 | 362880
10 | 3628800
(10 rows)
Postgresql does support an advanced SQL feature called recursive query, which can also be used to express the factorial table recursively:
WITH RECURSIVE series AS (
SELECT i FROM generate_series(1, 10) i
)
, rec AS (
SELECT i, 1 factorial FROM series WHERE i = 1
UNION ALL
SELECT series.i, series.i * rec.factorial
FROM series
JOIN rec ON series.i = rec.i + 1
)
SELECT *
FROM rec;
what EXP(SUM(LN(i)) OVER w) does:
This exploits the mathematical identities that:
[1]: log(a * b * c) = log (a) + log (b) + log (c)
[2]: exp (log a) = a
[combining 1&2]: exp(log a + log b + log c) = a * b * c
SQL does not have an aggregate multiply operation, so to perform an aggregate multiply operation, we first have to take the log of each value, then we can use the sum aggregate function to give us the the log of the values' product. This we invert with the final exponentiation step.
This works as long as the values being multiplied are positive as log is undefined for 0 and negative numbers. If you have negative numbers, or zero, the trick is to check if any value is 0, then the whole aggregation is 0, and check if the number of negative values is even, then the result is positive, else it is negative. Alternatively, you could also convert the reals to the complex plane and then use the identity Log(z) = ln(r) - iπ
what ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW does
This declares an expanding window frame that includes all preceding rows, and the current row.
e.g.
when i equals 1 the values in this window frame are {1}
when i equals 2 the values in this window frame are {1,2}
when i equals 3 the values in this window frame are {1,2,3}
what is a recursive query
A recursive query lets you express recursive logic using SQL. Recursive queries are often used to generate parent-child relationships from relational data (think manager-report, or product classification hierarchy), but they can generally be used to query any tree like structure.
Here is a SO answer I wrote a while back that illustrates and explains some of the capabilities of recursive queries.
There are also a tonne of useful tutorials on recursive queries. It is a very powerful sql-language feature and solves a type of problem that are very difficult do do without recursion.
Hope this gives you more insight into what the code does. Happy learning!

Query table by a value in the second dimension of a two dimensional array column

WHAT I HAVE
I have a table with the following definition:
CREATE TABLE "Highlights"
(
id uuid,
chunks numeric[][]
)
WHAT I NEED TO DO
I need to query the data in the table using the following predicate:
... WHERE id = 'some uuid' and chunks[????????][1] > 10 chunks[????????][3] < 20
What should I put instead of [????????] in order to scan all items in the first dimension of the array?
Notes
I'm not entirely sure that chunks[][1] even close to something I need.
All I need is to test a row, whether its chunks column contains a two dimensional array, that has in any of its tuples some specific values.
May be there's better alternative, but this might do - you just go over first dimension of each array and testing your condition:
select *
from highlights as h
where
exists (
select
from generate_series(1, array_length(h.chunks, 1)) as tt(i)
where
-- your condition goes here
h.chunks[tt.i][1] > 10 and h.chunks[tt.i][3] < 20
)
db<>fiddle demo
update as #arie-r pointed out, it'd be better to use generate_subscripts function:
select *
from highlights as h
where
exists (
select *
from generate_subscripts(h.chunks, 1) as tt(i)
where
h.chunks[tt.i][3] = 6
)
db<>fiddle demo

PostgreSQL doesn't update/copy boolean from subselect?

Here's a short code sample that behaves unexpectedly for me in PostgreSQL v9.5:
create table test (a boolean, b boolean);
insert into test (a) values ('true'),('true'),('true'),('true'),('true');
update test set b = rand.val from (select random() > 0.5 as val from test) as rand;
I expect column b to take on random true/false values, but for some reason it's always false. If I run the subselect of:
select random() > 0.5 as val from test;
It returns random true/false combinations as I want. But for some reason once I try to update the actual table it fails. I've tried several casting combinations but it doesn't seem to help.
What am I missing here?
How about:
update test set b = (random() > 0.5);

How to get min/max of two integers in Postgres/SQL?

How do I find the maximum (or minimum) of two integers in Postgres/SQL? One of the integers is not a column value.
I will give an example scenario:
I would like to subtract an integer from a column (in all rows), but the result should not be less than zero. So, to begin with, I have:
UPDATE my_table
SET my_column = my_column - 10;
But this can make some of the values negative. What I would like (in pseudo code) is:
UPDATE my_table
SET my_column = MAXIMUM(my_column - 10, 0);
Have a look at GREATEST and LEAST.
UPDATE my_table
SET my_column = GREATEST(my_column - 10, 0);
You want the inline sql case:
set my_column = case when my_column - 10 > 0 then my_column - 10 else 0 end
max() is an aggregate function and gets the maximum of a row of a result set.
Edit: oops, didn't know about greatest and least in postgres. Use that instead.