Best way to maintain an ordered list in PostgreSQL? - postgresql

Say I have a table called list, where there are items like these (the ids are random uuids):
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 Bar
x 4 Baz
I want to maintain the property that rank column always goes from 0 to n-1 (n being the number of rows)---if a client asks to insert an item with rank = 3, then the pg server should push the current 3 and 4 to 4 and 5, respectively:
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 New Item!
x 4 Bar
x 5 Baz
My current strategy is to have a dedicated insertion function add_item(item) that scans through the table, filter out items with rank equal or greater than that of the item being inserted, and increment those ranks by one. However, I think this approach will run into all sorts of problems---like race conditions.
Is there a more standard practice or more robust approach?
Note: The rank column is completely independent of rest of the columns, and insertion is not the only operation I need to support. Think of it as the back-end of a sortable to-do list, and the user can add/delete/reorder the items on the fly.

Doing verbatim what you suggest might be difficult or not possible at all, but I can suggest a workaround. Maintain a new column ts which stores the time a record is inserted. Then, insert the current time along with rest of the record, i.e.
id rank text ts
--- ----- ----- --------------------
x 0 Hello 2017-12-01 12:34:23
x 1 World 2017-12-03 04:20:01
x 2 Foo ...
x 3 New Item! 2017-12-12 11:26:32
x 3 Bar 2017-12-10 14:05:43
x 4 Baz ...
Now we can easily generate the ordering you want via a query:
SELECT id, rank, text,
ROW_NUMBER() OVER (ORDER BY rank, ts DESC) new_rank
FROM yourTable;
This would generate 0 to 5 ranks in the above sample table. The basic idea is to just use the already existing rank column, but to let the timestamp break the tie in ordering should the same rank appear more than once.

you can wrap it up to function if you think its worth of:
t=# with u as (
update r set rank = rank + 1 where rank >= 3
)
insert into r values('x',3,'New val!')
;
INSERT 0 1
the result:
t=# select * from r;
id | rank | text
----+------+----------
x | 0 | Hello
x | 1 | World
x | 2 | Foo
x | 3 | New val!
x | 4 | Bar
x | 5 | Baz
(6 rows)
also worth of mention you might have concurrency "chasing condition" problem on highly loaded systems. the code above is just a sample

You can have a “computed rank” which is a double precision and a “displayed rank” which is an integer that is computed using the row_number window function on output.
When a row is inserted that should rank between two rows, compute the new rank as the arithmetic mean of the two ranks.
The advantage is that you don't have to update existing rows.
The down side is that you have to calculate the displayed ranks before you can insert a new row so that you know where to insert it.
This solution (like all others) are subject to race conditions.
To deal with these, you can either use table locks or serializable transactions.

The only way to prevent a race condition would be to lock the table
https://www.postgresql.org/docs/current/sql-lock.html
Of course this would slow you down if there are lots of updates and inserts.
If can somehow limit the scope of your updates then you can do a SELECT .... FOR UPDATE on that scope. For example if the records have a parent_id you can do a select for update on the parent record first and any other insert who does the same select for update would have to wait till your transaction is done.
https://www.postgresql.org/docs/current/explicit-locking.html#:~:text=5.-,Advisory%20Locks,application%20to%20use%20them%20correctly.
Read the section on advisory locks to see if you can use those in your application. They are not enforced by the system so you'll need to be careful of how you write your application.

Related

Can PostgreSQL LAG() function refer to itself?

I've just discovered LAG() function in PostgreSQL and I've been experimenting to see what it can achieve. I've though that I might calculate factorial with it and I wrote
SELECT i, i * lag(factorial, 1, 1) OVER (ORDER BY i, 1) as factorial FROM generate_series(1, 10) as i;
But online IDE complains that 42703 column "factorial" does not exist.
Is there any way I can access the result of previous LAG call?
You can't refer to the column recursively in its definition.
However, you can express the factorial calculation as:
SELECT i, EXP(SUM(LN(i)) OVER w)::int factorial
FROM generate_series(1, 10) i
WINDOW w AS (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- outputs:
i | factorial
----+-----------
1 | 1
2 | 2
3 | 6
4 | 24
5 | 120
6 | 720
7 | 5040
8 | 40320
9 | 362880
10 | 3628800
(10 rows)
Postgresql does support an advanced SQL feature called recursive query, which can also be used to express the factorial table recursively:
WITH RECURSIVE series AS (
SELECT i FROM generate_series(1, 10) i
)
, rec AS (
SELECT i, 1 factorial FROM series WHERE i = 1
UNION ALL
SELECT series.i, series.i * rec.factorial
FROM series
JOIN rec ON series.i = rec.i + 1
)
SELECT *
FROM rec;
what EXP(SUM(LN(i)) OVER w) does:
This exploits the mathematical identities that:
[1]: log(a * b * c) = log (a) + log (b) + log (c)
[2]: exp (log a) = a
[combining 1&2]: exp(log a + log b + log c) = a * b * c
SQL does not have an aggregate multiply operation, so to perform an aggregate multiply operation, we first have to take the log of each value, then we can use the sum aggregate function to give us the the log of the values' product. This we invert with the final exponentiation step.
This works as long as the values being multiplied are positive as log is undefined for 0 and negative numbers. If you have negative numbers, or zero, the trick is to check if any value is 0, then the whole aggregation is 0, and check if the number of negative values is even, then the result is positive, else it is negative. Alternatively, you could also convert the reals to the complex plane and then use the identity Log(z) = ln(r) - iπ
what ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW does
This declares an expanding window frame that includes all preceding rows, and the current row.
e.g.
when i equals 1 the values in this window frame are {1}
when i equals 2 the values in this window frame are {1,2}
when i equals 3 the values in this window frame are {1,2,3}
what is a recursive query
A recursive query lets you express recursive logic using SQL. Recursive queries are often used to generate parent-child relationships from relational data (think manager-report, or product classification hierarchy), but they can generally be used to query any tree like structure.
Here is a SO answer I wrote a while back that illustrates and explains some of the capabilities of recursive queries.
There are also a tonne of useful tutorials on recursive queries. It is a very powerful sql-language feature and solves a type of problem that are very difficult do do without recursion.
Hope this gives you more insight into what the code does. Happy learning!

Can (aggregate) functions be used to define a column?

Assume a table like this one:
a | b | total
--|---|------
1 | 2 | 3
4 | 7 | 11
…
CREATE TEMPORARY TABLE summedup (
a double precision DEFAULT 0
, b double precision DEFAULT 0
--, total double precision
);
INSERT INTO summedup (a, b) VALUES (1, 2);
INSERT INTO summedup (a, b) VALUES (4, 7);
SELECT a, b, a + b as total FROM summedup;
It's easy to sum up the first two columns on SELECT.
But does Postgres (9.6) also support the ability to define total as the sum of the other two columns? If so:
What is the syntax?
What is this type of operation called (aggregates typically sum up cells over multiple rows, not columns.)
What you are looking for is typically called a "computed column".
Postgres 9.6 does not support that (Postgres 12 - to be released in Q4 2019 - will).
But for such a simple sum, I wouldn't bother storing redundant information.
If you don't want to repeat the expression, create a view.
I think what you want is a View.
CREATE VIEW table_with_sum AS
SELECT id, a, b, a + b as total FROM summedup;
then you can query the view for the sum.
SELECT total FROM table_with_sum where id=5;
The View does not store the sum for each row, the totalcolumn is computed every time you query the View. If your goal is to make your query more efficient, this will not help.
There is an other way: add the column to the table and create triggers for update and insert that update the total column every time a row is modified.

Postgres Function: how to return the first full set of data that occurs after specified date/time

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80
Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Min value with GROUP BY in Power BI Desktop

id datetime new_column datetime_rankx
1 12.01.2015 18:10:10 12.01.2015 18:10:10 1
2 03.12.2014 14:44:57 03.12.2014 14:44:57 1
2 21.11.2015 11:11:11 03.12.2014 14:44:57 2
3 01.01.2011 12:12:12 01.01.2011 12:12:12 1
3 02.02.2012 13:13:13 01.01.2011 12:12:12 2
3 03.03.2013 14:14:14 01.01.2011 12:12:12 3
I want to make new column, which will have minimum datetime value for each row in group by id.
How could I do it in Power BI desktop using DAX query?
Use this expression:
NewColumn =
CALCULATE(
MIN(
Table[datetime]),
FILTER(Table,Table[id]=EARLIER(Table[id])
)
)
In Power BI using a table with your data it will produce this:
UPDATE: Explanation and EARLIER function usage.
Basically, EARLIER function will give you access to values of different row context.
When you use CALCULATE function it creates a row context of the whole table, theoretically it iterates over every table row. The same happens when you use FILTER function it will iterate on the whole table and evaluate every row against the filter condition.
So far we have two row contexts, the row context created by CALCULATE and the row context created by FILTER. Note FILTER use the EARLIER to get access to the CALCULATE's row context. Having said that, in our case for every row in the outer (CALCULATE's row context) the FILTER returns a set of rows that correspond to the current id in the outer context.
If you have a programming background it could give you some sense. It is similar to a nested loop.
Hope this Python code points the main idea behind this:
outer_context = ['row1','row2','row3','row4']
inner_context = ['row1','row2','row3','row4']
for outer_row in outer_context:
for inner_row in inner_context:
if inner_row == outer_row: #this line is what the FILTER and EARLIER do
#Calculate the min datetime using the filtered rows
...
...
UPDATE 2: Adding a ranking column.
To get the desired rank you can use this expression:
RankColumn =
RANKX(
CALCULATETABLE(Table,ALLEXCEPT(Table,Table[id]))
,Table[datetime]
,Hoja1[datetime]
,1
)
This is the table with the rank column:
Let me know if this helps.

How to create a custom windowing function for PostgreSQL? (Running Average Example)

I would really like to better understand what is involved in creating a UDF that operates over windows in PostgreSQL. I did some searching about how to create UDFs in general, but haven't found an example of how to do one that operates over a window.
To that end I am hoping that someone would be willing to share code for how to write a UDF (can be in C, pl/SQL or any of the procedural languages supported by PostgreSQL) that calculates the running average of numbers in a window. I realize there are ways to do this by applying the standard average aggregate function with the windowing syntax (rows between syntax I believe), I am simply asking for this functionality because I think it makes a good simple example. Also, I think if there was a windowing version of average function then the database could keep a running sum and observation count and wouldn't sum up almost identical sets of rows at each iteration.
You have to look to postgresql source code postgresql/src/backend/utils/adt/windowfuncs.c and postgresql/src/backend/executor/nodeWindowAgg.c
There are no good documentation :( -- fully functional window function should be implemented only in C or PL/v8 - there are no API for other languages.
http://www.pgcon.org/2009/schedule/track/Version%208.4/128.en.html presentation from author of implementation in PostgreSQL.
I found only one non core implementation - http://api.pgxn.org/src/kmeans/kmeans-1.1.0/
http://pgxn.org/dist/plv8/1.3.0/doc/plv8.html
According to the documentation "Other window functions can be added by the user. Also, any built-in or user-defined normal aggregate function can be used as a window function." (section 4.2.8). That worked for me for computing stock split adjustments:
CREATE OR REPLACE FUNCTION prod(float8, float8) RETURNS float8
AS 'SELECT $1 * $2;'
LANGUAGE SQL IMMUTABLE STRICT;
CREATE AGGREGATE prods ( float8 ) (
SFUNC = prod,
STYPE = float8,
INITCOND = 1.0
);
create or replace view demo.price_adjusted as
select id, vd,
prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as adjf,
rawprice * prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as price
from demo.prices_raw left outer join demo.adjustments using (id,vd);
Here are the schemas of the two tables:
CREATE TABLE demo.prices_raw (
id VARCHAR(30),
vd DATE,
rawprice float8 );
CREATE TABLE demo.adjustments (
id VARCHAR(30),
vd DATE,
sdiv float);
Starting with table
payments
+------------------------------+
| customer_id | amount | item |
| 5 | 10 | book |
| 5 | 71 | mouse |
| 7 | 13 | cover |
| 7 | 22 | cable |
| 7 | 19 | book |
+------------------------------+
SELECT customer_id,
AVG(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
we get
+----------------------------------+
| customer_id | avg_amount | item |
| 5 | 40.5 | book |
| 5 | 40.5 | mouse |
| 7 | 18 | cover |
| 7 | 18 | cable |
| 7 | 18 | book |
+----------------------------------+
AVG being an aggregate function, it can act as a window function. However not all window functions are aggregate functions. The aggregate functions are the non-sophisticated window functions.
In the query above, let's not use the built-in AVG function and use our own implementation. Does the same, just implemented by the user. The query above becomes:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The only difference from the former query is that AVG has been replaced with my_avg. We now need to implement our custom function.
On how to compute the average
Sum up all the elements, then divide by the number of elements. For customer_id of 7, that would be (13 + 22 + 19) / 3 = 18.
We can devide it in:
a step-by-step accumulation -- the sum.
a final operation -- division.
On how the aggregate function gets to the result
The average is computed in steps. Only the last value is necessary.
Start with an initial value of 0.
Feed 13. Compute the intermediate/accumulated sum, which is 13.
Feed 22. Compute the accumulated sum, which needs the previous sum plus this element: 13 + 22 = 35
Feed 19. Compute the accumulated sum, which needs the previous sum plus this element: 35 + 19 = 54. This is the total that needs to be divided by the number of element (3).
The result of step 3. is fed to another function, that knows how to divide the accumulated sum by the number of elements
What happened here is that the state started with the initial value of 0 and was changed with every step, then passed to the next step.
State travels between steps for as long as there is data. When all data is consumed state goes to a final function (terminal operation). We want the state to contain all the information needed for the accumulator as well as by the terminal operation.
In the specific case of computing the average, the terminal operation needs to know how many elements the accumulator worked with because it needs to divide by that. For that reason, the state needs to include both the accumulated sum and the number of elements.
We need a tuple that will contain both. Pre-defined POINT PostgreSQL type to the rescue. POINT(5, 89) means an accumulated sum of 5 elements that has the value of 89. The initial state will be a POINT(0,0).
The accumulator is implemented in what's called a state function. The terminal operation is implemented in what's called a final function.
When defining a custom aggregate function we need to specify:
the aggregate function name and return type
the initial state
the type of the state that the infrastructure will pass between steps and to the final function
a state function -- knows how to perform the accumulation steps
a final function -- knows how to perform the terminal operation. Not always needed (e.g. in a custom implementation of SUM the final value of the accumulated sum is the result.)
Here's the definition for the custom aggregate function.
CREATE AGGREGATE my_avg (NUMERIC) ( -- NUMERIC is what the function returns
initcond = '(0,0)', -- this is the initial state of type POINT
stype = POINT, -- this is the type of the state that will be passed between steps
sfunc = my_acc, -- this is the function that knows how to compute a new average from existing average and new element. Takes in the state (type POINT) and an element for the step (type NUMERIC)
finalfunc my_final_func -- returns the result for the aggregate function. Takes in the state of type POINT (like all other steps) and returns the result as what the aggregate function returns - NUMERIC
);
The only thing left is to define two functions my_acc and my_final_func.
CREATE FUNCTION my_acc (state POINT, elem_for_step NUMERIC) -- performs accumulated sum
RETURNS POINT
LANGUAGE SQL
AS $$
-- state[0] is the number of elements, state[1] is the accumulated sum
SELECT POINT(state[0]+1, state[1] + elem_for_step);
$$;
CREATE FUNCTION my_final_func (POINT) -- performs devision and returns final value
RETURNS NUMERIC
LANGUAGE SQL
AS $$
-- $1[1] is the sum, $1[0] is the number of elements
SELECT ($1[1]/$1[0])::NUMERIC;
$$;
Now that the functions are available CREATE AGGREGATE defined above will run successfully. Now that we have the aggregate defined, the query based on my_avg instead of the built-in AVG can be run:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The results are identical with what you get when using the built-in AVG.
The PostgreSQL documentation suggests that the users are limited to implementing user-defined aggregate functions:
In addition to these [pre-defined window] functions, any built-in or user-defined general-purpose or statistical aggregate (i.e., not ordered-set or hypothetical-set aggregates) can be used as a window function;
What I suspect ordered-set or hypothetical-set aggregates means:
the value returned is identical to all other rows (e.g. AVG and SUM. In contrast RANK returns different values for all rows in group depending on more sophisticated criteria)
it makes no sense to ORDER BY when PARTITIONing because the values are the same for all rows anyway. In contrast we want to ORDER BY when using RANK()
Query:
SELECT customer_id, item, rank() OVER (PARTITION BY customer_id ORDER BY amount desc) FROM payments;
Geometric mean
The following is a user-defined aggregate function that I found no built-in aggregate for and may be useful to some.
The state function computes the average of the natural logarithms of the terms.
The final function raises constant e to whatever the accumulator provides.
CREATE OR REPLACE FUNCTION sum_of_log(state POINT, curr_val NUMERIC)
RETURNS POINT
LANGUAGE SQL
AS $$
SELECT POINT(state[0] + 1,
(state[1] * state[0]+ LN(curr_val))/(state[0] + 1));
$$;
CREATE OR REPLACE FUNCTION e_to_avg_of_log(POINT)
RETURNS NUMERIC
LANGUAGE SQL
AS $$
select exp($1[1])::NUMERIC;
$$;
CREATE AGGREGATE geo_mean (NUMBER)
(
stype = NUMBER,
initcond = '(0,0)', -- represent POINT value
sfunc = sum_of_log,
finalfunc = e_to_avg_of_log
);
PL/R provides such functionality. See here for some examples. That said, I'm not sure that it (currently) meets your requirement of "keep[ing] a running sum and observation count and [not] sum[ming] up almost identical sets of rows at each iteration" (see here).