How to create a flag that increments by 1 based on conditions - postgresql

How can I create a flag by looking at values of consecutive variables?
For example, in the table(image) below,
For row#1, flag takes the value 1;
For row#2 onwards it checks:
If variable1 =lag(variable2)
and variable2=lag(variable1) then flag = lag(flag) else flag increments by 1.
In this case, the condition doesn’t match therefore the flag takes value 2.
For row#3:
Since it matches the above condition flag is same as 2
For row#4: Even though it matches the above condition, the flag changes to 3 as the previous 2 rows(row#2 &row#3) have already been matched
And so on..
The final flag will look like:

Bear in mind that you better have your input data sorted to implement a "moving flag" with 2-row-based aggregation. For this answer's sake I've added a row_number() function to generate the order in which your sample data is given.
Test data
create table flagtest( var1 text, var2 text);
insert into flagtest(var1,var2) values
('T','Z'),('B','A'),('A','B'),('B','A'),('A','B'),('A','B'),
('A','B'),('B','A'),('C','D'),('E','F'),('F','E'),('M','N');
Code
-- fourth part
select var1, var2, sum(change_flag_2_based) over (order by ordcol) as flag
from( -- third part
select *,
case when
lag(change_flag) over (order by ordcol) = 0
and lag(change_flag, 2) over (order by ordcol) = 1
then 1 else change_flag
end as change_flag_2_based
from ( -- second part
select
var1, var2, ordcol,
case when
var1 = lag(var2) over (order by ordcol) and
var2 = lag(var1) over (order by ordcol)
then 0 else 1
end as change_flag
from ( -- first part
select var1, var2, row_number() over () as ordcol
from flagtest
) ordered_data
) prep_aggr_all
) prep_aggr_two_rows_based;
How does it work?
First part is all about providing a column to order the input data later in window functions. This will be any other column that you currently have in your table. In the example it introduces row_number() window function to generate such numerical order.
Second part is where we are marking rows, with assumed strategy of cross-equals between two variables comparing current with previous row, with indicators 1 and 0 whether the flag should change in this particular row. This is not a 2-based pair aggregation (yet).
Third part introduces comparing current row change flag indicator with indicators from two previous rows and if 1 row behind doesn't change the flag and 2 rows behind does change it it means that we should mark current row as flag-changing (2-row-based flag).
Fourth part is just for moving sum which makes final flags by summing those groups.
Output
var1 | var2 | flag
------+------+------
T | Z | 1
B | A | 2
A | B | 2
B | A | 3
A | B | 3
A | B | 4
A | B | 5
B | A | 5
C | D | 6
E | F | 7
F | E | 7
M | N | 8

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24
step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.
Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!
The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

Postgres: Why DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s)?

Why I can't retrieve the first distinct row with just any other expression in the order by, why should the leftmost expression be the same expression I used in DISTINCT ON?
Well, the ORDER BY is needed to keep those rows together that share the same value for the "distinct columns". The database processes them sequentially discarding all subsequent rows from the same set. If the rows weren't sorted, this wouldn't be easily possible.
Assume this set of rows:
c1 | c2
---+----
1 | 100
2 | 10
1 | 200
2 | 15
If you want the c1 to be unique and pick the highest c2 you would need to use
select distinct on (c1) *
from the_table
order by c1, c2 desc;
The order by itself will generate the following result:
c1 | c2
---+----
1 | 200
1 | 100
2 | 15
2 | 10
By processing that result row-by-row the database can now efficiently discard every but the first row for each c1 value by simply checking if that value changes from row to another. If the result wasn't sorted this check would be become far more complicated.

Retain only 3 highest positive and negative records in a table

I am new to databases and postgres as such.
I have a table called names which has 2 columns name and value which gets updated every x seconds with new name value pairs. My requirement is to retain only 3 positive and 3 negative values at any point of time and delete the rest of the rows during each table update.
I use the following query to delete the old rows and retain the 3 positive and 3 negative values ordered by value.
delete from names
using (select *,
row_number() over (partition by value > 0, value < 0 order by value desc) as rn
from names ) w
where w.rn >=3
I am skeptical to use a conditional like value > 0 in a partition statement. Is this approach correct?
For example,
A table like this prior to delete :
name | value
--------------
test | 10
test1 | 11
test1 | 12
test1 | 13
test4 | -1
test4 | -2
My table after delete should look like :
name | value
--------------
test1 | 13
test1 | 12
test1 | 11
test4 | -1
test4 | -2
demo:db<>fiddle
This works generally as expected: value > 0 clusters the values into all numbers > 0 and all numbers <= 0. The ORDER BY value orders these two groups as expected well.
So, the only thing, I would change:
row_number() over (partition by value >= 0 order by value desc)
remove: , value < 0 (Because: Why should you group the positive values into negative and other? You don't have any negative numbers in your positive group and vice versa.)
change: value > 0 to value >= 0 to ignore the 0 as long as possible
For deleting: If you want to keep the top 3 values of each direction:
you should change w.rn >= 3 into w.rn > 3 (it keeps the 3rd element as well)
you need to connect the subquery with the table records. In real cases you should use id columns for that. In your example you could take the value column: where n.value = w.value AND w.rn > 3
So, finally:
delete from names n
using (select *,
row_number() over (partition by value >= 0 order by value desc) as rn
from names ) w
where n.value = w.value AND w.rn > 3
If it's not a hard requirement to delete the other rows, you could instead select only the rows you're interested in:
WITH largest AS (
SELECT name, value
FROM names
ORDER BY value DESC
LIMIT 3),
smallest AS (
SELECT name, value
FROM names
ORDER BY value ASC
LIMIT 3)
SELECT * FROM largest
UNION
SELECT * FROM smallest
ORDER BY value DESC

Postgres aggregate sum conditional on row comparison

So, I have data that looks something like this
User_Object | filesize | created_date | deleted_date
row 1 | 40 | May 10 | Aug 20
row 2 | 10 | June 3 | Null
row 3 | 20 | Nov 8 | Null
I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:
SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.
This incorrect syntax illustrates what I want to do:
SELECT User_Object.id, User_Object.created,
SUM(CASE WHEN NOT window_function_row.deleted
OR window_function_row.deleted > User_Object.created
THEN filesize ELSE 0)
OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
When this function runs on the data that I have, it should output something like
id | created | sum_data_used|
1 | May 10 | 40
2 | June 3 | 50
3 | Nov 8 | 30
Something along these lines may work for you:
SELECT a.user_id
,MIN(a.created_date) AS created_date
,SUM(b.filesize) AS sum_data_used
FROM user_object a
JOIN user_object b ON (b.user_id <= a.user_id
AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
GROUP BY a.user_id
ORDER BY a.user_id
For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.
Example table definition:
create table user_object(user_id int, filesize int, created_date date, deleted_date date);
Data:
1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>
Result:
1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30

Calculate length of a series of line segments

I have a table like the following:
X | Y | Z | node
----------------
1 | 2 | 3 | 100
2 | 2 | 3 |
2 | 2 | 4 |
2 | 2 | 5 | 200
3 | 2 | 5 |
4 | 2 | 5 |
5 | 2 | 5 | 300
X, Y, Z are 3D space coordinates of some points, a curve passes through all the corresponding points from the first row to the last row. I need to calculate the curve length between two adjacent points whose "node" column aren't null.
If would be great if I can directly insert the result into another table that has three columns: "first_node", "second_node", "curve_length".
I don't need to interpolate extra points into the curve, just need to accumulate lengths all the straight lines, for example, in order to calculate the curve length between node 100 and 200, I need to sum the lengths of 3 straight lines: (1,2,3)<->(2,2,3), (2,2,3)<->(2,2,4), (2,2,4)<->(2,2,5)
EDIT
The table has an ID column, which is in increasing order from the first row to the last row.
To get a previous value in SQL, use the lag window function, e.g.
SELECT
x,
lag(x) OVER (ORDER BY id) as prev_x, ...
FROM ...
ORDER BY id;
That lets you get the previous and next points in 3-D space for a given segment. From there you can trivially calculate the line segment length using regular geometric maths.
You'll now have the lengths of each segment (sqlfiddle query). You can use this as input into other queries, using SELECT ... FROM (SELECT ...) subqueries or a CTE (WITH ....) term.
It turns out to be pretty awkward to go from the node segment lengths to node-to-node lengths. You need to create a table that spans the null entries, using a recursive CTE or with a window function.
I landed up with this monstrosity:
SELECT
array_agg(from_id) AS seg_ids,
-- 'max' is used here like 'coalese' for an aggregate,
-- since non-null is greater than null
max(from_node) AS from_node,
max(to_node) AS to_node,
sum(seg_length) AS seg_length
FROM (
-- lengths of all sub-segments with the null last segment
-- removed and a partition counter added
SELECT
*,
-- A running counter that increments when the
-- node ID changes. Allows us to group by series
-- of nodes in the outer query.
sum(CASE WHEN from_node IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY from_id) AS partition_id
FROM
(
-- lengths of all sub-segments
SELECT
id AS from_id,
lead(id, 1) OVER (ORDER BY id) AS to_id,
-- length of sub-segment
sqrt(
(x - lead(x, 1) OVER (ORDER BY id)) ^ 2 +
(y - lead(y, 1) OVER (ORDER BY id)) ^ 2 +
(z - lead(z, 1) OVER (ORDER BY id)) ^ 2
) AS seg_length,
node AS from_node,
lead(node, 1) OVER (ORDER BY id) AS to_node
FROM
Table1
) sub
-- filter out the last row
WHERE to_id IS NOT NULL
) seglengths
-- Group into series of sub-segments between two nodes
GROUP BY partition_id;
Credit to How do I efficiently select the previous non-null value? for the partition trick.
Result:
seg_ids | to_node | from_node | seg_length
---------+---------+---------+------------
{1,2,3} | 100 | 200 | 3
{4,5,6} | 200 | 300 | 3
(2 rows)
To insert directly into another table, use INSERT INTO ... SELECT ....