Get the next (or previous) non-null value in multiple partitioned - postgresql

Sample data below.
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).
I can't do lag, because in some cases there are consecutive nulls.
I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).
The only thing I can think of is to do
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.
Appreciate any help!
row
id
col_a
col_a_desired
col_b
col_b_desired
0
1
-
NYC
red
red
1
1
NYC
NYC
red
red
2
1
SF
SF
-
blue
3
1
-
SF
-
blue
4
1
SF
SF
blue
blue
5
2
PAR
PAR
red
red
6
2
LON
LON
-
blue
7
2
LON
LON
-
blue
8
2
-
LON
blue
blue
9
2
LON
LON
-
blue
10
2
-
LON
-
blue

Can you try this query?
WITH samp AS (
SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
SELECT 1, 1, 'NYC', 'red' UNION ALL
SELECT 2, 1, 'SF', NULL UNION ALL
SELECT 3, 1, NULL, NULL UNION ALL
SELECT 4, 1, 'SF', 'blue' UNION ALL
SELECT 5, 2, 'PAR', 'red' UNION ALL
SELECT 6, 2, 'LON', NULL UNION ALL
SELECT 7, 2, 'LON', NULL UNION ALL
SELECT 8, 2, NULL, 'blue' UNION ALL
SELECT 9, 2, 'LON', NULL UNION ALL
SELECT 10, 2, NULL, NULL
)
SELECT
row_id,
id,
IFNULL(FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
IFNULL(FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
from samp order by id, row_id
Output:
References:
https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

I want to clean up data based on the next non-null value.
So if you reverse the order, that's the last non-null value.
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.

Related

Can lead() return the next row only when a condition is met?

Recently my company upgraded from SQL Server 2008 to 2016, so I want to take advantage of some "new" features, one of which is lead().
I understand the basic usage, but I want to know if I can return the next row only when a condition is met. My original query looked like the following, where x.next_id is null if the next row isn't more than 12 days past the current row.
SELECT
a.id,
a.date_a,
x.next_id
FROM
table a
OUTER APPLY
(SELECT TOP 1
next_id = i.intIndex
FROM
table i
WHERE
i.date_a > DATEADD(DAY, 12, a.date_a)
ORDER BY
date_a, id ASC) x
ORDER BY
date_a, id ASC
Data might look like the following, where the third column is added by the query:
id date_a next_id
--------------------------------
1798678 2014-12-01 NULL
1798689 2013-01-05 1798688
1798688 2014-03-31 NULL
1798696 2013-04-03 1798694
1798694 2013-08-12 1798691
1798691 2014-09-30 NULL
1798698 2013-05-14 1798697
1798697 2013-08-29 NULL
Assuming this data set (your result table; minus the result column):
CREATE TABLE some_table(id INT PRIMARY KEY,date_a DATE);
INSERT INTO some_table(id,date_a)
VALUES (1798678,'2014-12-01'),
(1798689,'2013-01-05'),
(1798688,'2014-03-31'),
(1798696,'2013-04-03'),
(1798694,'2013-08-12'),
(1798691,'2014-09-30'),
(1798698,'2013-05-14'),
(1798697,'2013-08-29');
This query returns the same result set as what the query you have returns:
SELECT
id,
date_a,
next_id=
CASE WHEN LEAD(date_a) OVER (ORDER BY date_a,id)>DATEADD(DAY,12,date_a)
THEN LEAD(id) OVER (ORDER BY date_a,id)
ELSE NULL
END
FROM
some_table
ORDER BY
date_a,id;

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4
If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example
If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

PostgreSQL Crosstab issues / "Return and SQL tuple descriptions are incompatible"

Good afternoon, I am using POSTGRESql version 9.2 and I'm trying to use a crosstab function to transpose two columns on a table so that i can later join it to a different SELECT query.
I have installed the tablefunc extension.
However i keep getting this "Return and SQL tuple descriptions are incompatible" error which seems to be because of typecasts.
I don't need them to be a specific type.
My original SELECT query is this
SELECT inventoryid, ttype, tamount
FROM inventorytesting
Which gives me the following result:
inventoryid ttype tamount
2451530088940460 7 0.2
2451530088940460 2 0.5
2451530088940460 8 0.1
2451530088940460 1 15.7
8751530077940461 7 0.7
8751530077940461 2 0.2
8751530077940461 8 1.1
8751530077940461 1 19.2
and my goal is to get it like:
inventoryid 7 2 8 1
8751530077940461 0.7 0.2 1.1 19.2
2451530088940460 0.2 0.5 0.1 15.7
The 'ttype' field has 49 different values such as "7","2","8","1" which are fixed.
The 'tamount' field varies its values depending on the 'inventoryid' field but there will always be 49 of them, even if its value is zero. It will never be "null".
I have tried a few variations that i could find in the internet which sum up to this:
SELECT *
FROM crosstab (
$$SELECT inventoryid, ttype, tamount
FROM inventorytesting
WHERE inventoryid = '2451530088940460'
ORDER BY inventoryid, ttype$$
)
AS ct("inventoryid" text,"ttype" smallint,"tamount" numeric)
The fieldtypes on the inventorytesting table are
select column_name, data_type from information_schema.columns
where table_name = 'inventorytesting'
Results:
column_name data_type
id bigint
ttype smallint
tamount numeric
tunit text
tlessthan smallint
plantid text
sessiontime bigint
deleted smallint
inventoryid text
docdata text
docname text
labid bigint
Any pointers would be great.
demo:db<>fiddle
The resulting table definition has to contain the table structure you are expecting - the pivoted one - and not the structure of the given one:
SELECT *
FROM crosstab(
$$SELECT inventoryid, ttype, tamount
FROM inventorytesting
WHERE inventoryid = '2451530088940460'
ORDER BY inventoryid, ttype$$
)
AS ct("inventoryid" text,"type1" numeric,"type2" numeric,"type7" numeric,"type8" numeric)
Addionally there is no need to use the crosstab function. You can achieve a pivot by simply using the standard CASE function:
SELECT
inventoryid,
SUM(CASE WHEN ttype = 1 THEN tamount END) AS type1,
SUM(CASE WHEN ttype = 2 THEN tamount END) AS type2,
SUM(CASE WHEN ttype = 7 THEN tamount END) AS type7,
SUM(CASE WHEN ttype = 8 THEN tamount END) AS type8
FROM
inventorytesting
GROUP BY 1
If you were on 9.4 or higher you could use the Postgres specific FILTER clause:
SELECT
inventoryid,
SUM(tamount) FILTER (WHERE ttype = 1) AS type1,
SUM(tamount) FILTER (WHERE ttype = 2) AS type2,
SUM(tamount) FILTER (WHERE ttype = 7) AS type7,
SUM(tamount) FILTER (WHERE ttype = 8) AS type8
FROM
inventorytesting
GROUP BY 1
demo:db<>fiddle
With the crosstab, you define the actual result table (basically the result of the pivot). The input query defines three columns which are then processed as:
grouping column result in the actual rows
the pivot columns
value for the pivot column
In your case, the crosstab therefore needs to be defined as:
ct(
"inventoryid" text,
"tamount_1" numeric,
"tamount_2" numeric,
"tamount_3" numeric,
...
)
The column header will then correlate to a certain value of column ttype in the order as defined by the inner query's ORDER BY.
The thing with crosstab is that missing values for ttype (e.g. some value returned for 4 but not for 3), the resulting columns would be 1, 2, 4, ... with 3 being missing. Here, you'd have to make sure (if you need consistent output) that your inner query returns at least a NULL row (e.g. via LEFT JOIN).

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

PostgreSQL, Custom aggregate

Is here way to get function like custom aggregate when MAX and SUM is not enough to get result?
Here is my table:
DROP TABLE IF EXISTS temp1;
CREATE TABLE temp1(mydate text, code int, price decimal);
INSERT INTO temp1 (mydate, code, price) VALUES
('01.01.2014 14:32:11', 1, 9.75),
( '', 1, 9.99),
( '', 2, 40.13),
('01.01.2014 09:12:04', 2, 40.59),
( '', 3, 18.10),
('01.01.2014 04:13:59', 3, 18.20),
( '', 4, 10.59),
('01.01.2014 15:44:32', 4, 10.48),
( '', 5, 8.19),
( '', 5, 8.24),
( '', 6, 11.11),
('04.01.2014 10:22:35', 6, 11.09),
('01.01.2014 11:48:15', 6, 11.07),
('01.01.2014 22:18:33', 7, 22.58),
('03.01.2014 13:15:40', 7, 21.99),
( '', 7, 22.60);
Here is query for getting result:
SELECT code,
ROUND(AVG(price), 2),
MAX(price)
FROM temp1
GROUP BY code
ORDER BY code;
In short:
I have to get LAST price by date (written as text) for every grouped code if date exists otherwise (if date isn't written) price should be 0.
In column LAST is wanted result and result of AVG and MAX for illustration:
CODE LAST AVG MAX
------------------------------
1 9.75 9.87 9.99
2 40.59 40.36 40.59
3 18.20 18.15 18.20
4 10.48 10.54 10.59
5 0.00 8.22 8.24
6 11.09 11.09 11.11
7 21.99 22.39 22.60
How would I get wanted result?
How that query would look like?
EDITED
I simply have to try 'IMSoP's advices to update and use custom aggregate functions first/last.
SELECT code,
CASE WHEN MAX(mydate)<>'' THEN
(SELECT last(price ORDER BY TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')))
ELSE
0
END AS "LAST",
ROUND(AVG(price), 2) AS "AVG",
MAX(price) AS "MAX"
FROM temp1
GROUP BY code
ORDER BY code;
With this simple query I get same results as with Mike's complex query.
And more, those one better consumes double (same) entries in mydate column, and is faster.
Is this possible? It look's similar to 'SELECT * FROM magic()' :)
You said in comments that one code can have two rows with the same date. So this is sane data.
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
There's no deterministic way to tell which of those rows is the "last" one, even if you sort by code and "date". (In the relational model--a model based on mathematical sets--the order of columns is irrelevant, and the order of rows is irrelevant.) The query optimizer is free to return rows is the way it thinks best, so this query
select *
from temp1
order by mydate, code
might return this on one run,
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
and this on another.
01.01.2014 1 3.50
01.01.2014 1 99.34
01.01.2014 1 17.25
Unless you store some value that makes the meaning of last obvious, what you're trying to do isn't possible. When people need to make last obvious, they usually use a timestamp.
After your changes, this query seems to return what you're looking for.
with distinct_codes as (
select distinct code
from temp1
),
corrected_table as (
select
case when mydate <> '' then TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')
else null
end as mydate,
code,
price
from temp1
),
max_dates as (
select code, max(mydate) max_date
from corrected_table
group by code
)
select c1.mydate, d1.code, coalesce(c1.price, 0)
from corrected_table c1
inner join max_dates m1
on m1.code = c1.code
and m1.max_date = c1.mydate
right join distinct_codes d1
on d1.code = c1.code
order by code;