Postgresql: FIRST_VALUE as aggregate funtion - postgresql

In Postgres, we want to use the window function as an aggregate function.
We have a table, where every line consists of two timestamps and a value. We first extend the table by adding a column with the difference between timestamps - only a few results are possible. Then we group data by timestamp1 and timediff. In each group, there can be more than one line. We need to choose in each group one value, the one that has the smallest timestamp2.
SELECT
timestamp1,
timediff,
FIRST_VALUE(value) OVER (ORDER BY timestamp2) AS value
FROM (
SELECT
timestamp1,
timestamp2,
value,
timestamp2 - timestamp1 AS timediff
FROM forecast_table WHERE device = 'TEST'
) sq
GROUP BY timestamp1,timediff
ORDER BY timestamp1
Error: column "sq.value" must appear in the GROUP BY clause or be used in an aggregate function

You can workaround this by aggregating into an array, then pick the first array element:
SELECT
timestamp1,
timediff,
(array_agg(value ORDER BY timestamp2))[1] AS value
FROM (
SELECT
timestamp1,
timestamp2,
value,
timestamp2 - timestamp1 AS timediff
FROM forecast_table
WHERE device = 'TEST'
) sq
GROUP BY timestamp1,timediff
ORDER BY timestamp1

Or you may use DISTINCT ON with custom ORDER BY.
SELECT DISTINCT ON (timestamp1, timediff)
timestamp1, timestamp2, value,
timestamp2 - timestamp1 AS timediff
FROM forecast_table WHERE device = 'TEST'
ORDER BY timestamp1, timediff, timestamp2;

There is no need for GROUP BY if you are not actually doing any aggregation.
You can get what you want if you define PARTITION BY timestamp1, timestamp2 - timestamp1 inside FIRST_VALUE():
SELECT DISTINCT timestamp1,
FIRST_VALUE(value) OVER (PARTITION BY timestamp1, timestamp2 - timestamp1 ORDER BY timestamp2) AS value,
timestamp2 - timestamp1 AS timediff
FROM forecast_table
WHERE device = 'TEST'
ORDER BY timestamp1, timediff;

Related

PostgreSQL group by change in criteria and timestamp bins

I have a table with 4 columns: name, criteria, timestamp1 and timestamp2. I also have the the query below that groups by change in the criteria for each name when sorted by timestamp1
Select name, criteria, min(timestamp1), max(timestamp1), min(timestamp2), max(timestamp2),grp
from (
SELECT name, timestamp1, criteria, timestamp2,
row_number() OVER (PARTITION BY name ORDER BY timestamp1)
- row_number() OVER (PARTITION BY name, criteria ORDER BY timestamp1) grp
from table
) foo
group by name, criteria, grp
Unfortunately, the data has overlapping timestamp1 and name values (due to timestamp1 errors) that can only be separated by timestamp2, which is when the record was added to the DB.
What I want to do is order by name and timestamp1 and create a new grp anytime the criteria changes or timestamp2 is <> the first timestamp2 in the group by more than two weeks.

How to get timestamp associated with percentile(x) value using timescale db time_bucket

I need find percentile(50) value and its timestamp using timescale db time-bucket. Finding P50 is easy but I don't know how to get the time stamp.
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
I think what you're looking for we can do by selecting where the int_val is equal to the median value in a lateral (percentile_disc does ensure that there is a value exactly equal to that value, there may be more than one depending on what you want there you could deal with the more than one case in different ways), building on a previous answer and making it work a bit better I think would look something like this:
WITH p50 AS (
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
) SELECT p50.*, rmed.*
FROM p50, LATERAL (SELECT * FROM timeseries.raw r
-- copy over the same where clause from above so we're dealing with the same subset of data
WHERE timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
-- add a where clause on the median value
AND r.int_val = p50.medianVal
-- now add a where clause to account for the time bucket
AND r.timestamp_utc >= p50.interval_size
AND r.timestamp_utc < p50.interval_size + '120 sec'::interval
-- Can add an order by something desc limit 1 if you want to avoid ties
) rmed;
Note that this will do a second scan of the table, it should be reasonably efficient, especially if you have an index on that column, but it will cause another scan, there isn't a great way that I know of of doing it without a second scan.

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4
If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example
If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

How to select corresponding record alongside aggregate function with having clause

Let's say I have an orders table with customer_id, order_total, and order_date columns. I'd like to build a report that shows all customers who haven't placed an order in the last 30 days, with a column for the total amount their last order was.
This gets all of the customers who should be on the report:
select customer, max(order_date), (select order_total from orders o2 where o2.customer = orders.customer order by order_date desc limit 1)
from orders
group by 1
having max(order_date) < NOW() - '30 days'::interval
Is there a better way to do this that doesn't require a subquery but instead uses a window function or other more efficient method in order to access the total amount from the most recent order? The techniques from How to select id with max date group by category in PostgreSQL? are related, but the extra having restriction seems to stop me from using something like DISTINCT ON.
demo:db<>fiddle
Solution with row_number window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
SELECT
customer, order_date, order_total
FROM (
SELECT
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total,
row_number() OVER w as row_count
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
) s
WHERE row_count = 1 AND order_date < CURRENT_DATE - 30
Solution with DISTINCT ON (https://www.postgresql.org/docs/9.5/static/sql-select.html#SQL-DISTINCT):
SELECT
customer, order_date, order_total
FROM (
SELECT DISTINCT ON (customer)
*,
first_value(order_date) OVER w as last_order,
first_value(order_total) OVER w as last_total
FROM orders
WINDOW w AS (PARTITION BY customer ORDER BY order_date DESC)
ORDER BY customer, order_date DESC
) s
WHERE order_date < CURRENT_DATE - 30
Explanation:
In both solutions I am working with the first_value window function. The window function's frame is defined by customers. The rows within the customers' groups are ordered descending by date which gives the latest row first (last_value is not working as expected every time). So it is possible to get the last order_date and the last order_total of this order.
The difference between both solutions is the filtering. I showed both versions because sometimes one of them is significantly faster
The window function style is creating a row count within the frames. Every first row can be filtered later. This is done by adding a row_number window function. The benefit of this solution comes out when you are trying to filter the first two or three data sets. You simply have to change the filter from WHERE row_count = 1 to WHERE row_count = 2
But if you want only one single row per group you just need to ensure that the expected row per group is ordered to be the first row in the group. Then the DISTINCT ON function can delete all following rows. DISTINCT ON (customer) gives the first (ordered) row per customer group.
Try to join table on itself
select o1.customer, max(order_date),
from orders o1
join orders o2 on o1.id=o2.id
group by o1.customer
having max(o1.order_date) < NOW() - '30 days'::interval
Subqueries in select is a bad idea, because DB will execute a query for each row
If you use postgres you can also try to use CTE
https://www.postgresql.org/docs/9.6/static/queries-with.html
WITH t as (
select id, order_total from orders o2 where o2.customer = orders.customer
order by order_date desc limit 1
) select o1.customer, max(order_date),
from orders o1
join t t.id=o2.id
group by o1.customer
having max(order_date) < NOW() - '30 days'::interval

how to concatenate timestamp in different rows in postgresql?

I'm looking for a way to concatenate timestamp in two difference row, for an example, I have this table:
I want it to be grouped by weekday and concatenate the min(start_hour) with max(start_hour), to get something like this
and I'm using this query to retrieve the first image result
The query below should give you what you are looking for provided the information supplied. I made some assumptions. That the '00:00:00' in the start and end hours is not a valid time and can be ignored. If they should be considered valid, then Friday's output would be one entry of '00:00:00' - '11:30:00'.
I created two CTEs, one for the start hours and the other for the end hours where the values are not '00:00:00'. Added a row number to the CTEs so i can match up the day & row_number to get you a set.
SELECT day
,array_to_string(array_agg(t.shift), ',') shifts
FROM (
WITH cte_start AS (
SELECT row_number() OVER (PARTITION BY day)
,day
,start_hour
FROM test22
WHERE start_hour <> '00:00:00'::time
)
,cte_stop AS (
SELECT row_number() OVER (PARTITION BY day)
,day
,stop_hour
FROM test22
WHERE stop_hour <> '00:00:00'::time
)
SELECT cte_start.day
,cte_start.start_hour::varchar || ' - ' || cte_stop.stop_hour::varchar AS shift
FROM cte_start
LEFT OUTER JOIN cte_stop ON cte_start.day = cte_stop.day
AND cte_start.row_number = cte_stop.row_number
) T
GROUP BY T.day
-HTH