postgres I was given a set of number like "1,2,3,6,7,8,11,12,15,18,19,20" obtain maximum of each group consecutive numbers - postgresql

the consecutive numbers are grouped by the query below, but I dont know how to obtain the maximum of each group of consecutive numbers
with trans as (
select c1,
case when lag(c1) over (order by c1) = c1 - 1 then 0 else 1 end as new
from table1
), groups as (
select c1, sum(new) over (order by c1) as grpnum
from trans
), ranges as (
select grpnum, min(c1) as low, max(c1) as high
from groups
group by grpnum
), texts as (
select grpnum,
case
when low = high then low::text
else low::text||'-'||high::text
end as txt
from ranges
)
select string_agg(txt, ',' order by grpnum) as number
from texts;

In R, we can create a group with diff and cumsum, and then use tapply to get the max of the vector for each group
grp <- cumsum(c(TRUE, diff(v1) > 1))
tapply(v1, grp, FUN = max)
# 1 2 3 4 5
# 3 8 12 15 20
data
v1 <- c(1, 2, 3, 6, 7, 8, 11, 12, 15, 18, 19, 20)

Related

forward rolling sum with different stopping points by row

First, some sample data so the business problem can be explained -
select
ItemID = 276,
Quantity,
Bucket,
DaysInMonth = day(eomonth(Bucket)),
DailyQuantity = cast(Quantity * 1.0 / day(eomonth(Bucket)) as decimal(4, 0)),
DaysFactor
into #data
from
(
values
('1/1/2021', 95, 5500),
('2/1/2021', 75, 6000),
('3/1/2021', 80, 5000),
('4/1/2021', 82, 5300),
('5/1/2021', 90, 5200),
('6/1/2021', 80, 6500),
('7/1/2021', 85, 6100),
('8/1/2021', 90, 5100),
('9/1/2021', null, 5800),
('10/1/2021', null, 5900)
) d (Bucket, DaysFactor, Quantity);
select * from #data;
Now, the business problem -
The first row has a DaysFactor of 95.
The forward rolling sum for this row is calculated as
(31 x 177) + (28 x 214) + (31 x 161) + (5 x 177) = 17,355
That is...
the daily quantity for all 31 days of the 1/1/2021 bucket plus
the daily quantity for all 28 days of the 2/1/2021 bucket plus
the daily quantity for all 31 days of the 3/1/2021 bucket plus
the daily quantity for 5 days of the 4/1/2021 bucket.
This results in 95 days of forward looking quantity.
95 days = 31 + 28 + 31 + 5
For the second row, with a DaysFactor of 75, it would start with daily quantity for the 28 days in the 2/1/2021 bucket and go out until a total of 75 days' worth of quantity were summed, like so:
(28 x 214) + (31 x 161) + (16 x 177) = 13,815
75 days = 28 + 31 + 16
One approach to this is building a calendar of daily demand and then summing quantity over the specified days. However, I'm stuck on how to do the summing. Here is the code that builds the calendar with daily quantities:
with
dates as
(
select
FirstDay = min(cast(Bucket as date)),
LastDay = eomonth(max(cast(Bucket as date)))
from #data
),
tally as (
select top (select datediff(d, FirstDay, LastDay) + 1 from dates) --restrict to number of rows equal to number of days between first and last days
n = row_number() over(order by (select null)) - 1
from sys.messages
),
calendar as (
select
Bucket = dateadd(d, t.n, d.FirstDay)
from tally t
cross join dates d
)
select
c.Bucket,
d.DailyQuantity
from #data d
inner join calendar c
on year(d.Bucket) = year(c.Bucket)
and month(d.Bucket) = month(c.Bucket);
Here's a screenshot of a subset of rows from this query:
I was hoping to use T-SQL's LEAD() to do this but don't see a way to put the DaysFactor into the ROWS clause within OVER(). Is there a way to do that? If not, is there a set based approach to calculating the rolling forward sum?
Expected result set:
Figured it out using an approach different than LEAD(). This column was added to #data:
BucketEnd = cast(dateadd(d, DaysFactor - 1, Bucket) as date)
Then code that builds the calendar with daily quantities shown in original question was put into a temp table called #calendar.
Then this query performs the calculations:
select
d.ItemID,
d.Bucket,
RollingForwardQuantitySum = sum(iif(c.Bucket between d.Bucket and d.BucketEnd, c.DailyQuantity, null))
from #data d
cross join #calendar c
group by
d.ItemID,
d.Bucket
order by
d.ItemID,
cast(d.Bucket as date);
The output from this query matches the expected result set screen shot in the original post.

t-sql function like "filter" for sum(x) filter(condition) over (partition by

I'm trying to sum a window with a filter. I saw something similar to
sum(x) filter(condition) over (partition by...)
but it does not seem to work in t-sql, SQL Server 2017.
Essentially, I want to sum the last 5 rows that have a condition on another column.
I've tried
sum(case when condition...) over (partition...)
and sum(cast(nullif(x))) over (partition...).
I've tried left joining the table with a where condition to filter out the condition.
All of the above will add the last 5 from the starting point of the current row with the condition.
What I want is from the current row. Add the last 5 values above that meet a condition.
Date| Value | Condition | Result
1-1 10 1
1-2 11 1
1-3 12 1
1-4 13 1
1-5 14 0
1-6 15 1
1-7 16 0
1-8 17 0 sum(15+13+12+11+10)
1-9 18 1 sum(18+15+13+12+11)
1-10 19 1 sum(19+18+15+13+12)
In the above example the condition I would want would be 1, ignoring the 0 but still having the "window" size be 5 non-0 values.
This can easily be achieved using a correlated sub query:
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
[Date] Date,
[Value] int,
Condition bit
)
INSERT INTO #T ([Date], [Value], Condition) VALUES
('2019-01-01', 10, 1),
('2019-01-02', 11, 1),
('2019-01-03', 12, 1),
('2019-01-04', 13, 1),
('2019-01-05', 14, 0),
('2019-01-06', 15, 1),
('2019-01-07', 16, 0),
('2019-01-08', 17, 0),
('2019-01-09', 18, 1),
('2019-01-10', 19, 1)
The query:
SELECT [Date], [Value], Condition,
(
SELECT Sum([Value])
FROM
(
SELECT TOP 5 [Value]
FROM #T AS t1
WHERE Condition = 1
AND t1.[Date] <= t0.[Date]
-- If you want the sum to appear starting from a specific date, unremark the next row
--AND t0.[Date] > '2019-01-07'
ORDER BY [Date] DESC
) As t2
HAVING COUNT(*) = 5 -- there are at least 5 rows meeting the condition
) As Result
FROM #T As T0
Results:
Date Value Condition Result
2019-01-01 10 1
2019-01-02 11 1
2019-01-03 12 1
2019-01-04 13 1
2019-01-05 14 0
2019-01-06 15 1 61
2019-01-07 16 0 61
2019-01-08 17 0 61
2019-01-09 18 1 69
2019-01-10 19 1 77

PostgreSQL: how do I group rows by 'nearby' timestamps

Considering the following simplified situation:
create table trans
(
id integer not null
, tm timestamp without time zone not null
, val integer not null
, cus_id integer not null
);
insert into trans
(id, tm, val, cus_id)
values
(1, '2017-12-12 16:42:00', 2, 500) --
,(2, '2017-12-12 16:42:02', 4, 501) -- <--+---------+
,(3, '2017-12-12 16:42:05', 7, 502) -- |dt=54s |
,(4, '2017-12-12 16:42:56', 3, 501) -- <--+ |dt=59s
,(5, '2017-12-12 16:43:00', 2, 503) -- |
,(6, '2017-12-12 16:43:01', 5, 501) -- <------------+
,(7, '2017-12-12 16:43:15', 6, 502) --
,(8, '2017-12-12 16:44:50', 4, 501) --
;
I want to group rows by cus_id, but also where the interval between time stamps of consecutive rows for the same cus_id is less than 1 minute.
In the example above this applies to rows with id's 2, 4 and 6. These rows have the same cus_id (501) and have intervals below 1 minute. The interval id{2,4} is 54s and for id{2,6} it is 59s. The interval id{4,6} is also below 1 minute, but it is overridden by the larger interval id{2,6}.
I need a query that gives me the output:
cus_id | tm | val
--------+---------------------+-----
501 | 2017-12-12 16:42:02 | 12
(1 row)
The tm value would be the tm of the first row, i.e. with the lowest tm. The val would be the sum(val) of the grouped rows.
In the example 3 rows are grouped, but that could also be 2, 4, 5, ...
For simplicity, I only let the rows for cus_id 501 have nearby time stamps, but in my real table, there would be a lot more of them. It contains 20M+ rows.
Is this possible?
Naive (subobtimal) solution using a CTE
(a faster approach would avoid the CTE, replacing it by a joined subquery or maybe even use a window function) :
-- Step one: find the start of a cluster
-- (the start is everything after a 60 second silence)
WITH starters AS (
SELECT * FROM trans tr
WHERE NOT EXISTS (
SELECT * FROM trans nx
WHERE nx.cus_id = tr.cus_id
AND nx.tm < tr.tm
AND nx.tm >= tr.tm -'60sec'::interval
)
)
-- SELECT * FROM starters ; \q
-- Step two: join everything within 60sec to the starter
-- and aggregate the clusters
SELECT st.cus_id
, st.id AS id
, MAX(tr.id) AS max_id
, MIN(tr.tm) AS first_tm
, MAX(tr.tm) AS last_tm
, SUM(tr.val) AS val
FROM trans tr
JOIN starters st ON st.cus_id = tr.cus_id
AND st.tm <= tr.tm AND st.tm > tr.tm -'60sec'::interval
GROUP BY 1,2
ORDER BY 1,2
;

how to efficiently locate a value from one table among values from another table, with SQL

I have a problem in Postgresql which I find even difficult to describe in the title: I have two tables, containing each a range of values very similar but not identical. Suppose I have values like 0, 10, 20, 30, ... in one, and 1, 5, 6, 9, 10, 12, 19, 25, 26, ... in the second one (these are milliseconds). For each value of the second one I want to find the values immediately lower and higher in the first one. So, for the value 12 it would give me 10 and 20. I'm doing it like this :
SELECT s.*, MAX(v1."millisec") AS low_v, MIN(v2."millisec") AS high_v
FROM "signals" AS s, "tracks" AS v1, "tracks" AS v2
WHERE v1."millisec" <= s."d_time"
AND v2."millisec" > s."d_time"
GROUP BY s."d_time", s."field2"; -- this is just an example
And it works ... but it is very slow once I process several thousands of lines, even with indexes on s."d_time" and v.millisec. So, I think there must be a much better way to do it, but I fail to find one. Could anyone help me ?
Try:
select s.*,
(select millisec
from tracks t
where t.millisec <= s.d_time
order by t.millisec desc
limit 1
) as low_v,
(select millisec
from tracks t
where t.millisec > s.d_time
order by t.millisec asc
limit 1
) as high_v
from signals s;
Be sure you have an index for track.millisec;
If you had just created
the index, you'll need to analyze the table to take advantage of it.
Naive (trivial) way to find the preceding and next value.
-- the data (this could have been part of the original question)
CREATE TABLE table_one (id SERIAL NOT NULL PRIMARY KEY
, msec INTEGER NOT NULL -- index maight help
);
CREATE TABLE table_two (id SERIAL NOT NULL PRIMARY KEY
, msec INTEGER NOT NULL -- index maight help
);
INSERT INTO table_one(msec) VALUES (0), ( 10), ( 20), ( 30);
INSERT INTO table_two(msec) VALUES (1), ( 5), ( 6), ( 9), ( 10), ( 12), ( 19), ( 25), ( 26);
-- The query: find lower/higher values in table one
-- , but but with no values between "us" and "them".
--
SELECT this.msec AS this
, prev.msec AS prev
, next.msec AS next
FROM table_two this
LEFT JOIN table_one prev ON prev.msec < this.msec AND NOT EXISTS (SELECT 1 FROM table_one nx WHERE nx.msec < this.msec AND nx.msec > prev.msec)
LEFT JOIN table_one next ON next.msec > this.msec AND NOT EXISTS (SELECT 1 FROM table_one nx WHERE nx.msec > this.msec AND nx.msec < next.msec)
;
Result:
CREATE TABLE
CREATE TABLE
INSERT 0 4
INSERT 0 9
this | prev | next
------+------+------
1 | 0 | 10
5 | 0 | 10
6 | 0 | 10
9 | 0 | 10
10 | 0 | 20
12 | 10 | 20
19 | 10 | 20
25 | 20 | 30
26 | 20 | 30
(9 rows)
try this :
select * from signals s,
(select millisec low_value,
lead(millisec) over (order by millisec) high_value from tracks) intervals
where s.d_time between low_value and high_value-1
For this type of problem "Window functions" are ideal see : http://www.postgresql.org/docs/9.1/static/tutorial-window.html

How to "fold" array values in Postgres?

I am a Postgresql newcomer and I'd like to know is it possible to calculate the following:
select T.result +
-- here I want to do the following:
-- iterate through T.arr1 and T.arr2 items and add their values
-- to the T.result using the rules:
-- if arr1 item is 1 then add 10, if arr1 item is 2 or 3 then add 20,
-- if arr2 item is 3 then add 15, if arr2 item is 4 then add 20, else add 30
from (
select 5 as result, array[1,2,3] as arr1, array[3,4,5] as arr2
) as T
So that for these arrays the query will produce: 5+10+20+20+15+20+30 = 120.
Thanks for any help.
Try something like:
SELECT SUM(CASE val
WHEN 1 THEN 10
WHEN 2 THEN 20
WHEN 3 THEN 20
END)
FROM unnest(array[1,2,3]) as val
To get the sum of array.
The full query will look like:
select T.result +
(SELECT SUM(CASE val
WHEN 1 THEN 10
WHEN 2 THEN 20
WHEN 3 THEN 20
END)
FROM unnest(arr1) as val) +
(SELECT SUM(CASE val
WHEN 3 THEN 15
WHEN 4 THEN 20
ELSE 30
END)
FROM unnest(arr2) as val)
from (
select 5 as result, array[1,2,3] as arr1, array[3,4,5] as arr2
) as T
SQLFiddle