Calculate the cumulative total between 2 columns until a non-zero value is reached in 1st column. Once non-zero value is reached, the sum restarts - tsql

I am querying a database using TSQL in SSMS.I have a dataset that contains two unique ID's, A and B. For each of these ID's I want to sum columns OAC and Adj cumulatively until the next non-zero value is reached in column OAC. In other words, the non-zero values on column OAC will remain the same while the values between them will be adding consequent column Adj values using the first non-zero OAC value as a stop and restart point.
The data table can be created using the below
drop table #T
CREATE TABLE #T(
Id varchar(10),
PeriodNum int,
row_num int,
OAC money ,
Adj money
) ON [PRIMARY]
GO
insert into #T
values
('A','201606','1','5','0'),
('A','201905','2','0','-2'),
('A','201906','3','100','0'),
('A','202008','4','0','-6'),
('A','202009','5','0','-8'),
('A','202106','6','0','-11'),
('A','202109','7','23','0'),
('B','201606','1','3','0'),
('B','201905','2','0','25'),
('B','201906','3','60','0'),
('B','202008','4','0','12'),
('B','202009','5','0','-5'),
('B','202106','6','0','6'),
('B','202109','7','6','0')
I tried the following code to calculate the desired result
select * , sum( iif(t.OAC<> 0,(t.OAC + t.Adj),0)) over (partition by t.ID order by t.row_num asc) as Calc
from #T t
This did not work as the needed result looks as per below in the calc column.

A courtesy of https://blog.jooq.org/10-sql-tricks-that-you-didnt-think-were-possible/ :
Of course, that vendor is not Microsoft, so we're stuck to less finessed options.
One way to go about this is to first use a conditional aggregate on a flag to create "groupings", let's call that [sumstep], and then partition on that to have separate running totals:
;with cte as
(
select *,sum(case when OAC=0 then 0 else 1 end) over (partition by id order by row_num) as sumstep
from #t
)
select *,sum(OAC+Adj) over (partition by id,sumstep order by row_num) as Calc
from cte

Related

Merge overlapping date intervals into big intervals providing uniquness of some id inside merged group

I have some date intervals, each interval is characterized by known "prop_id". My goal is to merge overlapping intervals into big intervals, while keeping the uniquness of "prop_id" inside the merged group. I have some code that helps me to get big intervals, but I 've no idea how to keep condition of uniquness (. Thanks in advance for any assistance.
________1 ________1
___________2
________1 |________1
_________|__2
[1,2]_________|________[1,2]
For SQLFiddle:
CREATE SEQUENCE ido_seq;
create table slots (
ido integer NOT NULL default nextval('ido_seq'),
begin_at date,
end_at date,
prop_id integer);
ALTER SEQUENCE ido_seq owned by slots.ido;
INSERT INTO slots (ido, begin_at, end_at, prop_id) VALUES
(0, '2014-10-05', '2014-10-10', 1),
(1, '2014-10-08', '2014-10-15', 2),
(2, '2014-10-13', '2014-10-20', 1),
(3, '2014-10-21', '2014-10-30', 2);
-- disired output:
-- start, end, props
-- 2014-10-05, 2014-10-12, [1,2] --! the whole group is (2014-10-05, 2014-10-20, [1,2,1]), but props should be unique
-- 2014-10-13, 2014-10-20, [1,2] --so, we obtain 2 ranges instead of 1, each one with 2 generating prop_id
-- 2014-10-21, 2014-10-30 [2]
How do we get it:
if two date intervals overlap, we merge them. The first ['2014-10-05', '2014-10-10'] and second ['2014-10-08', '2014-10-15'] have part ['2014-10-08', '2014-10-10'] in common. So we can merge them to ['2014-10-05', '2014-10-15']. The generalizing props are unique - OK. The next one ['2014-10-13', '2014-10-20'] is overlapping with previously calculated ['2014-10-05', '2014-10-15'], but we can't merge them without breaking the condition of uniquness. So we are to split the big interval ['2014-10-05', '2014-10-20'] into 2 small using the begin date of repeating prop ('2014-10-13'), but keeng the condition and receive ['2014-10-05', '2014-10-12'] (as '2014-10-13' minus 1 day) and ['2014-10-13', '2014-10-20'] both generalizing by props 1 and 2.
My attempt to get merged intervals (not keeping uniqueness condition):
SELECT min(begin_at), max(enddate), array_agg(prop_id) AS props
FROM (
SELECT *,
count(nextstart > enddate OR NULL) OVER (ORDER BY begin_at DESC, end_at DESC) AS grp
FROM (
SELECT
prop_id
, begin_at
, end_at
, end_at AS enddate
, lead(begin_at) OVER (ORDER BY begin_at, end_at) AS nextstart
FROM slots
) a
)b
GROUP BY grp
ORDER BY 1;
The right solution here is probably to use a recursive CTE to find the large intervals no matter how many smaller intervals need to be combined, and then to remove the intervals that we do not need.
with recursive intervals(idos, begin_at,end_at,prop_ids) as(
select array[ido], begin_at, end_at, array[prop_id]
from slots
union
select i.idos || s.ido
, least(s.begin_at, i.begin_at)
, greatest(s.end_at, i.end_at)
, i.prop_ids || s.prop_id
from intervals i
join slots s
on (s.begin_at, s.end_at) overlaps (i.begin_at, i.end_at)
and not (i.prop_ids && array[s.prop_id]) --check that the prop id is not already in the large interval
where s.begin_at < i.begin_at --to avoid having double intervals
)
select * from intervals i
--finally, remove the intervals that are a subinterval of an included interval
where not exists(select 1 from intervals i2 where i2.idos #> i.idos
and i2.idos <> i.idos);

postgresql: How to grab an existing id from a not subsequent ids of a table

Postgresql version 9.4
I have a table with an integer column, which has a number of integers with some gaps, like the sample below; I'm trying to get an existing id from the column at random with the following query, but it returns NULL occasionally:
CREATE TABLE
IF NOT EXISTS test_tbl(
id INTEGER);
INSERT INTO test_tbl
VALUES (10),
(13),
(14),
(16),
(18),
(20);
-------------------------------
SELECT * FROM test_tbl;
-------------------------------
SELECT COALESCE(tmp.id, 20) AS classification_id
FROM (
SELECT tt.id,
row_number() over(
ORDER BY tt.id) AS row_num
FROM test_tbl tt
) tmp
WHERE tmp.row_num =floor(random() * 10);
Please let me know where I'm doing wrong.
but it returns NULL occasionally
and I must add to this that it sometimes returns more than 1 rows, right?
in your sample data there are 6 rows, so the column row_num will have a value from 1 to 6.
This:
floor(random() * 10)
creates a random number from 0 up to 0.9999...
You should use:
floor(random() * 6 + 1)::int
to get a random integer from 1 to 6.
But this would not solve the problem, because the WHERE clause is executed once for each row, so there is a case that row_num will never match the created random number, so it will return nothing, or it will match more than once so it will return more than 1 rows.
See the demo.
The proper (although sometimes not the most efficient) way to get a random row is:
SELECT id FROM test_tbl ORDER BY random() LIMIT 1
Also check other links from SO, like:
quick random row selection in Postgres
You could select one row and order by random(), this way you are ensured to hit an existing row
select id
from test_tbl
order by random()
LIMIT 1;

lead and lag on large table 1billion rows

I have a table T as follows with 1 Billion records. Currently, this table has no Primary key or Indexes.
create table T(
day_c date,
str_c varchar2(20),
comm_c varchar2(20),
src_c varchar2(20)
);
some sample data:
insert into T
select to_date('20171011','yyyymmdd') day_c,'st1' str_c,'c1' comm_c,'s1' src_c from dual
union
select to_date('20171012','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171013','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171014','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171015','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171016','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171017','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171018','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171019','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171020','yyyymmdd'),'st1','c1','s1' from dual;
The expected result is to generate the date ranges for the changes in column src_c.
I have the following code snippet which provides the desired result. However, it is slow as the cost of running lag and lead is quite high on the table.
WITH EndsMarked AS (
SELECT
day_c,str_c,comm_c,src_c,
CASE WHEN src_c= LAG(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_START,
CASE WHEN src_c= LEAD(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_END
FROM T
), GroupsNumbered AS (
SELECT
day_c,str_c,comm_c,
src_c,
IS_START,
IS_END,
COUNT(CASE WHEN IS_START = 1 THEN 1 END)
OVER (ORDER BY day_c) AS GroupNum
FROM EndsMarked
WHERE IS_START=1 OR IS_END=1
)
SELECT
str_c,comm_c,src_c,
MIN(day_c) AS GROUP_START,
MAX(day_c) AS GROUP_END
FROM GroupsNumbered
GROUP BY str_c,comm_c, src_c,GroupNum
ORDER BY groupnum;
Output :
STR_C COMM_C SRC_C GROUP_START GROUP_END
st1 c1 s1 11-OCT-17 13-OCT-17
st1 c1 s2 14-OCT-17 16-OCT-17
st1 c1 s1 17-OCT-17 20-OCT-17
Any suggestion to speed up?
Oracle database :12c.
SGA Memory:20GB
Total CPU:22
Explain plan:
Order by day_c only, or do you need to partition by str_c and comm_c first? It seems so - in which case I am not sure your query is correct, and Sentinel's solution will need to be adjusted accordingly.
Then:
For some reason (which escapes me), it appears that the match_recognize clause (available only since Oracle 12.1) is faster than analytic functions, even when the work done seems to be the same.
In your problem, (1) you must read 1 billion rows from disk, which can't be done faster than the hardware allows (do you REALLY need to do this on all 1 billion rows, or should you archive a large portion of your table, perhaps after performing this identification of GROUP_START and GROUP_END)? (2) you must order the data by day_c no matter what method you use, and that is time consuming.
With that said, the tabibitosan method (see Sentinel's answer) will be faster than the start-of-group method (which is close to, but simpler than what you currently have).
The match_recognize solution, which will probably be faster than any solution based on analytic functions, looks like this:
select str_c, comm_c, src_c, group_start, group_end
from t
match_recognize(
partition by str_c, comm_c
order by day_c
measures x.src_c as src_c,
first(day_c) as group_start,
last(day_c) as group_end
pattern ( x y* )
define y as src_c = x.src_c
)
-- Add ORDER BY clause here, if needed
;
Here is a quick explanation of how this works; for developers who are not familiar with match_recognize, I provided links to a few good tutorials in a Comment below this Answer.
The match_recognize clause partitions the input rows by str_c and comm_c and orders them by day_c. So far this is exactly the same work that analytic functions do.
Then in the PATTERN and DEFINE clauses I declare and define two "classes" of rows, which will be flagged as X and Y, respectively. X is any row (there are no restrictions on it in the DEFINE clause). However, Y is restricted: it must have the same src_c as the last X row preceding it.
So, in each partition, and reading from the earliest row to the latest (within the partition), I am looking for any number of matches, where a match consists of an arbitrary row (marked X), followed by as many Y rows as possible; where Y means "same src_c as the first row in this match. So, this will identify sequences of rows where the src_c did not change.
For each match that is found, the clause will output the src_c value from the X row (which is the same, really, for all the rows in that match), and the first and the last value in the day_c column for that match. That is what we need to put in the SELECT clause of the overall query.
You can eliminate one CTE by using the Tabibito-san (Traveler) method:
with Groups as (
select t.*
, row_number() over (order by day_c)
- row_number() over (partition by str_c
, comm_c
, src_c
order by day_c) GroupNum
from t
)
select str_c
, comm_c
, src_c
, min(day_c) GROUP_START
, max(day_c) GROUP_END
from Groups
group by str_c
, comm_c
, src_c
, GroupNum

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

How to reference output rows with window functions?

Suppose I have a table with quantity column.
CREATE TABLE transfers (
user_id integer,
quantity integer,
created timestamp default now()
);
I'd like to iteratively go thru a partition using window functions, but access the output rows, not the input table rows.
To access the input table rows I could do something like this:
SELECT LAG(quantity, 1, 0)
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
I need to access the previous output row to calculate the next output row. How can i access the lag row in the output? Something like:
CREATE VIEW balance AS
SELECT LAG(balance.total, 1, 0) + quantity AS total
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
Edit
This is a minimal example to support the question of how to access the previous output row within a window partition. I don't actually want a sum.
It seems you attempt to calculate a running sum. Luckily that's just what Sum() window function does:
WITH transfers AS(
SELECT i, random()-0.3 AS quantity FROM generate_series(1,100) as i
)
SELECT i, quantity, sum(quantity) OVER (ORDER BY i) from transfers;
I guess, looking at the question, that the only you need is to calculate a cumulative sum.
To calculate a cumulative summ use this query:
SELECT *,
SUM( CASE WHEN quantity IS NULL THEN 0 ELSE quantity END)
OVER ( PARTITION BY user_id ORDER BY created
ROWS BETWEEN unbounded preceding AND current row
) As cumulative_sum
FROM transfers
ORDER BY user_id, created
;
But if you want more complex calculations, especially containing some conditions (decisions) that depend on a result from prevoius row, then you need a recursive approach.