Get min and max dates - tsql

I have a set of data (T-SQL 2012) that contains an item number, orig_counter, prev_counter, start_date and a stop date. What i need to do is get the min start_date and max stop_date if the item has moved from one place to another (the prev_counter will be populated with the orig_counter on this move. If the prev_counter is 0 then it didn't have a prior movement.
Here's what my data looks like:
item orig_counter prev_counter start_date stop_date
---------------------------------------------------------------
AB108 8194 0 2001-12-03 2001-12-10
AB108 8569 0 2002-01-04 2002-01-22
AB108 9233 0 2002-02-01 2002-02-01
AB108 12365 0 2002-07-08 2004-02-29
AB108 24602 12365 2002-07-08 2004-03-09
AB108 24855 24602 2002-07-08 2004-03-23
AB108 24945 24855 2002-07-08 2004-03-29
AB108 25042 24945 2002-07-08 2004-04-04
AB108 25106 25042 2002-07-08 2004-04-11
AB108 25226 25106 2002-07-08 2004-04-22
AB108 25569 25226 2002-07-08 2004-04-28
AB108 25724 25569 2002-07-08 2004-06-01
AB108 26749 25724 2002-07-08 2004-06-30
AB108 27187 26749 2002-07-08 2004-07-11
AB108 27336 27187 2002-07-08 2004-08-15
AB108 28272 27336 2002-07-08 2004-08-24
AB108 28329 28272 2002-07-08 2004-11-07
AB108 29831 28329 2002-07-08 2004-11-08
AB108 30003 29831 2002-07-08 2005-08-03
AB108 36618 0 2005-09-19 2005-10-19
AB108 37613 0 2005-11-07 2005-11-07
AB108 37756 0 2005-11-10 2005-11-28
AB108 38979 0 2006-01-25 2006-08-01
As you can see the 4th row (orig_counter = 12365) starts where this item moved from one place to another until the row that is 5th from the bottom (orig_counter = 30003).
So that i can determine the length of time a person had this item, i need to have results like this where i show the original counter, ending counter (if there is one) the min start date and max stop date.
item orig_counter end_counter start_date stop_date
---------- ------------ ------------ ---------- ----------
AB108 8194 0 2001-12-03 2001-12-10
AB108 8569 0 2002-01-04 2002-01-22
AB108 9233 0 2002-02-01 2002-02-01
AB108 12365 30003 2002-07-08 2005-08-03
AB108 36618 0 2005-09-19 2005-10-19
AB108 37613 0 2005-11-07 2005-11-07
AB108 37756 0 2005-11-10 2005-11-28
AB108 38979 0 2006-01-25 2006-08-01

This query recursively loop through the orig_counter and prev_counter starting with prev_counter = 0:
-- Sample data
declare #data table(item char(5), orig_counter int, prev_counter int, start_date datetime, stop_date datetime);
insert into #data(item, orig_counter, prev_counter, start_date, stop_date) values
('AB108', 8194, 0, '2001-12-03', '2001-12-10')
, ('AB108', 8569, 0, '2002-01-04', '2002-01-22')
, ('AB108', 9233, 0, '2002-02-01', '2002-02-01')
, ('AB108', 12365, 0, '2002-07-08', '2004-02-29')
, ('AB108', 24602, 12365, '2002-07-08', '2004-03-09')
, ('AB108', 24855, 24602, '2002-07-08', '2004-03-23')
, ('AB108', 24945, 24855, '2002-07-08', '2004-03-29')
, ('AB108', 25042, 24945, '2002-07-08', '2004-04-04')
, ('AB108', 25106, 25042, '2002-07-08', '2004-04-11')
, ('AB108', 25226, 25106, '2002-07-08', '2004-04-22')
, ('AB108', 25569, 25226, '2002-07-08', '2004-04-28')
, ('AB108', 25724, 25569, '2002-07-08', '2004-06-01')
, ('AB108', 26749, 25724, '2002-07-08', '2004-06-30')
, ('AB108', 27187, 26749, '2002-07-08', '2004-07-11')
, ('AB108', 27336, 27187, '2002-07-08', '2004-08-15')
, ('AB108', 28272, 27336, '2002-07-08', '2004-08-24')
, ('AB108', 28329, 28272, '2002-07-08', '2004-11-07')
, ('AB108', 29831, 28329, '2002-07-08', '2004-11-08')
, ('AB108', 30003, 29831, '2002-07-08', '2005-08-03')
, ('AB108', 36618, 0, '2005-09-19', '2005-10-19')
, ('AB108', 37613, 0, '2005-11-07', '2005-11-07')
, ('AB108', 37756, 0, '2005-11-10', '2005-11-28')
, ('AB108', 38979, 0, '2006-01-25', '2006-08-01');
-- Recursive query
with list(n, item, orig_counter, prev_counter, start_date, stop_date) as (
Select 0, item, orig_counter, orig_counter, start_date, stop_date From #data Where prev_counter = 0
Union All
Select l.n+1, l.item, l.orig_counter, d.orig_counter, l.start_date, d.stop_date From list as l
Inner Join #data as d on l.prev_counter = d.prev_counter and l.item= d.item
)
Select l.item, l.orig_counter, prev_counter = case when m.mx > 0 then l.prev_counter else 0 end, l.start_date, l.stop_date
From list l
Inner Join (Select mx = max(n), item, orig_counter From list Group By item, orig_counter) as m
On m.item = l.item and m.orig_counter = l.orig_counter and m.mx = l.n
Order By l.item, l.orig_counter
OPTION (MAXRECURSION 0);
Output:
item | orig_counter | prev_counter | start_date | stop_date
AB108 | 8194 | 0 | 2001-12-03 00:00:00.000 | 2001-12-10 00:00:00.000
AB108 | 8569 | 0 | 2002-01-04 00:00:00.000 | 2002-01-22 00:00:00.000
AB108 | 9233 | 0 | 2002-02-01 00:00:00.000 | 2002-02-01 00:00:00.000
AB108 | 12365 | 30003 | 2002-07-08 00:00:00.000 | 2005-08-03 00:00:00.000
AB108 | 36618 | 0 | 2005-09-19 00:00:00.000 | 2005-10-19 00:00:00.000
AB108 | 37613 | 0 | 2005-11-07 00:00:00.000 | 2005-11-07 00:00:00.000
AB108 | 37756 | 0 | 2005-11-10 00:00:00.000 | 2005-11-28 00:00:00.000
AB108 | 38979 | 0 | 2006-01-25 00:00:00.000 | 2006-08-01 00:00:00.000

Related

Select previous different value PostgreSQL

I have a table:
id
date
value
1
2022-01-01
1
1
2022-01-02
1
1
2022-01-03
2
1
2022-01-04
2
1
2022-01-05
3
1
2022-01-06
3
I want to detect changing of value column by date:
id
date
value
diff
1
2022-01-01
1
null
1
2022-01-02
1
null
1
2022-01-03
2
1
1
2022-01-04
2
1
1
2022-01-05
3
2
1
2022-01-06
3
2
I tried a window function lag(), but all I got:
id
date
value
diff
1
2022-01-01
1
null
1
2022-01-02
1
1
1
2022-01-03
2
1
1
2022-01-04
2
2
1
2022-01-05
3
2
1
2022-01-06
3
3
I am pretty sure you have to do a gaps-and-islands to "group" your changes.
There may be a more concise way to get the result you want, but this is how I would solve this:
with changes as ( -- mark the changes and lag values
select id, date, value,
coalesce((value != lag(value) over w)::int, 1) as changed_flag,
lag(value) over w as last_value
from a_table
window w as (partition by id order by date)
), groupnums as ( -- number the groups, carrying the lag values forward
select id, date, value,
sum(changed_flag) over (partition by id order by date) as group_num,
last_value
from changes
window w as (partition by id order by date)
) -- final query that uses group numbering to return the correct lag value
select id, date, value,
first_value(last_value) over (partition by id, group_num
order by date) as diff
from groupnums;
db<>fiddle here

Postgres generate_series joined onto result set to fill empty dates within a range

I have a result set that sometimes has missing dates (because no data is present within that week), and need to fill those with zero's. For simplicity I've reduced the query and table down to
Table: generated_data
id | data | date_key
1 | 3 | 2021-12-13 03:00:00.000
2 | 1 | 2021-12-22 05:00:00.000
3 | 4 | 2021-12-24 07:00:00.000
4 | 7 | 2022-01-03 01:00:00.000
5 | 2 | 2022-01-05 02:00:00.000
Query:
Select
sum(data) / count(data),
DATE_TRUNC('week', date_key AT TIME ZONE 'America/New_York') as date_key
from generated_data
group by DATE_TRUNC('week', date_key AT TIME ZONE 'America/New_York') as date_key
would produce the following result set:
3 | 2021-12-13 00:00:00.000
2.5 | 2021-12-20 00:00:00.000
5.5 | 2022-01-03 00:00:00.000
but as you can see there's a missing date of 12/27 which I'd like to return in the result set as a zero. I've looked into using generate_series and joining onto the above simplified query, but haven't found a good solution.
The idea would be doing something like
SELECT GENERATE_SERIES('2021-11-08T00:00:00+00:00'::date, '2022-01-17T04:59:59.999000+00:00'::date, '1 week'::interval) as date_key
but I'm not sure how to join that back to the result query where just the missing dates are added. What would a on clause look like for something like that?
final result set would look like
3 | 2021-12-13 00:00:00.000
2.5 | 2021-12-20 00:00:00.000
0 | 2021-12-27 00:00:00.000
5.5 | 2022-01-03 00:00:00.000
At first, you should find the min and max of date and generate based on that. Then join a table with generated data
Demo
WITH data_range AS (
SELECT
min(date_key) AT TIME ZONE 'America/New_York' min,
max(date_key) AT TIME ZONE 'America/New_York' max
from generated_data
),
generated_range AS (
SELECT DATE_TRUNC(
'week',
GENERATE_SERIES(min, max, '1 week'::interval)
) AS date FROM data_range
)
SELECT
coalesce(sum(data) / count(data), 0),
DATE_TRUNC('week', gr.date)
FROM
generated_range gr
LEFT JOIN generated_data gd ON
DATE_TRUNC('week', gd.date_key AT TIME ZONE 'America/New_York') = gr.date
GROUP BY DATE_TRUNC('week', gr.date)
ORDER BY 2

Count rows within a group, but also from global result set: performance issue

I have a table with log records. Each log record is represented by a status (open or closed) and a date:
CREATE TABLE logs (
id BIGSERIAL PRIMARY KEY,
status VARCHAR NOT NULL,
inserted_at DATE NOT NULL
);
I need to get a daily report with a following information:
how many log records with status = open were created,
how many log records with status = closed were created,
how many log records with status = open exist to this day, including this day.
Here's a sample report output:
day | created | closed | total_open
------------+---------+--------+------------
2017-01-01 | 2 | 0 | 2
2017-01-02 | 2 | 1 | 3
2017-01-03 | 1 | 1 | 3
2017-01-04 | 1 | 0 | 4
2017-01-05 | 1 | 0 | 5
2017-01-06 | 1 | 0 | 6
2017-01-07 | 1 | 0 | 7
2017-01-08 | 0 | 1 | 6
2017-01-09 | 0 | 0 | 6
2017-01-10 | 0 | 0 | 6
(10 rows)
I solved this in a very "dirty" way:
INSERT INTO logs (status, inserted_at) VALUES
('created', '2017-01-01'),
('created', '2017-01-01'),
('closed', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-03'),
('closed', '2017-01-03'),
('created', '2017-01-04'),
('created', '2017-01-05'),
('created', '2017-01-06'),
('created', '2017-01-07'),
('closed', '2017-01-08');
SELECT days.day,
count(case when logs.inserted_at = days.day AND logs.status = 'created' then 1 end) as created,
count(case when logs.inserted_at = days.day AND logs.status = 'closed' then 1 end) as closed,
count(case when logs.inserted_at <= days.day AND logs.status = 'created' then 1 end) -
count(case when logs.inserted_at <= days.day AND logs.status = 'closed' then 1 end) as total
FROM (SELECT day::date FROM generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) day) days,
logs
GROUP BY days.day
ORDER BY days.day;
Also (posted it on gist for brevity), and would like to improve the solution.
Right now explain for my query reveals some ridiculous cost numbers that that I would like to minimize (I don't have indexes yet).
How an efficient query to achieve the report above would look like?
A possible solution is to use window functions:
select s.*, sum(created - closed) over (order by inserted_at)
from (select inserted_at,
count(status) filter (where status = 'created') created,
count(status) filter (where status = 'closed') closed
from (select d::date inserted_at
from generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) d) d
left join logs using (inserted_at)
group by inserted_at) s
http://rextester.com/GFRRP71067
Also, an index on (inserted_at, status) could help you a lot with this query.
Note: count(...) filter (where ...) is really just a fancy way to write count(case when ... then ... [else null] end).

How to insert row data between consecutive dates in HIVE?

Sample Data:
customer txn_date tag
A 1-Jan-17 1
A 2-Jan-17 1
A 4-Jan-17 1
A 5-Jan-17 0
B 3-Jan-17 1
B 5-Jan-17 0
Need to fill every missing txn_date between date range (1-Jan-17 to 5-Jan-2017). Just like below:
Output should be:
customer txn_date tag
A 1-Jan-17 1
A 2-Jan-17 1
A 3-Jan-17 0 (inserted)
A 4-Jan-17 1
A 5-Jan-17 0
B 1-Jan-17 0 (inserted)
B 2-Jan-17 0 (inserted)
B 3-Jan-17 1
B 4-Jan-17 0 (inserted)
B 5-Jan-17 0
select c.customer
,d.txn_date
,coalesce(t.tag,0) as tag
from (select date_add (from_date,i) as txn_date
from (select date '2017-01-01' as from_date
,date '2017-01-05' as to_date
) p
lateral view
posexplode(split(space(datediff(p.to_date,p.from_date)),' ')) pe as i,x
) d
cross join (select distinct
customer
from t
) c
left join t
on t.customer = c.customer
and t.txn_date = d.txn_date
;
c.customer d.txn_date tag
A 2017-01-01 1
A 2017-01-02 1
A 2017-01-03 0
A 2017-01-04 1
A 2017-01-05 0
B 2017-01-01 0
B 2017-01-02 0
B 2017-01-03 1
B 2017-01-04 0
B 2017-01-05 0
Just have the delta content i.e the missing data in a file(input.txt) delimited with the same delimiter you have mentioned when you created the table.
Then use the load data command to insert this records into the table.
load data local inpath '/tmp/input.txt' into table tablename;
Your data wont be in the order you have mentioned , it would get appended to the last. You could retrieve the order by adding order by txn_date in the select query.

Forming a Tsql query that ranks and categorizes date field

I have this datset:
create table #date_example
(
date_val datetime, rownum int
)
insert #date_example values('3/1/14',1)
insert #date_example values('3/1/14',2)
insert #date_example values('3/1/14',3)
insert #date_example values('2/1/14',4)
insert #date_example values('1/3/14',5)
select --top 1 with ties
date_val,
ROW_NUMBER() OVER(PARTITION BY rownum ORDER BY date_val DESC) AS 'RowNum'
from #date_example
order by date_val
desc
With output:
date_val RowNum
2014-03-01 00:00:00.000 1
2014-03-01 00:00:00.000 1
2014-03-01 00:00:00.000 1
2014-02-01 00:00:00.000 1
2014-01-03 00:00:00.000 1
But I want instead output:
date_val RowNum
2014-03-01 00:00:00.000 1
2014-03-01 00:00:00.000 1
2014-03-01 00:00:00.000 1
2014-02-01 00:00:00.000 2
2014-01-03 00:00:00.000 3
So I want the RowNum to be a ranking which includes ties. How can I do this?
I found the answer from another post:
select
date_val,
Rank() OVER(ORDER BY date_val DESC) AS 'RowNum'
from #date_example