Can some one please help me with how to create end date from start date.
Products referred to a company for testing while the product with the company they carry out multiple tests on different dates and record the test date to establish the product condition i.e. (outcomeID).
I need to establish the StartDate which is the testDate and EndDate which is the start date of the next row. But if multiple consecutive tests resulted in the same OutcomeID I need to return only one row with the StartDate of the first test and the end date of the last test. In another word if the outcomeID did not change over a few consecutive tests.
Here is my data set
DECLARE #ProductTests TABLE
(
RequestID int not null,
ProductID int not null,
TestID int not null,
TestDate datetime null,
OutcomeID int
)
insert into #ProductTests
(RequestID ,ProductID ,TestID ,TestDate ,OutcomeID )
select 1,2,22,'2005-01-21',10
union all
select 1,2,42,'2007-03-17',10
union all
select 1,2,45,'2010-12-25',10
union all
select 1,2,325,'2011-01-14',13
union all
select 1,2,895,'2011-08-10',15
union all
select 1,2,111,'2011-12-23',15
union all
select 1,2,636,'2012-05-02',10
union all
select 1,2,554,'2012-11-08',17
--select *from #producttests
RequestID ProductID TestID TestDate OutcomeID
1 2 22 2005-01-21 10
1 2 42 2007-03-17 10
1 2 45 2010-12-25 10
1 2 325 2011-01-14 13
1 2 895 2011-08-10 15
1 2 111 2011-12-23 15
1 2 636 2012-05-02 10
1 2 554 2012-11-08 17
And this is what I need to achieve.
RequestID ProductID StartDate EndDate OutcomeID
1 2 2005-01-21 2011-01-14 10
1 2 2011-01-14 2011-08-10 13
1 2 2011-08-10 2012-05-02 15
1 2 2012-05-02 2012-11-08 10
1 2 2012-11-08 NULL 17
As you see from the dataset the first three tests (22, 42, and 45) all resulted in OutcomeID 10 so in my result I only need start date of test 22 and end date of test 45 which is the start date of test 325.As you see in test 636 outcomeID has gone back to 10 from 15 so it needs to be returned too.
--This is what I have managed to achieve at the moment using the following script
select T1.RequestID,T1.ProductID,T1.TestDate AS StartDate
,MIN(T2.TestDate) AS EndDate ,T1.OutcomeID
from #producttests T1
left join #ProductTests T2 ON T1.RequestID=T2.RequestID
and T1.ProductID=T2.ProductID and T2.TestDate>T1.TestDate
group by T1.RequestID,T1.ProductID ,T1.OutcomeID,T1.TestDate
order by T1.TestDate
Result:
RequestID ProductID StartDate EndDate OutcomeID
1 2 2005-01-21 2007-03-17 10
1 2 2007-03-17 2010-12-25 10
1 2 2010-12-25 2011-01-14 10
1 2 2011-01-14 2011-08-10 13
1 2 2011-08-10 2011-12-23 15
1 2 2011-12-23 2012-05-02 15
1 2 2012-05-02 2012-11-08 10
1 2 2012-11-08 NULL 17
nov 7 but still not answered
so here is my solution
not soo pretty but works
my hint is read about windowing , ranking and aggregate functions like row_number, rank , avg, sum etc.
those are essential when you want to write raports , and becoming quite powerfull in sql server 2012
i have also used CTE (common table expression) but it can be written as subquery or temporary table
;with cte ( ida, requestid, productid, testid, testdate, outcomeid) as
(
-- select rows where the outcome id is changing
select b.* from
(select ROW_NUMBER() over( partition by requestid, productid order by testDate) as id, * from #ProductTests)a
right outer join
(select ROW_NUMBER() over(partition by requestid, productid order by testDate) as id, * from #ProductTests) b
on a.requestID = b.requestID and a.productID = b.productID and a.id +1 = b.id
where 1=1
--or a.id = 1
and a.outcomeid <> b.outcomeid or b.outcomeid is null or a.id is null
)
select --*
a.RequestID,a.ProductID,a.TestDate AS StartDate ,MIN(b.TestDate) AS EndDate ,a.OutcomeID
from cte a left join cte b on a.requestid = b.requestid and a.productid = b.productid and a.testdate < b.testdate
group by a.RequestID,a.ProductID ,a.OutcomeID,a.TestDate
order by StartDate
Actually, there seem to be two problems in your question. One is how to group sequential (based on specific criteria) rows containing the same value. The other is the one actually spelled out in your title, i.e. how to use the next row's StartDate as the current row's EndDate.
Personally, I would solve these two problems in the order I mentioned them, so I would first address the grouping problem. One way to group the data properly in this case would be to use double ranking like this:
WITH partitioned AS (
SELECT
*,
grp = ROW_NUMBER() OVER (PARTITION BY RequestID, ProductID ORDER BY TestDate)
- ROW_NUMBER() OVER (PARTITION BY RequestID, ProductID, OutcomeID ORDER BY TestDate)
FROM #ProductTests
)
, grouped AS (
SELECT
RequestID,
ProductID,
StartDate = MIN(TestDate),
OutcomeID
FROM partitioned
GROUP BY
RequestID,
ProductID,
OutcomeID,
grp
)
SELECT *
FROM grouped
;
This should give you the following output for your data sample:
RequestID ProductID StartDate OutcomeID
--------- --------- ---------- ---------
1 2 2005-01-21 10
1 2 2011-01-14 13
1 2 2011-08-10 15
1 2 2012-05-02 10
1 2 2012-11-08 17
Obviously, one thing is still missing, and it's EndDate, and now is the right time to care about it. Use ROW_NUMBER() once again, to rank the result set of the grouped CTE, then use the rankings in the join condition when joining the result set with itself (using an outer join):
WITH partitioned AS (
SELECT
*,
grp = ROW_NUMBER() OVER (PARTITION BY RequestID, ProductID ORDER BY TestDate)
- ROW_NUMBER() OVER (PARTITION BY RequestID, ProductID, OutcomeID ORDER BY TestDate)
FROM #ProductTests
)
, grouped AS (
SELECT
RequestID,
ProductID,
StartDate = MIN(TestDate),
OutcomeID,
rnk = ROW_NUMBER() OVER (PARTITION BY RequestID, ProductID ORDER BY MIN(TestDate))
FROM partitioned
GROUP BY
RequestID,
ProductID,
OutcomeID,
grp
)
SELECT
g1.RequestID,
g1.ProductID,
g1.StartDate,
g2.StartDate AS EndDate,
g1.OutcomeID
FROM grouped g1
LEFT JOIN grouped g2
ON g1.RequestID = g2.RequestID
AND g1.ProductID = g2.ProductID
AND g1.rnk = g2.rnk - 1
;
You can try this query at SQL Fiddle to verify that it returns the output you are after.
Related
I have a table of datestamped events that I need to bundle into 7-day groups, starting with the earliest occurrence of each event_id.
The final output should return each bundle's start and end date and 'value' column of the most recent event from each bundle.
There is no predetermined start date, and the '7-day' windows are arbitrary, not 'week of the year'.
I've tried a ton of examples from other posts but none quite fit my needs or use things I'm not sure how to refactor for BigQuery
Sample Data;
Event_Id
Event_Date
Value
1
2022-01-01
010203
1
2022-01-02
040506
1
2022-01-03
070809
1
2022-01-20
101112
1
2022-01-23
131415
2
2022-01-02
161718
2
2022-01-08
192021
3
2022-02-12
212223
Expected output;
Event_Id
Start_Date
End_Date
Value
1
2022-01-01
2022-01-03
070809
1
2022-01-20
2022-01-23
131415
2
2022-01-02
2022-01-08
192021
3
2022-02-12
2022-02-12
212223
You might consider below.
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 6) { bin += 1; return 0; }
else return c += Number(v);
}, 0);
return bin;
""";
WITH sample_data AS (
select 1 event_id, DATE '2022-01-01' event_date, '010203' value union all
select 1 event_id, '2022-01-02' event_date, '040506' value union all
select 1 event_id, '2022-01-03' event_date, '070809' value union all
select 1 event_id, '2022-01-20' event_date, '101112' value union all
select 1 event_id, '2022-01-23' event_date, '131415' value union all
select 2 event_id, '2022-01-02' event_date, '161718' value union all
select 2 event_id, '2022-01-08' event_date, '192021' value union all
select 3 event_id, '2022-02-12' event_date, '212223' value
),
binning AS (
SELECT *, cumsumbin(ARRAY_AGG(diff) OVER w1) bin
FROM (
SELECT *, DATE_DIFF(event_date, LAG(event_date) OVER w0, DAY) AS diff
FROM sample_data
WINDOW w0 AS (PARTITION BY event_id ORDER BY event_date)
) WINDOW w1 AS (PARTITION BY event_id ORDER BY event_date)
)
SELECT event_id,
MIN(event_date) start_date,
ARRAY_AGG(
STRUCT(event_date AS end_date, value) ORDER BY event_date DESC LIMIT 1
)[OFFSET(0)].*
FROM binning GROUP BY event_id, bin;
I have a table with messages and I need to find chats where were two or more messages in period of 10 seconds. table
id message_id time
1 1 2021.11.10 13:09:00
1 2 2021.11.10 13:09:01
1 3 2021.11.10 13:09:50
2 1 2021.11.10 15:18:00
2 2 2021.11.10 15:20:00
3 1 2021.11.12 15:00:00
3 2 2021.11.12 15:10:00
3 2 2021.11.12 15:10:10
So the result looks like
id
1
3
I can't come up with the idea how to group by a period or maybe it can be done other way?
select id
from t
group by id, ?
having count(message_id) > 1
You can join the table with itself, matching them on the chat id and your timeframe.
create table messages (chat_id integer,message_id integer,"time" timestamp);
insert into messages values
(1,1,'2021.11.10 13:09:00'),
(1,2,'2021.11.10 13:09:01'),
(1,3,'2021.11.10 13:09:50'),
(2,1,'2021.11.10 15:18:00'),
(2,2,'2021.11.10 15:20:00'),
(3,1,'2021.11.12 15:00:00'),
(3,2,'2021.11.12 15:10:00'),
(3,2,'2021.11.12 15:10:10');
select target_chat,
target_message,
count(*) "number of messages preceding by no more than 10 seconds"
from
(select t1.chat_id target_chat,
t1.message_id target_message,
t1.time,
t2.chat_id,
t2.message_id,
t2.time
from messages t1
inner join messages t2
on t1.chat_id=t2.chat_id
and t1.message_id<>t2.message_id
and (t2.time<=t1.time-'10 seconds'::interval and t2.time<=t1.time)) a
group by 1,2;
-- target_chat | target_message | number of messages preceding by no more than 10 seconds
---------------+----------------+---------------------------------------------------------
-- 1 | 3 | 2
-- 2 | 2 | 1
-- 3 | 2 | 2
--(3 rows)
From that you can select the records with your desired number of preceding messages.
this is a simple query that finds every previous value that is included in our interval
select id from test_table t where
t.time + interval '10 second' >=
(select time from test_table where id=t.id and time>t.time limit 1)
group by id;
results
id
----
1
3
To find rows within an period of time, you can tipically use a window function which avoids a self join on the table :
SELECT id, count(*) OVER (ORDER BY time RANGE BETWEEN CURRENT ROW AND '10 minutes' FOLLOWING)
FROM t
GROUP BY id
Then you can use this query as a sub-query if you only want the id with count(*) > 1 :
SELECT DISTINCT ON (l.id) l.id
FROM
( SELECT id, count(*) OVER (ORDER BY time RANGE BETWEEN CURRENT ROW AND '10 minutes' FOLLOWING) AS ct
FROM t
GROUP BY id
) AS l
WHERE l.ct > 1 ;
I am trying to build a cohort analysis for monthly retention but experiencing challenge getting the Month Number column right. The month number is supposed to return month(s) user transacted i.e 0 for registration month, 1 for the first month after registration month, 2 for the second month until the last month but currently, it returns negative month numbers in some cells.
It should be like this table:
cohort_month total_users month_number percentage
---------- ----------- -- ------------ ---------
January 100 0 40
January 341 1 90
January 115 2 90
February 103 0 73
February 100 1 40
March 90 0 90
Here is the SQL:
with cohort_items as (
select
extract(month from insert_date) as cohort_month,
msisdn as user_id
from mfscore.t_um_user_detail where extract(year from insert_date)=2020
order by 1, 2
),
user_activities as (
select
A.sender_msisdn,
extract(month from A.insert_date)-C.cohort_month as month_number
from mfscore.t_wm_transaction_logs A
left join cohort_items C ON A.sender_msisdn = C.user_id
where extract(year from A.insert_date)=2020
group by 1, 2
),
cohort_size as (
select cohort_month, count(1) as num_users
from cohort_items
group by 1
order by 1
),
B as (
select
C.cohort_month,
A.month_number,
count(1) as num_users
from user_activities A
left join cohort_items C ON A.sender_msisdn = C.user_id
group by 1, 2
)
select
B.cohort_month,
S.num_users as total_users,
B.month_number,
B.num_users * 100 / S.num_users as percentage
from B
left join cohort_size S ON B.cohort_month = S.cohort_month
where B.cohort_month IS NOT NULL
order by 1, 3
I think the RANK window function is the right solution. So the idea is to assigne a rank to months of user activities for each user, order by year and month.
Something like:
WITH activity_per_user AS (
SELECT
user_id,
event_date,
RANK() OVER (PARTITION BY user_id ORDER BY DATE_PART('year', event_date) , DATE_PART('month', event_date) ASC) AS month_number
FROM user_activities_table
)
RANK number starts from 1, so you may want to substract 1.
Then, you can group by user_id and month_number to get the number of interactions for each user per month from the subscription (adapt to your use case accordingly).
SELECT
user_id,
month_number,
COUNT(1) AS n_interactions
FROM activity_per_user
GROUP BY 1, 2
Here is the documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html
I have two tables and for each ID and Level combination in table1, I need to get a count of times matching ID appears in table2 in between sequential times for levels in table1.
So for example, for ID = 1 and Level=1 in table1, two Time entries from table2 for ID=1 fall between Time of Level=1 and Level=2 in table1, so result will be 2 in the result table.
table1:
ID Level Time
1 1 6/7/13 7:03
1 2 6/9/13 7:05
1 3 6/12/13 12:02
1 4 6/17/13 5:01
2 1 6/18/13 8:38
2 3 6/20/13 9:38
2 4 6/23/13 10:38
2 5 6/28/13 1:38
table2:
ID Time
1 6/7/13 11:51
1 6/7/13 14:15
1 6/9/13 16:39
1 6/9/13 19:03
2 6/20/13 11:02
2 6/20/13 15:50
Result would be
ID Level Count
1 1 2
1 2 2
1 3 0
1 4 0
2 1 0
2 3 2
2 4 0
2 5 0
select transformed_tab1.id, transformed_tab1.level, count(tab2.id)
from
(select tab1.id, tab1.level, tm, lead(tm) over (partition by id order by tm) as next_tm
from
(
select 1 as id, 1 as level, '2013-06-07 07:03'::timestamp as tm union
select 1 as id, 2 as level, '2013-06-09 07:05 '::timestamp as tm union
select 1 as id, 3 as level, '2013-06-12 12:02'::timestamp as tm union
select 1 as id, 4 as level, '2013-06-17 05:01'::timestamp as tm union
select 2 as id, 1 as level, '2013-06-18 08:38'::timestamp as tm union
select 2 as id, 3 as level, '2013-06-20 09:38'::timestamp as tm union
select 2 as id, 4 as level, '2013-06-23 10:38'::timestamp as tm union
select 2 as id, 5 as level, '2013-06-28 01:38'::timestamp as tm) tab1
) transformed_tab1
left join
(select 1 as id, '2013-06-07 11:51'::timestamp as tm union
select 1 as id, '2013-06-07 14:15'::timestamp as tm union
select 1 as id, '2013-06-09 16:39'::timestamp as tm union
select 1 as id, '2013-06-09 19:03'::timestamp as tm union
select 2 as id, '2013-06-20 11:02'::timestamp as tm union
select 2 as id, '2013-06-20 15:50'::timestamp as tm) tab2
on transformed_tab1.id=tab2.id and tab2.tm between transformed_tab1.tm and transformed_tab1.next_tm
group by transformed_tab1.id, transformed_tab1.level
order by transformed_tab1.id, transformed_tab1.level
;
SQL Fiddle
select t1.id, level, count(t2.id)
from
(
select id, level,
tsrange(
"time",
lead("time", 1, 'infinity') over(
partition by id order by level
),
'[)'
) as time_range
from t1
) t1
left join
t2 on t1.id = t2.id and t1.time_range #> t2."time"
group by t1.id, level
order by t1.id, level
The solution starts creating a range of timestamps using the lead window function. Notice the [) parameter to the tsrange constructor. It means to include the lower and exclude the upper bound.
Then it joins the two tables with the #> range operator. It means the range includes the element.
It is necessary to left join t1 to have the zero counts.
I have multiple CTEs and I want to retrieve some information from a couple of them into next CTE.
So, I have this information from one of the CTEs:
PeriodID StarDate
1 2006-01-01
2 2007-04-25
3 2008-08-16
4 2009-12-08
5 2011-04-017
and this from other:
RecordID Date
100 2007-04-15
101 2008-05-21
102 2008-06-06
103 2008-07-01
104 2009-11-12
And I need to show in next one:
RecordID Date PeriodID
100 2007-04-15 1
101 2008-05-21 2
102 2008-06-06 2
103 2008-07-01 2
104 2009-11-12 3
I can use some case/when statement to define if date of record is in period 1,2,3,4 or 5 but it some situation I can have different numbers of periods return from the first CTE.
Is there a way to do this in the above context?
You can have multiple CTEs defined as follows, and then select from and join them as you would any other table.
with cte1 as (select * ...),
cte2 as (select * ...)
select
cte2.*,
periodid
from cte2
cross apply
(select top 1 * from cte1 where cte2.recorddate> cte1.startdate order by startdate desc) v