Scenario: Trying to count more active users for time series analysis.
Need: With postgreSQL(redshift) Count customers that have more than X unique transactions within Y days from said date, group by date.
How do i achieve this?
Table: orders
date
user_id
product_id
transaction_id
2022-01-01
001
003
001
2022-01-02
002
001
002
2022-03-01
003
001
003
2022-03-01
003
002
003
...
...
...
...
Outcome:
date
active_customers
2022-01-01
10
2022-01-02
12
2022-01-03
9
2022-01-04
13
You may be able to use the window functions LEAD() and LAG() here but this solution may also work for you.
WITH data AS
(
SELECT o.date
, o.user_id
, COUNT(o.trans_id) tcount
FROM orders o
WHERE o.date BETWEEN o.date - '30 DAYS'::INTERVAL AND o.date -- Y days from given date
GROUP BY o.date, o.user_id
), user_transaction_count AS
(
SELECT d.date
, COUNT(d.user_id) FILTER (WHERE d.tcount > 1) -- X number of transactions
OVER (PARTITION BY d.user_id) user_count
FROM data d
)
SELECT u.date
, SUM(u.user_count) active_customers
FROM user_transaction_count u
GROUP BY u.date
ORDER BY u.date
;
Here is a DBFiddle that demos a couple options.
I have a table that I need to reorder a column, but I need to keep the original order by date.
TABLE_1
id num_seq DateTimeStamp
fb4e1683-7035-4895-b2c8-d084d9b42ce3 111 08-02-2005
e40e4c3e-65e4-47b7-b13a-79e8bce2d02d 114 10-07-2017
49e261a8-a855-4844-a0ac-37b313da2222 113 01-30-2010
6c4bffb7-a056-4a20-ae1c-5a31bdf683f2 112 04-15-2006
I want to reorder num_seq starting with 1001 through 1004 and keep the numbering in order. So 111 = 1001 and 112 = 1002 and so forth.
This is what I have so far:
DECLARE #num INT
SET #num = 0
UPDATE Table_1
SET #num = num_seq = #id + 1
GO
I know that UPDATE doesn't let me use the keyword ORDER BY. Is there a way to do this in SQL 2008 R2?
Stage the new num_seq in a CTE, then leverage that in your update statement:
declare #Table_1 table (id uniqueidentifier, num_seq int, DateTimeStamp datetime);
insert into #Table_1
values
('fb4e1683-7035-4895-b2c8-d084d9b42ce3', 111, '08-02-2005'),
('e40e4c3e-65e4-47b7-b13a-79e8bce2d02d', 114, '10-07-2017'),
('49e261a8-a855-4844-a0ac-37b313da2222', 113, '01-30-2010'),
('6c4bffb7-a056-4a20-ae1c-5a31bdf683f2', 112, '04-15-2006');
;with stage as
(
select *,
num_seq_new = 1000 + row_number()over(order by DateTimeStamp asc)
from #Table_1
)
update stage
set num_seq = num_seq_new;
select * from #Table_1
Returns:
id num_seq DateTimeStamp
FB4E1683-7035-4895-B2C8-D084D9B42CE3 1001 2005-08-02 00:00:00.000
E40E4C3E-65E4-47B7-B13A-79E8BCE2D02D 1004 2017-10-07 00:00:00.000
49E261A8-A855-4844-A0AC-37B313DA2222 1003 2010-01-30 00:00:00.000
6C4BFFB7-A056-4A20-AE1C-5A31BDF683F2 1002 2006-04-15 00:00:00.000
I needed Capture multiple changes during the day but eliminating duplicates if occurs immediately.
Below is the snippet of sample data.
Source Data:
SEQ_ID ID LastName FirstName Updated_Time
50 1010 A A 01/06/2016 10:00
51 1010 B B 01/06/2016 11:00
52 1010 C C 01/06/2016 12:00
53 1010 D D 01/06/2016 15:00
54 1010 D D 01/06/2016 17:00
55 1010 D D 01/06/2016 18:00
56 1010 B B 01/06/2016 20:00
57 1010 B B 01/06/2016 21:00
58 1010 B B 01/06/2016 22:00
59 1010 B B 01/06/2016 23:00
100 2020 X X 01/06/2016 10:00
202 3030 TTT TTT 01/06/2016 10:00
201 3030 UUU UUU 01/06/2016 11:00
203 3030 VVV VVV 01/06/2016 12:00
210 3030 UUU UUU 01/06/2016 15:00
302 4000 KQ KQ 01/06/2016 07:00
300 4000 KQ KQ 01/06/2016 08:00
301 4000 KQ KQ 01/06/2016 09:00
303 4000 KQ KQ 02/06/2016 08:00
Result should be as below :
SEQ_ID ID LastName FirstName Updated_Time
50 1010 A A 01/06/2016 10:00
51 1010 B B 01/06/2016 11:00
52 1010 C C 01/06/2016 12:00
53 1010 D D 01/06/2016 15:00
56 1010 B B 01/06/2016 20:00
100 2020 X X 01/06/2016 10:00
202 3030 TTT TTT 01/06/2016 10:00
201 3030 UUU UUU 01/06/2016 11:00
203 3030 VVV VVV 01/06/2016 12:00
210 3030 UUU UUU 01/06/2016 15:00
302 4000 KQ KQ 01/06/2016 07:00
This is query I could come up with:
SELECT
[ID]
,[LastName]
,[FirstName]
, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID, [Updated_Time])
- ROW_NUMBER() OVER (PARTITION BY ID,
CAST(HASHBYTES('SHA2_256', CONCAT(
ID
,[LastName]
,[FirstName]
)) AS binary(32)) ORDER BY ID ASC, [Updated_Time] ASC) [DWRecordGroupID]
FROM
xxxxxxx.xxxxxxx
order by ID , [Updated_Time] asc
Result of the Query:
ID LastName FirstName DWRecordGroupID
1010 A A 0
1010 B B 1
1010 C C 2
1010 D D 3
1010 D D 3
1010 D D 3
1010 B B 5
1010 B B 5
1010 B B 5
1010 B B 5
2020 X X 0
3030 TTT TTT 0
3030 UUU UUU 1
3030 VVV VVV 2
3030 UUU UUU 2
4000 KQ KQ 0
4000 KQ KQ 0
4000 KQ KQ 0
4000 KQ KQ 0
The idea is to eliminate duplicated based on ID and DWRecordGroupID. But somehow I am missing at the below part where the query gives me same group number and one of them gets eliminated randomly, which is incorrect.
ID LastName FirstName DWRecordGroupID
3030 TTT TTT 0
3030 UUU UUU 1
3030 VVV VVV 2
3030 UUU UUU 2
Any help is really appreciated.
Thanks in advance.
I think you can try this (X1 is your table):
SELECT ID, LAST_NAME, FIRST_NAME, UPDATED_TIME FROM (
SELECT ID, LAST_NAME, FIRST_NAME, UPDATED_TIME
, LAG(LAST_NAME) OVER (PARTITION BY ID ORDER BY UPDATED_TIME, SEQ_ID) AS LNAME_prec
, LAG(FIRST_NAME) OVER (PARTITION BY ID ORDER BY UPDATED_TIME, SEQ_ID) AS FNAME_prec
FROM X1
) X2
WHERE (LAST_NAME <>LNAME_prec AND FIRST_NAME <>FNAME_prec) OR (LNAME_prec IS NULL)
Output:
ID LAST_NAME FIRST_NAME UPDATED_TIME
----------- ---------- ---------- -----------------------
1010 A A 2016-06-01 10:00:00.000
1010 B B 2016-06-01 11:00:00.000
1010 C C 2016-06-01 12:00:00.000
1010 D D 2016-06-01 15:00:00.000
1010 B B 2016-06-01 20:00:00.000
2020 X X 2016-06-01 10:00:00.000
3030 TTT TTT 2016-06-01 10:00:00.000
3030 UUU UUU 2016-06-01 11:00:00.000
3030 VVV VVV 2016-06-01 12:00:00.000
3030 UUU UUU 2016-06-01 15:00:00.000
4000 KQ KQ 2016-06-01 07:00:00.000
This will obviously need to be adapted to incorporate however many columns you have in your table:
;WITH cte ( rownum, seq_id, id, last_name, first_name, updated_time )
AS (SELECT row_number()
OVER (
ORDER BY id, updated_time),
seq_id,
id,
last_name,
first_name,
updated_time
FROM #tbl)
SELECT t.*
FROM cte l
INNER JOIN #tbl t ON l.seq_id = t.seq_id
LEFT OUTER JOIN cte p ON l.rownum - 1 = p.rownum
AND l.id = p.id
AND l.last_name = p.last_name
AND l.first_name = p.first_name
WHERE p.seq_id IS NULL
The real difficulty comes from the fact that, in the end, you have to compare every non-sequence field (i.e. not seq_id and not updated_time) from one row against every non-sequence field from another row.
Note: This solution naively assumes changes to a particular ID are to be treated as a single collection of changes. So if seq_id 548 that comes in on 01/23/2017 for id 1010 has the same first_name, last_name as seq_id 56, it will not be picked up. It could be adapted to work IF the seq_id column could be guaranteed to be in sequence order (but your sample data did not have that).
You can use Row_number() and get the values of 1
;with CTE as (
select *, RowN = row_number() over (partition by lastname order by seq_id) from #yourduplicates
) select * from cte where RowN = 1
Your Input table:
create table #yourDuplicates (Seq_ID int, id int, lastname varchar(10), firstname varchar(10), updated_time datetime)
insert into #yourDuplicates
(SEQ_ID , ID , LastName , FirstName , Updated_Time ) values
( 50 , 1010 ,'A ', 'A ', '01/06/2016 10:00')
, ( 51 , 1010 ,'B ', 'B ', '01/06/2016 11:00')
, ( 52 , 1010 ,'C ', 'C ', '01/06/2016 12:00')
, ( 53 , 1010 ,'D ', 'D ', '01/06/2016 15:00')
, ( 54 , 1010 ,'D ', 'D ', '01/06/2016 17:00')
, ( 55 , 1010 ,'D ', 'D ', '01/06/2016 18:00')
, ( 56 , 1010 ,'B ', 'B ', '01/06/2016 20:00')
, ( 57 , 1010 ,'B ', 'B ', '01/06/2016 21:00')
, ( 58 , 1010 ,'B ', 'B ', '01/06/2016 22:00')
, ( 59 , 1010 ,'B ', 'B ', '01/06/2016 23:00')
, ( 100 , 2020 ,'X ', 'X ', '01/06/2016 10:00')
, ( 202 , 3030 ,'TTT', 'TTT', '01/06/2016 10:00')
, ( 201 , 3030 ,'UUU', 'UUU', '01/06/2016 11:00')
, ( 203 , 3030 ,'VVV', 'VVV', '01/06/2016 12:00')
, ( 210 , 3030 ,'UUU', 'UUU', '01/06/2016 15:00')
, ( 302 , 4000 ,'KQ ', 'KQ ', '01/06/2016 07:00')
, ( 300 , 4000 ,'KQ ', 'KQ ', '01/06/2016 08:00')
, ( 301 , 4000 ,'KQ ', 'KQ ', '01/06/2016 09:00')
, ( 303 , 4000 ,'KQ ', 'KQ ', '02/06/2016 08:00')
I have two tables:
table1 =tbl_main:
item_id fastec_qty
001 102
002 200
003 300
004 400
table2= tbl_dOrder
order_id item_id amount
1001 001 30
1001 002 40
1002 001 50
1002 003 70
How can I write a query so that the result of the tables are as follows:
item_id amount difference
001 102 22
002 200 160
003 300 230
004 400 400
The difference between the amount in table 1 and the total amounts disbursed from the Table 2.
SELECT q.item_id, a.fastec_qty AS amount, a.fastec_qty - q.amount AS difference
FROM (
SELECT item_id, SUM(amount) AS amount
FROM tbl_dOrder
GROUP BY item_id
) q
JOIN tbl_main a ON a.item_id = q.item_id
Here this query is going to first SUM the amounts from tbl2 grouped by the item_id, then it's going to JOIN the results of that query with the first table so it can do the calculation for the difference column.