Updating Multiple Rows based on MAX(Date) in other tables - tsql

So my scenario... I need to update 2 columns ActivityDate and Activity in a User table. The date needs to be set to the MAX(DATE) from one of 2 other tables, Notes or Tasks, while the Activity Code is set based on which activity had the max date. I am not sure how I can go about this...
Example.
User table -- Currently
UserID ActivityDate Activity
1 NULL NULL
2 NULL NULL
3 NULL NULL
4 NULL NULL
Notes table:
UserID PostedDate Activity
1 2015-01-01 10:15:00.000 1
2 2015-02-01 10:15:00.000 1
3 2015-03-01 10:15:00.000 1
4 2015-04-01 10:15:00.000 1
Tasks table:
UserID PostedDate Activity
1 2015-01-15 11:30:00.000 2
2 2015-01-15 11:30:00.000 2
3 2015-05-01 11:30:00.000 2
4 2015-02-05 11:30:00.000 2
User table -- WHAT I NEED IT TO LOOK LIKE
UserID ActivityDate Activity
1 2015-01-15 11:30:00.000 2
2 2015-02-01 10:15:00.000 1
3 2015-05-01 11:30:00.000 2
4 2015-04-01 10:15:00.000 1
So as you can see, I need the MAX(DATE) & the Activity to be updated. Is there an easy way to go about something like this for 6.5 million rows?

Easy? No. What you need to do is lump the two tables on top of each other with a UNION, then find the max date for each UserID, then join back into that lump matching your max date to the specific record's date.
I think this will work...
SELECT
'UPDATE [User Table] SET [ActivityDate] = ''' + CONVERT(nvarchar(max), [DerivedTasksAndNotesTogetherAgain].[PostedDate]) + ''', [Activity] = ' + CONVERT(nvarchar(max), [DerivedTasksAndNotesTogetherAgain].[Activity]) + ' WHERE UserID = ' + CONVERT(nvarchar(max), [DerivedTasksAndNotesTogetherAgain].[UserID]) AS [TSQL]
FROM
(
SELECT
[UserID],
MAX([DerivedTasksAndNotesTogether].[PostedDate]) AS [MaxDate]
FROM
(
(
SELECT
--'Notes' AS [Source],
[Notes Table].[UserID],
[Notes Table].[PostedDate],
[Notes Table].[Activity]
FROM
[Notes Table]
)
UNION ALL
(
SELECT
--'Tasks' AS [Source],
[Tasks Table].[UserID],
[Tasks Table].[PostedDate],
[Tasks Table].[Activity]
FROM
[Tasks Table]
)
) DerivedTasksAndNotesTogether
GROUP BY
[UserID]
) DerivedMaxValues
LEFT OUTER JOIN
(
(
SELECT
--'Notes' AS [Source],
[Notes Table].[UserID],
[Notes Table].[PostedDate],
[Notes Table].[Activity]
FROM
[Notes Table]
)
UNION ALL
(
SELECT
--'Tasks' AS [Source],
[Tasks Table].[UserID],
[Tasks Table].[PostedDate],
[Tasks Table].[Activity]
FROM
[Tasks Table]
)
) DerivedTasksAndNotesTogetherAgain ON [DerivedMaxValues].[MaxDate] = DerivedTasksAndNotesTogetherAgain.[PostedDate]
Now you can either copy/paste the results to execute, or wrap this select in a cursor to automate it.
While this should work as is, there is some room for improvement here. First, that big ugly UNION that gets used twice could probably be turned into a CTE just to reduce the size. Also, I'm not a big fan in general of dynamic TSQL, and there are some wizards on these forms with doing UPDATE from SELECT - one of them may be come along and be able to eliminate that for you.

Related

BigQuery SQL: Group rows with shared ID that occur within 7 days of each other, and return values from most recent occurrence

I have a table of datestamped events that I need to bundle into 7-day groups, starting with the earliest occurrence of each event_id.
The final output should return each bundle's start and end date and 'value' column of the most recent event from each bundle.
There is no predetermined start date, and the '7-day' windows are arbitrary, not 'week of the year'.
I've tried a ton of examples from other posts but none quite fit my needs or use things I'm not sure how to refactor for BigQuery
Sample Data;
Event_Id
Event_Date
Value
1
2022-01-01
010203
1
2022-01-02
040506
1
2022-01-03
070809
1
2022-01-20
101112
1
2022-01-23
131415
2
2022-01-02
161718
2
2022-01-08
192021
3
2022-02-12
212223
Expected output;
Event_Id
Start_Date
End_Date
Value
1
2022-01-01
2022-01-03
070809
1
2022-01-20
2022-01-23
131415
2
2022-01-02
2022-01-08
192021
3
2022-02-12
2022-02-12
212223
You might consider below.
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 6) { bin += 1; return 0; }
else return c += Number(v);
}, 0);
return bin;
""";
WITH sample_data AS (
select 1 event_id, DATE '2022-01-01' event_date, '010203' value union all
select 1 event_id, '2022-01-02' event_date, '040506' value union all
select 1 event_id, '2022-01-03' event_date, '070809' value union all
select 1 event_id, '2022-01-20' event_date, '101112' value union all
select 1 event_id, '2022-01-23' event_date, '131415' value union all
select 2 event_id, '2022-01-02' event_date, '161718' value union all
select 2 event_id, '2022-01-08' event_date, '192021' value union all
select 3 event_id, '2022-02-12' event_date, '212223' value
),
binning AS (
SELECT *, cumsumbin(ARRAY_AGG(diff) OVER w1) bin
FROM (
SELECT *, DATE_DIFF(event_date, LAG(event_date) OVER w0, DAY) AS diff
FROM sample_data
WINDOW w0 AS (PARTITION BY event_id ORDER BY event_date)
) WINDOW w1 AS (PARTITION BY event_id ORDER BY event_date)
)
SELECT event_id,
MIN(event_date) start_date,
ARRAY_AGG(
STRUCT(event_date AS end_date, value) ORDER BY event_date DESC LIMIT 1
)[OFFSET(0)].*
FROM binning GROUP BY event_id, bin;

How to Shorten Execution Time for A View

I have 3 tables, a user table, an admin table, and a cust table. Both admin and cust tables are foreign keyed to the user_account table. Basically, every user has a user record, and the type of user they are is determined by if they have a record in the admin or the cust table.
user admin cust
user_id user_id | admin_id user_id | cust_id
--------- ---------|---------- ---------|---------
1 1 | a 2 | dd
2 4 | b 3 | ff
3
4
Then I have a login_history table that records the user_id and login timestamp every time a user logs into the app
login_history
user_id | login_on
---------|---------------------
1 | 2022-01-01 13:22:43
1 | 2022-01-02 16:16:27
3 | 2022-01-05 21:17:52
2 | 2022-01-11 11:12:26
3 | 2022-01-12 03:34:47
I would like to create a view that would contain all dates for the first day of each week in the year starting from jan 1st, and a count column that contains the count of unique admin users that logged in that week and a count of unique cust users that logged in that week. So the resulting view should contain the following 53 records, one for each week.
login_counts_view
week_start_date | admin_count | cust_count
-----------------|-------------|------------
2022-01-01 | 1 | 1
2022-01-08 | 0 | 2
2022-01-15 | 0 | 0
.
.
.
2022-12-31 | 0 | 0
Note that the first week (2022-01-01) only has 1 count for admin_count even though the admin with user_id 1 logged in twice that week.
Below is the current query I have for the view. However, the tables are pretty large and it takes over 10 seconds to retrieve all records from the view, mainly because of the left joined date comparisons.
CREATE VIEW login_counts_view AS
SELECT
week_start_dates.week_start_date::text AS week_start_date,
count(distinct a.user_id) AS admin_count,
count(distinct c.user_id) AS cust_count
FROM (
SELECT
to_char(i::date, 'YYYY-MM-DD') AS week_start_date
FROM
generate_series(date_trunc('year', NOW()), to_char(NOW(), 'YYYY-12-31')::date, '1 week') i
) week_start_dates
LEFT JOIN login_history l ON l.login_on::date BETWEEN week_start_dates.week_start_date::date AND (week_start_dates.week_start_date::date + INTERVAL '6 day')::date
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
GROUP BY week_start_date;
Does anyone have any tips as to how to make this query execute more efficiently?
Idea
Compute the pseudo-week of each login date: partition the year into 7-day slices and number them consecutively. The pseudo-week of a given date would be the ordinal number of the slice it falls into.
Then operate the joins on integers representing the pseudo-weeks instead of date values and comparisons.
Implementation
A view to implement this follows:
CREATE VIEW login_counts_view_fast AS
WITH RECURSIVE Numbers(i) AS ( SELECT 0 UNION ALL SELECT i + 1 FROM Numbers WHERE i < 52 )
SELECT CAST ( date_trunc('year', NOW()) AS DATE) + 7 * n.i week_start_date
, count(distinct lw.admin_id) admin_count
, count(distinct lw.cust_id) cust_count
FROM (
SELECT i FROM Numbers
) n
LEFT JOIN (
SELECT admin_id
, cust_id
, base
, pit
, pit-base delta
, (pit-base) / (3600 * 24 * 7) week
FROM (
SELECT a.user_id admin_id
, c.user_id cust_id
, CAST ( EXTRACT ( EPOCH FROM l.login_on ) AS INTEGER ) pit
, CAST ( EXTRACT ( EPOCH FROM date_trunc('year', NOW()) ) AS INTEGER ) base
FROM login_history l
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
) le
) lw
ON lw.week = n.i
GROUP BY n.i
;
Some remarks:
The epoch values are the number of seconds elapsed since an absolute base datetime (specifically 1/1/1970 0h00).
CASTS are necessary to convert doubles to integers and timestamps to dates as mandated by the signatures of postgresql date functions and in order to enforce integer arithmetics.
The recursive subquery is a generator of consecutive integers. It could possibly be replaced by a generate_series call (untested)
Evaluation
See it in action in this db fiddle
The query plan indicates savings of 50-70% in execution time.

How to calculate the number of messages within 10 seconds before the previous one?

I have a table with messages and I need to find chats where were two or more messages in period of 10 seconds. table
id message_id time
1 1 2021.11.10 13:09:00
1 2 2021.11.10 13:09:01
1 3 2021.11.10 13:09:50
2 1 2021.11.10 15:18:00
2 2 2021.11.10 15:20:00
3 1 2021.11.12 15:00:00
3 2 2021.11.12 15:10:00
3 2 2021.11.12 15:10:10
So the result looks like
id
1
3
I can't come up with the idea how to group by a period or maybe it can be done other way?
select id
from t
group by id, ?
having count(message_id) > 1
You can join the table with itself, matching them on the chat id and your timeframe.
create table messages (chat_id integer,message_id integer,"time" timestamp);
insert into messages values
(1,1,'2021.11.10 13:09:00'),
(1,2,'2021.11.10 13:09:01'),
(1,3,'2021.11.10 13:09:50'),
(2,1,'2021.11.10 15:18:00'),
(2,2,'2021.11.10 15:20:00'),
(3,1,'2021.11.12 15:00:00'),
(3,2,'2021.11.12 15:10:00'),
(3,2,'2021.11.12 15:10:10');
select target_chat,
target_message,
count(*) "number of messages preceding by no more than 10 seconds"
from
(select t1.chat_id target_chat,
t1.message_id target_message,
t1.time,
t2.chat_id,
t2.message_id,
t2.time
from messages t1
inner join messages t2
on t1.chat_id=t2.chat_id
and t1.message_id<>t2.message_id
and (t2.time<=t1.time-'10 seconds'::interval and t2.time<=t1.time)) a
group by 1,2;
-- target_chat | target_message | number of messages preceding by no more than 10 seconds
---------------+----------------+---------------------------------------------------------
-- 1 | 3 | 2
-- 2 | 2 | 1
-- 3 | 2 | 2
--(3 rows)
From that you can select the records with your desired number of preceding messages.
this is a simple query that finds every previous value that is included in our interval
select id from test_table t where
t.time + interval '10 second' >=
(select time from test_table where id=t.id and time>t.time limit 1)
group by id;
results
id
----
1
3
To find rows within an period of time, you can tipically use a window function which avoids a self join on the table :
SELECT id, count(*) OVER (ORDER BY time RANGE BETWEEN CURRENT ROW AND '10 minutes' FOLLOWING)
FROM t
GROUP BY id
Then you can use this query as a sub-query if you only want the id with count(*) > 1 :
SELECT DISTINCT ON (l.id) l.id
FROM
( SELECT id, count(*) OVER (ORDER BY time RANGE BETWEEN CURRENT ROW AND '10 minutes' FOLLOWING) AS ct
FROM t
GROUP BY id
) AS l
WHERE l.ct > 1 ;

Select dates missing data in a range

I have a postgres table test_table that looks like this:
date | test_hour
------------+-----------
2000-01-01 | 1
2000-01-01 | 2
2000-01-01 | 3
2000-01-02 | 1
2000-01-02 | 2
2000-01-02 | 3
2000-01-02 | 4
2000-01-03 | 1
2000-01-03 | 2
I need to select all the dates which don't have test_hour = 1, 2, and 3, so it should return
date
------------
2000-01-03
Here is what I have tried:
SELECT date FROM test_table WHERE test_hour NOT IN (SELECT generate_series(1,3));
But that only returns dates that have extra hours beyond 1, 2, 3
You can use aggregation and conditional HAVING clauses, like so:
SELECT mydate
FROM mytable
GROUP BY mydate
HAVING
MAX(CASE WHEN test_hour = 1 THEN 1 END) != 1
OR MAX(CASE WHEN test_hour = 2 THEN 1 END) != 1
OR MAX(CASE WHEN test_hour = 3 THEN 1 END) != 1
Another possibility would be to join it against the series (or another subquery containing the hours) and do a [distinct] count on the hours aggregatet per date:
select date from tst
inner join (select generate_series(1,3) "hour") hours on hours.hour = tst.hour
group by tst.date
having count(distinct tst.hour) < 3;
or
select date from tst
where hour in (select generate_series(1,3))
group by date
having count(distinct tst.hour) < 3;
[You don't need the distinct if date/hour combinations in Your table are unique]
A solution using set difference, giving you exactly the rows that are missing:
(SELECT DISTINCT
date, all_hour
FROM test_table
CROSS JOIN generate_series(1,3) all_hour)
EXCEPT
(TABLE test_table)
And a solution using an array aggregate and the array contains operator:
SELECT date
FROM test_table
GROUP BY date
HAVING NOT array_agg(test_hour) #> ARRAY(SELECT generate_series(1,3))
(online demos)

Get column of table for results having sum(a_int)=0 and order by date and group by another column

Think of a table like below:
unique_id
a_column
b_column
a_int
b_int
date_created
Let's say data is like:
-unique_id -a_column -b_column -a_int -b_int -date_created
1z23 abc 444 0 1 27.12.2016 18:03:00
2c31 abc 444 0 0 26.12.2016 13:40:00
2e22 qwe 333 0 1 28.12.2016 15:45:00
1b11 qwe 333 1 1 27.12.2016 19:00:00
3a33 rte 333 0 1 15.11.2016 11:00:00
4d44 rte 333 0 1 27.09.2016 18:00:00
6e66 irt 333 0 1 22.12.2016 13:00:00
7q77 aaa 555 1 0 27.12.2016 18:00:00
I want to get the unique_id s where b_int is 1, b_column is 333 and considering a_column, a_int column must always be 0, if there are any records with a_int = 1 even if there are records with a_int = 0 these records must not be shown in the result. Desired result is: " 3a33 , 6e66 " when grouped by a_column and ordered by date_created and got top1 for each unique a_column.
I tried lots of "with ties" and "over(partition by" samples, searched questions, but couldn't manage to do it. This is what I could do:
select unique_id
from the_table
where b_column = '333'
and b_int = 1
and a_column in (select a_column
from the_table
where b_column = '333'
and b_int = 1
group by a_column
having sum(a_int) = 0)
order by date_created desc;
This query returns the result like this " 3a33 ,4d44, 6e66 ". But I don't want "4d44".
You were on the right track with the partitions and window functions. This solution uses ROW_NUMBER to assign a value to the a_column so we can see where there is more than 1. The 1 is the most recent date_created. Then you select from the result set where the row_counter is 1.
;WITH CTE
AS (
SELECT unique_id
, a_column
, ROW_NUMBER() OVER (
PARTITION BY a_column ORDER BY date_created DESC
) AS row_counter --This assigns a 1 to the most recent date_created and partitions by a_column
FROM #test
WHERE a_column IN (
SELECT a_column
FROM #test
WHERE b_column = '333'
AND b_int = 1
GROUP BY a_column
HAVING MAX(a_int) < 1
)
)
SELECT unique_ID
FROM cte
WHERE row_counter = 1