I am having issues with counting the number of events by date and hour that are recorded recorded across multiple tables.
I have a system manufacturer's database with multiple 'events' tables that are all formatted identically (same number of columns and data types in the same order) that each hold around 100,000 transaction events that looks like this:
EventID EventTimestamp EventType EventSubType UnitGuid DeviceGuid
1 2022-04-16 15:14:43.000 515 0 AAAA BBBB
2 2022-04-16 15:14:44.000 520 0 AAAA CCCC
3 2022-04-16 15:14:44.000 520 0 AAAA BBBB
Because each table holds ~100,000 records, events that occur on a single day can be spread over one or more tables.
I am interested in obtaining a count of the total number of events per hour, per day which I am able to do on a table by table basis with the following query:
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_70
group by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
order by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
which produces output like:
event date event hour number of events
16 15 3966
16 16 4530
16 17 4357
... ... ...
I've been able to consolidate the data for multiple days in excel with a little manual work but there's got to be better way in SQL...
when i try to union two tables together with the following query:
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_70
union all
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_71
group by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
order by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
I am met with:
Column 'Events_70.EventTimestamp' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 104, Level 16, State 1, Line 85
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
Msg 104, Level 16, State 1, Line 85
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
I'm also guessing that a sum of counts for the same hour and day across two tables might be required but i've not got that far yet where one table's output looks like:
event date event hour number of events
19 21 2460
19 22 1963
**19 23 435**
And the next table's output looks like:
event date event hour number of events
**19 23 1057**
20 00 867
20 01 930
I've been searching around various forums this morning and haven't found a solution, any help would be appreciated.
Related
I have something similar to the following table, which is a randomly ordered list of thousands of transactions with a Customer_ID and an order_cost for each transaction.
Customer_ID
order_cost
1
$503
53
$7
4
$80
13
$76
6
$270
78
$2
8
$45
910
$89
10
$3
1130
$43
etc...
etc...
I want to group the transactions by Customer_ID, aggregate the cost of all the orders into a spending column, and then create a new "decile" row that would assign a number 1-10 to each customer so that when the "spending" for all customers in a decile is added up, each decile contains 10% of all the spending.
The resulting table would look something like the table below where each ascending decile will contain fewer customers, but the total sum of "spending" for all the records in each decile group will be the same for deciles 1-10. (The actual numbers in this sample column don't actually add up, it's just the concept)
Customer_ID
spending
Decile
45
$500
1
3
$700
1
349
$800
1
23
$1,000
1
64
$2,000
1
718
$2,100
1
3452
$2,300
1
1276
$2,600
2
10
$3,000
2
34
$4,000
2
etc...
etc...
etc...
So far I have grouped by Customer_ID, aggregated the order_cost to a spending column, ordered each customer in ascending order based on the spending column, and then partitioned all the customers into 5000 groups. From there I manually found the values for each .when statement that would result in deciles 1-10 each containing the right amount of customers so each decile has 10% of the sum of the entire spending column. It's pretty time-consuming to use trial and error to find the right bucket configuration that results in each decile having 10% of the spending column.
I'm trying to find a way to automate this process so I don't have to find the right bucketing ratio for each decile by trial and error.
This is my code so far:
Import pyspark.sql.functions as F
deciles = (table
.groupBy('Customer_ID')
.agg(F.sum('order_cost').alias('spending')).alias('a')
.withColumn('rank', F.ntile (5000).over(W.Window.partitionBy()
.orderBy(F.asc('spending'))))
.withColumn('rank', F.when(F.col('rank')<=4628, F.lit(1))
.when(F.col('rank')<=4850, F.lit(2))
.when(F.col('rank')<=4925, F.lit(3))
.when(F.col('rank')<=4965, F.lit(4))
.when(F.col('rank')<=4980, F.lit(5))
.when(F.col('rank')<=4987, F.lit(6))
.when(F.col('rank')<=4993, F.lit(7))
.when(F.col('rank')<=4997, F.lit(8))
.when(F.col('rank')<=4999, F.lit(9))
.when(F.col('rank')<=5000, F.lit(10))
.otherwise (F.lit(0)))
)
end_table = (table.alias('a').join(deciles.alias('b'), ['Customer_ID'], 'left')
.selectExpr('a.*', 'b.rank')
)
I am working on a query to return the next 7 days worth of data every time an event happens indicated by "where event = 1". The goal is to then group all the data by the user id and perform aggregate functions on this data after the event happens - the event is encoded as binary [0, 1].
So far, I have been attempting to use nested select statements to structure the data how I would like to have it, but using the window functions is starting to restrict me. I am now thinking a self join could be more appropriate but need help in constructing such a query.
The query currently first creates daily aggregate values grouped by user and date (3rd level nested select). Then, the 2nd level sums the data "value_x" to obtain an aggregate value grouped by the user. Then, the 1st level nested select statement uses the lead function to grab the next rows value over and partitioned by each user which acts as selecting the next day's value when event = 1. Lastly, the select statement uses an aggregate function to calculate the average "sum_next_day_value_after_event" grouped by user and where event = 1. Put together, where event = 1, the query returns the avg(value_x) of the next row's total value_x.
However, this doesn't follow my time rule; "where event = 1", return the next 7 days worth of data after the event happens. If there is not 7 days worth of data, then return whatever data is <= 7 days. Yes, I currently only have one lead with the offset as 1, but you could just put 6 more of these functions to grab the next 6 rows. But, the lead function currently just grabs the next row without regard to date. So theoretically, the next row's "value_x" could actually be 15 days from where "event = 1". Also, as can be seen below in the data table, a user may have more than one row per day.
Here is the following query I have so far:
select
f.user_id
avg(f.sum_next_day_value_after_event) as sum_next_day_values
from (
select
bld.user_id,
lead(bld.value_x, 1) over(partition by bld.user_id order by bld.daily) as sum_next_day_value_after_event
from (
select
l.user_id,
l.daily,
sum(l.value_x) as sum_daily_value_x
from (
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x) l
group by l.user_id, l.day_ts
order by l.user_id) bld) f
group by f.user_id
Below is a snippet of the data from table_1:
user_id
day_ts
value_x
event
50
4/2/21 07:37
25
0
50
4/2/21 07:42
45
0
50
4/2/21 09:14
67
1
50
4/5/21 10:09
8
0
50
4/5/21 10:24
75
0
50
4/8/21 11:08
34
0
50
4/15/21 13:09
32
1
50
4/16/21 14:23
12
0
50
4/29/21 14:34
90
0
55
4/4/21 15:31
12
0
55
4/5/21 15:23
34
0
55
4/17/21 18:58
32
1
55
4/17/21 19:00
66
1
55
4/18/21 19:57
54
0
55
4/23/21 20:02
34
0
55
4/29/21 20:39
57
0
55
4/30/21 21:46
43
0
Technical details:
PostgreSQL, supported by EDB, version = 14.1
pgAdmin4, version 5.7
Thanks for the help!
"The query currently first creates daily aggregate values"
I don't see any aggregate function in your first query, so that the GROUP BY clause is useless.
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x
could be simplified as
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
which in turn provides no real added value, so this first query could be removed and the second query would become :
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
The order by user_id clause can also be removed at this step.
Now if you want to calculate the average value of the sum_daily_value_x in the period of 7 days after the event (I'm referring to the avg() function in your top query), you can use avg() as a window function that you can restrict to the period of 7 days after the event :
select f.user_id
, avg(f.sum_daily_value_x) over (order by f.daily range between current row and '7 days' following) as sum_next_day_values
from (
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
) AS f
group by f.user_id
The partition by f.user_id clause in the window function is useless because the rows have already been grouped by f.user_id before the window function is applied.
You can replace the avg() window function by any other one, for instance sum() which could better fit with the alias sum_next_day_values
I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month. This calcualtion is to be done for 12 months in a single query. Output should be as below.
Month Count
01/07/2019 50
01/08/2019 34
01/09/2019 23
01/10/2019 98
01/11/2019 10
01/12/2019 5
01/01/2020 32
01/02/2020 65
01/03/2020 23
01/04/2020 12
01/05/2020 64
01/06/2020 54
01/07/2020 78
I am able to get the value only for one month. I want to get it for all months in a single query.
This is my current query:
SELECT COUNT(DISTINCT TWO_MONTHS_AGO.USER_ID), TWO_MONTHS_AGO.MONTH AS INVOICE_MONTH
FROM (
SELECT USER_ID, LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
WHERE invoice_amt > 0
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 2)
GROUP BY user_id
) AS TWO_MONTHS_AGO
LEFT JOIN (
SELECT user_id,LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 1)
GROUP BY USER_ID
) AS ONE_MONTH_AGO ON TWO_MONTHS_AGO.USER_ID = ONE_MONTH_AGO.USER_ID
WHERE ONE_MONTH_AGO.USER_ID IS NULL
GROUP BY INVOICE_MONTH;
Thank you in advance.
Lona
Probably lots of different approaches but the way I would do it is as follows:
Summarise data by user and month for the last 13 months (you need 12 months plus the previous month to that first month
Compare "this" month (that has data) to "next" month and select records where there is no "next" month data
Summarise this dataset by month and distinct userid
For example, assuming a table created as follows:
create table INVOICE_DATA (
USERID varchar(4),
INVOICE_DT date,
INVOICE_AMT NUMBER(10,2)
);
the following query should give you what you want - you may need to adjust it depending on whether you are including this month, or only up to the end of last month, in your calculation, etc.:
--Summarise data by user and month
WITH MONTH_SUMMARY AS
(
SELECT USERID
,TO_CHAR(INVOICE_DT,'YYYY-MM') "INVOICE_MONTH"
,TO_CHAR(ADD_MONTHS(INVOICE_DT,1),'YYYY-MM') "NEXT_MONTH"
,SUM(INVOICE_AMT) "MONTHLY_TOTAL"
FROM INVOICE_DATA
WHERE INVOICE_DT >= TRUNC(ADD_MONTHS(current_date(),-13),'MONTH') -- Last 13 months of data
GROUP BY 1,2,3
),
--Get data for users with invoices in this month but not the next month
USER_DATA AS
(
SELECT USERID, INVOICE_MONTH, MONTHLY_TOTAL
FROM MONTH_SUMMARY MS_THIS
WHERE NOT EXISTS
(
SELECT USERID
FROM MONTH_SUMMARY MS_NEXT
WHERE
MS_THIS.USERID = MS_NEXT.USERID AND
MS_THIS.NEXT_MONTH = MS_NEXT.INVOICE_MONTH
)
AND MS_THIS.INVOICE_MONTH < TO_CHAR(current_date(),'YYYY-MM') -- Don't include this month as obviously no next month to compare to
)
SELECT INVOICE_MONTH, COUNT(DISTINCT USERID) "USER_COUNT"
FROM USER_DATA
GROUP BY INVOICE_MONTH
ORDER BY INVOICE_MONTH
;
The purpose of this question is to optimize some SQL by using set-based operations vs iterative (looping, like I'm doing below):
Some Explanation -
I have this cte that is inserted to a temp table #dataForPeak. Each row represents a minute, and a respective value retrieved.
For every row, my code uses a while loop to add 15 rows at a time (the current row + the next 14 rows). These sums are inserted into another temp table #PeakDemandIntervals, which is my workaround for then finding the max sum of these groups of 15.
I've bolded my end goal above. My code achieves this but in about 12 seconds for 26k rows. I'll be looking at much more data, so I know this is not enough for my use case.
My question is,
can anyone help me find a fast alternative to this loop?
It can include more tables, CTEs, nested queries, whatever. The while loop might not even be the issue, it's probably the inner code.
insert into #dataForPeak
select timestamp, value
from cte
order by timestamp;
while ##ROWCOUNT<>0
begin
declare #timestamp datetime = (select top 1 timestamp from #dataForPeak);
insert into #PeakDemandIntervals
select #timestamp, sum(interval.value) as peak
from (select * from #dataForPeak base
where base.timestamp >= #timestamp
and base.timestamp < DATEADD(minute,14,#timestamp)
) interval;
delete from #dataForPeak where timestamp = #timestamp;
end
select max(peak)
from #PeakDemandIntervals;
Edit
Here's an example of my goal, using groups of 3min instead of 15min.
Given the data:
Time | Value
1:50 | 2
1:51 | 4
1:52 | 6
1:53 | 8
1:54 | 6
1:55 | 4
1:56 | 2
the max sum (peak) I'm looking for is 20, because the group
1:52 | 6
1:53 | 8
1:54 | 6
has the highest sum.
Let me know if I need to clarify more than that.
Based on the example given it seems like you are trying to get the maximum value of a rolling sum. You can calculate the 15-minute rolling sum very easily as follow:
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
Note the key here is the ROWS 14 PRECEDING statement. It effectively state that SQL Server should sum the preceding 14 records with the current record which will give you your 15 minute interval.
Now you can simply max the result of the rolling sum. The full query will look as follow:
;WITH CTE_RollingSum
AS
(
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
)
SELECT MAX([RollingSum]) AS Peak
FROM CTE_RollingSum
A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.
First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;