Scala Spark get sum by time bucket across team spans and key - scala

I have a question that is very similar to How to group by time interval in Spark SQL
However, my metric is time spent (duration), so my data looks like
KEY |Event_Type | duration | Time
001 |event1 | 10 | 2016-05-01 10:49:51
002 |event2 | 100 | 2016-05-01 10:50:53
001 |event3 | 20 | 2016-05-01 10:50:55
001 |event1 | 15 | 2016-05-01 10:51:50
003 |event1 | 13 | 2016-05-01 10:55:30
001 |event2 | 12 | 2016-05-01 10:57:00
001 |event3 | 11 | 2016-05-01 11:00:01
Is there a way to sum the time spent into five minute buckets, grouped by key, and know when the duration goes outside of the bound of the bucket?
For example, the first row starts at 10:49:51 and ends at 10:50:01
Thus, the bucket for key 001 in window [2016-05-01 10:45:00.0,2016-05-01 10:50:00.0] would get 8 seconds of duration (51 seconds to 60 seconds) and the and the 10:50 to 10:55 would get 2 seconds of duration, plus the relevant seconds from other log lines (20 seconds from the third row, 15 from the 4th row).
I want to sum the time in a specific bucket, but the solution on the other thread of
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
would overcount in the buckets timestamps that overlap buckets start in, and undercount the subsequent buckets
Note: My Time column is also in Epoch timestamps like 1636503077, but I can easily cast it to the above format if that makes this calculation easier.

for my opinion, maybe you need preprocess you data by spilt you duration to every minutes (or every five minutes).
as you wish, the first row
001 |event1 | 10 | 2016-05-01 10:49:51
should be convert to
001 |event1 | 9 | 2016-05-01 10:49:51
001 |event1 | 1 | 2016-05-01 10:50:00
then you can use spark window function to sum it properly.
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
that will not change the result if you only want to know the duration of time bucket, but will increasing the record counts.

Related

How to calculate the amount of SQL?

I have a table transaction_details:
transaction_id
customer_id
item_id
item_number
transaction_dttm
7765
1
23
1
2022-01-15
1254
2
12
4
2022-02-03
3332
3
56
2
2022-02-15
7658
1
43
1
2022-03-01
7231
4
56
1
2022-01-15
7231
2
23
2
2022-01-29
I need to calculate the amount spent by the client in the last month and find out the item (item_name) on which the client spent the most in the last month.
Example result:
|customer_id|amount_spent_lm|top_item_lm|
| - | ---------- | ----- |
| 1 | 700 | glasses |
| 2 | 20000 | notebook |
| 3 | 100 | cup |
When calculating, it is necessary to take into account the current price at the time of the transaction (dict_item_prices). Customers who have not made purchases in the last month are not included in the final table. he last month is defined as the last 30 days at the time of the report creation.
There is also a table dict_item_prices:
item_id
item_name
item_price
valid_from_dt
valid_to_dt
23
phone 1
1000
2022-01-01
2022-12-31
12
notebook
5000
2022-01-02
2022-12-31
56
cup
50
2022-01-02
2022-12-31
43
glasses
700
2022-01-01
2022-12-31

Stored procedure (or better way) to add a new row to existing table every day at 22:00

I will be very grateful for your advice regarding the following issue.
Given:
PostgreSQL database
Initial (basic) query
select day, Value_1, Value_2, Value_3
from table
where day=current_date
which returns a row with following columns
Day | Value_1(int) | Value_2(int) | Value 3 (int)
2019-11-14 | 10 | 10 | 14
It is needed to create a view with this starting information and add a new row every day based on the outcome of initial query executed at 22:00.
The expected outcome tomorrow at 22:01 will be
Day | Value_1 | Value_2 | Value_3
2019-11-14 | 10 | 10 | 14
2019-11-15 | N | M | P
Many thanks in advance for your time and support.

Postgres placeholders for 0 data

I have some Postgres data like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
I want to run a query that pulls this data with 0s in place for the dates that are missing, like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
This is for a very large range of dates, and I need it to fill in all the gaps. How can I accomplish this with just SQL?
Given a table junk of:
d | c
------------+----
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
Running
select fake.d, coalesce(j.c, 0) as c
from (select min(d) + generate_series(0,7,1) as d from junk) fake
left outer join junk j on fake.d=j.d;
gets us:
d | c
------------+----------
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
2015-01-06 | 0
2015-01-07 | 0
2015-01-08 | 0
You could of course adjust the start date for the series, length it runs for, etc.
Where is this data going? To an outside source or another table or view?
There's probably a better solution but you could create a new table(or in excel wherever the data is going) that has the entire date-range you want with another integer column of null values. Then update that table with your current dataset then replace all nulls with zero.
It's a really roundabout way to do things but it'll work.
I don't have enough rep to comment :(
This is also a good reference
Using COALESCE to handle NULL values in PostgreSQL

postgresql Subselect Aggregate in larger query

I'm working with a gigantic dataset of individuals with demographic information and action tracking. I am trying to get the percentage of people who committed an action, which is simple, but also trying to get average ages of people who fit in a specific subgroup of the original SELECT. The CASE WHEN line works fine alone, and the subquery runs fine in it's own query but I cannot seem to get it integrated into this query as a subquery, it gives me a syntax error on the CASE WHEN statement. Here's a slightly anonymized version of the query. Any help would be VERY appreciated.
SELECT
AVG(ageagg)
FROM
(
SELECT
age AS ageagg
FROM
agetable
WHERE
age>30
AND action_taken=1) AvgAge_30Action,
COUNT(
CASE
WHEN action_taken=1
AND age> 30
THEN 1
ELSE 0 NULL) / COUNT(
CASE
WHEN age>30) AS Over_30_Action
FROM
agetable
WHERE
website_type=3
If I've interpreted your intent correctly, you wish to compute the following:
1) the number of people over the age of 30 that took a specific action as a percentage of the total number of people over the age of 30
2) the average age of the people over the age of 30 that took a specific action
Assuming my interpretation is correct, this query might work for you:
SELECT
100 * over_30_action / over_30_total AS percentage_of_over_30_took_action,
average_age_of_over_30_took_action
FROM (
SELECT
SUM(CASE WHEN action_taken=1 THEN 1 ELSE 0 END) AS over_30_action,
COUNT(*) AS over_30_total,
AVG(CASE WHEN action_taken=1 THEN age ELSE NULL END)
AS average_age_of_over_30_took_action
FROM agetable
WHERE website_type=3 AND age>30
) aggregated;
I created a dummy table and populated it with the following data.
postgres=# select * from agetable order by website_type, action_taken, age;
age | action_taken | website_type
-----+--------------+--------------
33 | 1 | 1
32 | 1 | 2
28 | 1 | 3
29 | 1 | 3
32 | 1 | 3
33 | 1 | 3
34 | 1 | 3
32 | 2 | 3
32 | 3 | 3
33 | 4 | 3
34 | 5 | 3
33 | 6 | 3
34 | 7 | 3
35 | 8 | 3
(14 rows)
Of the 14 rows, 4 rows (the first four in this listing) have either the wrong website_type or have age below 30. Of the ten remaining rows, you can see that 3 of them have an action_taken of 1. So, the query should determine that 30% of folks over the age of 30 took a particular action, and the average age among that particular population should be 33 (ages 32, 33, and 34). The results of the query I posted:
percentage_of_over_30_took_action | average_age_of_over_30_took_action
-----------------------------------+------------------------------------
30 | 33.0000000000000000
(1 row)
Again, all of this is predicated upon my interpretation of your intent actually being accurate. This is of course based on a highly contrived data set, but hopefully it's enough of a functional signpost to get you on the right path.

Convert rows to columns dynamically

I have table:
name | surname | project | dates | hours
aaa aaaa 1 12.08.2011 10
aaa aaaa 1 13.08.2011 8
aaa aaaa 1 14.08.2011 7
And i need result like this:
name | surname | project | dates | hours | dates | hours | dates | hours | total
aaa aaaa 1 12.08.2011 10 13.08.2011 8 14.08.2011 7 25
SELECT name,surname,project,
MAX(DECODE(C,1,dates)) dates,
MAX(DECODE(C,1,hours)) hours,
MAX(DECODE(C,2,dates)) dates,
MAX(DECODE(C,2,hours)) hours,
MAX(DECODE(C,3,dates)) dates,
MAX(DECODE(C,3,hours)) hours,
sum(hours) as Total
FROM (SELECT name,surname,project,dates,hours
,ROW_NUMBER() OVER(PARTITION BY project ORDER BY project) C
FROM work )
GROUP BY name,surname,project
This work. But I need dynamically sql query because number of rows can be variable. Is it possible ? Thanks
You can generate the sql dynamically. Look into the DBMS_SQL package. But each row needs to have the same number of columns.
Another way to do this is to return a nested table or vararray of dates and hours.