How to calculate the amount of SQL? - postgresql

I have a table transaction_details:
transaction_id
customer_id
item_id
item_number
transaction_dttm
7765
1
23
1
2022-01-15
1254
2
12
4
2022-02-03
3332
3
56
2
2022-02-15
7658
1
43
1
2022-03-01
7231
4
56
1
2022-01-15
7231
2
23
2
2022-01-29
I need to calculate the amount spent by the client in the last month and find out the item (item_name) on which the client spent the most in the last month.
Example result:
|customer_id|amount_spent_lm|top_item_lm|
| - | ---------- | ----- |
| 1 | 700 | glasses |
| 2 | 20000 | notebook |
| 3 | 100 | cup |
When calculating, it is necessary to take into account the current price at the time of the transaction (dict_item_prices). Customers who have not made purchases in the last month are not included in the final table. he last month is defined as the last 30 days at the time of the report creation.
There is also a table dict_item_prices:
item_id
item_name
item_price
valid_from_dt
valid_to_dt
23
phone 1
1000
2022-01-01
2022-12-31
12
notebook
5000
2022-01-02
2022-12-31
56
cup
50
2022-01-02
2022-12-31
43
glasses
700
2022-01-01
2022-12-31

Related

Scala Spark get sum by time bucket across team spans and key

I have a question that is very similar to How to group by time interval in Spark SQL
However, my metric is time spent (duration), so my data looks like
KEY |Event_Type | duration | Time
001 |event1 | 10 | 2016-05-01 10:49:51
002 |event2 | 100 | 2016-05-01 10:50:53
001 |event3 | 20 | 2016-05-01 10:50:55
001 |event1 | 15 | 2016-05-01 10:51:50
003 |event1 | 13 | 2016-05-01 10:55:30
001 |event2 | 12 | 2016-05-01 10:57:00
001 |event3 | 11 | 2016-05-01 11:00:01
Is there a way to sum the time spent into five minute buckets, grouped by key, and know when the duration goes outside of the bound of the bucket?
For example, the first row starts at 10:49:51 and ends at 10:50:01
Thus, the bucket for key 001 in window [2016-05-01 10:45:00.0,2016-05-01 10:50:00.0] would get 8 seconds of duration (51 seconds to 60 seconds) and the and the 10:50 to 10:55 would get 2 seconds of duration, plus the relevant seconds from other log lines (20 seconds from the third row, 15 from the 4th row).
I want to sum the time in a specific bucket, but the solution on the other thread of
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
would overcount in the buckets timestamps that overlap buckets start in, and undercount the subsequent buckets
Note: My Time column is also in Epoch timestamps like 1636503077, but I can easily cast it to the above format if that makes this calculation easier.
for my opinion, maybe you need preprocess you data by spilt you duration to every minutes (or every five minutes).
as you wish, the first row
001 |event1 | 10 | 2016-05-01 10:49:51
should be convert to
001 |event1 | 9 | 2016-05-01 10:49:51
001 |event1 | 1 | 2016-05-01 10:50:00
then you can use spark window function to sum it properly.
df.groupBy($"KEY", window($"time", "5 minutes")).sum("metric")
that will not change the result if you only want to know the duration of time bucket, but will increasing the record counts.

Unpivot data in PostgreSQL

I have a table in PostgreSQL with the below values,
empid hyderabad bangalore mumbai chennai
1 20 30 40 50
2 10 20 30 40
And my output should be like below
empid city nos
1 hyderabad 20
1 bangalore 30
1 mumbai 40
1 chennai 50
2 hyderabad 10
2 bangalore 20
2 mumbai 30
2 chennai 40
How can I do this unpivot in PostgreSQL?
You can use a lateral join:
select t.empid, x.city, x.nos
from the_table t
cross join lateral (
values
('hyderabad', t.hyderabad),
('bangalore', t.bangalore),
('mumbai', t.mumbai),
('chennai', t.chennai)
) as x(city, nos)
order by t.empid, x.city;
Or this one: simpler to read- and real plain SQL ...
WITH
input(empid,hyderabad,bangalore,mumbai,chennai) AS (
SELECT 1,20,30,40,50
UNION ALL SELECT 2,10,20,30,40
)
,
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
SELECT
empid
, CASE i
WHEN 1 THEN 'hyderabad'
WHEN 2 THEN 'bangalore'
WHEN 3 THEN 'mumbai'
WHEN 4 THEN 'chennai'
ELSE 'unknown'
END AS city
, CASE i
WHEN 1 THEN hyderabad
WHEN 2 THEN bangalore
WHEN 3 THEN mumbai
WHEN 4 THEN chennai
ELSE NULL::INT
END AS city
FROM input CROSS JOIN i
ORDER BY empid,i;
-- out empid | city | city
-- out -------+-----------+------
-- out 1 | hyderabad | 20
-- out 1 | bangalore | 30
-- out 1 | mumbai | 40
-- out 1 | chennai | 50
-- out 2 | hyderabad | 10
-- out 2 | bangalore | 20
-- out 2 | mumbai | 30
-- out 2 | chennai | 40

Run a SQL query against ten-minutes time intervals

I have a postgresql table with this schema:
id SERIAL PRIMARY KEY,
traveltime INT,
departuredate TIMESTAMPTZ,
departurehour TIMETZ
Here is a bit of data (edited):
id | traveltime | departuredate | departurehour
----+------------+------------------------+---------------
1 | 73 | 2019-12-24 00:00:03+01 | 00:00:03+01
2 | 73 | 2019-12-24 00:12:16+01 | 00:12:16+01
53 | 115 | 2019-12-24 07:53:44+01 | 07:53:44+01
54 | 116 | 2019-12-24 07:58:45+01 | 07:58:45+01
55 | 119 | 2019-12-24 08:03:46+01 | 08:03:46+01
56 | 120 | 2019-12-24 08:08:47+01 | 08:08:47+01
57 | 121 | 2019-12-24 08:13:48+01 | 08:13:48+01
58 | 121 | 2019-12-24 08:18:48+01 | 08:18:48+01
542 | 112 | 2019-12-26 07:52:41+01 | 07:52:41+01
543 | 114 | 2019-12-26 07:57:42+01 | 07:57:42+01
544 | 116 | 2019-12-26 08:02:43+01 | 08:02:43+01
545 | 116 | 2019-12-26 08:07:44+01 | 08:07:44+01
546 | 117 | 2019-12-26 08:12:45+01 | 08:12:45+01
547 | 118 | 2019-12-26 08:17:46+01 | 08:17:46+01
548 | 118 | 2019-12-26 08:22:48+01 | 08:22:48+01
1031 | 80 | 2019-12-28 07:50:33+01 | 07:50:33+01
1032 | 81 | 2019-12-28 07:55:34+01 | 07:55:34+01
1033 | 81 | 2019-12-28 08:00:35+01 | 08:00:35+01
1034 | 82 | 2019-12-28 08:05:36+01 | 08:05:36+01
1035 | 82 | 2019-12-28 08:10:37+01 | 08:10:37+01
1036 | 83 | 2019-12-28 08:15:38+01 | 08:15:38+01
1037 | 83 | 2019-12-28 08:20:39+01 | 08:20:39+01
I'd like to get the average for all the values collected for traveltime for each 10 minutes interval for several weeks.
Expected result for the data sample: for the 10-minutes interval between 8h00 and 8h10, the rows that will be included in the avg are with id 55, 56, 544, 545, 1033 and 1034
and so on.
I can get the average for a specific interval:
select avg(traveltime) from belt where departurehour >= '10:40:00+01' and departurehour < '10:50:00+01';
To avoid creating a query for each interval, I used this query to get all the 10-minutes intervals for the complete period encoded:
select i from generate_series('2019-11-23', '2020-01-18', '10 minutes'::interval) i;
What I miss is a way to apply my AVG query to each of these generated intervals. Any direction would be helpful!
It turns out that the generate_series does not actually apply as requardless of the date range. The critical part is the 144 10Min intervals per day. Unfortunatly Postgres does not provide an interval type for minuets. (Perhaps creating one would be a useful exersize). But all is not loss you can simulate the same with BETWEEN, just need to play with the ending of the range.
The following generates this simulation using a recursive CTE. Then as previously joins to your table.
set timezone to '+1'; -- necessary to keep my local offset from effecting results.
-- create table an insert data here
-- additional data added outside of date range so should not be included)
with recursive min_intervals as
(select '00:00:00'::timetz start_10Min -- start of 1st 10Min interval
, '00:09:59.999999'::timetz end_10Min -- last microsecond in 10Min interval
, 1 interval_no
union all
select start_10Min + interval '10 min'
, end_10Min + interval '10 min'
, interval_no + 1
from Min_intervals
where interval_no < 144 -- 6 10Min intervals/hr * 24 Hr/day = No of 10Min intervals in any day
) -- select * from min_intervals;
select start_10Min, end_10Min, avg(traveltime) average_travel_time
from min_intervals
join belt
on departuredate::time between start_10Min and end_10Min
where departuredate::date between date '2019-11-23' and date '2020-01-18'
group by start_10Min, end_10Min
order by start_10Min;
-- test result for 'specified' Note added rows fall within time frame 08:00 to 08:10
-- but these should be excluded so the avg for that period should be the same for both queries.
select avg(traveltime) from belt where id in (55, 56, 544, 545, 1033, 1034);
My issue with the above is the data range is essentially hard coded (yes substitution parameter are available) and manually but that is OK for psql or an IDE but not good for a production environment. If this is to be used in that environment I'd use the following function to return a virtual table of the same results.
create or replace function travel_average_per_10Min_interval(
start_date_in date
, end_date_in date
)
returns table (Start_10Min timetz
,end_10Min timetz
,avg_travel_time numeric
)
language sql
as $$
with recursive min_intervals as
(select '00:00:00'::timetz start_10Min -- start of 1st 10Min interval
, '00:09:59.999999'::timetz end_10Min -- last microsecond in 10Min interval
, 1 interval_no
union all
select start_10Min + interval '10 min'
, end_10Min + interval '10 min'
, interval_no + 1
from Min_intervals
where interval_no < 144 -- 6 10Min intervals/hr * 24 Hr/day = No of 10Min intervals in any day
) -- select * from min_intervals;
select start_10Min, end_10Min, avg(traveltime) average_travel_time
from min_intervals
join belt
on departuredate::time between start_10Min and end_10Min
where departuredate::date between start_date_in and end_date_in
group by start_10Min, end_10Min
order by start_10Min;
$$;
-- test
select * from travel_average_per_10Min_interval(date '2019-11-23', date '2020-01-18');

kdb+ equivalent of SQL's rank() and dense_rank()

Any one every have to simulate the result of SQL's rank(), dense_rank(), and row_number(), in kdb+? Here is some SQL to demonstrate the features. If anyone has a specific solution below, perhaps I could work on generalising it to support multiple partition and order by columns -- and post back on this site.
CREATE TABLE student(course VARCHAR(10), mark int, name varchar(10));
INSERT INTO student VALUES
('Maths', 60, 'Thulile'),
('Maths', 60, 'Pritha'),
('Maths', 70, 'Voitto'),
('Maths', 55, 'Chun'),
('Biology', 60, 'Bilal'),
('Biology', 70, 'Roger');
SELECT
RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY course ORDER BY mark DESC) AS row_num,
course, mark, name
FROM student ORDER BY course, mark DESC;
+------+------------+---------+---------+------+---------+
| rank | dense_rank | row_num | course | mark | name |
+------+------------+---------+---------+------+---------+
| 1 | 1 | 1 | Biology | 70 | Roger |
| 2 | 2 | 2 | Biology | 60 | Bilal |
| 1 | 1 | 1 | Maths | 70 | Voitto |
| 2 | 2 | 2 | Maths | 60 | Thulile |
| 2 | 2 | 3 | Maths | 60 | Pritha |
| 4 | 3 | 4 | Maths | 55 | Chun |
+------+------------+---------+---------+------+---------+
Here is some kdb+ to generate the equivalent student table:
student:([] course:`Maths`Maths`Maths`Maths`Biology`Biology;
mark:60 60 70 55 60 70;
name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
Thank you!
If you sort the table initially by course and mark:
student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
course mark name
--------------------
Biology 70 Roger
Biology 60 Bilal
Maths 70 Voitto
Maths 60 Thulile
Maths 60 Pritha
Maths 55 Chun
Then you can use something like the below to achieve your output:
update rank_sql:first row_num by course,mark from update dense_rank:1+where count each (where differ mark)cut mark,row_num:1+rank i by course from student
course mark name dense_rank row_num rank_sql
------------------------------------------------
Biology 70 Roger 1 1 1
Biology 60 Bilal 2 2 2
Maths 70 Voitto 1 1 1
Maths 60 Thulile 2 2 2
Maths 60 Pritha 2 3 2
Maths 55 Chun 3 4 4
This solution uses rank and the virtual index column if you would like to read up further on these.
For table ordered by target columns:
q) dense_sql:{sums differ x}
q) rank_sql:{raze #'[(1_deltas b),1;b:1+where differ x]}
q) row_sql:{1+til count x}
q) student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
q)update row_num:row_sql mark,rank_s:rank_sql mark,dense_s:dense_sql mark by course from student
I can think of this as of now:
Note: The rank function in kdb works on asc list, so I created below functions.
I would not xdesc the table, as I can just use the vector column and desc it
q)denseF
{((desc distinct x)?x)+1}
q)rankF
{((desc x)?x)+1}
q)update dense_rank:denseF mark,rank_rank:rankF mark,row_num:1+rank i by course from student
course
mark name
dense_rank
rank_rank
row_num
Maths
60 Thulile
2
2
1
Maths
60 Pritha
2
2
2
Maths
70 Voitto
1
1
3
Maths
55 Chun
3
4
4
Biology
60 Bilal
2
2
1
Biology
70 Roger
1
1
2

Create calendar year variable from daily date

I have a Stata elapsed date variable (excerpt below) for each patient patid. But I believe I need to generate a new variable if I were to make use of only the year element within that date, that is, not just change the display format.
clear
input long patid float date
1015 18766
1018 13135
1020 13325
1025 14384
1029 14514
1050 13501
1070 14523
1071 14878
1090 14701
1092 14159
end
format %td date
How do I generate a year variable for all dates within the same year? That is all days from 1st January to 31st December of the same year?
That calls for just the function year(), which together with similar stuff is prominently documented at help datetime.
clear
input long patid float date
1015 18766
1018 13135
1020 13325
1025 14384
1029 14514
1050 13501
1070 14523
1071 14878
1090 14701
1092 14159
end
format %td date
gen year = year(date)
list
+--------------------------+
| patid date year |
|--------------------------|
1. | 1015 19may2011 2011 |
2. | 1018 18dec1995 1995 |
3. | 1020 25jun1996 1996 |
4. | 1025 20may1999 1999 |
5. | 1029 27sep1999 1999 |
|--------------------------|
6. | 1050 18dec1996 1996 |
7. | 1070 06oct1999 1999 |
8. | 1071 25sep2000 2000 |
9. | 1090 01apr2000 2000 |
10. | 1092 07oct1998 1998 |
+--------------------------+