6.Hive: Given a table t with schema (date, revenue), like this
6.Hive: Given a table t with schema (date, revenue), like this
date r
Jan. 1 100
Jan. 2 120
Jan. 3 80
Jan. 4 150
Jan. 5 50
What does the following query do?
SELECT t1.date AS date, sum(t2.revenue) AS revenue
FROM t as t1 JOIN t as t2 ON t2.date <= t1.date GROUP BY 1 ORDER BY 1
Related
I want to see the previous quarter's max month(as a new column) value in the current quarter using a big query.
When in Q1 2022 it should display Q4 December 2021 as a new column
When in Q2 2022 it should display Q1 March 2022 (in this case 60000)
When in Q3 2022 it should display Q2 June 2022 (in this case 40000)
My data is like below
date Sales
2022-09-01 10000
2022-08-02 20000
2022-07-01 30000
2022-06-01 40000
2022-05-01 30000
2022-04-01 50000
2022-03-01 60000
2022-02-01 10000
2022-01-01 89090
Output
your given result table fits not the task: previous quarter's max month.
Here several outputs. Do you want the maximum of the last months, or the values from three month ago? Both columns are included here.
The formula of the 1st month of quarter can be edit to the last month by changing the 1 to -1. As You want zero values for all other months, you need to multiply this with the other column.
Window function do the job. But for each month there must be one row. This is filled up with the all_months table.
with tbl as
(
Select date("2022-09-01") as dates, 10000 money
union all select date("2022-08-02"), 20000
union all select date("2022-07-01"), 30000
union all select date("2022-06-01"), 40000
union all select date("2022-05-01"), 30000
union all select date("2022-04-01"), 50000
union all select date("2022-03-01"), 60000
union all select date("2022-02-01"), 10000
union all select date("2022-01-01"), 89090
),
all_months as
(select dates,0 from (Select max(dates) A, min(dates) B from tbl), unnest(generate_date_array(A,B,interval 1 month)) dates)
select *,
if( date_trunc(dates,quarter)= date_trunc(date_sub(dates,interval 1 month),quarter),0,1) as first_month_of_quarter,
lag(money_max_this_quarter) over (order by dates) as money_max_last_quarter,
lag(money,3) over (order by dates) as money_three_months_ago,
from
(
select * ,
max(money) over (partition by date_trunc(dates,quarter ) ) as money_max_this_quarter
from
(
Select dates,sum(money) as money from tbl group by 1
union all select * from all_months
)
)
order by 1 desc
I want to count %days when a user was active. A query like this
select
a.id,
a.created_at,
CURRENT_DATE - a.created_at::date as days_since_registration,
NOW() as current_d
from public.accounts a where a.id = 3257
returns
id created_at days_since_registration current_d tot_active
3257 2022-04-01 22:59:00.000 1 2022-04-02 12:00:0.000 +0400 2
The person registered less than 24 hours ago (less than a day ago), but there are two distinct dates between the registration and now. Hence, if a user was active one hour before midnight and one hour after midnight, he is two days active in less than a day (active 200% of days)
What is the right way to count distinct dates and get 2 for a user, who registered at 23:00:00 two hours ago?
WITH cte as (
SELECT 42 as userID,'2022-04-01 23:00:00' as d
union
SELECT 42,'2022-04-02 01:00:00' as d
)
SELECT
userID,
count(d),
max(d)::date-min(d)::date+1 as NrOfDays,
count(d)/(max(d)::date-min(d)::date+1) *100 as PercentageOnline
FROM cte
GROUP BY userID;
output:
userid
count
nrofdays
percentageonline
42
2
2
100
I am trying to build a cohort analysis for monthly retention but experiencing challenge getting the Month Number column right. The month number is supposed to return month(s) user transacted i.e 0 for registration month, 1 for the first month after registration month, 2 for the second month until the last month but currently, it returns negative month numbers in some cells.
It should be like this table:
cohort_month total_users month_number percentage
---------- ----------- -- ------------ ---------
January 100 0 40
January 341 1 90
January 115 2 90
February 103 0 73
February 100 1 40
March 90 0 90
Here is the SQL:
with cohort_items as (
select
extract(month from insert_date) as cohort_month,
msisdn as user_id
from mfscore.t_um_user_detail where extract(year from insert_date)=2020
order by 1, 2
),
user_activities as (
select
A.sender_msisdn,
extract(month from A.insert_date)-C.cohort_month as month_number
from mfscore.t_wm_transaction_logs A
left join cohort_items C ON A.sender_msisdn = C.user_id
where extract(year from A.insert_date)=2020
group by 1, 2
),
cohort_size as (
select cohort_month, count(1) as num_users
from cohort_items
group by 1
order by 1
),
B as (
select
C.cohort_month,
A.month_number,
count(1) as num_users
from user_activities A
left join cohort_items C ON A.sender_msisdn = C.user_id
group by 1, 2
)
select
B.cohort_month,
S.num_users as total_users,
B.month_number,
B.num_users * 100 / S.num_users as percentage
from B
left join cohort_size S ON B.cohort_month = S.cohort_month
where B.cohort_month IS NOT NULL
order by 1, 3
I think the RANK window function is the right solution. So the idea is to assigne a rank to months of user activities for each user, order by year and month.
Something like:
WITH activity_per_user AS (
SELECT
user_id,
event_date,
RANK() OVER (PARTITION BY user_id ORDER BY DATE_PART('year', event_date) , DATE_PART('month', event_date) ASC) AS month_number
FROM user_activities_table
)
RANK number starts from 1, so you may want to substract 1.
Then, you can group by user_id and month_number to get the number of interactions for each user per month from the subscription (adapt to your use case accordingly).
SELECT
user_id,
month_number,
COUNT(1) AS n_interactions
FROM activity_per_user
GROUP BY 1, 2
Here is the documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html
I am trying to find if (select dates from public_holidays) exist in dates between start_date and end_date.
It will be look like this:
select
id, name, start_date, end_date,
case when (public_holidays = true) then number - 1 else number end as find_real_number
from my_table
Sample Data:
ID Name Start_date End_Date Numbers
1 Mike 3/9/2020 4/9/2020 67
2 Rick 3/1/2020 3/6/2020 34
3 Simm 3/24/2020 3/28/2020 98
4 Lisa 3/27/2020 4/5/2020 103
5 Rosy 3/9/2020 4/9/2020 23
And some sample expected results:
ID Name Start_date End_Date Numbers
1 Mike 3/9/2020 4/9/2020 66
2 Rick 3/1/2020 3/6/2020 34
3 Simm 3/24/2020 3/28/2020 98
4 Lisa 3/27/2020 4/5/2020 102
5 Rosy 3/9/2020 4/9/2020 23
Because we assume the 1st of April is a public holiday, so number of row 1st and 4th got minus by 1.
And sample public holidays view I created:
Public_holidays Dates
April fools 04/01/2020
Labour Day 05/01/2020
Random Day 07/24/2020
However, because I am building query on the Metabase, it does not allow me to create a table. All I did was create a view where has 2 columns that are 'Public Holidays' and 'Dates'
Anyone possibly could give me a suggestion of how to do this? Thanks.
Try something like this:
SELECT id, name, start_date, end_date,
numbers - ( SELECT COUNT(*) FROM holidays
WHERE dates BETWEEN t.start_date AND t.end_date ) AS numbers
FROM my_table AS t
This assumes that your holidays are in a table/view named holidays. Also it counts the holidays between start and end dates and subtract it from numbers of my_table.
I think you want to check public holiday falls or not between start and end Date.
so you should compare dates like below:
select
id, name, start_date, end_date,
case when ((CAST(start_date as date) < CAST(public_holiday_date as date)
and CAST(public_holiday_date as date) < CAST(end_date as date))
then number - 1 else number end as find_real_number
from my_table
or
select
id, name, start_date, end_date,
case when CAST(public_holiday_date as date)
between (CAST(start_date as date) and CAST(end_date as date)
then number - 1 else number end as find_real_number
from my_table
I'm trying to query some transactional data to establish the CurrentProductionHours value for each Report at the end of each month.
Providing there has been a transaction for each report in each month, that's pretty straight-forward... I can use something along the lines of the code below to partition transactions by month and then pick out the rows where TransactionByMonth = 1 (effectively, the last transaction for each report each month).
SELECT
ReportId,
TransactionId,
CurrentProductionHours,
ROW_NUMBER() OVER (PARTITION BY [ReportId], [CalendarYear], [MonthOfYear]
ORDER BY TransactionTimestamp desc
) AS TransactionByMonth
FROM
tblSource
The problem that I have is that there will not necessarily be a transaction for every report every month... When that's the case, I need to carry forward the last known CurrentProductionHours value to the month which has no transaction as this indicates that there has been no change. Potentially, this value may need to be carried forward multiple times.
Source Data:
ReportId TransactionTimestamp CurrentProductionHours
1 2014-01-05 13:37:00 14.50
1 2014-01-20 09:15:00 15.00
1 2014-01-21 10:20:00 10.00
2 2014-01-22 09:43:00 22.00
1 2014-02-02 08:50:00 12.00
Target Results:
ReportId Month Year ProductionHours
1 1 2014 10.00
2 1 2014 22.00
1 2 2014 12.00
2 2 2014 22.00
I should also mention that I have a date table available, which can be referenced if required.
** UPDATE 05/03/2014 **
I now have query which is genertating results as shown in the example below but I'm left with islands of data (where a transaction existed in that month) and gaps in between... My question is still similar but in some ways a little more generic - What is the best way to fill gaps between data islands if you have the dataset below as a starting point?
ReportId Month Year ProductionHours
1 1 2014 10.00
1 2 2014 12.00
1 3 2014 NULL
2 1 2014 22.00
2 2 2014 NULL
2 3 2014 NULL
Any advice about how to tackle this would be greatly appreciated!
Try this:
;with a as
(
select dateadd(m, datediff(m, 0, min(TransactionTimestamp))+1,0) minTransactionTimestamp,
max(TransactionTimestamp) maxTransactionTimestamp from tblSource
), b as
(
select minTransactionTimestamp TT, maxTransactionTimestamp
from a
union all
select dateadd(m, 1, TT), maxTransactionTimestamp
from b
where tt < maxTransactionTimestamp
), c as
(
select distinct t.ReportId, b.TT from tblSource t
cross apply b
)
select c.ReportId,
month(dateadd(m, -1, c.TT)) Month,
year(dateadd(m, -1, c.TT)) Year,
x.CurrentProductionHours
from c
cross apply
(select top 1 CurrentProductionHours from tblSource
where TransactionTimestamp < c.TT
and ReportId = c.ReportId
order by TransactionTimestamp desc) x
A similar approach but using a cartesian to obtain all the combinations of report ids/months.
in the first step.
A second step adds to that cartesian the maximum timestamp from the source table where the month is less or equal to the month in the current row.
Finally it joins the source table to the temp table by report id/timestamp to obtain the latest source table row for every report id/month.
;
WITH allcombinations -- Cartesian (reportid X yearmonth)
AS ( SELECT reportid ,
yearmonth
FROM ( SELECT DISTINCT
reportid
FROM tblSource
) a
JOIN ( SELECT DISTINCT
DATEPART(yy, transactionTimestamp)
* 100 + DATEPART(MM,
transactionTimestamp) yearmonth
FROM tblSource
) b ON 1 = 1
),
maxdates --add correlated max timestamp where the month is less or equal to the month in current record
AS ( SELECT a.* ,
( SELECT MAX(transactionTimestamp)
FROM tblSource t
WHERE t.reportid = a.reportid
AND DATEPART(yy, t.transactionTimestamp)
* 100 + DATEPART(MM,
t.transactionTimestamp) <= a.yearmonth
) maxtstamp
FROM allcombinations a
)
-- join previous data to the source table by reportid and timestamp
SELECT distinct m.reportid ,
m.yearmonth ,
t.CurrentProductionHours
FROM maxdates m
JOIN tblSource t ON t.transactionTimestamp = m.maxtstamp and t.reportid=m.reportid
ORDER BY m.reportid ,
m.yearmonth