I've got date/ duration data, e.g.:
16-Apr (Mon) 30mins
16-Apr (Mon) 90mins
17 Apr (Tue) 60 mins
19-Apr (thu) 15 mins
19-Apr (thu) 20 mins
19-Apr (thu) 20 mins
21-April (Sat ) 120 mins
23-April (Mon) 60 mins
24-Apr (Tue)15 mins
I want to produce an average duration per weekday
A pivot table with date can give me a sum of the durations for each date.
A pivot table on (derived) Day of the week gives me an average for the number of entries for that day of the week, e.g. in the above, Monday = 3 entries and I want an average from 2 (dates).
What I want is Mondays = x mins, Tues= y mins, Wed =z mins, etc.
Can this be done in one step with a pivot table or array formula?
Suppose you have your dates in column E, and durations in F
This formula will give you a summary by day of week:
=query({arrayformula(weekday(E2:E7)), arrayformula(n(F2:F7))},
"select Col1, sum(Col2) group by Col1")
To get an average, use avg(Col2) instead of sum(Col2). You still have to format the second result column as Duration, and find a way to convert the days of week from numeric to text, but that can be left as an exercise to the reader :-)
You could do this to add labelling
=ArrayFormula(query({hlookup(weekday(A:A),{1,2,3,4,5,6,7;"Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"},2,false),B:B},"select Col1,avg(Col2) group by Col1 label Col1 'Day'"))
The day numbers in column C are just for checking.
If any of the dates are missing, you need to ignore them otherwise they will all be treated as Saturdays...
=ArrayFormula(query({A:A,hlookup(weekday(A:A),{1,2,3,4,5,6,7;"Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"},2,false),B:B},"select Col2,avg(Col3) where Col1 is not null group by Col2 label Col2 'Day'"))
EDIT
OK this is an answer for the actual requirement which is for sum of time durations for each weekday but divided by number of unique dates which fall on that weekday.
=ArrayFormula(query({query({C2:C,weekday(C2:C),choose(weekday(C2:C),"Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"),E2:E},"select Col3,sum(Col4) where Col1 is not null group by Col2,Col3"),
query({unique(C2:C),weekday(unique(C2:C))},"select count(Col2) where Col1 is not null group by Col2")},"Select Col1,Col2/Col3 label Col1 'Day',Col2/Col3 'Average'"))
Notes
(1) The number of groups based on unique dates is exactly the same as the number of groups based on dates with duplicates.
(2) To get the day names in the correct order (Sunday, Monday, Tuesday...) I grouped on weekday number (1-7) then weekday name (Col2,Col3 in the first inner query). This doesn't create any more groups than just using Col3, but has the effect of putting the days in weekday number order instead of weekday name (alphabetical) order while still allowing you to put Col3 (weekday name) in the select list.
(3) Have included #ttarchala's recommendation of using Choose rather than Hlookup.
Related
I have a table which features 4 columns of dates. I need to calculate in column 5 the date which is the closest future date to today and display this within the same row e.g. is the 20th anniv closer that the 85th birthday or is theh 10th anniv closer than the 85th birthday. NB the 85th birthday will alway be the maximum date. Column 6 needs to then display the appriopriate column heading
Really appreciate any help that any one can offer.
Column names / sample values
strt date - 01/01/2010
85th birthday - 11/12/2047
10th anniv - 01/01/2020
20th anniv - 01/01/2030
next date - 01/01/2030
anniv_type - 20th anniv
The following query uses CROSS APPLY to UNPIVOT the 4 date columns and then get the one closest to current date. TOP 1 is to get only one row ie the nearest date
select *
from yourtable t
cross apply
(
select top 1 *
from
(
values ([strt date], 'strt date'),
([85th birthday], '85th birthday'),
([10th anniv], '10th anniv'),
([20th anniv], '20th anniv')
) d ([next date], [anniv_type])
where [next date] > getdate()
order by datediff(day, getdate(), [next date])
) n
I have two tables and I am trying to find data gaps in them where the dates do not overlap.
Item Table:
id unique start_date end_date data
1 a 2019-01-01 2019-01-31 X
2 a 2019-02-01 2019-02-28 Y
3 b 2019-01-01 2019-06-30 Y
Plan Table:
id item_unique start_date end_date
1 a 2019-01-01 2019-01-10
2 a 2019-01-15 'infinity'
I am trying to find a way to produce the following
Missing:
item_unique from to
a 2019-01-11 2019-01-14
b 2019-01-01 2019-06-30
step-by-step demo:db<>fiddle
WITH excepts AS (
SELECT
item,
generate_series(start_date, end_date, interval '1 day') gs
FROM items
EXCEPT
SELECT
item,
generate_series(start_date, CASE WHEN end_date = 'infinity' THEN ( SELECT MAX(end_date) as max_date FROM items) ELSE end_date END, interval '1 day')
FROM plan
)
SELECT
item,
MIN(gs::date) AS start_date,
MAX(gs::date) AS end_date
FROM (
SELECT
*,
SUM(same_day) OVER (PARTITION BY item ORDER BY gs)
FROM (
SELECT
item,
gs,
COALESCE((gs - LAG(gs) OVER (PARTITION BY item ORDER BY gs) >= interval '2 days')::int, 0) as same_day
FROM excepts
) s
) s
GROUP BY item, sum
ORDER BY 1,2
Finding the missing days is quite simple. This is done within the WITH clause:
Generating all days of the date range and subtract this result from the expanded list of the second table. All dates that not occur in the second table are keeping. The infinity end is a little bit tricky, so I replaced the infinity occurrence with the max date of the first table. This avoids expanding an infinite list of dates.
The more interesting part is to reaggregate this list again, which is the part outside the WITH clause:
The lag() window function take the previous date. If the previous date in the list is the last day then give out true (here a time changing issue occurred: This is why I am not asking for a one day difference, but a 2-day-difference. Between 2019-03-31 and 2019-04-01 there are only 23 hours because of daylight saving time)
These 0 and 1 values are aggregated cumulatively. If there is one gap greater than one day, it is a new interval (the days between are covered)
This results in a groupable column which can be used to aggregate and find the max and min date of each interval
Tried something with date ranges which seems to be a better way, especially for avoiding to expand long date lists. But didn't come up with a proper solution. Maybe someone else?
I have this field named late_in that contains data like this 2017-05-29 08:36:44 where the limit for entry time is 08:30:00 every day.
What I want to do is to get the year, month and how many times he late in that month even if it zero late in the month.
I want the result look something like this:
year month late
-------------------
2017 1 6
2017 2 0
2017 3 14
and continue until the end of year.
You are looking for conditional aggregation:
select extract(year from late_in) as year,
extract(month from late_in ) as month,
count(*) filter (where late_in::time > time '08:30:00') as late
from the_table
group by extract(year from late_in),
extract(month from late_in );
This assumes that late_in is defined as timestamp.
The expression late_in::time returns only the time part of the value and the filter() clause for the aggregation will result in only those rows being counted where the condition is true, i.e. where the time part is after 08:30
To find the number of days between two dates we can use something like this:
SELECT date_part('day',age('2017-01-31','2017-01-01')) as total_days;
In the above query we got 30 as output instead of 31. Why is that?
And I also want to find the number of days except Sundays. Expected output for the interval ('2017-01-01', '2017-01-31'):
Total Days = 31
Total Days except Sundays = 26
You need to define "between two dates" more closely. Lower and upper bound included or excluded? A common definition would be to include the lower and exclude the upper bound of an interval. Plus, define the result as 0 when lower and upper bound are identical. This definition happens to coincide with date subtraction exactly.
SELECT date '2017-01-31' - date '2017-01-01' AS days_between
This exact definition is important for excluding Sundays. For the given definition an interval from Sun - Sun (1 week later) does not include the upper bound, so there is only 1 Sunday to subtract.
interval in days | sundays
0 | 0
1-6 | 0 or 1
7 | 1
8-13 | 1 or 2
14 | 2
...
An interval of 7 days always includes exactly one Sunday.
We can get the minimum result with a plain integer division (days / 7), which truncates the result.
The extra Sunday for the remainder of 1 - 6 days depends on the first day of the interval. If it's a Sunday, bingo; if it's a Monday, too bad. Etc. We can derive a simple formula from this:
SELECT days, sundays, days - sundays AS days_without_sundays
FROM (
SELECT z - a AS days
, ((z - a) + EXTRACT(isodow FROM a)::int - 1 ) / 7 AS sundays
FROM (SELECT date '2017-01-02' AS a -- your interval here
, date '2017-01-30' AS z) tbl
) sub;
Works for any given interval.
Note: isodow, not dow for EXTRACT().
To include the upper bound, just replace z - a with (z - a) + 1. (Would work without parentheses, due to operator precedence, but better be clear.)
Performance characteristic is O(1) (constant) as opposed to a conditional aggregate over a generated set with O(N).
Related:
How do I determine the last day of the previous month using PostgreSQL?
Calculate working hours between 2 dates in PostgreSQL
You could try using generate_series to generate all the dates between given date and then take count of required days.
SELECT
count(case when extract(dow from generate_series) <> 0 then 1 end) n
from generate_series('2017-01-01'::date,'2017-01-31'::date, '1 day');
I'm trying to produce a fully refreshed set of numbers each week, pulling from a table in hive. Right now I using this method:
SELECT
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-01-30") then userid end) as week_1,
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-02-06") then userid end) as week_2
FROM Data;
I'm trying to get something more along the lines of:
SELECT
Month(timestamp), Week(timestamp), COUNT (DISTINCT userid)
FROM Data
Group By Month, Week
But my week runs Sunday to Saturday. Is there a smarter way to be doing this that works in HIVE?
Solution found:
You can simply create your own formula instead of going with pre-defined function for "week of the year" Advantage: you will be able to take any set of 7 days for a week.
In your case since you want the week should start from Sunday-Saturday we will just need the first date of sunday in a year
eg- In 2016, First Sunday is on '2016-01-03' which is 3rd of Jan'16 --assumption considering the timestamp column in the format 'yyyy-mm-dd'
SELECT
count(distinct UserId), lower(datediff(timestamp,'2016-01-03') / 7) + 1 as week_of_the_year
FROM table.data
where timestamp>='2016-01-03'
group by lower(datediff(timestamp,'2016-01-03') / 7) + 1;
I see that you need the data to be grouped by week. you can just do this :
SELECT weekofyear(to_date(timestamp)), COUNT (DISTINCT userid) FROM Data Group By weekofyear(to_date(timestamp))