Counting by Week in Hive - date

I'm trying to produce a fully refreshed set of numbers each week, pulling from a table in hive. Right now I using this method:
SELECT
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-01-30") then userid end) as week_1,
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-02-06") then userid end) as week_2
FROM Data;
I'm trying to get something more along the lines of:
SELECT
Month(timestamp), Week(timestamp), COUNT (DISTINCT userid)
FROM Data
Group By Month, Week
But my week runs Sunday to Saturday. Is there a smarter way to be doing this that works in HIVE?
Solution found:
You can simply create your own formula instead of going with pre-defined function for "week of the year" Advantage: you will be able to take any set of 7 days for a week.
In your case since you want the week should start from Sunday-Saturday we will just need the first date of sunday in a year
eg- In 2016, First Sunday is on '2016-01-03' which is 3rd of Jan'16 --assumption considering the timestamp column in the format 'yyyy-mm-dd'
SELECT
count(distinct UserId), lower(datediff(timestamp,'2016-01-03') / 7) + 1 as week_of_the_year
FROM table.data
where timestamp>='2016-01-03'
group by lower(datediff(timestamp,'2016-01-03') / 7) + 1;

I see that you need the data to be grouped by week. you can just do this :
SELECT weekofyear(to_date(timestamp)), COUNT (DISTINCT userid) FROM Data Group By weekofyear(to_date(timestamp))

Related

Using 'over' function results in column "table.id" must appear in the GROUP BY clause or be used in an aggregate function

I'm currently writing an application which shows the growth of the total number of events in my table over time, I currently have the following query to do this:
query = session.query(
count(Event.id).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
This results in the following output:
Count
Year
Month
100
2021
1
50
2021
2
75
2021
3
While this is okay on it's own, I want it to display the total number over time, so not just the number of events that month, so the desired outpout should be:
Count
Year
Month
100
2021
1
150
2021
2
225
2021
3
I read on various places I should use a window function using SqlAlchemy's over function, however I can't seem to wrap my head around it and every time I try using it I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) column "event.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT count(event.id) OVER (PARTITION BY event.date ORDER...
^
[SQL: SELECT count(event.id) OVER (PARTITION BY event.date ORDER BY EXTRACT(year FROM event.date), EXTRACT(month FROM event.date)) AS count, EXTRACT(year FROM event.date) AS year, EXTRACT(month FROM event.date) AS month
FROM event
WHERE event.date IS NOT NULL GROUP BY year, month]
This is the query I used:
session.query(
count(Event.id).over(
order_by=(
extract('year', Event.date),
extract('month', Event.date)
),
partition_by=Event.date
).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
Could someone show me what I'm doing wrong? I've been searching for hours but can't figure out how to get the desired output as adding event.id in the group by would stop my rows from getting grouped by month and year
The final query I ended up using:
query = session.query(
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month'),
func.sum(func.count(Event.id)).over(order_by=(
extract('year', Event.date),
extract('month', Event.date)
)).label('count'),
).filter(
Event.date.isnot(None)
).group_by('year', 'month')
I'm not 100% sure what you want, but I'm assuming you want the number of events up to that month for each month. You're going to first need to calculate the # of events per month and also sum them with the postgresql window function.
You can do that with in a single select statement:
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, SUM(COUNT(events.id)) OVER(ORDER BY extract(year FROM events.date), extract(month FROM events.date)) AS total_so_far
FROM events
GROUP BY 1,2
but it might be easier to think about if you split it into two:
SELECT year, month, SUM(events_count) OVER(ORDER BY year, month)
FROM (
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, COUNT(events.id) AS events_count
FROM events
GROUP BY 1,2
)
but not sure how to do that in SqlAlchemy

How to Display Last rolling 12 months in correct month order

I have an SSRS report that shows customer sales for the year and I have been asked to change it to the last 13 rolling months. I have changed my where clause to be:
WHERE (#First12Months.FirstSaleDate BETWEEN DATEADD(MM,-13,#ReportDate) AND (#ReportDate))
( #ReportDate is the last day of the month that needs to be displayed on the right of the matrix.)
This where clause pulls the correct data but it is still displaying in my monthsort order and I need to change this to the last 12 months so that the newest month is on the right and the oldest month is on the left. I cannot work out how to do the sort.
My old sort is MonthSort which gives each month a number where April is 1 through to March = 12:
CASE WHEN Month(#First12Months.FirstSaleDate)<=3 THEN MONTH(#First12Months.FirstSaleDate)+9 ELSE MONTH(#First12Months.FirstSaleDate)-3 END AS MonthSort
but of course this is now incorrect as I need the month from #ReportDate to be number 13 and each month before that chronologically to be 1 number less.
I found this post which is the only one that seems to come close to what I need but unfortunately I simply don't understand what it is saying.
Dynamic table/output each month for report
How do I tell the MonthSort column which number to allocate to the months to get the correct sort order for a rolling 13 months?
As your data is in rows and your SSRS displays it in columns you can do the following:
Add a sorting column to your sql query that uses an analytical function in order to give the (dense) rank of the month. That rank can then be used as a sorting criteria in SSRS.
Assuming your month column is called month, your query could look like this:
select t.*, dense_rank() over (order by month) rnk from t
That order could also be done descending like this:
select t.*, dense_rank() over (order by month desc) rnk from t
Let's have an example:
with t as (
select 2134 sales, cast('20190101' as date) month union all
select 3456 sales, cast('20190201' as date) month union all
select 234 sales, cast('20190301' as date) month union all
select 4567 sales, cast('20190401' as date) month union all
select 5678 sales, cast('20190501' as date) month union all
select 234 sales, cast('20190601' as date) month union all
select 756 sales, cast('20190701' as date) month union all
select 9 sales, cast('20190801' as date) month union all
select 24356134 sales, cast('20190901' as date) month union all
select 2456134 sales, cast('20191001' as date) month union all
select 234 sales, cast('20191101' as date) month union all
select 675 sales, cast('20191201' as date) month union all
select 86 sales, cast('20200101' as date) month union all
select 786 sales, cast('20200201' as date) month union all
select 715 sales, cast('20200301' as date) month union all
select 156 sales, cast('20200401' as date) month union all
select 123 sales, cast('20200501' as date) month union all
select 687 sales, cast('20200601' as date) month union all
select 45 sales, cast('20200701' as date) month
)
, t1 as (
select sales, month from t where t.month > dateadd(MONTH, -12, getdate())
)
select t1.*, DENSE_RANK() over (order by datefromparts(year(month), month([month]), 1)) rnk from t1
will return
I guess, your first query retrieving the data is correct. Changing the columns' order would have to be done in the SSRS report.
For sorting tablix (table elements in SSRS) have a look here
Posting my own answer:
I have managed to work out a formula that calculates a position number for each column to replace my MonthSort column:
select case when MONTH(saledate) BETWEEN MONTH(dateadd(mm,-11,#ReportDate)) AND 12 THEN((MONTH(saledate)+1)-MONTH(#ReportDate)+11) ELSE ((month(saledate)+1)+month(#ReportDate)+11) end as position,
from table
WHERE (saledate BETWEEN DATEADD(MM,-11,#ReportDate) AND (#ReportDate))
This doesn't quite give me what I wanted as I wanted 12 months but couldn't work out how to differentiate between the same month this year and the same month last year e.g. if the report date is 30/6/2020 then with a 12 month parameter it gives 2 June months (1 for each year 2019 and 2020) but places both of them in position 1 when June 2019 should be in position 1 and June 2020 should be in position 13. Works well with 11 months. If anyone can help with getting it to 12 months I would be grateful
I am posting another answer of my own because it is a different way of achieving what I needed as well as the simplest - I was making this issue far more complicated than it needed to be.
In my second effort I added an extra column called SaleYear using: YEAR(SaleDate). I already had a column for MONTH(SaleDate) that I was using in a case when to achieve the Apr to March sort.
I restricted the data within the SQL query to the last 13 months using a where clause:
WHERE (SaleDate BETWEEN DATEADD(MM,-13,#ReportDate) AND DATEADD(minute, - 1, #ReportDate + 1))
And in the SSRS report I added 'Year' as a parent column group to the 'Month' column group. In the column group I added sorting by Year and then by month.
Because I had already restricted the data in the sql query to the last 13 months I have the last 13 rolling months in the correct order.
This is the cleanest and most simplistic answer.

DB2: Bi-monthly query for a DB2 report

I am currently writing a Crystal Report that has a DB2 query as its backend. I have finished the query but am stuck on the date portion of it. I am going to be running it twice a month - once on the 16th, and once on the 1st of the next month. Here's how it should work:
If I run it on the 16th of the month, it will give me results from the 1st of that same month to the 15th of that month.
If I run it on the 1st of the next month, it will give me results from the 16th of the previous month to the last day of the previous month.
This comes down a basic bi-monthly report. I've found plenty of hints to do this in T-SQL, but no efficient ways on how to accomplish this in DB2. I'm having a hard time wrapping my head around the logic to get this to consistently work, taking into account differences in month lengths and such.
There are 2 expressions for start and end date of an interval depending on the report date passed, which you may use in your where clause.
The logic is as follows:
1) If the report date is the 1-st day of a month, then:
DATE_START is 16-th of the previous month
DATE_END is the last day of the previous month
2) Otherwise:
DATE_START is 1-st of the current month
DATE_END is 15-th of the current month
SELECT
REPORT_DATE
, CASE DAY(REPORT_DATE) WHEN 1 THEN REPORT_DATE - 1 MONTH + 15 ELSE REPORT_DATE - DAY(REPORT_DATE) + 1 END AS DATE_START
, CASE DAY(REPORT_DATE) WHEN 1 THEN REPORT_DATE - 1 ELSE REPORT_DATE - DAY(REPORT_DATE) + 15 END AS DATE_END
FROM
(
VALUES
DATE('2020-02-01')
, DATE('2020-02-05')
, DATE('2020-02-16')
) T (REPORT_DATE);
The result is:
|REPORT_DATE|DATE_START|DATE_END |
|-----------|----------|----------|
|2020-02-01 |2020-01-16|2020-01-31|
|2020-02-05 |2020-02-01|2020-02-15|
|2020-02-16 |2020-02-01|2020-02-15|
In Db2 (for Unix, Linux and Windows) it could be a WHERE Condition like
WHERE
(CASE WHEN date_part('days', CURRENT date) > 15 THEN yourdatecolum >= this_month(CURRENT date) AND yourdatecolum < this_month(CURRENT date) + 15 days
ELSE yourdatecolum > this_month(CURRENT date) - 1 month + 15 DAYS AND yourdatecolum < this_month(CURRENT date)
END)
Check out the THIS_MONTH function - there are multiple ways to do it. Also DAYS_TO_END_OF_MONTH might be helpful

PostgreSQL count how many late every month in a year

I have this field named late_in that contains data like this 2017-05-29 08:36:44 where the limit for entry time is 08:30:00 every day.
What I want to do is to get the year, month and how many times he late in that month even if it zero late in the month.
I want the result look something like this:
year month late
-------------------
2017 1 6
2017 2 0
2017 3 14
and continue until the end of year.
You are looking for conditional aggregation:
select extract(year from late_in) as year,
extract(month from late_in ) as month,
count(*) filter (where late_in::time > time '08:30:00') as late
from the_table
group by extract(year from late_in),
extract(month from late_in );
This assumes that late_in is defined as timestamp.
The expression late_in::time returns only the time part of the value and the filter() clause for the aggregation will result in only those rows being counted where the condition is true, i.e. where the time part is after 08:30

Selecting the current week of data and reseting from week to week

I want to just select data for the current week. So...
If the current date is a Monday just select Monday
If the current date is a Tuesday select Monday and Tuesday's data
If the current date is Wednesday select Monday, Tuesday and Wednesday
...and so on. I want it to reset on Sunday and I believe it's some kind of "where" clause just don't know what. As you can see below I'm just counting the number of pieces into the oven and want it to accumulate as the week goes on and then reset on Sunday.
select
count(*) as PiecesIntoOven
from ovenfeederfloat
where...??
Thanks for the help.
If you're looking to do this in Sql Server, see below. Essentially this converts the current date to its numeric (0-6) value, then finds the 0th date for that week and uses it to set the lower bound of the where clause.
select sum(numberofpieces)
from Test
where dateofwork <= getdate()
and dateofwork >= (DATEADD(DAY, DATEPART(WEEKDAY,getdate()) * -1, getdate()) + 1)
Note that the '0' value is impacted by DATEFIRST. https://stackoverflow.com/a/1113891/4824030
I'm not certain how to do this in Oracle. Something like the below should work, but it's being finicky in sqlfiddle.
select sum(numberofpieces)
from Test
where dateofwork <= current_timestamp
and dateofwork >= (((to_char(level+trunc(current_timestamp,'D'),'Day') * -1) + current_timestamp) + 1)