BigQuery - DATE_TRUNC error - date

trying to get the monthly aggregated data from Legacy table. Meaning date columns are strings:
amount date_create
100 2018-01-05
200 2018-02-03
300 2018-01-22
However, the command
Select DATE_TRUNC(DATE date_create, MONTH) as month,
sum(amount) as amount_m
from table
group by 1
Returns the following error:
Error: Syntax error: Expected ")" but got identifier "date_create"
Why does this query not run and what can be done to avoid the issue?
Thanks

It looks like you meant to cast date_create instead of using the DATE keyword (which is how you construct a literal value) there. Try this instead:
Select DATE_TRUNC(DATE(date_create), MONTH) as month,
sum(amount) as amount_m
from table
GROUP BY 1

I figured it out:
date_trunc(cast(date_create as date), MONTH) as Month

Another option for BigQuery Standard SQL - using PARSE_DATE function
#standardSQL
WITH `project.dataset.table` AS (
SELECT 100 amount, '2018-01-05' date_create UNION ALL
SELECT 200, '2018-02-03' UNION ALL
SELECT 300, '2018-01-22'
)
SELECT
DATE_TRUNC(PARSE_DATE('%Y-%m-%d', date_create), MONTH) AS month,
SUM(amount) AS amount_m
FROM `project.dataset.table`
GROUP BY 1
with result as
Row month amount_m
1 2018-01-01 400
2 2018-02-01 200
In practice - I prefer PARSE_DATE over CAST as former kind of documents expectation about data format

Try to add double quote to date_creat :
Select DATE_TRUNC('date_create', MONTH) as month,
sum(amount) as amount_m
from table
group by 1

Related

How to sum for previous n number of days for a number of dates in PostgreSQL

I have a list of dates each with a value in Postgresql.
For each date I want to sum the value for this date and the previous 4 days.
I also want to sum the values for the start of that month to the present date. So for example:
For 07/02/2021 sum all values from 07/02/2021 to 01/02/2021
For 06/02/2021 sum all values from 06/02/2021 to 01/02/2021
For 31/01/2021 sum all values from 31/01/2021 to 01/01/2021
The output should look like, will be created as two separate tables:
Output
Any help would be appreciated.
Thanks
Sample data and structure: dbfiddle
For first part of query:
select date,
value,
sum(value) over (
order by to_date(date, 'DD/MM/YYYY')
rows between 4 preceding and current row) as five_day_period
from your_table_name
order by to_date(date, 'DD/MM/YYYY') desc;
For second part of query:
select date,
value,
sum(value)
over (
partition by regexp_replace(date, '[0-9]{2}/(.+)', '\1')
order by to_date(date, 'DD/MM/YYYY')
rows between unbounded preceding and current row) as month_to_date
from your_table_name
order by to_date(date, 'DD/MM/YYYY') desc;

Using 'over' function results in column "table.id" must appear in the GROUP BY clause or be used in an aggregate function

I'm currently writing an application which shows the growth of the total number of events in my table over time, I currently have the following query to do this:
query = session.query(
count(Event.id).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
This results in the following output:
Count
Year
Month
100
2021
1
50
2021
2
75
2021
3
While this is okay on it's own, I want it to display the total number over time, so not just the number of events that month, so the desired outpout should be:
Count
Year
Month
100
2021
1
150
2021
2
225
2021
3
I read on various places I should use a window function using SqlAlchemy's over function, however I can't seem to wrap my head around it and every time I try using it I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) column "event.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT count(event.id) OVER (PARTITION BY event.date ORDER...
^
[SQL: SELECT count(event.id) OVER (PARTITION BY event.date ORDER BY EXTRACT(year FROM event.date), EXTRACT(month FROM event.date)) AS count, EXTRACT(year FROM event.date) AS year, EXTRACT(month FROM event.date) AS month
FROM event
WHERE event.date IS NOT NULL GROUP BY year, month]
This is the query I used:
session.query(
count(Event.id).over(
order_by=(
extract('year', Event.date),
extract('month', Event.date)
),
partition_by=Event.date
).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
Could someone show me what I'm doing wrong? I've been searching for hours but can't figure out how to get the desired output as adding event.id in the group by would stop my rows from getting grouped by month and year
The final query I ended up using:
query = session.query(
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month'),
func.sum(func.count(Event.id)).over(order_by=(
extract('year', Event.date),
extract('month', Event.date)
)).label('count'),
).filter(
Event.date.isnot(None)
).group_by('year', 'month')
I'm not 100% sure what you want, but I'm assuming you want the number of events up to that month for each month. You're going to first need to calculate the # of events per month and also sum them with the postgresql window function.
You can do that with in a single select statement:
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, SUM(COUNT(events.id)) OVER(ORDER BY extract(year FROM events.date), extract(month FROM events.date)) AS total_so_far
FROM events
GROUP BY 1,2
but it might be easier to think about if you split it into two:
SELECT year, month, SUM(events_count) OVER(ORDER BY year, month)
FROM (
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, COUNT(events.id) AS events_count
FROM events
GROUP BY 1,2
)
but not sure how to do that in SqlAlchemy

Postgres find where dates are NOT overlapping between two tables

I have two tables and I am trying to find data gaps in them where the dates do not overlap.
Item Table:
id unique start_date end_date data
1 a 2019-01-01 2019-01-31 X
2 a 2019-02-01 2019-02-28 Y
3 b 2019-01-01 2019-06-30 Y
Plan Table:
id item_unique start_date end_date
1 a 2019-01-01 2019-01-10
2 a 2019-01-15 'infinity'
I am trying to find a way to produce the following
Missing:
item_unique from to
a 2019-01-11 2019-01-14
b 2019-01-01 2019-06-30
step-by-step demo:db<>fiddle
WITH excepts AS (
SELECT
item,
generate_series(start_date, end_date, interval '1 day') gs
FROM items
EXCEPT
SELECT
item,
generate_series(start_date, CASE WHEN end_date = 'infinity' THEN ( SELECT MAX(end_date) as max_date FROM items) ELSE end_date END, interval '1 day')
FROM plan
)
SELECT
item,
MIN(gs::date) AS start_date,
MAX(gs::date) AS end_date
FROM (
SELECT
*,
SUM(same_day) OVER (PARTITION BY item ORDER BY gs)
FROM (
SELECT
item,
gs,
COALESCE((gs - LAG(gs) OVER (PARTITION BY item ORDER BY gs) >= interval '2 days')::int, 0) as same_day
FROM excepts
) s
) s
GROUP BY item, sum
ORDER BY 1,2
Finding the missing days is quite simple. This is done within the WITH clause:
Generating all days of the date range and subtract this result from the expanded list of the second table. All dates that not occur in the second table are keeping. The infinity end is a little bit tricky, so I replaced the infinity occurrence with the max date of the first table. This avoids expanding an infinite list of dates.
The more interesting part is to reaggregate this list again, which is the part outside the WITH clause:
The lag() window function take the previous date. If the previous date in the list is the last day then give out true (here a time changing issue occurred: This is why I am not asking for a one day difference, but a 2-day-difference. Between 2019-03-31 and 2019-04-01 there are only 23 hours because of daylight saving time)
These 0 and 1 values are aggregated cumulatively. If there is one gap greater than one day, it is a new interval (the days between are covered)
This results in a groupable column which can be used to aggregate and find the max and min date of each interval
Tried something with date ranges which seems to be a better way, especially for avoiding to expand long date lists. But didn't come up with a proper solution. Maybe someone else?

Count and records from yesterday and add datecolumn next to it with yesterday's date in Bigquery, standardSQL

I've been able to get a sql running where I grab the count of all records from the day before.
SELECT count(*)
FROM mytable
WHERE date(ingest_time) >= (DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND
date(ingest_time) < (CURRENT_DATE());
Adding to the SQL above in Bigquery, how do I generate a date column next to it that shows that these records are from yesterday with the date.
Something like this:
1) 3000390 | 2019-11-13
Instead of SELECT count(*) use SELECT count(*), DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Getting maximum sequential streak with events

I’m having trouble getting my head around this.
I’m looking for a single query, if possible, running PostgreSQL 9.6.6 under pgAdmin3 v1.22.1
I have a table with a date and a row for each event on the date:
Date Events
2018-12-10 1
2018-12-10 1
2018-12-10 0
2018-12-09 1
2018-12-08 0
2018-12-07 1
2018-12-06 1
2018-12-06 1
2018-12-06 1
2018-12-05 1
2018-12-04 1
2018-12-03 0
I’m looking for the longest sequence of dates without a break. In this case, 2018-12-08 and 2018-12-03 are the only dates with no events, there are two dates with events between 2018-12-08 and today, and four between 2018-12-8 and 2018-12-07 - so I would like the answer of 4.
I know I can group them together with something like:
Select Date, count(Date) from Table group by Date order by Date Desc
To get just the most recent sequence, I’ve got something like this- the subquery returns the most recent date with no events, and the outer query counts the dates after that date:
select count(distinct date) from Table
where date>
( select date from Table
group by date
having count (case when Events is not null then 1 else null end) = 0
order by date desc
fetch first row only)
But now I need the longest streak, not just the most recent streak.
Thank you!
Your instinct is a good one in looking at the rows with zero events and working off them. We can use a subquery with a window function to get the "gaps" between zero event days, and then in a query outside it take the record we want, like so:
select *
from (
select date as day_after_streak
, lag(date) over(order by date asc) as previous_zero_date
, date - lag(date) over(order by date asc) as difference
, date_part('days', date - lag(date) over(order by date asc) ) - 1 as streak_in_days
from dates
group by date
having sum(events) = 0 ) t
where t.streak_in_days is not null
order by t.streak_in_days desc
limit 1