I've actually already solved the problem, but I'm trying to understand why the problem occurred because as far as I can see it has no reason to happen.
I have a rather large query that I run to prepare a table with some often used combinations. Generally, it only contains 2 years of data. Occasionally I will reconstruct it. While doing this I tweaked the query to add more information, but suddenly the result no longer matched up to the old query. Comparing the old to the new I noticed several missing orders. Amazingly, even after removing the tweaked parts the results still didn't match up.
I ultimately tracked the problem down to my WHERE clause, which was different from how I did it last time.
The type of the orderdate column I go over has type (datetime, null)
One of the orders that was omitted had this as date:
2018-12-23 20:58:52.383
An order that was included had this as date:
2019-01-28 15:20:49.107
It looks exactly the same to me.
The entire query is the same, except for the WHERE clause. My original where was:
WHERE DATEPART(yyyy,tbOrder.[OrderDate]) >= DATEPART(yyyy,GETDATE()-2)
My new where is now:
WHERE tborder.[OrderDate] >= DATEADD(yy, DATEDIFF(yy, 0, GETDATE())-2, 0)
Any help in understanding why the original where clause drops some lines would be greatly appreciated.
Because you are doing two different things. First predicate,
WHERE DATEPART(yyyy,tbOrder.[OrderDate]) >= DATEPART(yyyy,GETDATE()-2)
Take all order dates that are bigger than the year for the day before yesterday or two days before. Notice that, -2 is inside the brackets.
Second predicate,
WHERE tborder.[OrderDate] >= DATEADD( yy, DATEDIFF( yy, 0, GETDATE() ) - 2, 0)
Take all order dates bigger than two years before, i.e., datediff(yy,startdate,enddate) will return the year result of the difference between today and the initial value for date datatype, which is 1900-01-01. Then, add this, -2, to 1900-01-01. The second expression is the form of:
1900 + ( 201X - 1898 )
I simplified 1900 - 2 = 1898.
The two expressions return completely different things, so it shouldn't be a surprise the results are different. The first one returns the current year as a number (or the year of the day before yesterday to be precise). The second one returns January 1st two years ago.
You can put both expressions in a SELECT query to see what they return :
select DATEPART(yyyy,GETDATE()-2), DATEADD(yy, DATEDIFF(yy, 0, GETDATE()) - 2, 0)
The result is :
2019 2017-01-01 00:00:00.000
Both expressions are more complex than they need to be. The first condition will also harm performance because DATEPART(yyyy,tbOrder.[OrderDate]) prevents the server from using any indexes that cover OrderDate.
The question doesn't explain what you actually want to return. If you wanted to return all rows in the current year you can use :
Where
OrderDate >=DATEFROMPARTS( YEAR(GETDATE()) ,1,1) and
OrderDate < DATEFROMPARTS( YEAR(GETDATE()) + 1,1,1)
The same can be used to find rows two years in the past :
Where
OrderDate >= DATEFROMPARTS( YEAR(GETDATE()) -2 ,1,1) and
OrderDate < DATEFROMPARTS(YEAR(GETDATE()) - 1,1,1)
All rows since January 1st two years ago :
Where OrderDate >= DATEFROMPARTS( YEAR(GETDATE()) -2 ,1,1)
All those queries can take advantage of indexes that cover OrderDate.
Date range queries become a lot easier if you use a Calendar table. A Calendar table is a table that contains eg 50 or 100 years' worth of dates with extra columns for month, month day, week number, day of week, quarter, semester, month and day names, holidays, business reprorint periods, formatted short, long dates etc.
This makes yearly, monthly or weekly queries as easy as joining with the Calendar table and filtering based on the month or period you want.
In this case, retrieving rows two yeas in the past would look like :
From Orders inner Join Calendar on OrderDate=Calendar.Date
Where Calendar.Year=YEAR(GETDATE())-2
That may not looks so impressive but what about Q2 two years ago?
From Orders inner Join Calendar on OrderDate=Calendar.Date
Where Calendar.Year=YEAR(GETDATE())-2 and Quarter=2
Two years ago, same quarter
From Orders inner Join Calendar on OrderDate=Calendar.Date
Where Calendar.Year=YEAR(GETDATE())-2 and Quarter=DATEPART(q,GETDATE())
Retrieving totals for the current quarter for the last two years :
SELECT Year,Quarter,SUM(Total) QuarterTotal
From Orders inner Join Calendar on OrderDate=Calendar.Date
Where Calendar.Year > YEAR(GETDATE())-2 and Quarter=DATEPART(q,GETDATE())
GROUP BY Calendar.Year
Related
Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.
I'm used to do the following syntax when analysing weekly data:
select week(creation_date)::date as week,
count(*) as n
from table_1
where creation_date > current_date - 30
group by 1
However, by doing this I will get just part of the first week.
Is there any smart way to alway get a whole week in the beginning?
Like get the first day of the week I would get half of.
First off you need to define what you mean by "week". This is more difficult than it appears. While humans have an intuitive since of a week, computers are just not that smart. There are 2 common conventions: the ISO-8601 Standard and, for lack of a better term, Traditional. ISO-8601 defines a week as always beginning on Monday and always containing 7 days. Traditional weeks begin on Sunday (usually) but may have weeks with less than 7 days. This results from having the 1st week of the year beginning on 1-Jan regardless of day of week. Thus the 1st and/or last weeks may have less than 7 days. ISO-8601 throws it own curve into the mix: the 1st week of the year begins on the week containing 4-Jan. Thus the last days of Dec may be in week 1 of the next year and the first days Jan may be in week 52/53 of the prior year.
All the below assume the ISO-8061.
Secondly there is no week function in Postgres. In you need extract function. So for this particular case:
select extract(week from creation_date)::integer as week, ...
Finally, your predicate (current_date - 30) ensures you will unusually not begin on the 1st of the week. To get the correct date take that result back 1 week, then go forward to the next Monday.
with days_to_monday (day_adj) as
( values ('{7,6,5,4,3,2,1}'::int[]) )
select current_date - 30
, current_date - 30 - 7 + day_adj[extract (isodow from current_date - 30 )]
from table_1 cross join days_to_monday;
The CTE establishes an array which for a given day of the week contains the number of days need to the next Monday. That main query extracts the day of week of current date and uses that to index the array. The corresponding value is added to get the proper date.
Putting that together with your original query to arrive at:
with next_week (monday) as
( values (current_date - 30 - 7
+ ('{7,6,5,4,3,2,1}'::int[])[extract (isodow from current_date - 30 )])
)
select extract(week from creation_date) as week,
count(*) as n
from table_1
where creation_date >= (select monday from next_week)
group by 1
order by 1;
For full example see fiddle.
I have a date column which I am trying to query to return only the largest date per month.
What I currently have, albeit very simple, returns 99% of what I am looking for. For example, If I list the column in ascending order the first entry is 2016-10-17 and ranges up to 2017-10-06.
A point to note is that the last day of every month may not be present in the data, so I'm really just looking to pull back whatever is the "largest" date present for any existing month.
The query I'm running at the moment looks like
SELECT MAX(date_col)
FROM schema_name.table_name
WHERE <condition1>
AND <condition2>
GROUP BY EXTRACT (MONTH FROM date_col)
ORDER BY max;
This does actually return most of what I'm looking for - what I'm actually getting back is
"2016-11-30"
"2016-12-30"
"2017-01-31"
"2017-02-28"
"2017-03-31"
"2017-04-28"
"2017-05-31"
"2017-06-30"
"2017-07-31"
"2017-08-31"
"2017-09-29"
"2017-10-06"
which are indeed the maximal values present for every month in the column. However, the result set doesn't seem to include the maximum date value from October 2016 (The first months worth of data in the column). There are multiple values in the column for that month, ranging up to 2016-10-31.
If anyone could point out why the max value for this month isn't being returned, I'd much appreciate it.
You are grouping by month (1 to 12) rather than by month and year. Since 2017-10-06 is greater than any day in October 2016, that's what you get for the "October" group.
You should
GROUP BY date_trunc('month', date_col)
I am using SQL Server 2008 R2. Here is the query I have that returns monthly sales totals by zip code, per store.
select
left(a.Zip, 5) as ZipCode,
s.Store,
datename(month,s.MovementDate) as TheMonth,
datepart(year,s.MovementDate) as TheYear,
datepart(mm,s.MovementDate) as MonthNum,
sum(s.Dollars) as Sales,
count(*) as [TxnCount],
count(distinct s.AccountNumber) as NumOfAccounts
from
dbo.DailySales s
inner join
dbo.Accounts a on a.AccountNumber = s.AccountNumber
where
s.SaleType = 3
and s.MovementDate > '1/1/2016'
and isnull(a.Zip, '') <> ''
group by
left(a.Zip, 5),
s.Store,
datename(month, s.MovementDate),
datepart(year, s.MovementDate),
datepart(mm, s.MovementDate)
Now I'd like to add columns that compare sales, TxnCount, and NumOfAccounts to the same month the previous year for each zip code and store. I also would like each zip code/store combo to have a record for every month in the range; so zeros if null.
I do have a calendar table that I tried to use to get all months, but I ran into problems because of my "where" statements.
I know that both of these issues (comparing to previous year and including all dates in a date range) have been asked and answered before, and I've gotten them to work before myself, but this particular one has me running in circles. Any help would be appreciated.
I hope this is clear enough.
Thanks,
Tim
Treat the Query you have above as a data source. Run it as a CTE for the period you want to report, plus the period - 12 months (to get the historic data). (SalesPerMonth)
Then do a query that gets all the months you need from your calendar table as another CTE. This is the reporting months, not the previous year. (MonthsToReport)
Get a list of every valid zip code / Store combo - probably a select distinct from the SalesPerMonth CTE this would give you only combos that have at least one sale in the period (or historical period - you probably also want ones that sold last year, but not this year). Another CTE - StoreZip
Finally, your main query cross joins the StoreZip results with the MonthsToReport - this gives you the one row per StoreZip/Month combos you are looking for. Left join twice to the SalesPerMonth data, once for the month, once for the 1 year previous data. Use ISNULL to change any null records (no data) to zero.
Instead of CTEs, you could also do it as separate queries, storing the results in Temp tables instead. This may work better for large amounts of data.
I've been working on this problem for a while now. Basically I have a simple set of data with UserId, and TimeStamp. I want to know how many distinct UserId's appear each week, the catch is my week is measured in Sunday-Saturday, NOT Monday - Sunday, which is what Weekofyear() uses.
Right now I'm hardcoding each week and running the query:
SELECT
count(distinct UserId)
FROM data.table
where from_unixtime((CAST(timestamp as BIGINT)))
between TO_DATE("2016-06-05") AND TO_DATE("2016-06-12")
I'm trying to find a way to shift the timestamp back a day to trick weekofyear into thinking my Sunday is actually a Monday, but have not been successful. My latest futile attempt looked like:
SELECT
count(distinct UserId), weekofyear(date_sub(from_unixtime(CAST(timestamp as BIGINT)),1))
FROM table.data
where from_unixtime((CAST(timestamp as BIGINT)))
between TO_DATE("2016-06-01") AND TO_DATE("2016-06-30")
group by weekofyear(date_sub(from_unixtime(CAST(timestamp as BIGINT)),1))
This results in the same numbers as if I didn't subtract a day. I not sure why this isn't working. I feel like there should be a way to manage this. Right now if I wanted to pull all the data by week WHERE X is true, I'd have to manually do each week, that won't be sustainable. Any suggestions on how to work smarter?
Thank you.
Simple Solution
You can simply create your own formula instead of going with pre-defined function for "week of the year"
Advantage: you will be able to take any set of 7 days for a week.
In your case since you want the week should start from Sunday-Saturday we will just need the first date of sunday in a year
eg- In 2016, First Sunday is on '2016-01-03' which is 3rd of Jan'16
--assumption considering the timestamp column in the format 'yyyy-mm-dd'
SELECT
count(distinct UserId), lower(datediff(timestamp,'2016-01-03') / 7) + 1 as week_of_the_year
FROM table.data
where timestamp>='2016-01-03'
group by lower(datediff(timestamp,'2016-01-03') / 7) + 1;