Average day of month range for bill date - tsql

Our billing system has a table of locations. These locations have a bill generated every month. These bills can vary by a few days each month (ex. billed on 6th one month then 8th the next). They always stay around the same time of the month, though. I need to come up with a "blackout" range that is essentially any day of month that location has been billed over the last X months. It needs to appropriately calculate for locations that may bounce between the end/beginning of months.
For this the only relevant tables are a Location Table & a Bill Table. Loc Table has a list of locations with LocationID being the PK. Other fields are irrelevant for this I believe. There's the bill table that has a document number as the PK and LocationID as a FK. Also, with many other fields such as doc amount, due date, etc. It also has a Billing Date, which is the date I'd like to calculate my 'Blackout' dates from.
Basically we're going to be changing every meter, and don't want to change them on a day where they might be billed.
Example Data:
Locations:
111111, Field1, Field2, etc.
222222, Field1, Field2, etc.
Bills (DocNum, LocationID, BillingDate):
BILL0001, 111111, 1/6/2018
BILL0002, 111111, 2/8/2018
BILL0003, 111111, 3/5/2018
BILL0004, 111111, 4/6/2018
BILL0005, 111111, 5/11/2018
BILL0006, 111111, 6/10/2018
BILL0007, 111111, 7/9/2018
BILL0008, 222222, 1/30/2018
BILL0009, 222222, 3/1/2018
BILL0010, 222222, 4/2/2018
BILL0011, 222222, 5/3/2018
BILL0012, 222222, 6/1/2018
BILL0013, 222222, 7/1/2018
BILL0014, 222222, 7/28/2018
Example output:
Location: 111111 BlackOut: 6th - 11th
Location: 222222 BlackOut: 28th - 3rd
How can I go about doing this?

As I understand it you want to work out the "average" day when the billing occurs and then construct a date range based around that?
The most accurate way I can think of doing this is to calculate a distance function and try to find the day of the month which minimises this distance function.
An example function would be:
distance() = Min((Day(BillingDate) - TrialDayNumber) % 30, (TrialDayNumber - Day(BillingDate)) % 30)
This will produce a number between 0 and 15 telling you how far away a billing date is from an arbitrary day of the month TrialDayNumber.
You want to try to select a day which, on average, has the lowest possible distance value as this will be the day of the month which is closest to your billing days. This kind of optimisation function is going to be much easier to calculate if you can perform the calculation outside of SQL (e.g. if you export the data to Excel or SSRS).
I believe you can calculate it purely using SQL but it is pretty tricky. I got as far as generating a table with a distance result for every DayOfTheMonth / Location ID combination but got stuck when I tried to narrow this down to the best distance per location.
WITH DaysOfTheMonth as(
SELECT 1 as DayNumber
UNION ALL
SELECT DayNumber + 1 as DayNumber
FROM DaysOfTheMonth
WHERE AutoNumber < 31
)
SELECT Bills.LocationID, DaysOfTheMonth.DayNumber,
Avg((Day(Bills.BillingDate)-DaysOfTheMonth.DayNumber)%30) as Distance
FROM Bills, DaysOfTheMonth
GROUP BY Bills.LocationID, DaysOfTheMonth.DayNumber
A big downside of this approach is that it is computation heavy.

Related

Return at most 100 equally spaced rows from query

I have a table with time series data that looks like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1999-01-08 04:05:06
"London"
c47c89f1-bdf8-420c-9237-32d70f9119f6
1.67
1999-01-08 04:05:06
"Paris"
54bbd56b-3269-4417-8422-3f1b7285e165
1.77
1999-01-08 04:05:06
"Berlin"
...
...
...
...
Now I would like to create a query for a given location between 2 dates that returns at most 100 results. If its less or equal than 100, simply return all the rows. If its more than 100, I'd like it to equally space them by date.
(ie if I select from 1.January 1970 to 1.January 2020 theres 18,993 days, so I'd like it to skip every 189th day so it returns exactly 100 records (cutting off the remaining 93 days).
So it returns rows like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1970-01-01 04:05:06
London
8dc7c77b-6958-4cc7-914a-4b9c1f661200
1.1
1970-07-09 04:05:06
London
4e3d4c3b-a7e3-48bf-b6a3-5327cc606c82
1.23
1971-01-14 04:05:06
London
...
...
...
...
heres how far I got:
SELECT
pressure,
date
FROM
location
WHERE
location=$1
date>$2 AND date<$3
ORDER BY
date
The problem is, I see no way of achieving this without first loading all the rows, counting them and then sampling them out, which would be bad for perfomance (which means I'd have to load 18993 rows to return 100). So is there a way to efficiently load those hundred rows as you go?
I don't have a complete answer, but some ideas to start one:
SELECT MAX(date), MIN(date) FROM location gets you the first and last dates
something like SELECT MIN(date)::timestamp as beginning, (MAX(date)::timestamp - MIN(date)::timestamp)/COUNT(*) as avgdistance FROM location (no guarantee on syntax, can't check on a live DB right now) would get you the start point and the average distance between points.
The method to get every nth row is SELECT * FROM (SELECT *, row_number() over() rn FROM tab) foo WHERE foo.rn % {n} = 0; (with {n} replaced by your desired number).
You can probably replace that WHERE with something else than a row_number() check and that will get you somewhere.
Feel free to delete this answer if a real answer comes along, until then maybe this'll get you started.

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

First date with sales greater than 100 in TABLEAU

I have a very Basic flat file with Sales by date and product names. I need to create a field for First sales day where sales are greater than 100 units.
I tried {FIXED [Style Code]: MIN([Prod Cal Activity Date])} but that just gives me the first day in the data the Style code Exists
I also tried IF ([Net Sales Units]>200) THEN {FIXED [Style Code]: MIN([Prod Cal Activity Date])}END but that also gives me the first day in the data the Style code Exists
DATA EXISTS PRIOR TO SALES DATE
You can use the following calculation:
MIN(IF([Net Sales Units]>100) THEN [Prod Cal Activity Date] ELSE #2100-01-01# END)
The IF([Net Sales Units]>100) THEN [Prod Cal Activity Date] ELSE #2100-01-01# END part of the calculation converts the date into a very high value (year 2100 in the example) for all the cases where the sales was more than 100 units. Once this is done, you can simply take a minimum of the calculated date to get the desired result. If you need this by style code, then you can add a fixed function in the beginning.
A few ways to simplify further if you like. They don't change the meaning.
You don't need parenthesis around boolean expressions as you would in C.
You can eliminate the ELSE clause altogether. The if expression will default to null in cases where the condition was false. Aggregation functions like MIN(), MAX(), SUM() etc silently ignore nulls, so you don't have to come up with some default future date.
So MIN(IF [Net Sales Units] > 100 THEN [Prod Cal Activity Date] END is exactly equivalent, just a few less characters to read.
The next possible twist has a bit of analytic value beyond just saving keystrokes.
You don't need to hard code the choice of aggregation function into the calculation. You could instead name your calculated field something like High Sales Activity Date defined as just
if [Net Sales Units] > 100 then [Prod Cal Activity Date] end
This field just holds the date for records with high sales, and is null for records with low sales. But by leaving the aggregation function out of the calculation, you have more flexibility to use it in different ways. For example, you could
Calculate the earliest (i.e. Min) high sales date as requested originally
Calculate the latest high sales date using Max
Filter to only dates with high sales by filtering special non-null values
Calculate the number of high sales dates using COUNTD
Simple little filtering calculations like this can be very useful - so called because of the embedded if statement effectively filters out values that don't match the condition. There are still null values for the other records, but since aggregation functions ignore nulls, you can think of them as effectively filtered out by the calculation.

Calculated Field to Count While Between Dates

I am creating a Tableau visualization for floor stock in our plant. We have a column for incoming date, quantity, and outgoing date. I am trying to create a visualization that sums the quantity but only while between the 2 columns.
So for example, if we have 9 parts in stock that arrived on 9/1 and is scheduled to ship out on 9/14, I would like this visualization to include these 9 parts in the sum only while it is in our stock between those 2 dates. Here is an example of some of the data I am working with.
4/20/2018 006 5/30/2018
4/20/2018 017 5/30/2018
4/20/2018 008 5/30/2018
6/29/2018 161 9/7/2018
Create a new calculation:
if [ArrivalDate]>="2018-09-01" and [ArrivalDate]<"2018-09-15"
and [Shipdate]<'2018-09-15"
then [MEASUREofStock] else 0 end
Here is a solution using UNIONs written before Tableau added support for Unions (so it required custom SQL)
Volume of an Incident Queue at a Point in Time
For several years now, Tableau has supported Union directly, so now it is possible to get the same effect without writing custom SQL, but the concept is the same.
The main thing to understand is that you need a data row per event (per arrival or per departure) and a single date column, not two. That will let you calculate the net change in quantity per day, and you can then use a running total if you want to see the absolute quantity at the close of each day
There is no simple way to display the total quantity between the two dates without changing the input table structure. If you want to show all dates and the "eligible" quantity in each day, you should
Create a calendar table that has all dates start from 1990-01-01 to 2029-12-31. (You can limit the dates to be displayed in dashboard later by applying date filter, but here you want to be safe and include all dates that may exist in your stock table) Here is how to create the date table quickly.
Left join the date table to stock table and calculate the eligible quantity in each day.
SELECT
a.date,
SUM(CASE WHEN b.quantity IS NULL THEN 0 ELSE b.quantity END) AS quantity
FROM date a
LEFT JOIN
stock b on a.date BETWEEN b.Incoming_Date AND b.Outgoing_Date
GROUP BY a.date
Import the output table to Tableau, and simply add dates and quantity to the chart.

Count Until A Specific Value?

Say you've got a table ordered by the date that captures the speed of vehicles with a device in them. And imagine you get 30 updates per day for the speed. It's not always 30 per vehicle. The data will have the vehicle, the timestamp, and the speed.
What I want to do is be able to count how many days have passed since the vehicle last went over 10 mph in order to find inactive vehicles. Is something like that possible in postgresql?
*Or is there a way to get back the row number of the table if it's sorted where the speed goes past 10, and then select the date in that row number to subtract the current date from the date listed?
SELECT DISTINCT ON (vessel) vessel, now() - date
FROM your_table
WHERE speed > 10
ORDER BY vessel, date DESC
This will tell you, for every vehicle, how long ago its speed field was last over 10.
SELECT vessel, now() - max(date)
WHERE speed > 10
FROM your_table
GROUP BY vessel;