PostgreSQL LAG records one year apart by partitioning

PostgreSQL LAG records one year apart by partitioning - postgresql

I have a db table with a bunch of records in a snapshot fashioned way, i. e. daily captures of product units availability for many years
product units category expire_date report_date
pineapple 10 common 12/25/2021 12/01/2021
pineapple 8 common 12/25/2021 12/02/2021
pineapple 8 deluxe 12/28/2021 12/02/2021
grapes 45 deluxe 11/30/2022 12/01/2021
...
pineapple 21 common 12/12/2022 12/01/2022
...
What I'm trying to get from that data is something like this "lagged" version, partitioning by product and category:
product units category report_date prev_year_units_atreportdate
pineapple 10 common 12/01/2021 NULL
pineapple 21 common 12/01/2022 10
pineapple 16 common 12/01/2023 21
...
It's important to know that from time to time the cron snapshot task fails and no records are stored for days. This leads to a different number of records by product.
I've been using LAG() to no avail since I can only get previous day/month using partitioning by product, category
Can anyone help me on this?

I think I would use a subselect rather than a window function.
select *,
(
select units from t t2
where t2.report_date=t1.report_date-interval '1 year' and t2.product=t1.product and t2.category=t1.category
) lagged_units
from t as t1
I'm not sure what you want to happen on leap year, though, or the year after one.

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?

The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

How to group previously-denormalized-data from a row

I have a table containing courses run by teachers, I want to grab the number of taught days and split these by years and teachers' status.
The table contains the following fields:
id teacher_id course_name course_date course_duration teacher_status
--------------------------------------------------------------------------
1 Teacher_01 Course_AA 2012-02-01 2 volunteer
2 Teacher_02 Course_BB 2012-02-01 7 employee
3 Teacher_03 Course_BB 2013-02-01 7 contractor
4 Teacher_01 Course_AA 2014-02-01 2 paid volunteer
5 Teacher_04 Course_AA 2014-06-01 2 paid volunteer
Teachers may run a course under various statuses: volunteer, paid volunteer, contractor, employee, etc. The status of a given teacher can change through time. The duration of a course is expressed in days.
I can already gather the sum of taught days by teachers, split by status. This is done by
SELECT
teacher_status,
sum(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
teacher_status
;
But data is not normalized and different families of statuses have been mixed. So I want to gather the same info (number of taught days) split:
by 3 statuses: volunteer, paid volunteer, all other statuses,
and by years.
What is expected is:
Year Teacher_status Taught_days
---------------------------------------
2012 volunteer 2
2012 employee 7
2013 contractor 7
2014 paid volunteer 4
I've tried various combinations of aggregate functions, GROUP BY / HAVING / ROLLUP statements but without success. How should I achieve this?

You'll want to select a complex expression and then GROUP BY that, not just by a raw column value. You could either repeat the expression or, in Postgres, also refer to the column alias:
SELECT
EXTRACT(year FROM course_date) as year,
(CASE teacher_status
WHEN 'volunteer' THEN 'volunteer'
WHEN 'paid volunteer' THEN 'paid'
ELSE 'other'
END) AS status,
SUM(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
year,
status;

To get your example result, I have this query
SELECT extract (year from course_date),
teacher_status,
sum(course_duration) AS "Taught days"
FROM
my_table
GROUP BY
extract (year from course_date),
teacher_status;

TIBCO Spotfire: Cars in service since 3 weeks or more

I have a table of vehicles at service locations showing columns such as DAY, LICENSE, BOROUGH etc. I'd like to add a cross table showing the number of vehicles that have been serviced for 3 weeks or more. I'm not sure what custom expression to use.
Sample data:
Sample data

I hope your sample data isn't containing a bunch of legitmate license plates. not the most compromising data but I would recommend blacking out or replacing them with test data if it isn't already.
anyway. you're looking for the DateDiff() function. for example:
If(DateDiff('day', Date(DateTimeNow()), [Date]) >= 21, "21 days or more", "less than 21 days")

Average day of month range for bill date

Our billing system has a table of locations. These locations have a bill generated every month. These bills can vary by a few days each month (ex. billed on 6th one month then 8th the next). They always stay around the same time of the month, though. I need to come up with a "blackout" range that is essentially any day of month that location has been billed over the last X months. It needs to appropriately calculate for locations that may bounce between the end/beginning of months.
For this the only relevant tables are a Location Table & a Bill Table. Loc Table has a list of locations with LocationID being the PK. Other fields are irrelevant for this I believe. There's the bill table that has a document number as the PK and LocationID as a FK. Also, with many other fields such as doc amount, due date, etc. It also has a Billing Date, which is the date I'd like to calculate my 'Blackout' dates from.
Basically we're going to be changing every meter, and don't want to change them on a day where they might be billed.
Example Data:
Locations:
111111, Field1, Field2, etc.
222222, Field1, Field2, etc.
Bills (DocNum, LocationID, BillingDate):
BILL0001, 111111, 1/6/2018
BILL0002, 111111, 2/8/2018
BILL0003, 111111, 3/5/2018
BILL0004, 111111, 4/6/2018
BILL0005, 111111, 5/11/2018
BILL0006, 111111, 6/10/2018
BILL0007, 111111, 7/9/2018
BILL0008, 222222, 1/30/2018
BILL0009, 222222, 3/1/2018
BILL0010, 222222, 4/2/2018
BILL0011, 222222, 5/3/2018
BILL0012, 222222, 6/1/2018
BILL0013, 222222, 7/1/2018
BILL0014, 222222, 7/28/2018
Example output:
Location: 111111 BlackOut: 6th - 11th
Location: 222222 BlackOut: 28th - 3rd
How can I go about doing this?

As I understand it you want to work out the "average" day when the billing occurs and then construct a date range based around that?
The most accurate way I can think of doing this is to calculate a distance function and try to find the day of the month which minimises this distance function.
An example function would be:
distance() = Min((Day(BillingDate) - TrialDayNumber) % 30, (TrialDayNumber - Day(BillingDate)) % 30)
This will produce a number between 0 and 15 telling you how far away a billing date is from an arbitrary day of the month TrialDayNumber.
You want to try to select a day which, on average, has the lowest possible distance value as this will be the day of the month which is closest to your billing days. This kind of optimisation function is going to be much easier to calculate if you can perform the calculation outside of SQL (e.g. if you export the data to Excel or SSRS).
I believe you can calculate it purely using SQL but it is pretty tricky. I got as far as generating a table with a distance result for every DayOfTheMonth / Location ID combination but got stuck when I tried to narrow this down to the best distance per location.
WITH DaysOfTheMonth as(
SELECT 1 as DayNumber
UNION ALL
SELECT DayNumber + 1 as DayNumber
FROM DaysOfTheMonth
WHERE AutoNumber < 31
)
SELECT Bills.LocationID, DaysOfTheMonth.DayNumber,
Avg((Day(Bills.BillingDate)-DaysOfTheMonth.DayNumber)%30) as Distance
FROM Bills, DaysOfTheMonth
GROUP BY Bills.LocationID, DaysOfTheMonth.DayNumber
A big downside of this approach is that it is computation heavy.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?

Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.

You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL LAG records one year apart by partitioning - postgresql

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

How to group previously-denormalized-data from a row

TIBCO Spotfire: Cars in service since 3 weeks or more

Average day of month range for bill date

Web analytics schema with postgres

Categories

Resources