Calculate data with overlapping dates in SQL - postgresql

I'm using PostgreSQL 13 (I can update to 14 if that helps) and I'd like to return some rows based on data I've got.
The data I've got is a bit complex and comes from a few different tables but I don't think it matters here.
Currently I was able to create a query that returns data that looks like this:
Product ID Start End AvailableAmount
------------ ------------- ------------- ------------------
1 null null 2
1 2022-07-20 2022-07-22 1
1 2022-07-24 2022-07-27 1
2 null null 1
3 null null 5
Where Start is the start of a time period, End is the end of the period, AvailableAmount is the amount of product available in that time period. Available amount is calculated based on some other data.
I've tried summing up the AvailableAmount column but that does not return valid data because for time period from 2022-07-20 to 2022-07-24 AvailableAmount should be 1 but it's 2.
I think I'd need to somehow separate these dates and list the amount per day, not per time period but I don't know how.
Basically, going day by day, AvailableAmount for product with ID 1 should be:
2022-07-20: 1
2022-07-21: 1,
2022-07-22: 1,
2022-07-23: 2,
2022-07-24: 1,
2022-07-25: 1,
2022-07-26: 1,
2022-07-27: 1,
2022-07-28: 2,
...
so if I'd to query for the product with time period 2022-07-20 to 2022-07-25 I should be able to request 1 unit of the product. Currently my implementation makes it impossible as it's summing up the amount so if my request spans over two different time periods the available amount is lower than it should be.
I've tried using gaps and islands approach but I don't think it'd work here. I've also read about multirange that was introduced in v14 but I haven't tested it yet, working on it. I've also tried using generate_series but that did not help me.
I don't know if that is enough information but I can provide more if needed.
Thanks!

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

TABLEAU Calculating a Running DISTINCT COUNT on usernames for last 3 months

Issue:
Need to show RUNNING DISTINCT users per 3-month interval^^. (See goal table as reference). However, “COUNTD” does not help even after table calculation or “WINDOW_COUNT” or “WINDOW_SUM” function.
^^RUNNING DISTINCT user means DISTINCT users in a period of time (Jan - Mar, Feb – Apr, etc.). The COUNTD option only COUNT DISTINCT users in a window. This process should go over 3-month window to find the DISTINCT users.
Original Table
Date Username
1/1/2016 A
1/1/2016 B
1/2/2016 C
2/1/2016 A
2/1/2016 B
2/2/2016 B
3/1/2016 B
3/1/2016 C
3/2/2016 D
4/1/2016 A
4/1/2016 C
4/2/2016 D
4/3/2016 F
5/1/2016 D
5/2/2016 F
6/1/2016 D
6/2/2016 F
6/3/2016 G
6/4/2016 H
Goal Table
Tried Methods:
Step-by-step:
Tried to distribute the problem into steps, but due to columnar nature of tableau, I cannot successfully run COUNT or SUM (any aggregate command) on the LAST STEP of the solution.
STEP 0 Raw Data
This tables show the structure Data, as it is in the original table.
STEP 1 COUNT usernames by MONTH
The table show the count of users by month. You will notice because user B had 2 entries he is counted twice. In the next step we use DISTINCT COUNT to fix this issue.
STEP 2 DISTINCT COUNT by MONTH
Now we can see who all were present in a month, next step would be to see running DISTINCT COUNT by MONTH for 3 months
STEP 3 RUNNING DISTINCT COUNT for 3 months
Now we can see the SUM of DISTINCT COUNT of usernames for running 3 months. If you turn the MONTH INTERVAL to 1 from 3, you can see STEP 2 table.
LAST STEP Issue Step
GOAL: Need the GRAND TOTAL to be the SUM of MONTH column.
Request:
I want to calculate the SUM of '1' by MONTH. However, I am using WINDOW function and aggregating the data that gave me an Error.
WHAT I NEED
Jan Feb March April May Jun
3 3 4 5 5 6
WHAT I GOT
Jan Feb March April May Jun
1 1 1 1 1 1
My Output after tried methods: Attached twbx file. DISTINCT_count_running_v1
HELP taken:
https://community.tableau.com/thread/119179 ; Tried this method but stuck at last step
https://community.tableau.com/thread/122852 ; Used some parts of this solution
The way I approached the problem was identifying the minimum login date for each user and then using that date to count the distinct number of users. For example, I have data in this format. I created a calculated field called Min User Login Date as { FIXED [User]:MIN([Date])} and then did a CNTD(USER) on Min User Login Date to get the unique user count by date. If you want running total, then you can do quick table calculation on Running Total on CNTD(USER) field.
You need to put Month(date) and count(username) in the columns then you will get result what you expect.
See screen below

Grouping by date difference/range

How would i write a statement that would make specific group by's looking at the monthly date range/difference. Example:
org_group | date | second_group_by
A 30.10.2013 1
A 29.11.2013 1
A 31.12.2013 1
A 30.01.2015 2
A 27.02.2015 2
A 31.03.2015 2
A 30.04.2015 2
as long es there isnt a monthly date_diff > 1 it should be in the same second_group_by. I hope its clear enough for you to understand, the column second_group_by should be generated by the user...it doesnt exists in the table.
date diff between which rows though?
If you just want to separate years (or months or weeks) use
GROUP BY DATEPART(....)
That's Sybase or SQL Server but other SQLs will have equivalent.
If you have specific data ranges, get them into a table with start and end date-time and a monotonically increasing integer, join to that with a BETWEEN and GROUP BY the integer.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?

Generating Running Sum of Ratings in SQL

I have a rating table. It boils down to:
rating_value created
+2 april 3rd
-5 april 20th
So, every time someone gets rated, I track that rating event in the database.
I want to generate a rating history/time graph where the rating is the sum of all ratings up to that point in time on a graph.
I.E. A person's rating on April 5th might be select sum(rating_value) from ratings where created <= april 5th
The only problem with this approach is I have to run this day by day across the interval I'm interested in. Is there some trick to generating a running total using this sort of data?
Otherwise, I'm thinking the best approach is to create a denormalized "rating history" table alongside the individual ratings.
If you have postgresql 8.4, you can use a window-aggregate function to calculate a running sum:
steve#steve#[local] =# select rating_value, created,
sum(rating_value) over(order by created)
from rating;
rating_value | created | sum
--------------+------------+-----
2 | 2010-04-03 | 2
-5 | 2010-04-20 | -3
(2 rows)
See http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
try to add a group by statement. that gives you the rating value for each day (in e.g. an array). as you output the rating value over time, you can just add the previous array elements together.