I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method? - postgresql

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?

The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

Related

Weekly hour allocation problem in Rails and Postgresql

If I have a list of tasks with a certain date ranges, and the task is broken into weekly hour chunks of work (ie. 30 hours from 2018-12-31 to 2019-01-06 ... etc starting from Monday).
The kind of operations I would like to do are
Display all the weekly hours of all the tasks for a list of users
Sum the weekly hours for a user for all his tasks for the week
When the duration of the task is modified, create/destroy the weekly hour chunks.
Would it be more efficient to store these weekly records as
start date/end date/hours,
year/week number/hours
Storing start/end date probably give more flexibility to the table as it could potentially store non-weekly align hours.
Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date, and populating the weeks in between (without converting to date ranges). Also easier validation for updating the hours for a week, as long as the week number is 1-53.
Wondering if anyone has tried out either option and can give any pointers on their preferred option.
I would probably go for a daterange column.
That gives you the flexibility to have differently sized chunks and allows you to define an exclusion constraint to prevent overlapping ranges.
Finding the row for a given week is still quite simple using the "contains" operator #>, e.g. where the_column #> to_date('2019-24', 'iyyy-iw') finds the row(s) that contain week number 24 in 2019.
The expression to_date('2019-24', 'iyyy-iw') returns the first day (Monday) of the specified week.
Finding all rows that are between two weeks can also be done, however construction the corresponding date range looks a bit ugly. You can either construction an inclusive range with the first and last day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-24', 'iyyy-iw') + 6, '[]')
Or you can create a range with an exclusive upper range with the next week's first day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-25', 'iyyy-iw'), '[)')
While ranges can be indexed quite efficiently and , the required GIST indexes are a bit more expensive to maintain than a B-Tree index on two integer columns.
Another downside of using ranges (if you don't really need the flexibility) is that they take up more space than two integer columns (14 byte instead of 8, or even 4 with two smallint). So if the size of the table is of any concern, then your current solution with the year/week columns is more efficient.
"Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date"
If your input is a start and end date to begin with (rather than a "week number"), then I would definitely go for a daterange column. If that start and end date cover more than one week, then you store only one row, rather than multiple rows.

Count Until A Specific Value?

Say you've got a table ordered by the date that captures the speed of vehicles with a device in them. And imagine you get 30 updates per day for the speed. It's not always 30 per vehicle. The data will have the vehicle, the timestamp, and the speed.
What I want to do is be able to count how many days have passed since the vehicle last went over 10 mph in order to find inactive vehicles. Is something like that possible in postgresql?
*Or is there a way to get back the row number of the table if it's sorted where the speed goes past 10, and then select the date in that row number to subtract the current date from the date listed?
SELECT DISTINCT ON (vessel) vessel, now() - date
FROM your_table
WHERE speed > 10
ORDER BY vessel, date DESC
This will tell you, for every vehicle, how long ago its speed field was last over 10.
SELECT vessel, now() - max(date)
WHERE speed > 10
FROM your_table
GROUP BY vessel;

Year over year monthly sales

I am using SQL Server 2008 R2. Here is the query I have that returns monthly sales totals by zip code, per store.
select
left(a.Zip, 5) as ZipCode,
s.Store,
datename(month,s.MovementDate) as TheMonth,
datepart(year,s.MovementDate) as TheYear,
datepart(mm,s.MovementDate) as MonthNum,
sum(s.Dollars) as Sales,
count(*) as [TxnCount],
count(distinct s.AccountNumber) as NumOfAccounts
from
dbo.DailySales s
inner join
dbo.Accounts a on a.AccountNumber = s.AccountNumber
where
s.SaleType = 3
and s.MovementDate > '1/1/2016'
and isnull(a.Zip, '') <> ''
group by
left(a.Zip, 5),
s.Store,
datename(month, s.MovementDate),
datepart(year, s.MovementDate),
datepart(mm, s.MovementDate)
Now I'd like to add columns that compare sales, TxnCount, and NumOfAccounts to the same month the previous year for each zip code and store. I also would like each zip code/store combo to have a record for every month in the range; so zeros if null.
I do have a calendar table that I tried to use to get all months, but I ran into problems because of my "where" statements.
I know that both of these issues (comparing to previous year and including all dates in a date range) have been asked and answered before, and I've gotten them to work before myself, but this particular one has me running in circles. Any help would be appreciated.
I hope this is clear enough.
Thanks,
Tim
Treat the Query you have above as a data source. Run it as a CTE for the period you want to report, plus the period - 12 months (to get the historic data). (SalesPerMonth)
Then do a query that gets all the months you need from your calendar table as another CTE. This is the reporting months, not the previous year. (MonthsToReport)
Get a list of every valid zip code / Store combo - probably a select distinct from the SalesPerMonth CTE this would give you only combos that have at least one sale in the period (or historical period - you probably also want ones that sold last year, but not this year). Another CTE - StoreZip
Finally, your main query cross joins the StoreZip results with the MonthsToReport - this gives you the one row per StoreZip/Month combos you are looking for. Left join twice to the SalesPerMonth data, once for the month, once for the 1 year previous data. Use ISNULL to change any null records (no data) to zero.
Instead of CTEs, you could also do it as separate queries, storing the results in Temp tables instead. This may work better for large amounts of data.

kdb/q building NBBO from TAQ data

I have a table with bid/ask for for every stock/venue. Something like:
taq:`time xasc ([] time:10:00:00+(100?1000);bid:30+(100?20)%30;ask:30.8+(100?20)%30;stock:100?`STOCK1`STOCK2;exhcnage:100?`NYSE`NASDAQ)
How can I get the best/bid offer from all exchanges as of a time (in one minute buckets) for every stock?
My initial thought is to build a table that has a row for every minute/exchange/stock and do a asof join on the taq data. However, it sounds to me this is a brute force solution - since this is a solved problem, i figured i'd ask if there is a better way.
select max bid, min ask by stock,1+minute from 0!select by 1 xbar time.minute,stock,exchange from taq
This will give you the max bid, min ask across exchanges at the 1-minute interval in column minute.
The only tricky thing is the select by 1 xbar time.minute. When you do select by with no aggregation, it will just return the last row. So really what this means is select last time, last bid, last ask .... by 1 xbar time.minute etc.
So after we get the last values by minute and exchange, we just get the min/max across exchanges for that minute.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?