Return at most 100 equally spaced rows from query - postgresql

I have a table with time series data that looks like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1999-01-08 04:05:06
"London"
c47c89f1-bdf8-420c-9237-32d70f9119f6
1.67
1999-01-08 04:05:06
"Paris"
54bbd56b-3269-4417-8422-3f1b7285e165
1.77
1999-01-08 04:05:06
"Berlin"
...
...
...
...
Now I would like to create a query for a given location between 2 dates that returns at most 100 results. If its less or equal than 100, simply return all the rows. If its more than 100, I'd like it to equally space them by date.
(ie if I select from 1.January 1970 to 1.January 2020 theres 18,993 days, so I'd like it to skip every 189th day so it returns exactly 100 records (cutting off the remaining 93 days).
So it returns rows like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1970-01-01 04:05:06
London
8dc7c77b-6958-4cc7-914a-4b9c1f661200
1.1
1970-07-09 04:05:06
London
4e3d4c3b-a7e3-48bf-b6a3-5327cc606c82
1.23
1971-01-14 04:05:06
London
...
...
...
...
heres how far I got:
SELECT
pressure,
date
FROM
location
WHERE
location=$1
date>$2 AND date<$3
ORDER BY
date
The problem is, I see no way of achieving this without first loading all the rows, counting them and then sampling them out, which would be bad for perfomance (which means I'd have to load 18993 rows to return 100). So is there a way to efficiently load those hundred rows as you go?

I don't have a complete answer, but some ideas to start one:
SELECT MAX(date), MIN(date) FROM location gets you the first and last dates
something like SELECT MIN(date)::timestamp as beginning, (MAX(date)::timestamp - MIN(date)::timestamp)/COUNT(*) as avgdistance FROM location (no guarantee on syntax, can't check on a live DB right now) would get you the start point and the average distance between points.
The method to get every nth row is SELECT * FROM (SELECT *, row_number() over() rn FROM tab) foo WHERE foo.rn % {n} = 0; (with {n} replaced by your desired number).
You can probably replace that WHERE with something else than a row_number() check and that will get you somewhere.
Feel free to delete this answer if a real answer comes along, until then maybe this'll get you started.

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

Average day of month range for bill date

Our billing system has a table of locations. These locations have a bill generated every month. These bills can vary by a few days each month (ex. billed on 6th one month then 8th the next). They always stay around the same time of the month, though. I need to come up with a "blackout" range that is essentially any day of month that location has been billed over the last X months. It needs to appropriately calculate for locations that may bounce between the end/beginning of months.
For this the only relevant tables are a Location Table & a Bill Table. Loc Table has a list of locations with LocationID being the PK. Other fields are irrelevant for this I believe. There's the bill table that has a document number as the PK and LocationID as a FK. Also, with many other fields such as doc amount, due date, etc. It also has a Billing Date, which is the date I'd like to calculate my 'Blackout' dates from.
Basically we're going to be changing every meter, and don't want to change them on a day where they might be billed.
Example Data:
Locations:
111111, Field1, Field2, etc.
222222, Field1, Field2, etc.
Bills (DocNum, LocationID, BillingDate):
BILL0001, 111111, 1/6/2018
BILL0002, 111111, 2/8/2018
BILL0003, 111111, 3/5/2018
BILL0004, 111111, 4/6/2018
BILL0005, 111111, 5/11/2018
BILL0006, 111111, 6/10/2018
BILL0007, 111111, 7/9/2018
BILL0008, 222222, 1/30/2018
BILL0009, 222222, 3/1/2018
BILL0010, 222222, 4/2/2018
BILL0011, 222222, 5/3/2018
BILL0012, 222222, 6/1/2018
BILL0013, 222222, 7/1/2018
BILL0014, 222222, 7/28/2018
Example output:
Location: 111111 BlackOut: 6th - 11th
Location: 222222 BlackOut: 28th - 3rd
How can I go about doing this?
As I understand it you want to work out the "average" day when the billing occurs and then construct a date range based around that?
The most accurate way I can think of doing this is to calculate a distance function and try to find the day of the month which minimises this distance function.
An example function would be:
distance() = Min((Day(BillingDate) - TrialDayNumber) % 30, (TrialDayNumber - Day(BillingDate)) % 30)
This will produce a number between 0 and 15 telling you how far away a billing date is from an arbitrary day of the month TrialDayNumber.
You want to try to select a day which, on average, has the lowest possible distance value as this will be the day of the month which is closest to your billing days. This kind of optimisation function is going to be much easier to calculate if you can perform the calculation outside of SQL (e.g. if you export the data to Excel or SSRS).
I believe you can calculate it purely using SQL but it is pretty tricky. I got as far as generating a table with a distance result for every DayOfTheMonth / Location ID combination but got stuck when I tried to narrow this down to the best distance per location.
WITH DaysOfTheMonth as(
SELECT 1 as DayNumber
UNION ALL
SELECT DayNumber + 1 as DayNumber
FROM DaysOfTheMonth
WHERE AutoNumber < 31
)
SELECT Bills.LocationID, DaysOfTheMonth.DayNumber,
Avg((Day(Bills.BillingDate)-DaysOfTheMonth.DayNumber)%30) as Distance
FROM Bills, DaysOfTheMonth
GROUP BY Bills.LocationID, DaysOfTheMonth.DayNumber
A big downside of this approach is that it is computation heavy.

How to find the days having a drawdown greater than X bips?

What would be the most idiomatic way to find the days with a drawdown greater than X bips? I again worked my way through some queries but they become boilerplate ... maybe there is a simpler more elegant alternative:
q)meta quotes
c | t f a
----| -----
date| z
sym | s
year| j
bid | f
ask | f
mid | f
then I do:
bips:50;
`jump_in_bips xdesc distinct select date,jump_in_bips from (update date:max[date],jump_in_bips:(max[mid]-min[mid])%1e-4 by `date$date from quotes where sym=accypair) where jump_in_bips>bips;
but this will give me the days for which there has been a jump in that number of bips and not only the drawdowns.
I can of course put this result above in a temporary table and do several follow up selects like:
select ... where mid=min(mid),date=X
select ... where mid=max(mid),date=X
to check that the max(mid) was before the min(mid) ... is there a simpler, more idiomatic way?
I think maxs is the key function here, which allows you to maintain a running historical maximum, and you can compare your current value to that maximum. If you have some table quote which contains some series of mids (mids) and timestamps (date), the following query should return the days where you saw a drawdown greater than a certain value:
key select by `date$date from quote
where bips<({(maxs[x]-x)%1e-4};mid) fby `date$date
The lambda {(maxs[x]-x)%1e-4} is doing the comparison at each point to the historical maximum and checking if it's greater than bips, and fby lets you apply the where clause group-wise by date. Grouping with a by on date and taking the key will then return the days when this occurred.
If you want to preserve the information for the max drawdown you can use an update instead:
select max draw by date from
(update draw:(maxs[mid]-mid)%1e-4 by date from #[quote;`date;`date$])
where bips<draw
The date is updated separately with a direct modification to quote, to avoid repeated casting.
Difference between max and min mids for given date may be both increase and drawdown. Depending on if max mid precedes min. Also, as far a sym columns exists, I assume you may have different symbols in the table and want to get drawdowns for all of them.
For example if there are 3 quotes for given day and sym: 1.3000 1.2960 1.3010, than the difference between 2nd and 3rd is 50 pips, but this is increase.
The next query can be used to get dates and symbols with drawdown higher than given threshold
select from
(select drawdown: {max maxs[x]-x}mid
by date, sym from quotes)
where drawdown>bips*1e-4
{max maxs[x]-x} gives maximum drawdown for given date by subtracting each mid for maximum of preceding mids.

How to optimize a batch pivotization?

I have a datetime list (which for some reason I call it column date) containing over 1k datetime.
adates:2017.10.20T00:02:35.650 2017.10.20T01:57:13.454 ...
For each of these dates I need to select the data from some table, then pivotize by a column t i.e. expiry, add the corresponding date datetime as column to the pivotized table and stitch together the pivotization for all the dates. Note that I should be able to identify which pivotization corresponds to a date and that's why I do it one by one:
fPivot:{[adate;accypair]
t1:select from volatilitysurface_smile where date=adate,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`t xkey 0!exec mycols#(stype!mid) by t:t from t1;
t3:`t xkey select distinct t,tenor,xi,volofvol,delta_type,spread from t1;
result:ej[`t;t2;t3];
:result}
I then call this function for every datetime adates as follows:
raze {[accypair;adate] `date xcols update date:adate from fPivot[adate;accypair] }[`EURCHF] #/: adates;
this takes about 90s. I wonder if there is a better way e.g. do a big pivotization rather than running one pivotization per date and then stitching it all together. The big issue I see is that I have no apparent way to include the date attribute as part of the pivotization and the date can not be lost otherwise I can't reconciliate the results.
If you havent been to the wiki page on pivoting then it may be a good start. There is a section on a general pivoting function that makes some claims to being somewhat efficient:
One user reports:
This is able to pivot a whole day of real quote data, about 25 million
quotes over about 4000 syms and an average of 5 levels per sym, in a
little over four minutes.
As for general comments, I would say that the ej is unnecessary as it is a more general version of ij, allowing you to specify the key column. As both t2 and t3 have the same keying I would instead use:
t2 ij t3
Which may give you a very minor performance boost.
OK I solved the issue by creating a batch version of the pivotization that keeps the date (datetime) table field when doing the group by bit needed to pivot i.e. by t:t from ... to by date:date,t:t from .... It went from 90s down to 150 milliseconds.
fBatchPivot:{[adates;accypair]
t1:select from volatilitysurface_smile where date in adates,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`date`t xkey 0!exec mycols#(stype!mid) by date:date,t:t from t1;
t3:`date`t xkey select distinct date,t,tenor,xi,volofvol,delta_type,spread from t1;
result:0!(`date`t xasc t2 ij t3);
:result}

kdb/q building NBBO from TAQ data

I have a table with bid/ask for for every stock/venue. Something like:
taq:`time xasc ([] time:10:00:00+(100?1000);bid:30+(100?20)%30;ask:30.8+(100?20)%30;stock:100?`STOCK1`STOCK2;exhcnage:100?`NYSE`NASDAQ)
How can I get the best/bid offer from all exchanges as of a time (in one minute buckets) for every stock?
My initial thought is to build a table that has a row for every minute/exchange/stock and do a asof join on the taq data. However, it sounds to me this is a brute force solution - since this is a solved problem, i figured i'd ask if there is a better way.
select max bid, min ask by stock,1+minute from 0!select by 1 xbar time.minute,stock,exchange from taq
This will give you the max bid, min ask across exchanges at the 1-minute interval in column minute.
The only tricky thing is the select by 1 xbar time.minute. When you do select by with no aggregation, it will just return the last row. So really what this means is select last time, last bid, last ask .... by 1 xbar time.minute etc.
So after we get the last values by minute and exchange, we just get the min/max across exchanges for that minute.