How to optimize a batch pivotization? - kdb

I have a datetime list (which for some reason I call it column date) containing over 1k datetime.
adates:2017.10.20T00:02:35.650 2017.10.20T01:57:13.454 ...
For each of these dates I need to select the data from some table, then pivotize by a column t i.e. expiry, add the corresponding date datetime as column to the pivotized table and stitch together the pivotization for all the dates. Note that I should be able to identify which pivotization corresponds to a date and that's why I do it one by one:
fPivot:{[adate;accypair]
t1:select from volatilitysurface_smile where date=adate,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`t xkey 0!exec mycols#(stype!mid) by t:t from t1;
t3:`t xkey select distinct t,tenor,xi,volofvol,delta_type,spread from t1;
result:ej[`t;t2;t3];
:result}
I then call this function for every datetime adates as follows:
raze {[accypair;adate] `date xcols update date:adate from fPivot[adate;accypair] }[`EURCHF] #/: adates;
this takes about 90s. I wonder if there is a better way e.g. do a big pivotization rather than running one pivotization per date and then stitching it all together. The big issue I see is that I have no apparent way to include the date attribute as part of the pivotization and the date can not be lost otherwise I can't reconciliate the results.

If you havent been to the wiki page on pivoting then it may be a good start. There is a section on a general pivoting function that makes some claims to being somewhat efficient:
One user reports:
This is able to pivot a whole day of real quote data, about 25 million
quotes over about 4000 syms and an average of 5 levels per sym, in a
little over four minutes.
As for general comments, I would say that the ej is unnecessary as it is a more general version of ij, allowing you to specify the key column. As both t2 and t3 have the same keying I would instead use:
t2 ij t3
Which may give you a very minor performance boost.

OK I solved the issue by creating a batch version of the pivotization that keeps the date (datetime) table field when doing the group by bit needed to pivot i.e. by t:t from ... to by date:date,t:t from .... It went from 90s down to 150 milliseconds.
fBatchPivot:{[adates;accypair]
t1:select from volatilitysurface_smile where date in adates,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`date`t xkey 0!exec mycols#(stype!mid) by date:date,t:t from t1;
t3:`date`t xkey select distinct date,t,tenor,xi,volofvol,delta_type,spread from t1;
result:0!(`date`t xasc t2 ij t3);
:result}

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

most performant way to get asof price given a list of timestamps

I have a list of timestamps spanning multiple dates ( no sym, just timestamps). These can be 1000/2000 at times, spanning multiple dates.
What's the most performant way to hit an hdb and get the closest price available for each timestamp?
select from hdbtable where date = x -> can be over 60mm rows.
To do this for each date and then an aj on top is very poor.
Any suggestions are welcome
The most performant way to aj, assuming the HDB follows the standard conventions of date-partitioned with `p# attribute on sym, is
aj[`sym`time;select sym,time,other from myTable where …;select sym,time,price from prices where date=x]
There should be no additional filters/where-clause on the prices table other than date.
You're saying you have no syms just timestamps but what does that mean? Does that mean you want the price of all syms at that timestamp or you want the last price of any sym at that timestamp? The former is easy as you can just join your timestamps to your distinct sym list and use that as the "left" table in the aj. The latter will not be as easy as the HDB data likely isn't fully sorted on time, it's likely sorted by sym and then time. In that case you might have to again join your timestamps to your distinct sym list and aj for the price for all syms and from that result take the one with the max time.
So I guess it depends on a few factors. More info might help.
EDIT: suggestion based on further discussion:
targetTimes:update targetTime:time from ([]time:"n"$09:43:19 10:27:58 13:12:11 15:34:03);
res:aj0[`sym`time;(select distinct sym from trade where date=2021.01.22)cross targetTimes;select sym,time,price from trade where date=2021.01.22];
select from res where not null price,time=(max;time)fby targetTime
sym time targetTime price
----------------------------------------------------
AQMS 0D09:43:18.999937967 0D09:43:19.000000000 4.5
ARNA 0D10:27:57.999842638 0D10:27:58.000000000 76.49
GE 0D15:34:02.999979520 0D15:34:03.000000000 11.17
HAL 0D13:12:10.997972224 0D13:12:11.000000000 18.81
This gives the price of whichever sym is closest to your targetTime. Then you would peach this over multiple dates:
{targetTimes: ...;res:aj0[...];select from res ...}peach mydates;
Note that what's making this complicated is your requirement that it be the price of any sym that's closest to your sym-less targetTimes. This seems strange - usually you would want the price of sym(s) as of a particular time, not the price of anything closest to a particular time.
You can use multithreading to optimize your query, with each thread being assigned a date to process, essentially utilising more than just one core:
{select from hdbtable where date = x} peach listofdates
More info on multithreading can be found here, and more info on peach can be found here

Optimizing a WHERE clause with a dateadd function

For the business that I am working in I would like to get information on our customers. The base information I have on these customers is as follows:
Activation_Date stored in a Loans table, datatype is datetime.
ActivityDate stored in a CustomerDailyLoanActivity_Information table (a daily loans table to those interested, it is part of a datamart and stores for each day that a customer has been active with our company how much they have paid into their loan, so if a customer has an Activation_Date of 15-03-2017, it has ActivityDates in the CustomerDailyLoanActivity_Information table from 15-03-2017 up until now whereby each ActivityDate has a record in another column Sum_Paid_To_Date how much has been paid up until that ActivityDate). Datatype of ActivityDate is date.
What I would like to know is the following, I would like to know how much each customer has paid on 1, or 2, or 3, etc. months after his Activation_Date. So the query would look something like the following (slightly pseudo-code, the more important part is the WHERE clause).
SELECT
cldai.Sum_Paid_To_Date,
cldai.ActivityDate,
cldai.Customer_Account_Number
FROM
CustomerLoanDailyActivity_Information cldai
INNER JOIN
Loans l ON l.Customer_Account_Number = cldai.Customer_Account_Number
WHERE
(cldai.ActivityDate = CAST(l.Activation_Date AS date)
OR
cldai.ActivityDate = DATEADD(month, 1, CAST(l.Activation_Date AS date))
OR
cldai.ActivityDate = DATEADD(month, 2, CAST(l.Activation_Date AS date))
OR
cldai.ActivityDate = DATEADD(month, 3, CAST(l.Activation_Date AS date))
)
ORDER BY
l.Customer_Account_Number, cldai.ActivityDate ASC
So the problem is that this query is really really slow (because of the WHERE clause and because the cldai table is big (~6 GB)) and exits before any data is retrieved. A couple of problems that I have heard, and possible solutions, but haven't worked so far.
The CAST function makes the query really slow because it does a comparison with the ActivityDate column, which is indexed. I used CONVERT before but that was also really slow. I feel like I need to do the convert/cast though, because the ActivityDate is of date type and the Activation_Date is of datetime type, so there is a possibility that the time part of the datetime in Activation_Date will cause there to be no matches with the ActivityDate (e.g. Activation_Date for a given customer is 15-03-2017 09:00:00 so it will never match with ActivityDate 15-03-2017 because this might be converted to datetime 15-03-2017 00:00:00, which will never be equal because of the time part).
I have to use "DateTime" evaluations, which has been suggested as a solution, but I have no clue on how to apply this correctly.
I can't look at the execution plan because the DBA has blocked me from seeing that.
Any ideas on how to make this query perform more quickly? Any help would be greatly appreciated.
So a massive speedup was obtained by using a LEFT JOIN instead of an INNER JOIN and by not ordering the data on the server but on the client side. This reduced the query time from about an hour and 10 minutes to about 1 minute. It seems unbelievable but it's what happened.
Regards,
Tim.
If you are guaranteed to have a record for each day, you could apply use the row_number() function to apply row numbers for each group of customer loan repayment records, and then retrieve rows 1,31,61 and 91? This would avoid any date manipulation.
How about splitting this up into two steps? Step one - build a table with the four dates for each customer. Then step two, join this to your main CustomerLoanDailyActivity_Information table on date and customer account number. The second step would have a much simpler join, just an = between the ActivityDate and date entry in the table you have built.

KDB Converting Subselect to Q Query

We have a Q query running on tick data which consolidates to OHLC on 1-minute bars.
select subsel:(
exec last datetime.date+1 xbar datetime.minute.z.Z
from `base
where instrument=`GBPUSD,
datetime=datetime.date+1 xbar datetime.minute.z.Z),
max(datetime),
min(datetime),
Open:first price,
High:max price,
Low:min price,
Close:last price,
Volume:count(i)
by DT:($)datetime.date+1 xbar datetime.minute.z.Z
from `base
where instrument=`GBPUSD,
datetime>=2017.07.03T10:20:00.00,
datetime<2017.07.03T10:20:59.999
The problem is the xbar date is synthetic on both the main table and the 'subselect', the exec "datetime=" needs to reference the main table and cannot find the alias approach to use. Considered an ej but as both sides are synthetic also could not find the construct.
There are several issues with your query before we even get to the subselect. First, datetime.minute.z.Z is invalid syntax. You probably don't need the .z.Z suffix there. Second, 1 xbar is redundant: 1 xbar x is x for integer x and datetime.minute is integer. You can just do datetime.date+datetime.minute to get datetimes rounded to minutes. (Note that if you use timestamps, as you should, rounding would simply be 0D00:01 xbar timestamp, for datetime, you would have to precompute a minute as U:reciprocal 24*60 and use that with xbar - U xbar timestamp.) Fourth, you cast xbar'd timestamps to strings in the by clause. If you really want them as strings - do it as a separate update after the aggregation. Finally, there are some minor issues such as redundant parentheses and ($) which in q should be spelled as string.
Now, back to the subselect. I think once you resolve the issues I highlighted above, you will find out that you don't need the subquery at all. The result will already have the xbar'd timestamps in the key column. If you want the result as a regular table - just use 0!.

Getting the biggest change in data in postgres table

We're collecting lots of sensor data and logging them to a postgres DB.
Basic schema - cut down:
id | BIGINT PK
sensor-id| INT FK
location-id | INT FK
sensor-value | NUMERIC(0,2)
last-updated | TIMESTAMP_WITH_TIMEZONE
I'm trying to get the biggest change in sensor data in the last day. By that I mean, out of all the sensors, sensor ids 4,5,6,7 changed the biggest compared to the previous day. Before that, I'm trying to get a SQL query to figure out the delta between last reading and latest reading.
I thought maybe the lead and lag functions would help, but my query doesn't quite give me the result I was after:
SELECT
srd.last_updated,
spi.title,
lead(srd.value) OVER (ORDER BY srd.sensor_id DESC) as prev,
lag(srd.value) OVER (ORDER BY srd.sensor_id DESC) as next
FROM
sensor_rt_data srd
join sensor_prod_info spi on srd.sensor_id = spi.id
where srd.last_updated >= NOW() - '1 day'::INTERVAL -- current_date - 1
ORDER BY
srd.last_updated DESC
Simple dataset - making this up now because i can't login to the DB right now:
id|sensor,location,value,updated
1|1,1,24,'2017-04-28 19:30'
2|1,1,22,'2017-04-27 19:30'
3|2,1,35,'2017-04-28 19:30'
4|2,1,33,'2017-04-28 08:30'
5|2,1,31,'2017-04-27 19:30'
6|1,1,25,'2017-04-26 19:30'
Forgetting the join (that's for the user-friendly sensor tag name field staff need and the location), how do I workout which sensor has reported the biggest change in temperature over a time-series when they're grouped by sensor-id?
I'd be expecting:
updated,sensor,prev,next
'2017-04-28 19:30',1,24,22
'2017-04-28 19:30',2,33,31
(then from that, I can subtract and order to workout the top 10 sensors that have changed)
I noticed that Postgres 9.6 has some other functions too but want to try get Lead/Lag working first.
Window function aren't a best fit for this kind of task. Try this:
select sensor, max(value)-min(value) as value_change
from sensordata
where updated>=?-'1 day'::interval
group by sensor
order by value_change desc
limit 1;
Not much use for indexes besides updated for this kind of query. It would be probably possible to use a specially crafted index if you would be looking the largest change for a calendar day instead of last 24 hours.