most performant way to get asof price given a list of timestamps - kdb

I have a list of timestamps spanning multiple dates ( no sym, just timestamps). These can be 1000/2000 at times, spanning multiple dates.
What's the most performant way to hit an hdb and get the closest price available for each timestamp?
select from hdbtable where date = x -> can be over 60mm rows.
To do this for each date and then an aj on top is very poor.
Any suggestions are welcome

The most performant way to aj, assuming the HDB follows the standard conventions of date-partitioned with `p# attribute on sym, is
aj[`sym`time;select sym,time,other from myTable where …;select sym,time,price from prices where date=x]
There should be no additional filters/where-clause on the prices table other than date.
You're saying you have no syms just timestamps but what does that mean? Does that mean you want the price of all syms at that timestamp or you want the last price of any sym at that timestamp? The former is easy as you can just join your timestamps to your distinct sym list and use that as the "left" table in the aj. The latter will not be as easy as the HDB data likely isn't fully sorted on time, it's likely sorted by sym and then time. In that case you might have to again join your timestamps to your distinct sym list and aj for the price for all syms and from that result take the one with the max time.
So I guess it depends on a few factors. More info might help.
EDIT: suggestion based on further discussion:
targetTimes:update targetTime:time from ([]time:"n"$09:43:19 10:27:58 13:12:11 15:34:03);
res:aj0[`sym`time;(select distinct sym from trade where date=2021.01.22)cross targetTimes;select sym,time,price from trade where date=2021.01.22];
select from res where not null price,time=(max;time)fby targetTime
sym time targetTime price
----------------------------------------------------
AQMS 0D09:43:18.999937967 0D09:43:19.000000000 4.5
ARNA 0D10:27:57.999842638 0D10:27:58.000000000 76.49
GE 0D15:34:02.999979520 0D15:34:03.000000000 11.17
HAL 0D13:12:10.997972224 0D13:12:11.000000000 18.81
This gives the price of whichever sym is closest to your targetTime. Then you would peach this over multiple dates:
{targetTimes: ...;res:aj0[...];select from res ...}peach mydates;
Note that what's making this complicated is your requirement that it be the price of any sym that's closest to your sym-less targetTimes. This seems strange - usually you would want the price of sym(s) as of a particular time, not the price of anything closest to a particular time.

You can use multithreading to optimize your query, with each thread being assigned a date to process, essentially utilising more than just one core:
{select from hdbtable where date = x} peach listofdates
More info on multithreading can be found here, and more info on peach can be found here

Related

How to find difference of items in same column by group?

The dataset is university rankings and I have a column 'world rank' and 'year'. I want to create a new field called 'rank difference' to see the difference in rank of universities from 2018 to 2011. Eg:
Name Year World Rank
Harvard 2011 4
Harvard 2018 5
For the above, rank difference would be -1. The data set contains a lot of universities and I am not sure how to perform LOD or any other solution for this.
You can use “if” inside a calculation to return a value in certain cases, and to evaluate to null otherwise. One phrase for this is a filtered calculation. Since aggregagation calculations like min(), max(), avg() etc silently ignore null values, you can then use filtered calculations in aggregate calculations.
So assuming your data source has one row per university per year, and that you put [University] on some shelf as a dimension, then the following calculation will get the result you’re looking for. Since you only have a row per university per year, you could just as easily use max(), sum() or avg() instead of min()
min(if [Year] = 2018 then [World Rank] end) - min(if [Year] = 2011 then [World Rank] end)
To extend, you could use a user supplied parameter instead of hard coded start and end years, or an LOD calculation to find the earliest and latest records for each university in your data set.

kdb q - count subtable between 2 dates

I have a table
t:`date xasc ([]date:100?2018.01.01+til 100;price:100?til 100;acc:100?`a`b)
and would like to have a new column in t which contains the counts of entries in t where date is in the daterange of the previous 14 days and the account is the same as in acc. For example, if there is a row
date price acc prevdate prevdate1W countprev14
2018.01.10 37 a 2018.01.09 2018.01.03 ?
then countprev14 should contain the number of observations between 2018.01.03 and 2018.01.09 where acc=a
The way I am currently doing it can probably be improved:
f:{[dates;ac;t]count select from t where date>=(dates 0),date<=(dates 1),acc=ac}[;;t]
(f')[(exec date-7 from t),'(exec date-1 from t);exec acc from t]
Thanks for the help
Another method is using a window join (wj1):
https://code.kx.com/q/ref/joins/#wj-wj1-window-join
dates:exec date from t;
d:(dates-7;dates-1);
wj1[d;`acc`date;t;(`acc`date xasc t;(count;`i))]
I think you're looking for something like this:
update count14:{c-0^(c:sums 1&x)y bin y-14}[i;date] by acc from t
this uses sums to get the running counts, bin to find the running count from 14 days prior, and then indexes back into the list of running counts to get the counts from that date.
The difference between the counts then and now are those from the latest 14 days.
Note the lambda here allows us to store the result from the sums easily and avoid unnecessary recomputation.

How to find the days having a drawdown greater than X bips?

What would be the most idiomatic way to find the days with a drawdown greater than X bips? I again worked my way through some queries but they become boilerplate ... maybe there is a simpler more elegant alternative:
q)meta quotes
c | t f a
----| -----
date| z
sym | s
year| j
bid | f
ask | f
mid | f
then I do:
bips:50;
`jump_in_bips xdesc distinct select date,jump_in_bips from (update date:max[date],jump_in_bips:(max[mid]-min[mid])%1e-4 by `date$date from quotes where sym=accypair) where jump_in_bips>bips;
but this will give me the days for which there has been a jump in that number of bips and not only the drawdowns.
I can of course put this result above in a temporary table and do several follow up selects like:
select ... where mid=min(mid),date=X
select ... where mid=max(mid),date=X
to check that the max(mid) was before the min(mid) ... is there a simpler, more idiomatic way?
I think maxs is the key function here, which allows you to maintain a running historical maximum, and you can compare your current value to that maximum. If you have some table quote which contains some series of mids (mids) and timestamps (date), the following query should return the days where you saw a drawdown greater than a certain value:
key select by `date$date from quote
where bips<({(maxs[x]-x)%1e-4};mid) fby `date$date
The lambda {(maxs[x]-x)%1e-4} is doing the comparison at each point to the historical maximum and checking if it's greater than bips, and fby lets you apply the where clause group-wise by date. Grouping with a by on date and taking the key will then return the days when this occurred.
If you want to preserve the information for the max drawdown you can use an update instead:
select max draw by date from
(update draw:(maxs[mid]-mid)%1e-4 by date from #[quote;`date;`date$])
where bips<draw
The date is updated separately with a direct modification to quote, to avoid repeated casting.
Difference between max and min mids for given date may be both increase and drawdown. Depending on if max mid precedes min. Also, as far a sym columns exists, I assume you may have different symbols in the table and want to get drawdowns for all of them.
For example if there are 3 quotes for given day and sym: 1.3000 1.2960 1.3010, than the difference between 2nd and 3rd is 50 pips, but this is increase.
The next query can be used to get dates and symbols with drawdown higher than given threshold
select from
(select drawdown: {max maxs[x]-x}mid
by date, sym from quotes)
where drawdown>bips*1e-4
{max maxs[x]-x} gives maximum drawdown for given date by subtracting each mid for maximum of preceding mids.

How to optimize a batch pivotization?

I have a datetime list (which for some reason I call it column date) containing over 1k datetime.
adates:2017.10.20T00:02:35.650 2017.10.20T01:57:13.454 ...
For each of these dates I need to select the data from some table, then pivotize by a column t i.e. expiry, add the corresponding date datetime as column to the pivotized table and stitch together the pivotization for all the dates. Note that I should be able to identify which pivotization corresponds to a date and that's why I do it one by one:
fPivot:{[adate;accypair]
t1:select from volatilitysurface_smile where date=adate,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`t xkey 0!exec mycols#(stype!mid) by t:t from t1;
t3:`t xkey select distinct t,tenor,xi,volofvol,delta_type,spread from t1;
result:ej[`t;t2;t3];
:result}
I then call this function for every datetime adates as follows:
raze {[accypair;adate] `date xcols update date:adate from fPivot[adate;accypair] }[`EURCHF] #/: adates;
this takes about 90s. I wonder if there is a better way e.g. do a big pivotization rather than running one pivotization per date and then stitching it all together. The big issue I see is that I have no apparent way to include the date attribute as part of the pivotization and the date can not be lost otherwise I can't reconciliate the results.
If you havent been to the wiki page on pivoting then it may be a good start. There is a section on a general pivoting function that makes some claims to being somewhat efficient:
One user reports:
This is able to pivot a whole day of real quote data, about 25 million
quotes over about 4000 syms and an average of 5 levels per sym, in a
little over four minutes.
As for general comments, I would say that the ej is unnecessary as it is a more general version of ij, allowing you to specify the key column. As both t2 and t3 have the same keying I would instead use:
t2 ij t3
Which may give you a very minor performance boost.
OK I solved the issue by creating a batch version of the pivotization that keeps the date (datetime) table field when doing the group by bit needed to pivot i.e. by t:t from ... to by date:date,t:t from .... It went from 90s down to 150 milliseconds.
fBatchPivot:{[adates;accypair]
t1:select from volatilitysurface_smile where date in adates,ccypair=accypair;
mycols:`atm`s10c`s10p`s25c`s25p;
t2:`date`t xkey 0!exec mycols#(stype!mid) by date:date,t:t from t1;
t3:`date`t xkey select distinct date,t,tenor,xi,volofvol,delta_type,spread from t1;
result:0!(`date`t xasc t2 ij t3);
:result}

Filter data by different time intervals

I need to filter my query with different time intervals like that:
...
where
date >= '2011-07-01' and date <='2011-09-30'
and date >='2012-07-01' and date >='2012-09-30'
I suppose such code is not good, because these dates conflicts with each other. But how to filter only these two intervals, skipping everything else? Is it even possible? Because if I query like this, I don't get any results.I tried to use BETWEEN, but it does same thing.
I bypassed this by extracting quarters from years and calculating only third quarter. But then other quarters sum is showed as zero and I can't ignore these rows that have sum column with zero value. I tried to filter where price > 0 (column where sum goes), but it says that column do not exist. So I put it whole FROM under '('')' brackets to make it calculate sum before where clause, but it still does give me error that such column do not exist.
Also if you need to see query I have now, I can post it, just tell me if it is needed.
I want to do this, because I need to compare third quarter of two different years (maybe I should use another approach).
You're not going to get any results because you can't have a date that's both within 7/1/2011 through 9/30/11 and after 7/1/2012 and after 9/30/12.
You can have a date that is either between 7/1/20122 and 9/30/2011 or between 7/1/2012 and 9/30/2012.
SELECT col1 FROM table1
WHERE date BETWEEN '7/1/2011' AND '9/30/2011' OR date BETWEEN '7/1/2012' AND '9/30/2012';