95% Confidence Interval for a weighted column in KDB+/Q - kdb

I have a table like the following where each row corresponds to an execution:
table:([]name:`account1`account1`account1`account2`account2`account1`account1`account1`account1`account2;
Pnl:13.7,13.2,74.1,57.8,29.9,15.9,7.8,-50.4,2.3,-16.2;
markouts:.01,.002,-.003,-.02,.004,.001,-.008,.04,.011,.09;
notional:1370,6600,-24700,-2890,7475,15900,-975,-1260,210,-180)
I'd like to create a 95% confidence interval of Pnl for `account1. The problem is, Pnl is the product of markouts and notional values, so it's weighted and the mean wouldn't be a simple mean. I'm pretty sure the standard deviation calculation would also be a bit different than normal.
Is there a way to still do this in KDB? I'm not really sure how to go about this. Any advice is greatly appreciated!

statistics isn't my strong point but most of this can be done with some keywords for the standard calculation:
q)select { avg[x] + -1 1* 1.960*sdev[x]%sqrt count x } Pnl by name from table
name | Pnl
--------| ------------------
account1| -15.90856 37.76571
account2| -18.45611 66.12278
https://code.kx.com/q/ref/avg/#avg
https://code.kx.com/q/ref/sqrt/
https://code.kx.com/q/ref/dev/#sdev
As shown on the kx ref, the sdev calculation is as follows which you could use as a base to create your own to suit what you want/expect.
{sqrt var[x]*count[x]%-1+count x}
There is also wavg if you want to do weighted average:
https://code.kx.com/q/ref/avg/#wavg
Edit: Assuming this can work by substituting in weighted calculations, here's a weighted sdev I've thrown together wsdev:
table:update weight:2 6 3 5 2 4 5 6 7 3 from table;
wsdev:{[w;v] sqrt (sum ( (v-wavg[w;v]) xexp 2) *w)%-1+sum w }
// substituting avg and sdev above
w95CI:{[w;v] wavg[w;v] + -1 1* 1.960*wsdev[w;v]%sqrt count v };
select w95CI[weight;Pnl] by name from table
name | Pnl
--------| ------------------
account1| -19.70731 28.47701
account2| -8.201463 68.24146

Related

Rounding timestamp to the nearest 30 seconds

My table is as follows:
t: ([]dt: 2021.10.25T09:30:28 2021.10.25T09:30:32;price:9.99 10.00)
I wish to round the timestamp to the nearest 30sec mark.
I tried using xbar like so:
update roundedDt: 30 xbar dt.second from t
However it seems to have floored the results.
The desired result should be 09:30:30 for both rows.
How can one round to the nearest 30 second mark?
Jonathon's answer is the most flexible for modifying the rounding for not just seconds specifically but an alternative simple solution for just seconds would be to offset by 15:
q)update roundedDt:30 xbar 15+dt.second from t
dt price roundedDt
---------------------------------------
2021.10.25T09:30:28.000 9.99 09:30:30
2021.10.25T09:30:32.000 10 09:30:30
Edit: If you want the full dateTime rounded, I would convert it to timestamp as easy to work with and adjust my offset/xbar to match.
q)update roundedDt:30000000000 xbar 15000000000 + `timestamp$dt from t
dt price roundedDt
-----------------------------------------------------------
2021.10.25T09:30:28.000 9.99 2021.10.25D09:30:30.000000000
2021.10.25T09:30:32.000 10 2021.10.25D09:30:30.000000000
2020.10.25T23:59:59.000 9.99 2020.10.26D00:00:00.000000000
2020.10.26T00:00:01.000 10 2020.10.26D00:00:00.000000000
You can try something like this:
update roundedDt:?[(`ss$dt)within(0;14);`time$(`int$`time$dt)-1000*`ss$dt;
?[(`ss$dt)within(15;44);`time$30000+(`int$`time$dt)-1000*`ss$dt;`time$60000+(`int$`time$dt)-1000*`ss$dt]] from t
You could use a modified version of xbar that rounds to nearest int instead of flooring:
q)xbar2:{type[y]$x*"j"$y%x:$[16h=abs type x;"j"$x;x]}
q)update roundedDt:xbar2[30;dt.second] from t
dt price roundedDt
---------------------------------------
2021.10.25T09:30:28.000 9.99 09:30:30
2021.10.25T09:30:32.000 10 09:30:30
Note that because this function is defined in root namespace you must use bracket notation (xbar2[30;dt.second]). If you wish to use infix notation (30 xbar2 dt.second), you'll need to define the function in .q namespace i.e. .q.xbar2:{type[y]$x*"j"$y%x:$[16h=abs type x;"j"$x;x]}.
xbar2 is based on the original xbar, but where xbar uses div which has the effect of flooring the result, here % is used which will produce a float output and this is then cast to a long int which will round to the nearest integer.
What about this solutions:
/ x is your timestamp
/ y is the timebucket (in seconds)
.time.round:{
:"z"$+[`date$x;`time$1e3*y*`int$%[`time$x;y*1e3]];
};
As example, if you want to round at the nearest 30 seconds, you need to use this as follows:
ts1:2020.10.30T10:32:35
.time.round[ts1;30]
In your case, simply type:
t[`round_time]:{.time.round[x;30]} each t[`dt]
As a side note, some of the proposed solutions would round timestamps like 2020.10.25T23:59:59 and 2020.10.26T00:00:01 to 24:00:00 and 00:00:00 respectively, which is not what we would like I suppose.

Spark Window Functions That depend on itself

Say I have a column of sorted timestamps in a DataFrame. I want to write a function that adds a column to this DataFrame that cuts the timestamps into sequential time slices according to the following rules:
start at the first row and keep iterating down to the end
for each row, if you've walked n number of rows in the current group OR you have walked more than time interval t in the current group, make a cut
return a new column with the group assignment for each row, which should be an increasing integer
In English: each group should be no more than n rows, and should not span more than t time
For example: (Using integers for timestamps to simplify)
INPUT
time
---------
1
2
3
5
10
100
2000
2001
2002
2003
OUTPUT (after slice function with n = 3 and t = 5)
time | group
----------|------
1 | 1
2 | 1
3 | 1
5 | 2 // cut because there were no cuts in the last 3 rows
10 | 2
100 | 3 // cut because 100 - 5 > 5
2000 | 4 // cut because 2000 - 100 > 5
2001 | 4
2002 | 4
2003 | 5 // cut because there were no cuts in the last 3 rows
I have a feeling this can be done with window functions in Spark. Afterall, window functions were created to help developers compute moving averages. You'd basically calculate an aggregate (in this case average) of a column (stock price) per window of n rows.
The same should be able to be accomplished here. For each row, if the last n rows contains no cut, or the timespan between the last cut and the current timestamp is greater than t, cut = true, o.w. cut = false. But what I can't seem to figure out is how to make the Window Function aware of itself. That would be like the moving average of a particular row aware of the last moving average.

Joining multiple times in kdb

I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?
See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4
Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096

how to handle type F in a table in Q/KDB

I have started to learn q/KDB since a while, therefore forgive me in advance for trivial question but I am facing the following problem I don't know how to solve.
I have a table named "res" showing, side, summation of orders and average_price of some simbols
sym side | sum_order avg_price
----------| -------------------
ALPHA B | 95109 9849.73
ALPHA S | 91662 9849.964
BETA B | 47 9851.638
BETA S | 60 9853.383
with these types
c | t f a
---------| -----
sym | s p
side | s
sum_order| f
avg_price| f
I would like to calculate close and open positions, average point, made by close position, and average price of the open position.
I have used this query which I believe it is pretty bizarre (I am sure there will be a more professional way to do it) but it works as expected
position_summary:select
close_position:?[prev[sum_order]>sum_order;sum_order;prev[sum_order]],
average_price:avg_price-prev[avg_price],
open_pos:prev[sum_order]-sum_order,
open_wavgprice:?[sum_order>next[sum_order];avg_price;next[avg_price]][0]
by sym from res
giving me the following table
sym | close_position average_price open_pos open_wavgprice
----------| ----------------------------------------------------
ALPHA | 91662 0.2342456 3447 9849.73
BETA | 47 1.745035 -13 9853.38
and types are
c | t f a
--------------| -----
sym | s s
close_position| F
average_price | F
open_pos | F
open_wavgprice| f
Now my problem starts here, imagine I join position_summary table with another table appending another column "current_price" of type f
What I want to do is to determinate the points of the open positions.
I have tried this way:
select
?[open_pos>0;open_price-open_wavgprice;open_wavgprice-open]
from position_summary
but I got 'type error,
surely because sum_order is type F and open_wavgprice and current_price are f. I have search on internet by I did not find much about F type.
First: how can I handle this ? I have tried "cast" or use "raze" but no effects and moreover I am not sure if they are right on this particular occasion.
Second: is there a better way to use "if-then" during query tables (for example, in plain English :if this row of this column then take the previous / next of another column or the second or third of previous /next column)
Thank you for you help
Let me rephrase your question using a slightly simpler table:
q)show res:([sym:`A`A`B`B;side:`B`S`B`S]size:95 91 47 60;price:49.7 49.9 51.6 53.3)
sym side| size price
--------| ----------
A B | 95 49.7
A S | 91 49.9
B B | 47 51.6
B S | 60 53.3
You are trying to find the closing position for each symbol using a query like this:
q)show summary:select close:?[prev[size]>size;size;prev[size]] by sym from res
sym| close
---| -----
A | 91
B | 47
The result seems to have one number in each row of the "close" column, but in fact it has two. You may notice an extra space before each number in the display above or you can display the first row
q)first 0!summary
sym | `A
close| 0N 91
and see that the first row in the "close" column is 0N 91. Since the missing values such as 0N are displayed as a space, it was hard to see them in the earlier display.
It is not hard to understand how you've got these two values. Since you select by sym, each column gets grouped by symbol and for the symbol A, you have
q)show size:95 91
95 91
and
q)prev size
0N 95
that leads to
q)?[prev[size]>size;size;prev[size]]
0N 91
(Recall that 0N is smaller than any other integer.)
As a side note, ?[a>b;b;a] is element-wise minimum and can be written as a & b in q, so your conditional expression could be written as
q)size & prev size
0N 91
Now we can see why ? gave you the type error
q)close:exec close from summary
q)close
91
47
While the display is deceiving, "close" above is a list of two vectors:
q)first close
0N 91
and
q)last close
0N 47
The vector conditional does not support that:
q)?[close>0;10;20]
'type
[0] ?[close>0;10;20]
^
One can probably cure that by using each:
q)?[;10;20]each close>0
20 10
20 10
But I don't think this is what you want. Your problem started when you computed the summary table. I would expect the closing position to be the sum of "B" orders minus the sum of "S" orders that can be computed as
q)select close:sum ?[side=`B;size;neg size] by sym from res
sym| close
---| -----
A | 4
B | -13
Now you should be able to fix the rest of the columns in your summary query. Just make sure that you use an aggregation function such as sum in the expression for every column.
Type F means the "cell" in the column contains a vector of floats rather than an atom. So your column is actually a vector of vectors rather than a flat vector.
In your case you have a vector of size 1 in each cell, so in your case you could just do:
select first each close_position, first each average_price.....
which will give you a type f.
I'm not 100% on what you were trying to do in the first query, and I don't have a q terminal to hand to check but you could put this into your query:
select close_position:?[prev[sum_order]>sum_order;last sum_order; last prev[sum_order].....
i.e. get the last sum_order in the list.

kdb+/q: Apply iterative procedure with updated variable to a column

Consider the following procedure f:{[x] ..} with starting value a:0:
Do something with x and a. The output is saved as the new version of a, and the output is returned by the function
For the next input x, redo the procedure but now with the new a.
For a single value x, this procedure is easily constructed. For example:
a:0;
f:{[x] a::a+x; :a} / A simple example (actual function more complicated)
However, how do I make such a function such that it also works when applied on a table column?
I am clueless how to incorporate this step for 'intermediate saving of a variable' in a function that can be applied on a column at once. Is there a special technique for this? E.g. when I use a table column in the example above, it will simply calculate a+x with a:0 for all rows, opposed to also updating a at each iteration.
No need to use global vars for this - can use scan instead - see here.
Example --
Generate a table -
q)t:0N!([] time:5?.z.p; sym:5?`3; price:5?100f; size:5?10000)
time sym price size
-----------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496
2010.10.30D10:15:14.157553088 obp 31.59526 1728
2005.11.01D21:15:54.022395584 dhc 34.10485 5486
2005.03.06D21:05:07.403334368 mho 86.17972 2318
Example with a simple accumilator - note, the function has access to the other args if needed (see next example):
q)update someCol:{[a;x;y;z] (a+1)}\[0;time;price;size] from t
time sym price size someCol
-------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 1
2007.05.21D04:26:13.021438816 llm 7.347808 496 2
2010.10.30D10:15:14.157553088 obp 31.59526 1728 3
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 4
2005.03.06D21:05:07.403334368 mho 86.17972 2318 5
Say you wanted to get cumilative size:
q)update cuSize:{[a;x;y;z] (a+z)}\[0;time;price;size] from t
time sym price size cuSize
------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496 4490
2010.10.30D10:15:14.157553088 obp 31.59526 1728 6218
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 11704
2005.03.06D21:05:07.403334368 mho 86.17972 2318 14022
If you wanted more than one var passed through the scan, can pack more values into the first var, by giving it a more complex structure:
q)update cuPriceAndSize:{[a;x;y;z] (a[0]+y;a[1]+z)}\[0 0;time;price;size] from t
time sym price size cuPriceAndSize
--------------------------------------------------------------
2002.04.04D18:06:07.889113280 cmj 29.07093 3994 29.07093 3994
2007.05.21D04:26:13.021438816 llm 7.347808 496 36.41874 4490
2010.10.30D10:15:14.157553088 obp 31.59526 1728 68.014 6218
2005.11.01D21:15:54.022395584 dhc 34.10485 5486 102.1188 11704
2005.03.06D21:05:07.403334368 mho 86.17972 2318 188.2986 14022
#MdSalih solution is correct, I am just explaining here what could be the possible reason with global variable in your case and solution for that.
q) t:([]id: 1 2)
q)a:1
I think you might have been using it like this:
q) select k:{x:x+a;a::a+1;:x} id from t
output:
k
--
1
2
And a value is 2 which means function executed only once. Reason is we passed full id column list to function and (+) is atomic which means it operates on full list at once. In following ex. 2 will get added to all items in list.
q) 2 + (1;3;5)
Correct way to use it is 'each':
q)select k:{x:x+a;a::a+1;:x} each id from t
output:
k
--
2
3