How does wavg work in depth using a vectorised approach? - kdb

The objective of the snippet below is to evaluate weighted mid for n levels of an order book. The code snippet is from the book Machine Learning and Big Data with kdb+/q (2020 Wiley).
n:10;
quote: ([] sym: n?`A`B; time: asc n?0t; bid1: n?10f; bidSize1: n?100 );
update bid2: 0 | bid1 - .1 * n ? 10, bidSize2: n?100, ask1: bid1 + .2 * n ? 10, askSize1: n?100 from quote;
update ask2: ask1 + .1 * n ? 10, askSize2: n?100 from `quote;
select sym,time, wmid: ( bidSize1; bidSize2; askSize1; askSize2 ) wavg (bid1; bid2; ask1; ask2) from quote
I would like to understand a generic rule for how the wavg method works in-depth for lists of vectors. Could you please help me? Appreciate your help.

There are docs here on wavg https://code.kx.com/q/ref/avg/#wavg.
From these docs we can see that calling wavg is the equivalent to the function {(sum x*y)%sum x}
Using your example:
q)res1:select sym,time, wmid: ( bidSize1; bidSize2; askSize1; askSize2 ) wavg (bid1; bid2; ask1; ask2) from quote;
q)res2:select sym,time, wmid:{(sum x*y)%sum x} [( bidSize1; bidSize2; askSize1; askSize2 );(bid1; bid2; ask1; ask2)] from quote;
q)res1 ~ res2
1b
So in the case of your example we multiply bidSize1bid, bidSize2bid2, etc. , sum this result, then divide by the sums of our sizes e.g. (bidSize1[0]+ bidSize2[0] + askSize1[0] + askSize2[0];(bidSize1[1]+ bidSize2[1] + askSize1[1] + askSize2[1]; etc...)
I'm not sure if there is a more a general way of describing this but the above may help understand the nuts and bolts of what's going on

Related

TypeError("can't convert expression to float")

The code which I wrote might look foolish, because it is integration of a derivative function. since it is the basic foundation to the other code which I'm writing on acoustical analysis. this analysis contains integration of different derivative functions which are in multiplication. for this purpose I'm using SciPy for integration and sympy for differentiation. but it is giving an error showing TypeError("can't convert expression to float"). below is the code which I wrote. hoping a solution for this.
import sympy
from sympy import *
from scipy.integrate import quad
var('r')
def diff(r):
r=symbols('x')
Z = 64.25 * r ** 5 - 175.71 *r ** 4 + 170.6 *r ** 3 - 71.103 *r ** 2 + 3 * r
E=sympy.diff(Z,r)
print(E)
return E
R=quad(diff,0,1)[0]
print(R)
I have to say that I'm a bit confused by your statement "integration of a derivative function" since the fundamental theorem of calculus would suggest that this is just a waste of CPU cycles. I'll presume that you know what you're doing though and that you just want to be able to compute some definite integrals numerically...
The SymPy expression that you want to integrate is this:
In [33]: from sympy import *
In [34]: r = symbols("x") # Why are you calling this x?
In [35]: Z = 64.25 * r ** 5 - 175.71 * r ** 4 + 170.6 * r ** 3 - 71.103 * r ** 2 +
...: 3 * r
In [36]: E = diff(Z, r)
In [37]: E
Out[37]:
4 3 2
321.25⋅x - 702.84⋅x + 511.8⋅x - 142.206⋅x + 3
There are a two basic ways to do this with SymPy:
In [38]: integrate(E, (r, 0, 1)) # symbolic integration
Out[38]: -8.96299999999999
In [39]: Integral(E, (r, 0, 1)).evalf() # numeric integration
Out[39]: -8.96300000000002
Note that had you used exact rational numbers you would see a more accurate result in either case:
In [40]: nsimplify(E)
Out[40]:
4 3 2
1285⋅x 17571⋅x 2559⋅x 71103⋅x
─────── - ──────── + ─────── - ─────── + 3
4 25 5 500
In [41]: integrate(nsimplify(E), (r, 0, 1))
Out[41]:
-8963
──────
1000
In [42]: Integral(nsimplify(E), (r, 0, 1)).evalf()
Out[42]: -8.96300000000000
While the approaches above are very accurate and work nicely for this particular integral which is easy to compute both symbolically and numerically they are both slower than using something like scipy's quad function which works with machine precision floating point and efficient numpy arrays for the calculation. To use scipy's quad function you need to lambdify your expression into an ordinary Python function:
In [44]: from scipy.integrate import quad
In [45]: f = lambdify(r, E, "numpy")
In [46]: f(0)
Out[46]: 3.0
In [47]: f(1)
Out[47]: -8.99600000000001
In [48]: quad(f, 0, 1)[0]
Out[48]: -8.963000000000001
What lambdify does is just to generate an efficient Python function for you. You can see the code that it uses like this:
In [51]: import inspect
In [52]: print(inspect.getsource(f))
def _lambdifygenerated(x):
return 321.25*x**4 - 702.84*x**3 + 511.8*x**2 - 142.206*x + 3
The quad routine will pass in numpy arrays for x and so this can be very efficient. If you have high-order polynomials then sympy's horner function can be used to optimise the expression:
In [53]: horner(E)
Out[53]: x⋅(x⋅(x⋅(321.25⋅x - 702.84) + 511.8) - 142.206) + 3.0
In [54]: f2 = lambdify(r, horner(E), "numpy")
In [56]: print(inspect.getsource(f2))
def _lambdifygenerated(x):
return x*(x*(x*(321.25*x - 702.84) + 511.8) - 142.206) + 3.0
https://docs.sympy.org/latest/tutorial/calculus.html#integrals
https://docs.sympy.org/latest/modules/utilities/lambdify.html#sympy.utilities.lambdify.lambdify
https://docs.sympy.org/latest/modules/polys/reference.html#sympy.polys.polyfuncs.horner

Maxima. How to prevent degree calculations

Is it possible for all calculations in the expression for numbers in a power to be prevented? Perhaps by pre-processing the expression or adding tellsimp rules? Or some other way?
For example, to
distrib (10 ^ 10 * (x + 1));
which produces:
1000000000 x + 1000000000
instead issued:
10 ^ 10 * x + 10 ^ 10
And similarly
factor (10 ^ 10 * x + 10 ^ 10);
returned:
10 ^ 10 * (x + 1);
Just as
factor(200);
2^3*5^2
represents power of numbers, only permanently?
Interesting question, although I don't see a good solution. Here's something I tried as an experiment, which is to display integers in factored form. I am working with Maxima 5.44.0 + SBCL.
(%i1) :lisp (defun integer-formatter (x) ($factor x))
INTEGER-FORMATTER
(%i1) :lisp (setf (get 'integer 'formatter) 'integer-formatter)
INTEGER-FORMATTER
(%i1) (x + 1000)^3;
3 3 3
(%o1) (x + 2 5 )
(%i2) 10^10*(x + 1);
2 5 2 5
(%o2) (2 5 ) (x + 1)
This is only a modification of the display; the internal representation is just a single integer.
(%i3) :lisp $%
((MTIMES SIMP) 10000000000 ((MPLUS SIMP) 1 $X))
That seems kind of clumsy, since e.g. 2^(2*5)*5^(2*5) isn't really more comprehensible than 10000000000.
A separate question is whether the arithmetic on 10^10 could be suppressed, so it actually stays as 10^10 and isn't represented internally as 10000000000. I'm pretty sure that would be difficult. Unfortunately Maxima is not too good with retracting identities which are applied, particularly with the built-in identities which are applied to perform arithmetic and other operations.

Functions with pre millenium dates in q

I've built a function in q such that I can see how many Sunday's fall on the 1st of the month between two dates
\W 1 f3:{[sd;ed] count distinct `week$(sd + til 1 + ed - sd) where (`dd$distinct `week$sd + til 1 + ed - sd)=01}
How can I edit with to work with pre 2000 dates? Can I put a modulus around the negative dates? Or will that redender my function incorrect?
You can also try this:
q) f:{sum 1=mod[`date$a[1] + til 1+(-). a:(0;1<`dd$x)+`month$(y;x);7]}
q) f[2018.01.01;2018.12.31] / 2
q) f[1998.01.02;1999.12.31] / 4

q/KDB - nprev function to get all the previous n elements

I am struggling to write a nprev function in KDB; xprev function returns the nth element but I need all the prev n elements relative to the current element.
q)t:([] i:1+til 26; s:.Q.a)
q)update xp:xprev[3;]s,p:prev s from t
Any help is greatly appreciated.
You can achieve the desired result by applying prev repeatedly and flipping the result
q)n:3
q)select flip 1_prev\[n;s] from t
s
-----
" "
"a "
"ba "
"cba"
"dcb"
"edc"
..
If n is much smaller than the rows count, this will be faster than some of the more straightforward solutions.
The xprev function basically looks like this :
xprev1:{y til[count y]-x} //readable xprev
We can tweak it to get all n elements
nprev:{y til[count y]-\:1+til x}
using nprev in the query
q)update np: nprev[3;s] , xp1:xprev1[3;s] , xp: xprev[3;s], p:prev[s] from t
i s np xp1 xp p
-------------------
1 a " "
2 b "a " a
3 c "ba " b
4 d "cba" a a c
5 e "dcb" b b d
6 f "edc" c c e
k equivalent of nprev
k)nprev:{$[0h>#y;'`rank;y(!#y)-\:1+!x]}
and similarly nnext would look like
k)nnext:{$[0h>#y;'`rank;y(!#y)+\:1+!x]}

q - apply function on table rowwise

Given a table and a function
t:([] c1:1 2 3; c2:`a`b`c; c3:13:00 13:01 13:02)
f:{[int;sym;date]
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
I noticed that when applying the function f onto columns of t, then the entire columns are passed into f and if they can be operated on atomically then the output will be of the same length as the inputs and a new column is produced. However in our example this wont work:
update newcol:f[c1;c2;c3] from t / 'type error
because the inner functions symf and datef cannot be applied to the entire column c2, c3, respectively.
If I dont want to change the function f at all, how can I apply it row by row and collect the values into a new column in t.
What's the most q style way to do this?
EDIT
If not changing f is really inconvenient one could workaround like so
f:{[arglist]
int:arglist 0;
sym:arglist 1;
date:arglist 2;
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
f each (t`c1),'(t`c2),'(t`c3)
Still I would be interested how to get the same result when working with the original version of f
Thanks!
You can use each-both for this e.g.
q)update newcol:f'[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
However you will likely get better performance by modifying f to be "vectorised" e.g.
q)f2
{[int;sym;date]
symf:3^(`a`b!1 2)sym;
datef:3^(13:00 13:01!1 2)date;
r:int + symf + datef;
r
}
q)update newcol:f2[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
q)\ts:1000 update newcol:f2[c1;c2;c3] from t
4 1664
q)\ts:1000 update newcol:f'[c1;c2;c3] from t
8 1680
In general in KDB, if you can avoid using any form of each and stick to vector operations, you'll get much more efficiency