KDB: How to serialize a table for a union join within kdb-tick architecture? - kdb

Im trying to modify the kdb-tick architecture to support a union join on incoming data and the local rdb table.
I have modified the upd function in the tick.q file to the following:
ups:{[t;x]ts"d"$a:.z.P;
if[not -16=type first first x;a:"n"$a;x:$[0>type first x;a,x;(enlist(count first x)#a),x]];
f:key flip value t;pub[t;$[0>type first x;enlist f!x;flip f!x]];if[l;l enlist (`ups;t;x);i+:1];};
With ups:uj subsequently set in the subscriber files.
My question relates to how one might serialize a table row before publishing it within the .u.ups[] function.
I.e. given a table:
second | amount price
-----------|----------------
02:46:01 | 54 9953.5
02:46:02 | 54 9953.5
02:46:03 | 54 9953.5
02:46:04 | 150 9953.5
02:46:05 | 150 9954.5
How should one serialize the first row 02:46:01 | 54 9953.5 such that it can be sent via the .u.ups function to subscribers whereby uj will be run between the row and the local table on the subscribers.
Thanks in advance for your advice.

Some of this might help:
You can't set ups:uj in the subscribers because the table name is being passed as a symbol so the subscriber will effectively try to do
uj[`tab1;tab2]
which won't work because uj doesn't accept table names (symbols) as input. You would have to instead set ups to
ups:{x set value[x] uj y}
A standard tickerplant is not designed to handle variable/changing schema - for good reason, it's generally not a good idea to have a schema that changes intraday. However your situation might warrant it so in that case you'd need to modify your .u.ups function to something like
\d .u
ups:{[t;x]ts"d"$a:.z.P;
x:`time xcols update time:"n"$a from x;
pub[t;$[98h=type x;x;1=count last x;enlist x;flip x]];if[l;l enlist (`ups;t;x);i+:1];};
\d .
and your feeder process would have to send kdb tables or kdb dictionaries to the .u.ups function. Since a feedhandler process is usually not a kdb process, it may or may not be possible to send tables/dictionaries to the tickerplant as normally the feedhandler would send lists (without column metadata). In your case you need to somehow supply the column metadata to the tickerplant on each update (or maybe you're doing that already?), as otherwise it won't know which columns are which.
In other words your feeder process could send either of the following:
(`.u.upd;`tab;([]col1:`a`b`c;col2:1 2 3))
(`.u.upd;`tab;`col1`col2!(`a;1))
(`.u.upd;`tab;`col1`col2!(`a`b;1 2))

I'm going to assume this is related to your previous few questions about disparate schemas. I'd like to suggest an alternative solution, which is only truly viable if you are using kdb version 3.6, which uses anymap. If you can narrow your schemas down to a minimal list of common columns, all other columns can be placed as dictionaries into a general column.
q)tab:([]sym:`$();col1:`float$();colGeneral:(::))
q)`tab upsert (`AAPL;3.454;(`colX`colY`colZ!(1;2.3;"abc")))
`tab
q)`tab upsert (`MSFT;3.0;(`colX`colY!(2;100.0)))
`tab
q)`tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
`tab
q)tab
sym col1 colGeneral
----------------------------------------
AAPL 3.454 `colX`colY`colZ!(1;2.3;"abc")
MSFT 3 `colX`colY!(2;100f)
AMZN 100 (,`colX)!,10
q)select colGeneral from tab
colGeneral
-----------------------------
`colX`colY`colZ!(1;2.3;"abc")
`colX`colY!(2;100f)
(,`colX)!,10
q)select sym, colGeneral #\: `colX from tab
sym x
-------
AAPL 1
MSFT 2
AMZN 10
q)select sym, colGeneral #\: `colY from tab
sym x
---------
AAPL 2.3
MSFT 100f
AMZN 0N
With 3.6 you can be saving this to disk in any splayed format (splayed, partitioned, segmented) and still easily query the data. The storage of such a table will likely be sub-optimal due to poor compression characteristics of the general column (assuming you wish to compress data), but it will be perfectly functional.
Integrating uj into standard ingestion procedure with each update will be computationally expensive. Using a general column and dictionary method will massively improve your ingestion speed. Below I've given a demonstration using the example given a previous answer to a related question of yours
q)table:()
q)row1:enlist `x`y`colX!(`AMZN;100.0;10)
q)table:table uj row
q)\ts:100000 table:table uj row1
13828 6292352
q)\ts:100000 `tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
117 12746880

Related

sort data in hdb by using dbmain.q in kdb

I am trying to sort 1 or 2 columns in a hdb in kdb but failed. This is the code I have
fncol[dbdir;`trade;`sym;xasc];
and got a length error when I called it. But I don't have a length error if I use this code
fncol[dbdir;`trade;`sym;asc];.
However this only sorts the sym column itself. I want the data from other columns change according to sym column as well.
In addition, I would like to apply parted attribute to sym column. Also, I tried to sort this way
fncol[dbdir;`trade;`sym`ptime;xasc];. also failed
You should always be careful with dbmaint.q if you are unsure what it is going to do. I gather from the fact asc worked after xasc that you are using a test hdb each time.
fncol should be used with unary functions i.e. 1 argument. It's use case is for modifying individual columns. What you are trying to do is modifying the entire table as you want to sort the entire table relative to the sym column. Using .Q.dpft for each date is what you want as outlined by Cathal in your follow-up question. using .Q.dpft function to resave table
When you run this fncol[dbdir;`trade;`sym;xasc]; You are saving down a projection in place of the sym column in each date.
fncol[`:.;`trades;`sym;xasc];
select from trades where date = 2014.04.21
'length
[0] select from trades where date = 2014.04.21
q)get `:2014.04.21/trades/sym
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}[`p#`sym$`A..
// This is the k definition of xasc with the sym column as the first parameter.
q)xasc
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}
// Had you needed to fix your hdb, I managed to undo this using value and indexing to the sym col data.
fncol[`:.;`trades;`sym;{(value x)[1]}];
q)select from trades where date = 2014.04.21
date sym time src price size
------------------------------------------------------------
2014.04.21 AAPL 2014.04.21D08:00:12.155000000 N 25.31 2450
2014.04.21 AAPL 2014.04.21D08:00:42.186000000 N 25.32 289
2014.04.21 AAPL 2014.04.21D08:00:51.764000000 O 25.34 3167
asc will not break the hdb as it just takes 1 argument and saves down ONLY the sym column in ascending order not the table.
Is there any indication of what date is failing with a length error? It could be something wrong with one of the partitions.
Perhaps if you try to load one of the dates into memory and sort it manually IE
`sym xasc select from trade where date=last date
that might indicate if there's a specific partition causing issues.
FYI if you're intersted in applying the p# attribute you should try setattrcol in dbmaint.q. I think the data will need to be sorted first though.

Attributes internal working in aj for performance benefits in kdb

Considering the trade table 't' and quotes table 'q' in memory:
q)t:([] sym:`GOOG`AMZN`GOOG`AMZN; time:10:01 10:02 10:02 10:03; px:10 20 11 19)
q)q:([] sym:`GOOG`AMZN`AMZN`GOOG`AMZN; time:10:01 10:01 10:02 10:02 10:03; vol:100 200 210 110 220)
In order to get performance benefits applying grouped attribute on 'sym' column of q table and making 'time' column sorted within sym.
Using this, I can clearly see the performance benefits from it:
q)\t:1000000 aj[`sym`time;t;q]
9573
q)\t:1000000 aj[`sym`time;t;q1]
8761
q)\t:100000 aj[`sym`time;t;q]
968
q)\t:100000 aj[`sym`time;t;q1]
893
And in large tables the performance is far better.
Now, I'm trying to understand how it works internally when we are applying grouped attribute to sym column and sort time within sym.
My understanding is internally the aj should happen in below way, can someone please let me know the correct internal working?
* Since, grouped attribute is applied on sym; so it creates a hashtable for table q1, then since we are sorting on time so the internal q1 table might look like.
GOOG|(10:01;10:02)|(100;110)
AMZN|(10:01;10:02:10:03)|(200;210;220)
So in this case of q1, if the interpreter has to join (AMZN;10:02) of t table; it will directly find it in q1's hasttable in less time, but for joining same value(AMZN;10:02) of table 't' in table 'q' the interpreter will have to search linearly through table 'q' hence taking more time.
I believe you're on the right track, though we can't know for sure as we don't have access to the kdb source code to see precisely what it does.
If you look at the definition of aj you'll see that it's based on bin:
q)aj
k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
specifically,
(`sym`time#q)bin `sym`time#t
and the bin documentation provides some more details on how bin behaves: https://code.kx.com/q/ref/bin/
I believe in the two-column case it will first match on the sym column and then use bin on the second column. Like you said, the grouped attribute on sym speeds up the matching of syms part and the sorting on time ensures the bin returns the correct results. Note that for on-disk queries it's optimal to put `p# on sym rather than `g# as the parted attribute is optimal for matching/retrieving by sym from disk.

backtesting in kdb; updating/passing table as we parse each row of table

I have a trade table that contains historical execution records which contains timestamp, ric, side, price, quantity (where ric are all equities). Additionally, I have aj'ed the futures price snapshot table at each execution time. So, the trade table contains: timestamp, ric, side, price, quantity, futures_price.
I am trying to create an intra-day backtesting system where:
as each execution record is parsed (via { BACKTESTNIG_LOGIC_HERE } each trade), different set of logics will be used to decide hedging timing.
Is it possible for me to create hedge table which can write down timestamp for executing futures, execution price, trade_qty, cumulative_qty without writing to the disk? Basically, I want to see if it is possible to dynamically update hedge table (as each execution record is passed) and pass hedge table.
I was looking at over or scan, but i wasn't sure if that was the right approach of doing it. Can you guys provide some insights on this matter?
Thank you!
Yes, it sounds like over is what you need to use in this case.
This allows you to pass in an initial state, update it and pass in back in for the next iteration.
For example:
q){.[x;(y`sym;`cnt);+;1]}/[([sym:()] cnt:`long$());trades]
sym | cnt
----| ---
ORCL| 114
YHOO| 110
AAPL| 105
IBM | 124
NOK | 120
CSCO| 112
MSFT| 95
DELL| 109
GOOG| 111
In this simple example, a trade table is iterated over (second arg to over), and the initial state is the simple keyed table ([sym:()] cnt:`long$())
On each iteration, we simply add 1 to the count for the relevant sym. In the real usage, you would perform your backtesting here and return the updated hedge table from the lambda function - this updated table will be passed back into the function in the next iteration (e.g. in this example each time the cnt for a sym is increased, the next iteration receives the table with that cnt increased)

What is the right way to iterate through a kdb partitioned table in an client application?

I want to process all the rows of a kdb table in an R program (I use qserver.R). One way to do this is to initialize a memory handler and then iterate through all the rows one of the time, as explained here:
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
t[0]
t[1]
t[2]
...
I want to limit the number of client/server calls in R to loop as fast as possible.
How can I fetch multiple rows for each call?
NOTE: my answer below assumes that mytable is the partioned database, but that you now have t in memory.
another option using cut (using "chunks" of 1,000,000 as per your earlier post)
(`int$1e6) cut t
now you have a list of table "chunks" of your desired size and you can use accordingly.
I frequently use this for certain functions (particularly in combination with peach).
A pattern I've found useful is:
f:`function that does something useful on chunks`
fa:`function that reaggregates up to final results`
r:fa raze f peach (`int$`size`)cut t
if you're t is really large (both vertical/horizontal) you might want to avoid cut directly on the table for memory reasons, but can instead cut a list of indices for the table into the appropriate size and then feed the indices to your f and have that index to the t and grab what you want.
Below a quick comparison of both approaches (note that f here is pointless, but just to prove the point of the cut on t versus indices)
q)t:flip (`$"c",/:string til 100)!{(`int$1e7)?100} each til 100
q)\ts a:raze {select c1,c99 from x}each 1000 cut t
3827 4108103072j
q)\ts b:raze {select c1,c99 from t[x]}each 1000 cut til count t
3057 217623200j
q)4108103072j%217623200j
18.87714
q)a~b
1b
From your previous questions I assume this is a 1 person system so what benefit are you getting from kdb? Why not work fully in R and just use flat memory mapped files directly there? Avoiding unneeded complexity and overhead. If all you want to do is stream the data through R in order that should be simple.
Rather than "ts>12:30:00,ts<15:00:00" use "ts within (12:30:00;15:00:00)" it's quicker.
The larger the size of chunks you process in the more efficient it is likely to be. 100 seems quite small.
Regards,
Ryan Hamilton
Sorted out, this returns 100 rows each time:
\l /data/mydb
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
select [0 100] from t
select [100 100] from t
select [200 100] from t
..

T-sql query to compute a new column from the existing cilumn

I have a table as shown below
Dealid comment amount swaplink
A11 Nothing 1000
B11 this is swaP1 2000
b22 this is swap2 3000
b33 this is swap1 4000
b44 this is swap2 5000
Swaplink is a computed column from comment we need to follow 4 steps to follow
whether "swap" is occuring in comment column
check the number after swap
find swap1 which is not in the samw row,repeat it for all rows
in swaplink put the dealids
As far as I understand you - you cannot create persisted or indexable computed column on the table which relates any other rows, else you can use User Defined Function to encapsulate your logic, but you should understand that it will be performance killer.
If you need not the column in the table, but just a column in the query - you still may use your UDF or write your own subqueries
To make your life a little bit easier for yourself - try to separate swap2 and similar values into 2 columns with values swan AND 2