Attributes internal working in aj for performance benefits in kdb - kdb

Considering the trade table 't' and quotes table 'q' in memory:
q)t:([] sym:`GOOG`AMZN`GOOG`AMZN; time:10:01 10:02 10:02 10:03; px:10 20 11 19)
q)q:([] sym:`GOOG`AMZN`AMZN`GOOG`AMZN; time:10:01 10:01 10:02 10:02 10:03; vol:100 200 210 110 220)
In order to get performance benefits applying grouped attribute on 'sym' column of q table and making 'time' column sorted within sym.
Using this, I can clearly see the performance benefits from it:
q)\t:1000000 aj[`sym`time;t;q]
9573
q)\t:1000000 aj[`sym`time;t;q1]
8761
q)\t:100000 aj[`sym`time;t;q]
968
q)\t:100000 aj[`sym`time;t;q1]
893
And in large tables the performance is far better.
Now, I'm trying to understand how it works internally when we are applying grouped attribute to sym column and sort time within sym.
My understanding is internally the aj should happen in below way, can someone please let me know the correct internal working?
* Since, grouped attribute is applied on sym; so it creates a hashtable for table q1, then since we are sorting on time so the internal q1 table might look like.
GOOG|(10:01;10:02)|(100;110)
AMZN|(10:01;10:02:10:03)|(200;210;220)
So in this case of q1, if the interpreter has to join (AMZN;10:02) of t table; it will directly find it in q1's hasttable in less time, but for joining same value(AMZN;10:02) of table 't' in table 'q' the interpreter will have to search linearly through table 'q' hence taking more time.

I believe you're on the right track, though we can't know for sure as we don't have access to the kdb source code to see precisely what it does.
If you look at the definition of aj you'll see that it's based on bin:
q)aj
k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
specifically,
(`sym`time#q)bin `sym`time#t
and the bin documentation provides some more details on how bin behaves: https://code.kx.com/q/ref/bin/
I believe in the two-column case it will first match on the sym column and then use bin on the second column. Like you said, the grouped attribute on sym speeds up the matching of syms part and the sorting on time ensures the bin returns the correct results. Note that for on-disk queries it's optimal to put `p# on sym rather than `g# as the parted attribute is optimal for matching/retrieving by sym from disk.

Related

sort data in hdb by using dbmain.q in kdb

I am trying to sort 1 or 2 columns in a hdb in kdb but failed. This is the code I have
fncol[dbdir;`trade;`sym;xasc];
and got a length error when I called it. But I don't have a length error if I use this code
fncol[dbdir;`trade;`sym;asc];.
However this only sorts the sym column itself. I want the data from other columns change according to sym column as well.
In addition, I would like to apply parted attribute to sym column. Also, I tried to sort this way
fncol[dbdir;`trade;`sym`ptime;xasc];. also failed
You should always be careful with dbmaint.q if you are unsure what it is going to do. I gather from the fact asc worked after xasc that you are using a test hdb each time.
fncol should be used with unary functions i.e. 1 argument. It's use case is for modifying individual columns. What you are trying to do is modifying the entire table as you want to sort the entire table relative to the sym column. Using .Q.dpft for each date is what you want as outlined by Cathal in your follow-up question. using .Q.dpft function to resave table
When you run this fncol[dbdir;`trade;`sym;xasc]; You are saving down a projection in place of the sym column in each date.
fncol[`:.;`trades;`sym;xasc];
select from trades where date = 2014.04.21
'length
[0] select from trades where date = 2014.04.21
q)get `:2014.04.21/trades/sym
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}[`p#`sym$`A..
// This is the k definition of xasc with the sym column as the first parameter.
q)xasc
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}
// Had you needed to fix your hdb, I managed to undo this using value and indexing to the sym col data.
fncol[`:.;`trades;`sym;{(value x)[1]}];
q)select from trades where date = 2014.04.21
date sym time src price size
------------------------------------------------------------
2014.04.21 AAPL 2014.04.21D08:00:12.155000000 N 25.31 2450
2014.04.21 AAPL 2014.04.21D08:00:42.186000000 N 25.32 289
2014.04.21 AAPL 2014.04.21D08:00:51.764000000 O 25.34 3167
asc will not break the hdb as it just takes 1 argument and saves down ONLY the sym column in ascending order not the table.
Is there any indication of what date is failing with a length error? It could be something wrong with one of the partitions.
Perhaps if you try to load one of the dates into memory and sort it manually IE
`sym xasc select from trade where date=last date
that might indicate if there's a specific partition causing issues.
FYI if you're intersted in applying the p# attribute you should try setattrcol in dbmaint.q. I think the data will need to be sorted first though.

KDB: How to serialize a table for a union join within kdb-tick architecture?

Im trying to modify the kdb-tick architecture to support a union join on incoming data and the local rdb table.
I have modified the upd function in the tick.q file to the following:
ups:{[t;x]ts"d"$a:.z.P;
if[not -16=type first first x;a:"n"$a;x:$[0>type first x;a,x;(enlist(count first x)#a),x]];
f:key flip value t;pub[t;$[0>type first x;enlist f!x;flip f!x]];if[l;l enlist (`ups;t;x);i+:1];};
With ups:uj subsequently set in the subscriber files.
My question relates to how one might serialize a table row before publishing it within the .u.ups[] function.
I.e. given a table:
second | amount price
-----------|----------------
02:46:01 | 54 9953.5
02:46:02 | 54 9953.5
02:46:03 | 54 9953.5
02:46:04 | 150 9953.5
02:46:05 | 150 9954.5
How should one serialize the first row 02:46:01 | 54 9953.5 such that it can be sent via the .u.ups function to subscribers whereby uj will be run between the row and the local table on the subscribers.
Thanks in advance for your advice.
Some of this might help:
You can't set ups:uj in the subscribers because the table name is being passed as a symbol so the subscriber will effectively try to do
uj[`tab1;tab2]
which won't work because uj doesn't accept table names (symbols) as input. You would have to instead set ups to
ups:{x set value[x] uj y}
A standard tickerplant is not designed to handle variable/changing schema - for good reason, it's generally not a good idea to have a schema that changes intraday. However your situation might warrant it so in that case you'd need to modify your .u.ups function to something like
\d .u
ups:{[t;x]ts"d"$a:.z.P;
x:`time xcols update time:"n"$a from x;
pub[t;$[98h=type x;x;1=count last x;enlist x;flip x]];if[l;l enlist (`ups;t;x);i+:1];};
\d .
and your feeder process would have to send kdb tables or kdb dictionaries to the .u.ups function. Since a feedhandler process is usually not a kdb process, it may or may not be possible to send tables/dictionaries to the tickerplant as normally the feedhandler would send lists (without column metadata). In your case you need to somehow supply the column metadata to the tickerplant on each update (or maybe you're doing that already?), as otherwise it won't know which columns are which.
In other words your feeder process could send either of the following:
(`.u.upd;`tab;([]col1:`a`b`c;col2:1 2 3))
(`.u.upd;`tab;`col1`col2!(`a;1))
(`.u.upd;`tab;`col1`col2!(`a`b;1 2))
I'm going to assume this is related to your previous few questions about disparate schemas. I'd like to suggest an alternative solution, which is only truly viable if you are using kdb version 3.6, which uses anymap. If you can narrow your schemas down to a minimal list of common columns, all other columns can be placed as dictionaries into a general column.
q)tab:([]sym:`$();col1:`float$();colGeneral:(::))
q)`tab upsert (`AAPL;3.454;(`colX`colY`colZ!(1;2.3;"abc")))
`tab
q)`tab upsert (`MSFT;3.0;(`colX`colY!(2;100.0)))
`tab
q)`tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
`tab
q)tab
sym col1 colGeneral
----------------------------------------
AAPL 3.454 `colX`colY`colZ!(1;2.3;"abc")
MSFT 3 `colX`colY!(2;100f)
AMZN 100 (,`colX)!,10
q)select colGeneral from tab
colGeneral
-----------------------------
`colX`colY`colZ!(1;2.3;"abc")
`colX`colY!(2;100f)
(,`colX)!,10
q)select sym, colGeneral #\: `colX from tab
sym x
-------
AAPL 1
MSFT 2
AMZN 10
q)select sym, colGeneral #\: `colY from tab
sym x
---------
AAPL 2.3
MSFT 100f
AMZN 0N
With 3.6 you can be saving this to disk in any splayed format (splayed, partitioned, segmented) and still easily query the data. The storage of such a table will likely be sub-optimal due to poor compression characteristics of the general column (assuming you wish to compress data), but it will be perfectly functional.
Integrating uj into standard ingestion procedure with each update will be computationally expensive. Using a general column and dictionary method will massively improve your ingestion speed. Below I've given a demonstration using the example given a previous answer to a related question of yours
q)table:()
q)row1:enlist `x`y`colX!(`AMZN;100.0;10)
q)table:table uj row
q)\ts:100000 table:table uj row1
13828 6292352
q)\ts:100000 `tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
117 12746880

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version
I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!
Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind
Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

Query distinct values from historical database

If I run this query on large Historical database without specifying a date, will KDB be smart enough to retrive status values from index and not bring database down?
select distinct status from trades
The only way kdb can possibly tell all the distinct status is by reading from every partition. Yes this will take a lot of memory but unless you yourself want to maintain a cache of all distinct status, there is nothing else you can do. As previous mentioned an attribute will speed the query up but the query time will still only scale with the number of partitions.
To retrieve using index, kdb provides 'g#' attribute. Distinct alone can take more time which depends on size of your table(it will be linear search without `g# attribute).
Check this-> http://code.kx.com/q4m3/8_Tables/#88-attributes
Let's look at simple example:
q) a: 10000000#1 2 3 5
q) b:`g#a
q) \ts distinct a
68 134217888
q) \ts distinct b
0 288
Difference shows that g# attribute makes a lot of difference in time and space taken during searching. It is becauseg# attribute creates and maintains index on vector.

What is the right way to iterate through a kdb partitioned table in an client application?

I want to process all the rows of a kdb table in an R program (I use qserver.R). One way to do this is to initialize a memory handler and then iterate through all the rows one of the time, as explained here:
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
t[0]
t[1]
t[2]
...
I want to limit the number of client/server calls in R to loop as fast as possible.
How can I fetch multiple rows for each call?
NOTE: my answer below assumes that mytable is the partioned database, but that you now have t in memory.
another option using cut (using "chunks" of 1,000,000 as per your earlier post)
(`int$1e6) cut t
now you have a list of table "chunks" of your desired size and you can use accordingly.
I frequently use this for certain functions (particularly in combination with peach).
A pattern I've found useful is:
f:`function that does something useful on chunks`
fa:`function that reaggregates up to final results`
r:fa raze f peach (`int$`size`)cut t
if you're t is really large (both vertical/horizontal) you might want to avoid cut directly on the table for memory reasons, but can instead cut a list of indices for the table into the appropriate size and then feed the indices to your f and have that index to the t and grab what you want.
Below a quick comparison of both approaches (note that f here is pointless, but just to prove the point of the cut on t versus indices)
q)t:flip (`$"c",/:string til 100)!{(`int$1e7)?100} each til 100
q)\ts a:raze {select c1,c99 from x}each 1000 cut t
3827 4108103072j
q)\ts b:raze {select c1,c99 from t[x]}each 1000 cut til count t
3057 217623200j
q)4108103072j%217623200j
18.87714
q)a~b
1b
From your previous questions I assume this is a 1 person system so what benefit are you getting from kdb? Why not work fully in R and just use flat memory mapped files directly there? Avoiding unneeded complexity and overhead. If all you want to do is stream the data through R in order that should be simple.
Rather than "ts>12:30:00,ts<15:00:00" use "ts within (12:30:00;15:00:00)" it's quicker.
The larger the size of chunks you process in the more efficient it is likely to be. 100 seems quite small.
Regards,
Ryan Hamilton
Sorted out, this returns 100 rows each time:
\l /data/mydb
t: select from mytable where ts>12:30:00,ts<15:00:00,price,msg="A"
select [0 100] from t
select [100 100] from t
select [200 100] from t
..