How serializing foreign keyed table works internally in kdb - kdb

I have a keyed table(referenced table) linked using foreign key to the referencing table and I serialize both tables using set operator.
q)kt:([sym:`GOOG`AMZN`FB]; px:20 30 40);
q)`:/Users/uts/db/kt set kt
q)t:([] sym:`kt$5?`GOOG`AMZN`FB; vol:5?10000)
q)`:/Users/uts/db/t set t
Then I remove these tables from the memory
q)delete kt,t from `.
Now I deserialize the table t in memory:
t:get `:/Users/uts/db/t
If I do meta t after this it fails, expecting kt as foreign key.
If I print t, as expected it shows index values in column sym of table t.
So, the question arises -
As kdb stores the meta of each table(i.e c,t,f,a) and its corresponding values on disk, how does table t serialization works internally?
How(In which form in binary format) are these values stored in file t.
-rw-r--r-- 1 uts staff 100 Apr 13 23:09 t

tl;dr A foreign key is stored as a vector of 4-byte indices of a key column of a referenced table plus a name of a table a foreign key refers to.
As far as I know kx never documented their file formats, and yet I think some useful information relevant to your question can be deduced right from a q console session.
Let me modify your example a bit to make things simpler.
q)show kt:([sym:`GOOG`AMZN`FB]; px:20 30 40)
sym | px
----| --
GOOG| 20
AMZN| 30
FB | 40
q)show t:([] sym:`kt$`GOOG`GOOG`AMZN`FB`FB)
sym
----
GOOG
GOOG
AMZN
FB
FB
I left only one column - sym - in t because vol is not relevant to the question. Let's save t without any data first:
q)`:/tmp/t set 0#t
`:/tmp/t
q)hcount `:/tmp/t
30
Now we know that it takes 30 bytes to represent t when it's empty. Let's see if there's a pattern when we start adding rows to t:
q){`:/tmp/t set x#t;`cnt`size!(x;hcount[`:/tmp/t] - 30)} each til[11], 100 1000 1000000
cnt size
---------------
0 0
1 4
2 8
3 12
4 16
5 20
6 24
7 28
8 32
9 36
10 40
100 400
1000 4000
1000000 4000000
We can see that adding one row increases the size of t by four bytes. What can these 4 bytes be? Can they be a representation of a symbol itself? No, because if they were and we renamed a sym value in kt it would affect the size of t on disk but it doesn't:
q)update sym:`$50#.Q.a from `kt where sym=`GOOG
`kt
q)1#t
sym
--------------------------------------------------
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
q)`:/tmp/t set 1#t
`:/tmp/t
q)hcount `:/tmp/t
34
Still 34 bytes. I think it should be obvious by now that the 4 bytes is an index, but an index of what? Is it an index of a column which must be called sym exactly? Apparently no, it isn't.
q)kt:`foo xcol kt
q)t
sym
--------------------------------------------------
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
AMZN
FB
FB
There's no column called sym in kt any longer but t hasn't changed at all! We can go even further and change the type of foo (ex sym) in kt:
q)update foo:-1 -2 -3.0 from `kt
`kt
q)t
sym
---
-1
-1
-2
-3
-3
Not only did it change t, it changed its meta too:
q)meta t
c | t f a
---| ------
sym| f kt
q)/ ^------- used to be s
I hope it's clear now that kdb stores a 4-byte index of a key column of a referenced table and a name of a table (but not a key column name!). If a referenced table is missing kdb can't reconstruct the original data and displays the bare index. It a referencing table needs to be sent over the wire then indices are replaced with actual values so that the receiving side can see the real data.

Related

KDB: How to serialize a table for a union join within kdb-tick architecture?

Im trying to modify the kdb-tick architecture to support a union join on incoming data and the local rdb table.
I have modified the upd function in the tick.q file to the following:
ups:{[t;x]ts"d"$a:.z.P;
if[not -16=type first first x;a:"n"$a;x:$[0>type first x;a,x;(enlist(count first x)#a),x]];
f:key flip value t;pub[t;$[0>type first x;enlist f!x;flip f!x]];if[l;l enlist (`ups;t;x);i+:1];};
With ups:uj subsequently set in the subscriber files.
My question relates to how one might serialize a table row before publishing it within the .u.ups[] function.
I.e. given a table:
second | amount price
-----------|----------------
02:46:01 | 54 9953.5
02:46:02 | 54 9953.5
02:46:03 | 54 9953.5
02:46:04 | 150 9953.5
02:46:05 | 150 9954.5
How should one serialize the first row 02:46:01 | 54 9953.5 such that it can be sent via the .u.ups function to subscribers whereby uj will be run between the row and the local table on the subscribers.
Thanks in advance for your advice.
Some of this might help:
You can't set ups:uj in the subscribers because the table name is being passed as a symbol so the subscriber will effectively try to do
uj[`tab1;tab2]
which won't work because uj doesn't accept table names (symbols) as input. You would have to instead set ups to
ups:{x set value[x] uj y}
A standard tickerplant is not designed to handle variable/changing schema - for good reason, it's generally not a good idea to have a schema that changes intraday. However your situation might warrant it so in that case you'd need to modify your .u.ups function to something like
\d .u
ups:{[t;x]ts"d"$a:.z.P;
x:`time xcols update time:"n"$a from x;
pub[t;$[98h=type x;x;1=count last x;enlist x;flip x]];if[l;l enlist (`ups;t;x);i+:1];};
\d .
and your feeder process would have to send kdb tables or kdb dictionaries to the .u.ups function. Since a feedhandler process is usually not a kdb process, it may or may not be possible to send tables/dictionaries to the tickerplant as normally the feedhandler would send lists (without column metadata). In your case you need to somehow supply the column metadata to the tickerplant on each update (or maybe you're doing that already?), as otherwise it won't know which columns are which.
In other words your feeder process could send either of the following:
(`.u.upd;`tab;([]col1:`a`b`c;col2:1 2 3))
(`.u.upd;`tab;`col1`col2!(`a;1))
(`.u.upd;`tab;`col1`col2!(`a`b;1 2))
I'm going to assume this is related to your previous few questions about disparate schemas. I'd like to suggest an alternative solution, which is only truly viable if you are using kdb version 3.6, which uses anymap. If you can narrow your schemas down to a minimal list of common columns, all other columns can be placed as dictionaries into a general column.
q)tab:([]sym:`$();col1:`float$();colGeneral:(::))
q)`tab upsert (`AAPL;3.454;(`colX`colY`colZ!(1;2.3;"abc")))
`tab
q)`tab upsert (`MSFT;3.0;(`colX`colY!(2;100.0)))
`tab
q)`tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
`tab
q)tab
sym col1 colGeneral
----------------------------------------
AAPL 3.454 `colX`colY`colZ!(1;2.3;"abc")
MSFT 3 `colX`colY!(2;100f)
AMZN 100 (,`colX)!,10
q)select colGeneral from tab
colGeneral
-----------------------------
`colX`colY`colZ!(1;2.3;"abc")
`colX`colY!(2;100f)
(,`colX)!,10
q)select sym, colGeneral #\: `colX from tab
sym x
-------
AAPL 1
MSFT 2
AMZN 10
q)select sym, colGeneral #\: `colY from tab
sym x
---------
AAPL 2.3
MSFT 100f
AMZN 0N
With 3.6 you can be saving this to disk in any splayed format (splayed, partitioned, segmented) and still easily query the data. The storage of such a table will likely be sub-optimal due to poor compression characteristics of the general column (assuming you wish to compress data), but it will be perfectly functional.
Integrating uj into standard ingestion procedure with each update will be computationally expensive. Using a general column and dictionary method will massively improve your ingestion speed. Below I've given a demonstration using the example given a previous answer to a related question of yours
q)table:()
q)row1:enlist `x`y`colX!(`AMZN;100.0;10)
q)table:table uj row
q)\ts:100000 table:table uj row1
13828 6292352
q)\ts:100000 `tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
117 12746880

updating the cells in a partitioned table using .Q.ind[] in q

I have a partitioned table and can read it using a get command as such:
get `:hdb/2018.01.01/trade
and will give me:
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
I like to modify the cell value like size from 200 and 300 to a 1000 given an index or list of rows. So I am using
.Q.ind[`:hdb/2018.01.01/trade; 1 2j]
to get the rows and then change the cell. But I am getting a `rank error when running .Q.ind[].
The error you're getting is that the first input param to .Q.ind is the mapped table name, not a symbol representing the table name/location
I'm not sure if .Q.ind is going to help you here though, it's more useful for data retrieval than data (re)write.
A couple of approaches you could take:
Pull in the whole date slice select from table where date=X, modify it in memory and then write it back down using `:hdb/2018.01.01/trade/ set delete date from modifiedTable. This is assuming you're not modifying any enumerated/symbol columns. You'd have to be careful to maintain same schema, maintain same compression etc
Use the dbmaint package to handle the changes: https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md
If you're careful enough you could pull in only the column itself, modify it and write it back down. p set #[get p:`:hdb/2018.01.01/trade/col1;1 2;:;1000]
You could also use an amend operation to update the values.
#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]
This will edit your table on disk.
q)get`:hdb/2018.01.01/trade
sym size exchange
-----------------
0 100 2
1 200 2
1 300 2
q)#[`:hdb/2018.01.01/trade;`size;#[;1 2;:;1000]]
`:hdb/2018.01.01/trade
q)get `:hdb/2018.01.01/trade/
sym size exchange
-----------------
0 100 2
1 1000 2
2 1000 2

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

TSE_TO_STS application

I have 12 categorical sequence in TSE format. In the help page of this function tmax is specified as 12 based on the sequence data example used. How would I change this value if the maximum time length is 292 for one sequence and smaller than 292 for other sequences. Assume one of the sequence ends at time 25. using tmax=292, any state after 25 will use the same state till 292 which is wrong I believe. I would like to stop the sequence at time 25 and fill anything else on the right side with void.
TSE_to_STS is a function provided by the TraMineRextras package. It converts time stamped event sequences into state sequences. The resulting state sequences are in STS form, i.e., organized in a table with each sequence in a different row and the states in successive columns. tmaxis used to determine the number of columns of this table. Therefore, it should be fixed to the maximal state sequence length.
To end a sequence at time 25 for example, you need to insert an end of sequence event at time 25. TSE_to_STS cannot guess when the sequence ends.
============ example
Below I illustrate how to proceed using the actcal.tse data that ships with TraMineR. I consider the data for ids 2 and 4 and assume id 2 was observed up to the 8th month and id 4 up to the 10th month.
data(actcal.tse)
## Consider the data for id 2 and 4 and
## insert "endobs" event to indicate end of observation
subset <- rbind(actcal.tse[2:4,], data.frame(id=2,time=8,event="endobs"),
actcal.tse[7:9,], data.frame(id=4,time=10,event="endobs"))
subset
## id time event
## 2 2 0 NoActivity
## 3 2 4 Start
## 4 2 4 FullTime
## 1 2 8 endobs
## 7 4 0 LowPartTime
## 8 4 9 Increase
## 9 4 9 PartTime
## 11 4 10 endobs
## Define list of events of interest
events <- c("PartTime", "NoActivity", "FullTime", "LowPartTime", "endobs")
## Dropping all previous events
stm <- seqe2stm(events, dropList=list(PartTime=events[-1], NoActivity=events[-2],
FullTime=events[-3], LowPartTime=events[-4], endobs=events[-5]))
mysts <- TSE_to_STS(subset, id=1, timestamp=2, event=3,
stm=stm, tmin=1, tmax=12, firstState="None")
## replacing "endobs" with NAs
mysts[mysts=="endobs"] <- NA
seq <- seqdef(mysts)
seqiplot(seq)
We see the different length of the two resulting state sequences in the plot.

kdb ticker plant: where to find documentation on .u.upd?

I am aware of this resource. But it does not spell out what parameters .u.upd takes and how to check if it worked.
This statement executes without error, although it does not seem to do anything:
.u.upd[`t;(`$"abc";1;2;3)]
If I define the table beforehand, e.g.
t:([] name:"aaa";a:1;b:2;c:3)
then the above .u.upd still runs without error, and does not change t.
.u.upd has the same function signature as insert (see http://code.kx.com/q/ref/qsql/#insert) in prefix form. In the most simplest case, .u.upd may get defined as insert.
so:
.u.upd[`table;<records>]
For example:
q).u.upd:insert
q)show tbl:([] a:`x`y;b:10 20)
a b
----
x 10
y 20
q).u.upd[`tbl;(`z;30)]
,2
q)show tbl
a b
----
x 10
y 20
z 30
q).u.upd[`tbl;(`a`b`c;1 2 3)]
3 4 5
q)show tbl
a b
----
x 10
y 20
z 30
a 1
b 2
c 3
Documentation including the event sequence, connection diagram etc. for tickerplants can be found here:
http://www.timestored.com/kdb-guides/kdb-tick-data-store
.u.upd[tableName; tableData] accepts two arguments, for inserting data
to a named table. This function will normally be called from a
feedhandler. It takes the tableData, adds a time column if one is
present, inserts it into the in-memory table, appends to the log file
and finally increases the log file counter.