kdb/q -- how to number the rows by certain groupings - kdb

I have tables with date;sym columns. But each date might have multiple syms. I want to number the occurrences of symbol in each date
For example:
date sym
-------------------
2019.06.04 ABC
2019.06.04 DEF
2019.06.04 ABC
2019.06.05 DEF
2019.06.05 ABC
will give me
date sym c
-------------------
2019.06.04 ABC 1
2019.06.04 DEF 1
2019.06.04 ABC 2 / here ABC appears for the second time on this date.
2019.06.05 DEF 1
2019.06.05 ABC 1

This may be a little cleaner, here the c column is just a running sum of all rows that have been grouped by each combination of date and sym.
q)t:([]date:2019.06.04+0 0 0 1 1;sym:`ABC`DEF`ABC`DEF`ABC)
q)update c:sums i=i by date,sym from t
date sym c
----------------
2019.06.04 ABC 1
2019.06.04 DEF 1
2019.06.04 ABC 2
2019.06.05 DEF 1
2019.06.05 ABC 1

To count the occurrences of syms by date across all of the tables in a HDB we can run a count by date for each of the partitioned tabled .Q.pt and then scan that over pj plus join, as each table is keyed on date (matching keys). As pj is similar to an ij we need to ensure that there are no rows dropped as each date might be missing different syms
q)cntTabs:{2!0!update c:count each sym,sym:first each sym from select sym by date from x} each .Q.pt
q){t:pj[x;y];t,k!y k:key[y] except key[t]}/[cntTabs]

Related

Storing output of a iterating function into unique tables KDB+/Q

I have a query function of the format:
function:{[x] select Y from table1 where date=x}
I want to iterate this function across all days stored in a separate table, like so
function each (select distinct date from table2)
and store every iteration into it's own, unique table. For example, if the first iteration was for 2021.11.10, all 'Y' values from table1 on that date are stored in a table named 'a', for the next iteration for date 2021.11.12, it goes to a table named 'b', or something like that. How can I do this?
Assume the following tables:
q)t1:([]date:.z.d + 0 0 0 1 1 1;a:1 2 3 4 5 6)
q)t2:([]date:.z.d+0 1)
then you could do something like:
q)function:{[x;y] x set select from t1 where date=y}
q)function'[`a`b;exec distinct date from t2]
`a`b
q)show a
date a
------------
2021.11.10 1
2021.11.10 2
2021.11.10 3
q)show b
date a
------------
2021.11.11 4
2021.11.11 5
2021.11.11 6
If you wanted to make it a bit more dynamic then you could also do something like:
q)function:{[x] (`$"tab",string[x] except ".") set select from t1 where date=x}
q)
q)
q)function each exec distinct date from t2
`tab20211110`tab20211111
q)show tab20211110
date a
------------
2021.11.10 1
2021.11.10 2
2021.11.10 3
q)show tab20211111
date a
------------
2021.11.11 4
2021.11.11 5
2021.11.11 6
The above creates new tables named for each date (we remove the "." from the resulting names as they could be confused for q's . operator)

Pivot table with multiple value columns in KDB+

I would like to transform the following two row table generated by:
tb: ([] time: 2010.01.01 2010.01.01; side:`Buy`Sell; price:100 101; size:30 50)
time side price size
--------------------------------
2010.01.01 Buy 100 30
2010.01.01 Sell 101 50
To the table below with single row:
tb1: ([] enlist time: 2010.01.01; enlist price_buy:100; enlist price_sell:101; enlist size_buy:30; enlist size_sell:50)
time price_buy price_sell size_buy size_sell
-----------------------------------------------------
2010.01.01 100 101 30 50
What is the most efficient way to achieve this?
(select price_buy:price, size_buy:size by time from tb where side = `Buy) lj select price_sell:price, size_sell:size by time from tb where side = `Sell
time | price_buy size_buy price_sell size_sell
----------| ---------------------------------------
2010.01.01| 100 30 101 50
If you wanted to avoid 2 select statements:
raze each select `price_buy`price_sell!(side!price)#/:`Buy`Sell, `size_buy`size_sell!(side!size)#/:`Buy`Sell by time from tb
As an additional note, having a date column labeled time can be misleading. Typical financial tables in kdb have the format date time sym etc
Edit: Functional form for dynamic column generation:
{x[0] lj x[1]}[{?[`tb;enlist (=;`side;enlist `$x);(enlist `time)!enlist `time;(`$("price",x;"size",x))!(`price;`size)]} each ("Sell";"Buy")]
time | priceSell sizeSell priceBuy sizeBuy
----------| -----------------------------------
2010.01.01| 101 50 100 30
The general pivot function on the Kx website can do this, see https://code.kx.com/q/kb/pivoting-tables/
q)piv[tb;(),`time;(),`side;`price`size;{[v;P]`$raze each string raze P[;0],'/:v,/:\:P[;1]};{x,z}]
time | Buyprice Sellprice Buysize Sellsize
----------| -----------------------------------
2010.01.01| 100 101 30 50
I have a pivot function in github . But it doesn't support multiple columns
.math.st.pivot: {[t;rc;cf;ff]
P: asc distinct t cf;
Pcol: `$string[P] cross "_",/:string key ff;
t: ?[t;();rc!rc;key[ff]!{({[x;y;z] z each y#group x}[;;z];x;y)}[cf]'[key ff;value ff]];
t: ![t;();0b; Pcol! raze {((';#);x;$[-11h=type y;enlist;::] y)}'[key ff]'[P] ];
![t;();0b;key ff]
};
But you can left join to achieve expected result:
.math.st.pivot[tb;enlist`time;`side;enlist[`price]!enlist first]
lj .math.st.pivot[tb;enlist`time;`side;enlist[`size]!enlist first]
Looks like adding support for multiple columns is a good idea.

How can I delete a column by index from a kdb table?

For example how would you delete the first column from the following table:
q)t: ([] a: (2018.09.25; 2018.09.25; 2018.09.25); b: `ABC`XYZ`BAC ; c: (10 20 30))
q)t
a b c
-----------------
2018.09.25 ABC 10
2018.09.25 XYZ 20
2018.09.25 BAC 30
The expected result:
b c
---------
ABC 10
XYZ 20
BAC 30
It is possible to use delete a from t but I would like to be able to delete without knowing the exact column name beforehand.
You could use a functional delete:
q){[t;index]![t;();0b;enlist cols[t]index]}[t;0]
b c
------
ABC 10
XYZ 20
BAC 30
https://code.kx.com/q/ref/funsql/#delete
Use parse in order to see what the q-sql statement looks like in functional form:
q)parse"delete a from t"
!
`t
()
0b
,,`a
You could use
{(_/[cols x;desc y])#x}[t;0 2]
This takes in the columns of your table, takes the indices you want to drop and uses a drop scan to drop these columns. If you wanted to remove only one index, you'd have to enlist, like so:
{(_/[cols x;desc y])#x}[t;enlist 0]
If your table is not keyed then you can do simple deletion from dictionary:
q) f:{[t;ind] enlist[cols[t] ind]_t}
q) f[t;0]
b c
------
ABC 10
XYZ 20
BAC 30
Using flip and drop :
q)flip 1_flip 0!t
b c
------
ABC 10
XYZ 20
BAC 30

kdb how to calculate rolling count

Assume I have a table of events, with Timestamp and Type.
t1, 'b'
t2, 'x'
t3, 's'
t4, 'b'
How can I get a rolling count such that it would give me a list of all timestamps and the cummulative number of events up to taht ts, sort of like a count version of sums
for example for 'b' I d like a table
't1', 1
't2', 1
't3', 1
't4', 2
Here is one way to do it, although there may be a more clever way this uses sums:
//table definition
tab:([]a:`t1`t2`t3`t4;b:"bxsb")
//rolling sum of 1 by column b
update sums count[i]#1 by b from tab
Results in:
a b x
------
t1 b 1
t2 x 1
t3 s 1
t4 b 2
If you wanted replace b you would simply put b: in front of the sums .
One way:
q)t:([]p:asc 4?.z.p+til 1000;t:`b`x`s`b)
q)asc `p xcols ungroup select p,til count i by t from t
p t x
---------------------------------
2017.05.16D09:42:48.259062090 b 0
2017.05.16D09:42:48.259062585 x 0
2017.05.16D09:42:48.259062683 s 0
2017.05.16D09:42:48.259062858 b 1
Ps: Note I have started the sequence at 0 as if to say "I've had 0 events prior to this row" instead of beginning at 1 as per your example. It goes with your req "number of events up to that ts". If you need 1, just add 1 '1+til count i'. Also ensure your time is sorted so as it makes sense when beginning the sequence.
With table t as below:
q)show t: ([]ts:.z.t - desc "u"$(til 4);symb:`b`x`z`b)
ts symb
-----------------
09:46:56.384 b
09:47:56.384 x
09:48:56.384 z
09:49:56.384 b
using a vector conditional:
q)select ts, cum_count:sums ?[symb=`b;1;0] from t
ts cum_count
----------------------
09:46:56.384 1
09:47:56.384 1
09:48:56.384 1
09:49:56.384 2
The same, but with a function taking symb as a parameter:
q){select ts, cum_count:sums ?[symb=x;1;0] from t}[`b]
ts cum_count
----------------------
09:46:56.384 1
09:47:56.384 1
09:48:56.384 1
09:49:56.384 2
In fact you don't need a vector conditional because you can just sum the booleans directly:
q){select ts, cum_count:sums symb=x from t}[`b]
ts cum_count
----------------------
09:46:56.384 1
09:47:56.384 1
09:48:56.384 1
09:49:56.384 2
This also works
update x:1+til count i by b from tab

How to apply a function to an entire column?

I have the following table from a JDBC connection in Q.
q)r
some_int this created_at updated_at ..
-----------------------------------------------------------------------------..
1231231 "ASD" 2016.02.11D14:16:29.743260000 2016.02.11D14:16:29...
13312 "TSM" 2016.02.11D14:16:29.743260000 2016.02.11D14:16:29...
I would like to apply the following function to the first column.
deviation:{a:avg x; sqrt avg (x*x)-a*a}
This works for arrays.
q)l
1 2 3 4
q)deviation l
1.118034
How can I apply deviation on a column in a table? It seems my approach does not work:
q)select deviation(some_id) from r
'rank
UPDATE:
I cannot explain the following:
q)select avg(some_int) from r
some_int
---------
1005341
q)select min(some_int) from r
some_int
---------
812361
q)select max(some_int) from r
some_int
---------
1184014
q)select sum(some_int) from r
some_int
---------
You need to enlist the result if it is an atom since table columns must be lists, not atoms. Normally kdb can do this for you but often not when you're performing your own custom aggregations. For example, even if you define a function sum2 to be an exact copy of sum:
q)sum2:sum
kdb can only recognise sum as an aggregation and will enlist automatically, but not for sum2
q)select sum col1 from ([]col1:1 2 3 4)
col1
----
10
q)select sum2 col1 from ([]col1:1 2 3 4)
'rank
So you need to enlist in the second case:
q)select enlist sum2 col1 from ([]col1:1 2 3 4)
col1
----
10
UPDATE:
To answer your second question - it looks like your sum of numbers has spilled over the boundary for an integer. You'd need to convert them to long and then sum
q)select sum col1 from ([]col1:2147483645 1i)
col1
----------
2147483646
Above is the maximum integer. Adding one more gives infinity for an int
q)select sum col1 from ([]col1:2147483645 1 1i)
col1
----
0W
Adding anything more than that shows a blank (null)
q)select sum col1 from ([]col1:2147483645 1 1 1i)
col1
----
Solution is to cast to long before summing (or make them long in the first place)
q)select sum `long$col1 from ([]col1:2147483645 1 1 1i)
col1
----------
2147483648
You get a rank because the function does not return a list. Since the function returns a single number presumably you just want the single number answer? In which case you can simple index into the table (or use exec) to get the column vector and apply it:
deviation t`some_id
Else if you want to retain a table as the answer if you enlist the result:
select enlist deviation some_id from t