kdb+: group by and sum over multiple columns - kdb

Consider the following data:
table:
time colA colB colC
-----------------------------------
11:30:04.194 31 250 a
11:30:04.441 31 280 a
11:30:14.761 31.6 100 a
11:30:21.324 34 100 a
11:30:38.991 32 100 b
11:31:20.968 32 100 b
11:31:56.922 32.2 1000 b
11:31:57.035 32.6 5000 c
11:32:05.810 33 100 c
11:32:05.810 33 100 a
11:32:14.461 32 300 b
Now how can I sum colB whenever colC is the same, without losing the time order.
So the output would be:
first time avgA sumB colC
-----------------------------------
11:30:04.194 31.2 730 a
11:30:38.991 32.07 1200 b
11:31:57.035 32.8 5100 c
11:32:05.810 33 100 a
11:32:14.461 32 300 b
What I have so far:
select by time from (select first time, avg colA, sum colB by colC, time from table)
But the output is not grouped by colC. How should the query look like?

How about this?
get select first time, avg colA, sum colB, first colC by sums colC<>prev colC from table

A slightly different way to achieve this using differ :
value select first time, avg colA, sum colB , first colC by g:(sums differ colC) from table

Related

Add condition to where clause in q/kdb+

Table Tab
minThreshold
maxThreshold
point
1000
10000
10
wClause,:enlist((';~:;<);`qty;Tab[`minThreshold])
trying to incorporate maxThreshold column to where clause
qty >= MinThreshold
qty <= MaxThreshold
something like
wClause,:enlist((';~:;<);`qty;Tab[`minThreshold]);Tab[`maxThreshold])
q)Tab:([] minThreshold:500 1000;maxThreshold:700 2000;point:5 10)
q)Tab
minThreshold maxThreshold point
-------------------------------
500 700 5
1000 2000 10
q)select from Tab where minThreshold>=900,maxThreshold<=2500
minThreshold maxThreshold point
-------------------------------
1000 2000 10
q)parse"select from Tab where minThreshold>=900,maxThreshold<=2500"
?
`Tab
,(((';~:;<);`minThreshold;900);((';~:;>);`maxThreshold;2500))
0b
()
q)?[Tab;((>=;`minThreshold;900);(<=;`maxThreshold;2500));0b;()]
minThreshold maxThreshold point
-------------------------------
1000 2000 10
See the whitepaper for more information on functional selects:
https://code.kx.com/q/wp/parse-trees/
Is your problem
you have a Where phrase that works for functional qSQL and you want to extend it?
you want to select rows of a table where the value of a quantity falls within an upper and lower bound?
If (2) you can use Join Each to get the bounds for each row, and within to test the quantity.
q)show t:([]lwr:1000 900 150;upr:10000 25000 500;qty:10 1000 450)
lwr upr qty
---------------
1000 10000 10
900 25000 1000
150 500 450
q)select from t where qty within' lwr{x,y}'upr
lwr upr qty
--------------
900 25000 1000
150 500 450
Above we use {x,y} because in qSQL queries comma does not denote Join.

Pivot table with multiple value columns in KDB+

I would like to transform the following two row table generated by:
tb: ([] time: 2010.01.01 2010.01.01; side:`Buy`Sell; price:100 101; size:30 50)
time side price size
--------------------------------
2010.01.01 Buy 100 30
2010.01.01 Sell 101 50
To the table below with single row:
tb1: ([] enlist time: 2010.01.01; enlist price_buy:100; enlist price_sell:101; enlist size_buy:30; enlist size_sell:50)
time price_buy price_sell size_buy size_sell
-----------------------------------------------------
2010.01.01 100 101 30 50
What is the most efficient way to achieve this?
(select price_buy:price, size_buy:size by time from tb where side = `Buy) lj select price_sell:price, size_sell:size by time from tb where side = `Sell
time | price_buy size_buy price_sell size_sell
----------| ---------------------------------------
2010.01.01| 100 30 101 50
If you wanted to avoid 2 select statements:
raze each select `price_buy`price_sell!(side!price)#/:`Buy`Sell, `size_buy`size_sell!(side!size)#/:`Buy`Sell by time from tb
As an additional note, having a date column labeled time can be misleading. Typical financial tables in kdb have the format date time sym etc
Edit: Functional form for dynamic column generation:
{x[0] lj x[1]}[{?[`tb;enlist (=;`side;enlist `$x);(enlist `time)!enlist `time;(`$("price",x;"size",x))!(`price;`size)]} each ("Sell";"Buy")]
time | priceSell sizeSell priceBuy sizeBuy
----------| -----------------------------------
2010.01.01| 101 50 100 30
The general pivot function on the Kx website can do this, see https://code.kx.com/q/kb/pivoting-tables/
q)piv[tb;(),`time;(),`side;`price`size;{[v;P]`$raze each string raze P[;0],'/:v,/:\:P[;1]};{x,z}]
time | Buyprice Sellprice Buysize Sellsize
----------| -----------------------------------
2010.01.01| 100 101 30 50
I have a pivot function in github . But it doesn't support multiple columns
.math.st.pivot: {[t;rc;cf;ff]
P: asc distinct t cf;
Pcol: `$string[P] cross "_",/:string key ff;
t: ?[t;();rc!rc;key[ff]!{({[x;y;z] z each y#group x}[;;z];x;y)}[cf]'[key ff;value ff]];
t: ![t;();0b; Pcol! raze {((';#);x;$[-11h=type y;enlist;::] y)}'[key ff]'[P] ];
![t;();0b;key ff]
};
But you can left join to achieve expected result:
.math.st.pivot[tb;enlist`time;`side;enlist[`price]!enlist first]
lj .math.st.pivot[tb;enlist`time;`side;enlist[`size]!enlist first]
Looks like adding support for multiple columns is a good idea.

Replace infinity with nulls throughout entire table KDB

Example table:
table:([]col1:20 40 30 0w;col2:4?4;col3: 100 200 0w 300)
My solution:
{.[table;(where 0w=table[x];x);:;0n]}'[exec c from meta table where t="f"]
There is a way I am not seeing I'm sure. This just returns a list of for each change which I don't want. I just want the original table returned with nulls replaced.
Thanks in advance!
It would be good to flesh out your question a bit more. Are you always expecting it to be float columns? Will the table have many columns? Will there be string/sym columns mixed in that might complicate things?
If your table has a small number of columns you could just do an update
q)show t
col1 col2 col3
--------------
20 1 100
40 2 200
30 2 0w
0w 1 300
q)inftonull:{(x where x=0w):0n;x}
q)update inftonull col1, inftonull col3 from t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
If you think the column names might change or have a very large number of columns you could try a functional update (where you can pass the column names as parameters)
q){![t;();0b;x!inftonull,/:x,:()]}`col1`col3
col1 col2 col3
--------------
20 1 100
40 2 200
30 2
1 300
If your table is comprised of only numeric data something like
q)flip{(x where x=.Q.t[type x]$0w):x 0N;x}each flip t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
Might work, which tries to account for the fact the numeric data has different types.
If your data is going to contain string/sym columns the last example won't work

Filter rows based on two fields, where one of them contains a selection criterion

Given the following table
group | weight | category_id | category_name_plus
1 10 100 Ab
1 20 101 Bcd
1 30 100 Efghij
2 10 101 Bcd
2 20 101 Cdef
2 30 100 Defgh
2 40 100 Ab
3 10 102 Fghijkl
3 20 101 Ab
The "weight" is unique for each group and is also an indicator for the order of records inside the group.
What I want is to retrieve one record per group filtered by category_id, but only the record having the highest "weight" inside its "group".
Example for filtering by category_id = 100:
group | weight | category_id | category_name_plus
1 30 100 Efghij
2 40 100 Ab
Example for filtering by category_id = 101:
group | weight | category_id | category_name_plus
1 20 101 Bcd
2 20 101 Cdef
3 20 101 Ab
How can I select just these rows?
I tried fiddling with UNIQUE, MAX(category_id) etc. but I'm still unable to get the correct results. The main problem for me is to get the category_name_plus value here.
I am working with PostgreSQL 9.4(beta 3), because I also need various other niceties like "WITH ORDINALITY" etc.
The rank window function should do the trick:
SELECT "group", weight, category_id, category_name_plus
FROM (SELECT "group", weight, category_id, category_name_plus,
RANK() OVER (PARTITION BY "group"
ORDER BY weight DESC) AS rk
FROM my_table) t
WHERE rk = 1 AND category_id = 101
Note:
"group" is a reserved word in SQL, so it has to be surrounded by quotes in order to be used as a column name. It would probably be better, though, to replace it with a non-reserved word, such as "group_id".
Try something like:
SELECT DISTINCT ON (category_id) *
from your_table
order by category_id, weight desc

Insert rownumber repeatedly in records in t-sql

I want to insert a row number in a records like counting rows in a specific number of range. example output:
RowNumber ID Name
1 20 a
2 21 b
3 22 c
1 23 d
2 24 e
3 25 f
1 26 g
2 27 h
3 28 i
1 29 j
2 30 k
I rather to try using the rownumber() over (partition by order by column name) but my real records are not containing columns that will count into 1-3 rownumber.
I already try to loop each of record to insert a row count 1-3 but this loop affects the performance of the query. The query will use for the RDL report, that is why as much as possible the performance of the query must be good.
any suggestions are welcome. Thanks
have you tried modulo-ing rownumber()?
SELECT
((row_number() over (order by ID)-1) % 3) +1 as RowNumber
FROM table