Fby q/kdb aggregation function - kdb

I want to apply an fby using count distinct, i.e.
select from t where 1=(count distinct; column) fby another_column
How can I do this?

To add to what's already up there you could do this in less characters with an # apply
q)n:100
// Create a table where all entries for sym=`a are size 10
q)show t:update size:10 from ([]sym:n#`a`b`c`d;size:n?200) where sym = `a
sym size
--------
a 10
b 28
c 51
d 64
a 10
b 43
...
// Use count distinct# to select from the table as per your requirements
q)select from t where 1=(count distinct#;size) fby sym
sym size
--------
a 10
a 10
a 10
a 10
a 10
0N! is a great operator for checking the operation of these queries 'in-flight'
Using it we can see count distinct fails on it's own because it tries to count the function distinct which returns 1
q)select from t where 1=(0N!count distinct; size) fby sym
1
'type
[0] select from t where 1=(0N!count distinct; size) fby sym
However with dyadic # we can create a handy projection
q)select from t where 1=(0N!(count distinct#); size) fby sym
##[?:]
sym size
--------
a 10
a 10
a 10
a 10
...
Here I've had to wrap using brackets here to prevent 0N! getting sucked into the count distinct # projection. In k-speak this effectively translates to 'count the result of the distinct operator applied to whatever the second argument to # is'. Quite handy for code-golfing

It'd be easier to know for sure if you provide a sample table and desired output, but one of the following will likely help.
The simplest solution would be to use an anonymous function:
select from t where 1=({count distinct x};c1) fby c2
Alternatively you can use this syntax which I first saw used in Nick Psaris' new book Fun Q:
select from t where 1=(count distinct ::;c1) fby c2

Related

Select distinct for all columns from keyed table

It seems we can not get distinct values from a keyed table in the same way as for unkeyed:
t:([a:1 2]b:3 4)
?[t;();0b;()] // keyed table
?[0!t;();1b;()] // unkeyed table
?[t;();1b;()] // err 'type
Why do we have this error here?
I suspect it's the same reason you can't run distinct on a dictionary - it's ambiguous. Do you intend to apply distinct to the keys or the values? I think kdb doesn't pick a side so it makes you do it yourself.
q)t:([]a:1 1 1 2 2;b:10 12 10 14 14)
q)select distinct from t
a b
----
1 10
1 12
2 14
q)select distinct from 1!t
'type
q)distinct `a`b`c!(1;"ab";enlist 1b)
'type

Functional update - multivariable function with dynamic columns

Any help with the following would be much appreciated!
I have two tables: table1 is a summary table whilst table2 is a list of all data points. I want to be able to summarise the information in table2 for each row in table1.
table1:flip `grp`constraint!(`a`b`c`d; 10 10 20 20);
table2:flip `grp`cat`constraint`val!(`a`a`a`a`a`b`b`b;`cl1`cl1`cl1`cl2`cl2`cl2`cl2`cl1; 10 10 10 10 10 10 20 10; 1 2 3 4 5 6 7 8);
function:{[grpL;constraintL;catL] first exec total: sum val from table2 where constraint=constraintL, grp=grpL,cat=catL};
update cl1:function'[grp;constraint;`cl1], cl2:function'[grp;constraint;`cl2] from table1;
The fourth line of this code achieves what I want for the two categories:cl1 and cl2
In table1 I want to name a new column with the name of the category (cl1, cl2, etc.) and I want the values in that column to be the output from running the function over that column.
However, I have hundreds of different categories, so don't want to have to list them out manually as in the fourth line. How would I pass in a list of categories, e.g. below?
`cl1`cl2`cl3
Sticking to your approach, you would just have to make your update statement functional and then iterate over the columns like so:
{![`table1;();0b;(1#x)!enlist ((';function);`grp;`constraint;1#x)]} each `cl1`cl2
Assuming you can amend table1 in place. If you must retain the original table1 then you can pass it by value though it will consume more memory
{![x;();0b;(1#y)!enlist ((';function);`grp;`constraint;1#y)]}/[table1;`cl1`cl2]
Another approach would be to aggregate, pivot and join though it's not necessarily a better solution as you get nulls rather than zeros
a:select sum val by cat,grp,constraint from table2
p:exec (exec distinct cat from a)#cat!val by grp,constraint from a
table1 lj p
There are several different methods you can look into.
The easiest method would be a functional update - http://code.kx.com/wiki/JB:QforMortals2/queries_q_sql#Functional_update
Below, though, should somewhat prove more useful, quicker and neater:
Your problem can be split into 2 parts. For the first part, you are looking to create a sum of each category by grp and constraint within table2. As for the second part, you are looking to join these results (the lookups) onto the corresponding records from table1.
You can create the necessary groups using by
q)exec val,cat by grp,constraint from table2
grp constraint| val cat
--------------| ------------------------------
a 10 | 1 2 3 4 5 `cl1`cl1`cl1`cl2`cl2
b 10 | 6 8 `cl2`cl1
b 20 | ,7 ,`cl2
Note though, this will only create nested lists of the columns in your select query
Next is to sum each of the cat groups
q)exec sum each val group cat by grp,constraint from table2
grp constraint|
--------------| ------------
a 10 | `cl1`cl2!6 9
b 10 | `cl2`cl1!6 8
b 20 | (,`cl2)!,7
Then, to create the cat's columns you can use a pivot like syntax - http://code.kx.com/wiki/Pivot
q)cats:asc exec distinct cat from table2
q)exec cats#sum each val group cat by grp,constraint from table2
grp constraint| cl1 cl2
--------------| -------
a 10 | 6 9
b 10 | 8 6
b 20 | 7
Now you can use this lookup table and index into each row from table1
q)(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
To fill the nulls with zeros, use the carat symbol - http://code.kx.com/wiki/Reference/Caret
q)0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
0 0
0 0
And now you can join on each row from table1 to your results using join-each
q)table1,'0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
grp constraint cl1 cl2
----------------------
a 10 6 9
b 10 8 6
c 20 0 0
d 20 0 0
HTH, Sean
This approach is the easiest way to pass in a list of categories
{table1^flip x!function'[table1`grp;table1`constraint;]each x}`cl1`cl2

How to apply a function to an entire column?

I have the following table from a JDBC connection in Q.
q)r
some_int this created_at updated_at ..
-----------------------------------------------------------------------------..
1231231 "ASD" 2016.02.11D14:16:29.743260000 2016.02.11D14:16:29...
13312 "TSM" 2016.02.11D14:16:29.743260000 2016.02.11D14:16:29...
I would like to apply the following function to the first column.
deviation:{a:avg x; sqrt avg (x*x)-a*a}
This works for arrays.
q)l
1 2 3 4
q)deviation l
1.118034
How can I apply deviation on a column in a table? It seems my approach does not work:
q)select deviation(some_id) from r
'rank
UPDATE:
I cannot explain the following:
q)select avg(some_int) from r
some_int
---------
1005341
q)select min(some_int) from r
some_int
---------
812361
q)select max(some_int) from r
some_int
---------
1184014
q)select sum(some_int) from r
some_int
---------
You need to enlist the result if it is an atom since table columns must be lists, not atoms. Normally kdb can do this for you but often not when you're performing your own custom aggregations. For example, even if you define a function sum2 to be an exact copy of sum:
q)sum2:sum
kdb can only recognise sum as an aggregation and will enlist automatically, but not for sum2
q)select sum col1 from ([]col1:1 2 3 4)
col1
----
10
q)select sum2 col1 from ([]col1:1 2 3 4)
'rank
So you need to enlist in the second case:
q)select enlist sum2 col1 from ([]col1:1 2 3 4)
col1
----
10
UPDATE:
To answer your second question - it looks like your sum of numbers has spilled over the boundary for an integer. You'd need to convert them to long and then sum
q)select sum col1 from ([]col1:2147483645 1i)
col1
----------
2147483646
Above is the maximum integer. Adding one more gives infinity for an int
q)select sum col1 from ([]col1:2147483645 1 1i)
col1
----
0W
Adding anything more than that shows a blank (null)
q)select sum col1 from ([]col1:2147483645 1 1 1i)
col1
----
Solution is to cast to long before summing (or make them long in the first place)
q)select sum `long$col1 from ([]col1:2147483645 1 1 1i)
col1
----------
2147483648
You get a rank because the function does not return a list. Since the function returns a single number presumably you just want the single number answer? In which case you can simple index into the table (or use exec) to get the column vector and apply it:
deviation t`some_id
Else if you want to retain a table as the answer if you enlist the result:
select enlist deviation some_id from t

kdb: dynamically denormalize a table (convert key values to column names)

I have a table like this:
q)t:([sym:(`EURUSD`EURUSD`AUDUSD`AUDUSD);server:(`S01`S02`S01`S02)];volume:(20;10;30;50))
q)t
sym server| volume
-------------| ------
EURUSD S01 | 20
EURUSD S02 | 10
AUDUSD S01 | 30
AUDUSD S02 | 50
I need to de-normalize it to display the data nicely. The resulting table should look like this:
sym | S01 S02
------| -------
EURUSD| 20 10
AUDUSD| 30 50
How do I dynamically convert the original table using distinct values from server column as column names for the new table?
Thanks!
Basically you want 'pivot' table. Following page has a very good solution for your problem:
http://code.kx.com/q/cookbook/pivoting-tables/
Here are the commands to get the required table:
q) P:asc exec distinct server from t
q) exec P#(server!volume) by sym:sym from t
One tricky thing around pivoting a table is - the keys of the dictionary should be of type symbol otherwise it won't generate the pivot table structure.
E.g. In the following table, we have a column dt with type as date.
t:([sym:(`EURUSD`EURUSD`AUDUSD`AUDUSD);dt:(0 1 0 1+.z.d)];volume:(20;10;30;50))
Now if we want to pivot it with columns as dates , it will generate a structure like :
q)P:asc exec distinct dt from t
q)exec P#(dt!volume) by sym:sym from t
(`s#flip (enlist `sym)!enlist `s#`AUDUSD`EURUSD)!((`s#2018.06.22 2018.06.23)!30j, 50j;(`s#2018.06.22 2018.06.23)!20j, 10j)
To get the dates as the columns , the dt column has to be typecasted to symbol :
show P:asc exec distinct `$string date from t
`s#`2018.06.22`2018.06.23
q)exec P#((`$string date)!volume) by sym:sym from t
sym | 2018.06.22 2018.06.23
------| ---------------------
AUDUSD| 30 50
EURUSD| 20 10

kdb+: function with two arguments from columns

I have a function that does something with a date and a function that takes two arguments to perform a calculation. For now let's assume that they look as follows:
d:{[x] :x.hh}
f:{[x;y] :x+y}
Now I want to use function f in a query as follows:
select f each (columnOne,d[columnTwo]) from myTable
Hence, I first want to convert one column to the corresponding numbers using function d. Then, using both columnOne and the output of d[columnTwo], I want to calculate the outcome of f.
Clearly, the approach above does not work, as it fails with a 'rank error.
I've also tried select f ./: (columnOne,'d[columnTwo]) from myTable, which also doesn't work.
How do I do this? Note that I need to input columnOne and columnTwo into f such that the corresponding rows still match. E.g. input row 1 of columnOne and row 1 of columnTwo simultaneously into f.
I've also tried select f ./: (columnOne,'d[columnTwo]) from myTable, which also doesn't work.
You're very close with that code. The issue is the d function, in particular the x.hh within function d - the .hh notation doesn't work in this context, and you will need to do `hh$x instead, so d becomes:
d:{[x] :`hh$x}
So making only this change to the above code, we get:
q)d:{[x] :`hh$x}
q)f:{[x;y] :x+y}
q)myTable:([] columnOne:10?5; columnTwo:10?.z.t);
q)update res:f ./: (columnOne,'d[columnTwo]) from myTable
columnOne columnTwo res
--------------------------
1 21:10:45.900 22
0 20:23:25.800 20
2 19:03:52.074 21
4 00:29:38.945 4
1 04:30:47.898 5
2 04:07:38.923 6
0 06:22:45.093 6
1 19:06:46.591 20
1 10:07:47.382 11
2 00:45:40.134 2
(I've changed select to update so you can see other columns in result table)
Other syntax to achieve the same:
q)update res:f'[columnOne;d columnTwo] from myTable
columnOne columnTwo res
--------------------------
1 21:10:45.900 22
0 20:23:25.800 20
2 19:03:52.074 21
4 00:29:38.945 4
1 04:30:47.898 5
2 04:07:38.923 6
0 06:22:45.093 6
1 19:06:46.591 20
1 10:07:47.382 11
2 00:45:40.134 2
Only other note worthy point - in the above example, function d is vectorised (works with vector arg), if this wasn't the case, you'd need to change d[columnTwo] to d each columnTwo (or d'[columnTwo])
This would then result in one of the following queries:
select res:f'[columnOne;d'[columnTwo]] from myTable
select res:f ./: (columnOne,'d each columnTwo) from myTable
select res:f ./: (columnOne,'d'[columnTwo]) from myTable