kdb/q question: How do I interpret this groupby in my functional selection? - group-by

I am new to kdb/q and am trying to figure out what this particular query means. The code is using functional select, which I am not overly comfortable with.
?[output;();b;a];
where output is some table which has columns size time symbol
the groupby filter dictionary b is defined as follows
key | value
---------------
ts | ("+";00:05:00v;("k){x*y div x:$[16h=abs[#x];"j"$x;x]}";00:05:00v;("%:";`time)))
sym | ("k){x'y}";"{`$(,/)("/" vs string x)}";`symbol)
For the sake of completeness, dictionary a is defined as
volume ("sum";`size)
In effect, the functional select seems to be bucketing the data into 5 minute buckets and doing some parsing in symbol. What baffles me is how to read the groupby dictionary. Especially the k)" part and the entire thing being in quotes. Can someone help me go through this or point me to resources that can help me understand? Any input will be appreciated.

The aggregation part of the function form takes a dictionary, the key being the output key column names and the values being parse tree functions.
A parse tree is an expression that is not immediately evaluated. The first argument as a function and subsequent elements are its arguments. The inner-most brackets are evaluated first and then it moves up the heirarchy, evaluating each one in turn. More detailed information can be found here and in the whitepaper linked on that page
You can use the function parse with a string argument to get the parse tree of a function. For example, the parse tree for 1+2+3 is (+;1;(+;2;3)):
q)parse "1+2+3"
+
1
(+;2;3)
The inner-most bracket (+;2;3) is evaluated first resulting in 5, before the result is propogated up to the outmost parse tree function (+;1;5) giving 6
The groupby part of the clause will evaluate one or more parse tree functions and then will collect together records with the same output from the grouping function.
Making the function a bit clearer to read:
(+;00:05:00v;({x*y div x:$[16h=abs[#x];"j"$x;x]}";00:05:00v;(%:;`time)))
Looking at the inner most bracket (%:;`time), it returns the result of %: applied on the time column. We can see that %: is k for the function ltime
q)ltime
%:
Moving up a level, the next function evaluated is the lambda function {x*y div x:$[16h=abs[#x];"j"$x;x]} with arguments 00:05:00v and the result of our previous evaluated function. The lambda rounds it down the the nearest 5 minute interval
({x*y div x:$[16h=abs[#x];"j"$x;x]};00:05:00v;(%:;`time))
Moving up once more to the whole expression it is equivalent to 00:05:00v + {x*y div x:$[16h=abs[#x];"j"$x;x]};00:05:00v;(%:;`time)), with 00:05:00 being added onto each result from the previous evaluation.
So essentially it first returns the local time of the timestamp, then
For the symbol aggregation
("k""{x'y}";{`$(,/)("/" vs string x)};`symbol)
The inner function {`$(,/)("/" vs string x)} strings a symbol, splits it at "/" character and then joins it back together, effectively removing the slash
The "k" is a function that evaluates the string using the k interpreter.
"k""{x'y}"" returns a function which itself takes a function x and argument y and modifies the function to use the each-both adverb '. This makes it so that the function x is applied on each symbol individually as opposed to the column as a whole.
This could be implemented in q instead of k like so:
({x#'y};{`$(,/)("/" vs string x)};`symbol)
The function {x#'y} takes the function argument {`$(,/)("/" vs string x)} and the symbol column as before, but we have to use # with the each-both adverb in q to apply the function on the arguments.
The aggregation function will then be applied to each group. In your case the function is a simple parse tree, which will return the sum of the size columns in each group, with the output column called volume
a:enlist[`volume]!enlist (sum;`size)

Related

kdb: differences between value and eval

From KX: https://code.kx.com/q/ref/value/ says, when x is a list, value[x] will be result of evaluating list as a parse tree.
Q1. In code below, I understand (A) is a parse tree, given below definition. However, why does (B) also work? Is ("+";3;4) a valid parse tree?
q)value(+;3;4) / A
7
q)value("+";3;4) / B
7
q)eval(+;3;4) / C
7
q)eval("+";3;4) / D
'length
[0] eval("+";3;4)
Any other parse tree takes a form of a list, of which the first item
is a function and the remaining items are its arguments. Any of these
items can be parse trees. https://code.kx.com/q/basics/parsetrees/
Q2. In below code, value failed to return the result of what I think is a valid parse tree, but eval works fine, recursively evaluating the tree. Does this mean the topmost description is wrong?
q)value(+;3;(+;4;5))
'type
[0] value(+;3;(+;4;5))
^
q)eval(+;3;(+;4;5))
12
Q3. In general then, how do we choose whether to use value or eval?
put simply the difference between eval and value is that eval is specifically designed to evaluate parse trees, whereas value works on parse trees among other operations it does. For example value can be used to see the non-keyed values of dictionaries, or value strings, such as:
q)value"3+4"
7
Putting this string instead into the eval, we simply get the string back:
q)eval"3+4"
"3+4"
1 Following this, the first part of your question isn't too bad to answer. The format ("+";3;4) is not technically the parsed form of 3+4, we can see this through:
q)parse"3+4"
+
3
4
The good thing about value in this case is that it is valuing the string "+" into a the operator + and then valuing executing the parse tree. eval cannot understand the string "+" as this it outside the scope of the function. Which is why A, B and C work but not D.
2 In part two, your parse tree is indeed correct and once again we can see this with the parse function:
q)parse"3+(4+5)"
+
3
(+;4;5)
eval can always be used if your parse tree represents a valid statement to get the result you want. value will not work on all parse tree's only "simple" ones. So the nested list statement you have here cannot be evaluated by value.
3 In general eval is probably the best function of choice for evaluating your parse trees if you know them to be the correct parse tree format, as it can properly evaluate your statements, even if they are nested.

Expressions/operator precedence in Amend At and in function parameters

I always thought that in q and in k all expressions divided ; evaluated left-to-right and operator precedence inside is right-to-left.
But then I tried to apply this principle to Ament At operator parameters. Confusingly it seems working in the opposite direction:
$ q KDB+ 3.6 2019.04.02 Copyright (C) 1993-2019 Kx Systems
q)#[10 20 30;g;:;100+g:1]
10 101 30
The same precedence works inside a function parameters too:
q){x+y}[q;10+q:100]
210
So why does it happend - why does it first calculate the last one parameter and only then first? Is it a feature we should avoid?
Upd: evaluation vs parsing.
There could be another cases: https://code.kx.com/q/ref/apply/#when-e-is-not-a-function
q)#[2+;"42";{)}]
')
[0] #[2+;"42";{)}]
q)#[string;42;a:100] / expression not a function
"42"
q)a // but a was assigned anyway
100
q)#[string;42;{b::99}] / expression is a function
"42"
q)b // not evaluated
'b
[0] b
^
The semicolon is a multi-purpose separator in q. It can separate statements (e.g. a:10; b:20), in which case statements are evaluated from left-to-right similar to many other languages. But when it separates elements of a list it creates a list expression which (an expression) is evaluated from right-to-left as any other q expression would be.
Like in this example:
q)(q;10+q:100)
110 100
One of many overloads of the dot operator (.) evaluates its left operand on a list of values in its right operand:
q){x+y} . (q;10+q:100)
210
In order to do so a list expression itself needs to be evaluated first, and it will be, from right-to-left as any other list expression.
However, the latter is just another way of getting the result of
{x+y}[q;10+q:100]
which therefore should produce the same value. And it does. By evaluating function arguments from right-to-left, of course!
A side note. Please don't be confused by the conditional evaluation statement $[a;b;c]. Even though it looks like an expression it is in fact a statement which evaluates a first and only then either b or c. In other words a, b and c are not arguments of some function $ in this case.

KDB Apply where phrase only if column exists

I'm looking for a way to write functional select in KDB such that the where phrases is only apply if the column exists (on order to avoid error). If the column doesn't exist, it defaults to true.
I tried this but it didn't work
enlist(|;enlist(in;`colname;key flip table);enlist(in;`colname;filteredValues[`colname]));
I tried to write a simple boolean expression and use parse to get my functional form
(table[`colname] in values)|(not `colname in key flip table)
But kdb doesn't have short circuit so the left-hand expression is still evaluated despite the right-hand expression evaluating to true. This caused a weird output boolean$() which is a list of booleans all evaluating to false 0b
Any help is appreciated. Thanks!
EDIT 1: I have to join a series of condition with parameter specified in the dictionary filters
cond,:(,/) {[l;k] enlist(in;k;enlist l[k])}[filters]'[a:(key filters)]
Then I pass this cond on and it gets executed on a few different selects on different tables. How can I make sure that whatever conditional expression I put in place of enlist(in;k;enlist l[k] will only get evaluated as the select statement gets executed.
You can use the if-else conditional $ here to do what you want
For example:
q)$[`bid in cols`quotes;enlist (>;`bid;35);()]
> `bid 35
q)$[`bad in cols`quotes;enlist (>;`bad;35);()]
Note that in the second example, the return is an empty list, as this column isn't in quotes table
So you can put this into the functional select like so:
?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
and the where clause will be applied the the column is present, otherwise no where clause will be applied:
q)count ?[`quotes;$[`bid in cols`quotes;enlist (>;`bid;35);()];0b;()]
541 //where clause applied, table filtered
q)count ?[`quotes;$[`bad in cols`quotes;enlist (>;`bad;35);()];0b;()]
1000 //where clause not applied, full table returned
Hope this helps
Jonathon
AquaQ Analytics
EDIT: If I'm understanding your updated question correctly, you might be able to do something a like the following. Firstly, let's define an example "filters" dictionary:
q)filters:`a`b`c!(1 2 3;"abc";`d`e`f)
q)filters
a| 1 2 3
b| a b c
c| d e f
So here we are assuming a few different columns of different types, for illustration purposes. You can build up your list of where clauses like so:
q)(in),'flip (key filters;value filters)
in `a 1 2 3
in `b "abc"
in `c `d`e`f
(this is equivalent to the code you had to generate cond, but it's a little neater & more efficient - you also have the values enlisted, which isn't necessary)
You could then use a vector conditional to generate your list of where clauses to apply to a given table e.g.
q)t:([] a:1 2 3 4 5 6;b:"adcghf")
q)?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
(in;`a;,1 2 3)
(in;`b;,"abc")
()
As you can see, in this example the table "t" has columns a and b, but not c. So using the vector conditional, you get the where clauses for a and b but not c.
Finally to actually apply this list of output where clauses to the table, you can make use of an over to apply each in turn:
q)l:?[key[filters] in cols[t];(in),'flip (key filters;value filters);count[filters]#()]
q){?[x;$[y~();y;enlist y];0b;()]}/[t;l]
a b
---
1 a
3 c
One thing to note here is that in the where clause of the functional select we need to check if y is an empty list - this is so we can enlist it if it is not an empty list
Hope this helps

SPSS Macro: compute by variable name

I don't think SPSS macros can return values, so instead of assigning a value like VIXL3 = !getLastAvail target=VIX level=3 I figured I need to do something like this:
/* computes last available entry of target at given level */
define !compLastAvail(name !Tokens(1) /target !Tokens(1) /level !Tokens(1))
compute tmpid= $casenum.
dataset copy tmpset1.
select if not miss(!target).
compute !name= lag(!target, !level).
match files /file= * /file= tmpset1 /by tmpid.
exec.
delete variables tmpid.
dataset close tmpset1.
!enddefine.
/* compute last values */
!compLastAvail name="VIXCL3" target=VIXC level=3.
The compute !name = ...is where the problem is.
How should this be done properly? The above returns:
>Error # 4285 in column 9. Text: VIXCL3
>Incorrect variable name: either the name is more than 64 characters, or it is
>not defined by a previous command.
>Execution of this command stops.
When you pass tokens to the macro, they get interpreted literally. So when you specify
!compLastAvail name="VIXCL3"
It gets passed to the corresponding compute statement as "VIXCL3", instead of just a variable name without quotation marks (e.g. VIXCL3).
Two other general pieces of advice;
If you do the command set mprint on before you execute your macro, you will see how your tokens are passed to the macro. In this instance, if you had taken that step, you would have seen that the offending compute statement and error message.
Sometimes you do what to use quotation marks in tokens, and when that is the case the string commands !QUOTE and !UNQUOTE come in handy.

Perl autoincrement of string not working as before

I have some code where I am converting some data elements in a flat file. I save the old:new values to a hash which is written to a file at the end of processing. On subsequence execution, I reload into a hash so I can reuse previously converted values on additional data files. I also save the last conversion value so if I encounter an unconverted value, I can assign it a new converted value and add it to the hash.
I had used this code before (back in Feb) on six files with no issues. I have a variable that is set to ZCKL0 (last character is a zero) which is retrieved from a file holding the last used value. I apply the increment operator
...
$data{$olddata} = ++$dataseed;
...
and the resultant value in $dataseed is 1 instead of ZCKL1. The original starting seed value was ZAAA0.
What am I missing here?
Do you use the $dataseed variable in a numeric context in your code?
From perlop:
If you increment a variable that is
numeric, or that has ever been used in
a numeric context, you get a normal
increment. If, however, the variable
has been used in only string contexts
since it was set, and has a value that
is not the empty string and matches
the pattern /^[a-zA-Z][0-9]\z/ , the
increment is done as a string,
preserving each character within its
range.
As prevously mentioned, ++ on strings is "magic" in that it operates differently based on the content of the string and the context in which the string is used.
To illustrate the problem and assuming:
my $s='ZCL0';
then
print ++$s;
will print:
ZCL1
while
$s+=0; print ++$s;
prints
1
NB: In other popular programming languages, the ++ is legal for numeric values only.
Using non-intuitive, "magic" features of Perl is discouraged as they lead to confusing and possibly unsupportable code.
You can write this almost as succinctly without relying on the magic ++ behavior:
s/(\d+)$/ $1 + 1 /e
The e flag makes it an expression substitution.