How do I parameterise the column list in KDB? - kdb

I have a number of repetitive queries:
select lR, e10, e25, vr from z
Is there a way I can do something like:
features: `lR`e10`e25`vr
select features from z

You could use # like so:
`lR`e10`e25`vr#z
NB: The left argument here must be a list so to select a single column use the following:
enlist[`vr]#z
Example:
q)t:([]a:`a`b`c;b:til 3;c:0b);
q)`a`b#t
a b
---
a 0
b 1
c 2

Another approach is to use a functional form (which you can build using parse):
q)0N!parse"select lR, e10, e25, vr from z";
(?;`z;();0b;`lR`e10`e25`vr!`lR`e10`e25`vr)
q)features:`lR`e10`e25`vr
q)?[z;();0b;features!features]

If you use # for this then be aware it will fail on a keyed table.
One possible way of modifying it to work on any table would be something like:
f:{[t;c] if[not .Q.qt[t]; '"Input is not a table"]; c,:(); $[99h = type[t];c#/:t;c#t]}
So make sure your table is, in fact, a table, make sure columns are a list, and then perform the required # operation.
q)t
a| b c d
-| ------
a| 4 7 10
b| 5 8 11
c| 6 9 12
q)f[t;`b]
a| b
-| -
a| 4
b| 5
c| 6
q)f[0!t;`b]
b
-
4
5
6
q)f[flip 0!t;`b]
'Input is not a table
[0] f[flip 0!t;`b]
^

Related

How can I group my data frame based on conditions on a column

I have a data frame like this:
Date
Version
Value
Name
Jan 1
123.1
3
A
Jan 2
123.23
5
A
Jan 1
223.1
6
B
Jan 2
623.23
7
B
I want to group the table for 'Version' with the same prefix (everything from the first letter to there is the .. And for the Value, it selects the values using the row with the longest string length of version. And for the `Name' column, it uses any of the rows with the same prefix.
Version Prefix
Value
Name
123
5
A
223
6
B
623
7
B
Meaning version 123.1 and 123.23 has the same prefix '123', so both rows become 1 row in the result. And 'Value' equals to 5 since row with Version 123.23 (the row with the longest
Version has 5 as Value.
(df.withColumn('Version Prefix', split('Version','\.')[0])#Create new column
.withColumn('size', size(split(split('Version','\.')[1],'(?!$)')))#Calculate the size of the suffixes
.withColumn('max', max('size').over(Window.partitionBy('Version Prefix','Name')))#Find the suffix with the maximum size
.where(col('size')==col('max'))#Filter out max suffixes
.drop('Date','size','max','Version')#Drop unwanted columns
).show()
+-----+----+--------------+
|Value|Name|Version Prefix|
+-----+----+--------------+
| 5| A| 123|
| 6| B| 223|
| 7| B| 623|
+-----+----+--------------+

Reshape [cols;table]

How do I get columns from a table? If they don't exist it's ok to get them as null columns.
Trying reshape#:
q)d:`a`b!1 2
q)enlist d
a b
---
1 2
q)`a`c#d
a| 1
c|
q)`a`c#enlist d
'c
[0] `a`c#enlist d
^
Why does thereshape# operator not work on a table? It could easily act on each row (which is dict) and combine results. So I'm forced to write:
q)`a`c#/:enlist d
a c
---
1
Is it the shortest way?
Any key you try to take (#) which is not present in a dictionary will be assigned a null value of the same type as the first value in the dictionary. Similar behaviour is not available for tables.
q)`a`c#`a`b!(1 2;())
a| 1 2
c| `long$()
q)`b`c#`a`b!(();1 2)
b| 1 2
c| ()
Like you mentioned, the use of each-right (/:) will act on each row of the table IE each dictionary. Instead of using an iterator to split the table into dictionaries we can act on the dictionary itself. This will return the same output and is slightly faster.
q)d:`a`b!1 2
q)enlist`a`c#d
a c
---
1
q)(`a`c#/:enlist d)~enlist`a`c#d
1b
q)\ts:1000000 enlist`a`c#d
395 864
q)\ts:1000000 `a`c#/:enlist d
796 880

Restrict table columns, preserving keys

I've found in "Q Tips" a technique to preserve keys in a table. This is useful for restriction columns in the right table in lj for example, without re-applying a key. Using each:
q)show t:(`c1`c2!1 2;`c1`c2!3 4)!(`c3`c4`c5!30 40 50;`c3`c4`c5!31 41 51)
c1 c2| c3 c4 c5
-----| --------
1 2 | 30 40 50
3 4 | 31 41 51
q)`c3`c4#/:t
c1 c2| c3 c4
-----| -----
1 2 | 30 40
3 4 | 31 41
I’m trying to understand why it preserves a key part of the table t:
q){-3!x}/:t
'/:
[0] {-3!x}/:t
^
But in this case q doesn’t show how it treats each row of the keyed table.
So why is this syntax #/:t works in such a way for a keyed table? Is it mentioned anywhere in code.kx.com docs?
Upd1: I've found a case with # and keyed table on code.kx.com, but it is about selecting rows, not columns.
If you view the keyed table as a dictionary (which it is) then it's no different to:
q)2*/:`a`b!1 2
a| 2
b| 4
or
q){x+1} each `a`b!1 2
a| 2
b| 3
The keys are retained when applying a function to each element of a dictionary. In your example the function being applied is to use take on a dictionary, e.g:
q)`c3`c4#first t
c3| 30
c4| 40
doing that for each row returns a list of dictionaries which is itself a table.
Also your other attempt would work as:
{-3!x}#/:t
so it's not unique to take #
{-3!x}/:t
each right needs two arguments so this wont work.
Since the table is keyed, it is treated as a dictionary. The each right iterates over the dictionary values and therefore ignores the keys of the main dictionary (= the keyed columns). To see what is happening it might help to see what happens when using each:
q)){-3!x} each t
c1 c2|
-----| --------------------
1 2 | "`c3`c4`c5!30 40 50"
3 4 | "`c3`c4`c5!31 41 51"

Different results of flip on select and on index-accessed table in kdb+

In a q session I've made a keyed table t:
q)/KDB+ 3.6 2018.05.17
q)f:flip (`a`b)!(1 2 3;4 5 6)
q)k:flip (enlist `k)!(enlist 101 102 103)
q)t:k!f;t
k | a b
---| ---
101| 1 4
102| 2 5
103| 3 6
Then I've tried to make a query and it gives a nice results:
q)select a,b from t where k=101
a b
---
1 4
q)flip select a,b from t where k=101
a| 1
b| 4
q)flip flip select a,b from t where k=101
a b
---
1 4
But without select-syntax this gives an error:
q)t[101]
a| 1
b| 4
q)flip t[101]
'rank
[0] flip t[101]
^
Why can't I just make a simple flip on the same result as from select of the same data types?
q)type flip select a,b from t where k=101
99h
q)type t[101]
99h
Because the elements of dictionary t[101] aren't lists, but atoms. So flip on a list of atoms fails.
Appending each element to an empty list first will work.
q)(),/:t[101]
a| 1
b| 4
Not necessarily something you want to do. For a given dictionary output, the solution you probably want is enlist
q)enlist t[101]
a b
---
1 4
An alternative approach would be to lookup using a table rather than a lookup using an atom:
q)t[([]k:(),101)]
a b
---
1 4
That would be the equivalent of select a,b from t where k=101

How to get the name of the group with maximum value of parameter? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")
There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.