Get additional column using functional select - kdb

How to get an additional column of type string using ??
I tried this:
t:([]c1:`a`b`c;c2:1 2 3)
?[t;();0b;`c1`c2`c3!(`c1;`c2;10)] / ok
?[t;();0b;`c1`c2`c3!(`c1;`c2;enlist(`abc))] / ok
?[t;();0b;`c1`c2`c3!(`c1;`c2;"10")] / 'length
?[t;();0b;`c1`c2`c3!(`c1;`c2;enlist("10"))] / 'length
but got 'length error.

Your first case works because an atom will automatically expand to the required length. For a compound column you'll need to explicitly generate the correct length as follows
q)select c1,c2,c3:`abc,c4:10,c5:count[i]#enlist"abc" from t
c1 c2 c3 c4 c5
------------------
a 1 abc 10 "abc"
b 2 abc 10 "abc"
c 3 abc 10 "abc"
// in functional form
q)?[t;();0b;`c1`c2`c3!(`c1;`c2;(#;(count;`i);(enlist;"abc")))]
c1 c2 c3
-----------
a 1 "abc"
b 2 "abc"
c 3 "abc"
Jason

Related

How to build chain of segments in scope of pyspark dataframe

I have a huge pyspark dataframe with segments and their subsegments, like this:
SegmentId SubSegmentStart SubSegmentEnd
1 a1 a2
1 a2 a3
2 b1 b2
3 c1 c2
3 c3 c4
3 c2 c3
I need to group records by SegmentId and add new column index to build chain of subsegments using start and end points. I need to do it for each Segment.
So I need to get the following dataframe:
SegmentId SubSegmentStart SubSegmentEnd Index
1 a1 a2 0
1 a2 a3 1
2 b1 b2 0
3 c1 c2 0
3 c3 c4 2
3 c2 c3 1
How can I do it by PySpark?

Creating nested columns in kdb table

I'd like to create a nested listed for one of my table's columns, but I'm unsure of the syntax to use. If for instance I had the following table...
q)t:([]submitter:`A`B`C; code:3?100; status:110b)
q)t
submitter code status
---------------------
A 2 1
B 39 1
C 64 0
I want to do something similar to below. However this will add the additional column x to the table and place the value there instead of creating a compound list for the code column....
q)update code,:77 from t where status<>1b
submitter code status x
------------------------
A 2 1
B 39 1
C 64 0 77
If it were a dictionary with a single value I would do the following...
q)d:`sumbitter`code`status!(`A;1?100;1)
q)d
sumbitter| `A
code | ,88
status | 1
q)d[`code],:99
q)d
sumbitter| `A
code | 88 99
status | 1
How do I perform the same operation on a table with multiple rows?
My desired output would look like...
q)t
submitter code status
----------------------
A 2 1
B 39 1
C 64 77 0
This would also do it for you, doesn't require you to change the type in advance
q)update code:(code,'(77;())status) from t
submitter code status
---------------------
A ,12 1
B ,10 1
C 1 77 0
You can't change the column type of your code column on-the-fly like you intend to do.
Instead, you first have to update the type of the column code to a list of long instead of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | j
status | b
Update the type:
t: update enlist each code from t
Now the type of code is "J", which is indeed a list of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | J
status | b
And then you can append an element to the code like this:
t:update code:{x,77} each code from t where status<>1b
q)t
submitter code status
----------------------
A ,2 1
B ,39 1
C 64 77 0

Kdb upsert with conditional syntax?

Is there a way I can upsert in kdb where the following occurs:
If key is not present, insert values
If key is present, check if current value is greater
A) If so, perform no action
B) If not, update values
Something like:
job upsert ([title: job1] time: enlist 1 where time > 1)
Since you're using a keyed table, and you want to change values only if they're greater and add in new keys and values, you can try avoiding upsert entirely:
t:([job:`a`b`c] val: 4 4 4) /current table
nt:([job:`a`c`d]val: 6 1 5) /new values to check
t|nt
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
This will automatically add keys that aren't there, and update the current value to the new value if the new value is larger.
please find a solution and explanation below. I'll edit if I come up with a better way - thanks. *also I hope I interpreted the question correctly.
q)t1
name | age height
-------| ----------
michael| 26 173
john | 57 156
sam | 23 134
jimmy | 83 183
conor | 32 145
jim | 64 167
q)t2
name age height
---------------
john 98 220
mary 24 230
jim 50 240
q)t1 upsert t2 where{$[all null n:x[y`name];1b;y[`age]>n[`age]]}[t1;]each t2
name | age height
-------| ----------
michael| 26 173
john | 98 220
sam | 23 134
jimmy | 83 183
conor | 32 145
jim | 64 167
mary | 24 230
q)
Explanation;
The function takes 2 args, x = the keyed table t1 and y = each record from t2(as a dictionary). First we extract the name value from the t2 record(y`name) and try to index into the source keyed table with that value and store the result in the local variable n. If the name exists, the corresponding record(n, as a dictionary)will be returned from y(and all null n will be false) otherwise an empty record will be returned(and all null n will be true). If we cannot find an instance of the t2[`name] in t1 then we just return 1b from the function. Otherwise, then we want to compare the ages between the two records (n[`age] <-- age referenced in t1 for the matching name & y[`age] <-- age of this particular record of t2) - if the age for this matching record in t2 (y[`age]) is greater than the matching value from t1 then we return 1b otherwise we return 0b.
The result of this function is a list of booleans, one for each record in t2. 1b is returned under 2 scenarios - either;
(1) This particular name from t2 has no match in t1. (2) This name from t2 does have a match in t1 and the age is greater than the corresponding age in t1. 0b is returned when the age referenced in t2 is less than the corresponding age from t1.
In our example the result of the function is 110b and after we apply where to this, the result is the indexes where the list value is true i.e. where 110b --> 0 1. We use this list to index into t2 which returns the first 2 records from t2(these are either new records or records where the age is greater than what is referenced in t1), then we simply upsert this into t1.
I hope this helps and hope some better solutions come along.
For a table, a key, and a value: upsert the tuple if the key is new or the value exceeds the existing value.
q)t:([job:`a`b`c] val: 4 4 4) /current table
q)t[`a]|:6 /old key, higher value
q)t
job| val
---| ---
a | 6
b | 4
c | 4
q)t[`c]|:1 /old key, lower value
q)t
job| val
---| ---
a | 6
b | 4
c | 4
q)t[`d]|:5 /new key
q)t
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
Remarks
A keyed table with a single data column could perhaps be a dictionary.
Amending through an operator works also with a new key.
Upserting a table (or dictionary) of new records is more efficient and simpler than updating a single tuple.
q)nt:([job:`a`c`d]val: 6 1 5) /new values to check
q)t|nt /maximum of two tables
job| val
---| ---
a | 6
b | 4
c | 4
d | 5
or just
q)t[([]job:`a`c`d)]|:([]val:6 1 5)
Simple-looking primitives such as maximum (|) repay careful study.

Dataframe GroupBy aggregate on column that contains pattern

i have a dataframe, with column c1, c2. I want to group them on c1 and want to pick c2 such that c2 value contains a pattern, if all c2 don't contain pattern return anyone
example df :
c1 c2
1 ai_za
1 ah_px
1 ag_po
1 af_io
1 ae_aa
1 ad_iq
1 ac_on
1 ab_eh
1 aa_bs
2 aa_ab
2 aa_ac
if pattern needed in c2 is '_io'
expected result:
c1 c2
1 af_io
2 aa_ab
1 af_io is returned as it contains '_io' pattern
2 aa_ab is returned as random as no one in group 2 contains pattern '_io'
How to get this using spark dataframe/dataset api ?
If it doesn't matter which row to pick if there is no match, you can try:
df.groupByKey(_.getAs[Int]("c1")).
reduceGroups((x, y) => if(x.getAs[String]("c2").matches(".*_io")) x else y).
toDF("key", "value").
select("value.c1", "value.c2").show
+---+-----+
| c1| c2|
+---+-----+
| 1|af_io|
| 2|aa_ac|
+---+-----+
Note: this picks the first row that matches the pattern and picks the last row in the group if there is no match.

KDB: select first n rows from each group

How can I extract the first n rows from each group? For example: for table
bb: ([]sym:(4#`a),(5#`b);val: til 9)
sym val
-------------
a 0
a 1
a 2
a 3
b 4
b 5
b 6
b 7
b 8
How can I select the first 2 rows of each group by sym?
Thanks
Can use fby:
q)select from bb where ({x in 2#x};i) fby sym
sym val
-------
a 0
a 1
b 4
b 5
You can try this:
q)select from t where i in raze exec 2#i by sym from t
sym val
-------
a 0
a 1
b 4
b 5