How do I make a dictionary of zeroes in KDB? - kdb

Say I've got a (long) list of keys:
`a`b`c
How do I make a dictionary where every value is zero?

If your list is saved as a variable, this can be achieved with the count keyword. For example
q)c:`a`b`c
q)c!(count c)#0
a| 0
b| 0
c| 0
will return your dictionary of zeroes.

If you have a blank (or existing) dictionary you can assign 0 to all of your keys:
q)c:`a`b`c
q)d:()!()
q)d[c]:0
q)d
a| 0
b| 0
c| 0

Just to add one more option, can also use each-right \: with assign : to do this:
q)(`a`b`c!()):\:0
a| 0
b| 0
c| 0

If 0 is a default value, you could also use fill when indexing into it:
q)d:`a`b`c!til 3
q)d[`d]
0N
q)0^d[`d]
0

Related

How to generate the columns based on the unique values of that particular column in pyspark?

I have a dataframe as below
+----------+------------+---------------------+
|CustomerNo|size |total_items_purchased|
+----------+------------+---------------------+
| 208261.0| A | 2|
| 208263.0| C | 1|
| 208261.0| E | 1|
| 208262.0| B | 2|
| 208264.0| D | 3|
+----------+------------+---------------------+
I have another table df that consists of customerNo's only. I have to create columns of unique comfortStyles and have to update the total_items_purchased in df
My df table should look like
CustomerNo,size_A,size_B,size_C,size_D,size_E
208261.0 1 0 0 0 1
208262.0 0 2 0 0 0
208263.0 0 0 1 0 0
208264.0 0 0 0 3 0
Can anyone tell me how to do this?
You can use pivot function to rearrange the table.
df = (df.groupBy('CustomerNo')
.pivot('size')
.agg(F.first('total_items_purchased'))
.na.fill(0))

How do I parameterise the column list in KDB?

I have a number of repetitive queries:
select lR, e10, e25, vr from z
Is there a way I can do something like:
features: `lR`e10`e25`vr
select features from z
You could use # like so:
`lR`e10`e25`vr#z
NB: The left argument here must be a list so to select a single column use the following:
enlist[`vr]#z
Example:
q)t:([]a:`a`b`c;b:til 3;c:0b);
q)`a`b#t
a b
---
a 0
b 1
c 2
Another approach is to use a functional form (which you can build using parse):
q)0N!parse"select lR, e10, e25, vr from z";
(?;`z;();0b;`lR`e10`e25`vr!`lR`e10`e25`vr)
q)features:`lR`e10`e25`vr
q)?[z;();0b;features!features]
If you use # for this then be aware it will fail on a keyed table.
One possible way of modifying it to work on any table would be something like:
f:{[t;c] if[not .Q.qt[t]; '"Input is not a table"]; c,:(); $[99h = type[t];c#/:t;c#t]}
So make sure your table is, in fact, a table, make sure columns are a list, and then perform the required # operation.
q)t
a| b c d
-| ------
a| 4 7 10
b| 5 8 11
c| 6 9 12
q)f[t;`b]
a| b
-| -
a| 4
b| 5
c| 6
q)f[0!t;`b]
b
-
4
5
6
q)f[flip 0!t;`b]
'Input is not a table
[0] f[flip 0!t;`b]
^

How to count the last 30 day occurrence & transpose a column's row value to new columns in pyspark

I am trying to get the count of occurrence of the status column for each 'name', 'id' & 'branch' combination in the last 30 days using Pyspark.
For simplicity lets assume the current day is 19/07/2021
Input dataframe
id name branch status eventDate
1 a main failed 18/07/2021
1 a main error 15/07/2021
2 b main failed 16/07/2021
3 c main snooze 12/07/2021
4 d main failed 18/01/2021
2 b main failed 18/07/2021
expected output
id name branch failed error snooze
1 a main 1 1 0
2 b main 2 0 0
3 c main 0 0 1
4 d main 0 0 0
I tried the following code.
from pyspark.sql import functions as F
df = df.withColumn("eventAgeinDays", (F.datediff(F.current_timestamp(), F.col("eventDate"))))
df = df.groupBy('id', 'branch', 'name', 'status')\
.agg(
F.sum(
F.when(F.col("eventAgeinDays") <= 30, 1).otherwise(0)
).alias("Last30dayFailure")
)
df = df.groupBy('id', 'branch', 'name', 'status').pivot('status').agg(F.collect_list('Last30dayFailure'))
The code kind of gives me the output, but I get arrays in the output since I am using F.collect_list()
my partially correct output
id name branch failed error snooze
1 a main [1] [1] []
2 b main [2] [] []
3 c main [] [] [1]
4 d main [] [] []
Could you please suggest a more elegant way of creating my expected output? Or let me know how to fix my code?
Instead of using collect_list which creates list, use first as the aggregation method (The reason we can use first is that you already had an aggregation grouped by id, branch, name and status so you are sure that there's at most one value for each unique combination):
(df.groupBy('id', 'branch', 'name')
.pivot('status')
.agg(F.first('Last30dayFailure'))
.fillna(0)
.show())
+---+------+----+-----+------+------+
| id|branch|name|error|failed|snooze|
+---+------+----+-----+------+------+
| 1| main| a| 1| 1| 0|
| 4| main| d| 0| 0| 0|
| 3| main| c| 0| 0| 1|
| 2| main| b| 0| 2| 0|
+---+------+----+-----+------+------+

Accumulator gives different result then direct function applying

Trying to combine two result sets I've faced with different behavior when joining two keyed tables:
q)show t:([a:1 1 2]b:011b)
a| b
-| -
1| 0
1| 1
2| 1
q)t,t
a| b
-| -
1| 1
1| 1
2| 1
q)(,/)(t;t)
a| b
-| -
1| 1
2| 1
Why does the accumulator ,/ remove duplicated keys, and why its result differs from a direct table join ,?
I suspect that join over (aka ,/ aka raze) has special handling under the covers that isn't exposed to the end user.
The interpreter recognises the ,/ and behaves a certain way depending on the inputs. This likely applies to dictionaries and keyed tables:
q)raze(`a`a`b!1 2 3;`a`b!9 9)
a| 9
b| 9
q)
q)(`a`a`b!1 2 3),`a`b!9 9
a| 9
a| 2
b| 9
q)
q)({x,y}/)(`a`a`b!1 2 3;`a`b!9 9)
a| 9
a| 2
b| 9

Reshape [cols;table]

How do I get columns from a table? If they don't exist it's ok to get them as null columns.
Trying reshape#:
q)d:`a`b!1 2
q)enlist d
a b
---
1 2
q)`a`c#d
a| 1
c|
q)`a`c#enlist d
'c
[0] `a`c#enlist d
^
Why does thereshape# operator not work on a table? It could easily act on each row (which is dict) and combine results. So I'm forced to write:
q)`a`c#/:enlist d
a c
---
1
Is it the shortest way?
Any key you try to take (#) which is not present in a dictionary will be assigned a null value of the same type as the first value in the dictionary. Similar behaviour is not available for tables.
q)`a`c#`a`b!(1 2;())
a| 1 2
c| `long$()
q)`b`c#`a`b!(();1 2)
b| 1 2
c| ()
Like you mentioned, the use of each-right (/:) will act on each row of the table IE each dictionary. Instead of using an iterator to split the table into dictionaries we can act on the dictionary itself. This will return the same output and is slightly faster.
q)d:`a`b!1 2
q)enlist`a`c#d
a c
---
1
q)(`a`c#/:enlist d)~enlist`a`c#d
1b
q)\ts:1000000 enlist`a`c#d
395 864
q)\ts:1000000 `a`c#/:enlist d
796 880