Accumulator gives different result then direct function applying - kdb

Trying to combine two result sets I've faced with different behavior when joining two keyed tables:
q)show t:([a:1 1 2]b:011b)
a| b
-| -
1| 0
1| 1
2| 1
q)t,t
a| b
-| -
1| 1
1| 1
2| 1
q)(,/)(t;t)
a| b
-| -
1| 1
2| 1
Why does the accumulator ,/ remove duplicated keys, and why its result differs from a direct table join ,?

I suspect that join over (aka ,/ aka raze) has special handling under the covers that isn't exposed to the end user.
The interpreter recognises the ,/ and behaves a certain way depending on the inputs. This likely applies to dictionaries and keyed tables:
q)raze(`a`a`b!1 2 3;`a`b!9 9)
a| 9
b| 9
q)
q)(`a`a`b!1 2 3),`a`b!9 9
a| 9
a| 2
b| 9
q)
q)({x,y}/)(`a`a`b!1 2 3;`a`b!9 9)
a| 9
a| 2
b| 9

Related

Pyspark pivot table and create new columns based on values of another column

I am trying to create new versions of existing columns based on the values of another column. Eg. in the input I create new columns for 'var1', 'var2', 'var3' for each value 'var' split can take.
Input:
time
student
split
var1
var2
var3
t1
Student1
A
1
3
7
t1
Student1
B
2
5
6
t1
Student1
C
3
1
9
t2
Student1
A
5
3
7
t2
Student1
B
9
6
3
t2
Student1
C
3
5
3
t1
Student2
A
1
2
8
t1
Student2
C
7
4
0
Output:
time
student
splitA_var1
splitA_var2
splitA_var1
splitB_var1
splitB_var2
splitB_var3
splitC_var1
splitC_var2
splitC_var3
t1
Student1
1
3
7
2
5
6
3
1
9
t2
Student1
5
3
7
9
6
3
3
5
3
t1
Student2
1
2
8
7
4
0
Image of output here if table not formatted
this is an easy pivot with multiple aggregations (within agg()).
see example below
import pyspark.sql.functions as func
data_sdf. \
withColumn('pivot_col', func.concat(func.lit('split'), 'split')). \
groupBy('time', 'student'). \
pivot('pivot_col'). \
agg(func.first('var1').alias('var1'),
func.first('var2').alias('var2'),
func.first('var3').alias('var3')
). \
show()
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# |time| student|splitA_var1|splitA_var2|splitA_var3|splitB_var1|splitB_var2|splitB_var3|splitC_var1|splitC_var2|splitC_var3|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# | t1|Student2| 1| 2| 8| null| null| null| 7| 4| 0|
# | t2|Student1| 5| 3| 7| 9| 6| 3| 3| 5| 3|
# | t1|Student1| 1| 3| 7| 2| 5| 6| 3| 1| 9|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
spark will create new columns with the following nomenclature - <pivot column value>_<aggregation alias>

How to count the last 30 day occurrence & transpose a column's row value to new columns in pyspark

I am trying to get the count of occurrence of the status column for each 'name', 'id' & 'branch' combination in the last 30 days using Pyspark.
For simplicity lets assume the current day is 19/07/2021
Input dataframe
id name branch status eventDate
1 a main failed 18/07/2021
1 a main error 15/07/2021
2 b main failed 16/07/2021
3 c main snooze 12/07/2021
4 d main failed 18/01/2021
2 b main failed 18/07/2021
expected output
id name branch failed error snooze
1 a main 1 1 0
2 b main 2 0 0
3 c main 0 0 1
4 d main 0 0 0
I tried the following code.
from pyspark.sql import functions as F
df = df.withColumn("eventAgeinDays", (F.datediff(F.current_timestamp(), F.col("eventDate"))))
df = df.groupBy('id', 'branch', 'name', 'status')\
.agg(
F.sum(
F.when(F.col("eventAgeinDays") <= 30, 1).otherwise(0)
).alias("Last30dayFailure")
)
df = df.groupBy('id', 'branch', 'name', 'status').pivot('status').agg(F.collect_list('Last30dayFailure'))
The code kind of gives me the output, but I get arrays in the output since I am using F.collect_list()
my partially correct output
id name branch failed error snooze
1 a main [1] [1] []
2 b main [2] [] []
3 c main [] [] [1]
4 d main [] [] []
Could you please suggest a more elegant way of creating my expected output? Or let me know how to fix my code?
Instead of using collect_list which creates list, use first as the aggregation method (The reason we can use first is that you already had an aggregation grouped by id, branch, name and status so you are sure that there's at most one value for each unique combination):
(df.groupBy('id', 'branch', 'name')
.pivot('status')
.agg(F.first('Last30dayFailure'))
.fillna(0)
.show())
+---+------+----+-----+------+------+
| id|branch|name|error|failed|snooze|
+---+------+----+-----+------+------+
| 1| main| a| 1| 1| 0|
| 4| main| d| 0| 0| 0|
| 3| main| c| 0| 0| 1|
| 2| main| b| 0| 2| 0|
+---+------+----+-----+------+------+

Remove rows from Spark DataFrame that ONLY satisfy two conditions

I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.
For example: let's say I have this DataFrame
+-------+----+
|country|date|
+-------+----+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
and I want to filter out country A and dates 1 and 2, so that the expected output should be:
+-------+----+
|country|date|
+-------+----+
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
As you can see, I am still keeping country B with dates 1 and 2.
I tried to use filter in the following way
df.filter("country != 'A' and date not in (1,2)")
But the output filters out all dates 1, and 2, which is not what I want.
Thanks.
Your current condition is
df.filter("country != 'A' and date not in (1,2)")
which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently
What you want is:
df.filter("not (country = 'A' and date in (1,2))")
i.e. "Find the rows with country A and date of 1 or 2, and reject them"
or equivalently:
df.filter("country != 'A' or date not in (1,2)")
i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"
See De Morgan's laws:
not(A or B) = not A and not B
not (A and B) = not A or not B

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.
As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

How to filter duplicate records having multiple key in Spark Dataframe?

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+