Add another column after groupBy and agg - scala

I have a df looks like this:
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 3| 3| 600|
| 2| 3| 702|
| 1| 2| 120|
| 2| 5| 200|
| 2| 2| 500|
| 3| 1| 100|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
I want to output the max count of vocabId and the docId it belongs to. I did this:
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count"))
and got this:
+-------+----------+
|vocabId| count |
+-------+----------+
| 1| 1000|
| 3| 1200|
| 5| 2000|
| 4| 122|
| 2| 500|
+-------+----------+
How do I add the docId at the front???
It should looks something like this(the order is not important):
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 2| 2| 500|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+

You can do self join with docwords over count and vocabId something like below
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))

Related

Pyspark keep state within tasks

This is related to this question: Pyspark dataframe column value dependent on value from another row but this one gets even more complicated.
I have a dataframe:
columns = ['id','seq','manufacturer']
data = [("1",1,"Factory"), ("1",2,"Sub-Factory-1"), ("1",3,"Order"),("1",4,"Sub-Factory-1"),("2",1,"Factory"), ("2",2,"Sub-Factory-1"), ("2",5,"Sub-Factory-1"),("3",1, "Sub-Factory-1"),("3",2,"Order"), ("3",4, "Sub-Factory-1"), ("4", 1,"Factory"), ("4",3, "Sub-Factory-1"),("4",4, "Sub-Factory-1"),("5",1,"Sub-Factory-1"), ("5",2, "Sub-Factory-1"), ("5", 6,"Order"), ("6",2,"Factory"), ("6",3, "Order"), ("6",4,"Sub-Factory-1"), ("6", 6,"Sub-Factory-1"), ("6",7,"Order"), ("7",1,"Sub-Factory-1"), ("7",2,"Factory" ), ("7", 3,"Order"), ("7", 4,"Sub-Factory-1"),("7",5,"Factory"), ("7",8, "Sub-Factory-1"),("7",10,"Sub-Factory-1")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.orderBy('id','seq').show(40)
+---+---+-------------+
| id|seq| manufacturer|
+---+---+-------------+
| 1| 1| Factory|
| 1| 2|Sub-Factory-1|
| 1| 3| Order|
| 1| 4|Sub-Factory-1|
| 2| 1| Factory|
| 2| 2|Sub-Factory-1|
| 2| 5|Sub-Factory-1|
| 3| 1|Sub-Factory-1|
| 3| 2| Order|
| 3| 4|Sub-Factory-1|
| 4| 1| Factory|
| 4| 3|Sub-Factory-1|
| 4| 4|Sub-Factory-1|
| 5| 1|Sub-Factory-1|
| 5| 2|Sub-Factory-1|
| 5| 6| Order|
| 6| 2| Factory|
| 6| 3| Order|
| 6| 4|Sub-Factory-1|
| 6| 6|Sub-Factory-1|
| 6| 7| Order|
| 7| 1|Sub-Factory-1|
| 7| 2| Factory|
| 7| 3| Order|
| 7| 4|Sub-Factory-1|
| 7| 5| Factory|
| 7| 8|Sub-Factory-1|
| 7| 10|Sub-Factory-1|
+---+---+-------------+
What I want to do is to assign hierarchical values to another column(not saying its the best idea) that I can use with the logic from Pyspark dataframe column value dependent on value from another row. So within id group and seq order I want only the first Sub-Factory to attribute to Factory, if there is a Factory within same id and seq order above the Sub-Factory.
So end result should look like:
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
| 6| 7| Order| 0|
| 7| 1|Sub-Factory-1| 0|
| 7| 2| Factory| 1|
| 7| 3| Order| 0|
| 7| 4|Sub-Factory-1| 1|
| 7| 5| Factory| 1|
| 7| 8|Sub-Factory-1| 1|
| 7| 10|Sub-Factory-1| 0|
+---+---+-------------+-------+
The dataset is large so I can't use something like df.collect() and then loop over data because it crashes memory. My first idea was to use an accumulator like:
acc = sc.accumulator(0)
def myFunc(manufacturer):
if manufacturer == 'Factory':
acc.value = 1
return 1
elif manufacturer == 'Sub-Factory-1' and acc.value == 1:
acc.value = 0
return 1
else:
return 0
myFuncUDF = F.udf(myFunc, IntegerType())
df = df.withColumn('test', myFuncUDF(col('manufacturer')))
But it's a bad idea since accumulator cannot be accessed within tasks.
Also Window function solves it if I want to attribute all Sub-Factories from above Factory within same id but now only the first Sub-Factory should get attributed. Any ideas?
from pyspark.sql.window import Window
from pyspark.sql.functions import *
df_mod = df.filter(df.manufacturer == 'Sub-Factory-1')
W = Window.partitionBy("id").orderBy("seq")
df_mod = df_mod.withColumn("rank",rank().over(W))
df_mod = df_mod.filter(col('rank') == 1)
df_mod2 = df.filter(col('manufacturer') == 'Factory')\
.select('id', 'seq', col('manufacturer').alias('Factory_chk_2'))
df_f = df\
.join(df_mod, ['id', 'seq'], 'left')\
.select('id', 'seq', df.manufacturer, 'rank')\
.join(df_mod2, 'id', 'left')\
.select('id', df.seq, df.manufacturer, 'rank', 'Factory_chk_2')\
.withColumn('Factory_chk', when(df.manufacturer=='Factory', 1))\
.withColumn('Factory_chk_2', when(col('Factory_chk_2')=='Factory', 1))\
.withColumn('checker',when(col('Factory_chk_2')=='1', coalesce(col('rank'),col('Factory_chk'))).otherwise(lit(0)))\
.select('id', 'seq', 'manufacturer', 'checker')\
.na.fill(value=0)\
.orderBy('id', 'seq')
df_f.show()
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
+---+---+-------------+-------+
only showing top 20 rows

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

Perform merge/insert on two spark dataframes with different schemas?

I have spark dataframe df and df1 both with different schemas.
DF:-
val DF = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")
+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
| 1| acv| 34| a| 1|
| 2| fbg| 56| b| 3|
| 3| rty| 78| c| 5|
+---+----+---+------+----+
DF1:-
val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
| 1| gbj| 67| a| 5|
| 2| gbj| 67| a| 7|
| 2| jku| 88| b| 8|
| 4| jku| 88| b| 7|
| 5| uuu| 12| c| 9|
+---+----+---+------+----+
I want to merge DF1 with DF based on value of id and DBName. So if my id and DBName already exists in DF then the record should be updated and if id and DBName doesn't exist then the new record should be added. So the resulting data frame should be like this:
+---+----+---+------+----+----+
| id|name|age|DBName|Test|col |
+---+----+---+------+----+----+
| 5| uuu| 12| c|NULL|9 |
| 2| jku| 88| b|NULL|8 |
| 4| jku| 88| b|NULL|7 |
| 1| gbj| 67| a|NULL|5 |
| 3| rty| 78| c|5 |NULL|
| 2| gbj| 67| a|NULL|7 |
+---+----+---+------+----+----+
I have tried so far
val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" && $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)
Error:-
org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;
You're selecting non-existent columns, and also there is a typo in the coalesce. You can follow the example below to fix your issue:
val updatedDF = DF.as("a").join(
DF1.as("b"),
$"a.id" === $"b.id" && $"a.DBName" === $"b.DBName",
"outer"
).select(
DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c)
:+ col(DF.columns.last)
:+ col(DF1.columns.last)
:_*
)
updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
| 5| uuu| 12| c|null| 9|
| 2| jku| 88| b| 3| 8|
| 4| jku| 88| b|null| 7|
| 1| gbj| 67| a| 1| 5|
| 3| rty| 78| c| 5|null|
| 2| gbj| 67| a|null| 7|
+---+----+---+------+----+----+

Percentile over a specific column

I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.
Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')