Related
This is related to this question: Pyspark dataframe column value dependent on value from another row but this one gets even more complicated.
I have a dataframe:
columns = ['id','seq','manufacturer']
data = [("1",1,"Factory"), ("1",2,"Sub-Factory-1"), ("1",3,"Order"),("1",4,"Sub-Factory-1"),("2",1,"Factory"), ("2",2,"Sub-Factory-1"), ("2",5,"Sub-Factory-1"),("3",1, "Sub-Factory-1"),("3",2,"Order"), ("3",4, "Sub-Factory-1"), ("4", 1,"Factory"), ("4",3, "Sub-Factory-1"),("4",4, "Sub-Factory-1"),("5",1,"Sub-Factory-1"), ("5",2, "Sub-Factory-1"), ("5", 6,"Order"), ("6",2,"Factory"), ("6",3, "Order"), ("6",4,"Sub-Factory-1"), ("6", 6,"Sub-Factory-1"), ("6",7,"Order"), ("7",1,"Sub-Factory-1"), ("7",2,"Factory" ), ("7", 3,"Order"), ("7", 4,"Sub-Factory-1"),("7",5,"Factory"), ("7",8, "Sub-Factory-1"),("7",10,"Sub-Factory-1")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.orderBy('id','seq').show(40)
+---+---+-------------+
| id|seq| manufacturer|
+---+---+-------------+
| 1| 1| Factory|
| 1| 2|Sub-Factory-1|
| 1| 3| Order|
| 1| 4|Sub-Factory-1|
| 2| 1| Factory|
| 2| 2|Sub-Factory-1|
| 2| 5|Sub-Factory-1|
| 3| 1|Sub-Factory-1|
| 3| 2| Order|
| 3| 4|Sub-Factory-1|
| 4| 1| Factory|
| 4| 3|Sub-Factory-1|
| 4| 4|Sub-Factory-1|
| 5| 1|Sub-Factory-1|
| 5| 2|Sub-Factory-1|
| 5| 6| Order|
| 6| 2| Factory|
| 6| 3| Order|
| 6| 4|Sub-Factory-1|
| 6| 6|Sub-Factory-1|
| 6| 7| Order|
| 7| 1|Sub-Factory-1|
| 7| 2| Factory|
| 7| 3| Order|
| 7| 4|Sub-Factory-1|
| 7| 5| Factory|
| 7| 8|Sub-Factory-1|
| 7| 10|Sub-Factory-1|
+---+---+-------------+
What I want to do is to assign hierarchical values to another column(not saying its the best idea) that I can use with the logic from Pyspark dataframe column value dependent on value from another row. So within id group and seq order I want only the first Sub-Factory to attribute to Factory, if there is a Factory within same id and seq order above the Sub-Factory.
So end result should look like:
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
| 6| 7| Order| 0|
| 7| 1|Sub-Factory-1| 0|
| 7| 2| Factory| 1|
| 7| 3| Order| 0|
| 7| 4|Sub-Factory-1| 1|
| 7| 5| Factory| 1|
| 7| 8|Sub-Factory-1| 1|
| 7| 10|Sub-Factory-1| 0|
+---+---+-------------+-------+
The dataset is large so I can't use something like df.collect() and then loop over data because it crashes memory. My first idea was to use an accumulator like:
acc = sc.accumulator(0)
def myFunc(manufacturer):
if manufacturer == 'Factory':
acc.value = 1
return 1
elif manufacturer == 'Sub-Factory-1' and acc.value == 1:
acc.value = 0
return 1
else:
return 0
myFuncUDF = F.udf(myFunc, IntegerType())
df = df.withColumn('test', myFuncUDF(col('manufacturer')))
But it's a bad idea since accumulator cannot be accessed within tasks.
Also Window function solves it if I want to attribute all Sub-Factories from above Factory within same id but now only the first Sub-Factory should get attributed. Any ideas?
from pyspark.sql.window import Window
from pyspark.sql.functions import *
df_mod = df.filter(df.manufacturer == 'Sub-Factory-1')
W = Window.partitionBy("id").orderBy("seq")
df_mod = df_mod.withColumn("rank",rank().over(W))
df_mod = df_mod.filter(col('rank') == 1)
df_mod2 = df.filter(col('manufacturer') == 'Factory')\
.select('id', 'seq', col('manufacturer').alias('Factory_chk_2'))
df_f = df\
.join(df_mod, ['id', 'seq'], 'left')\
.select('id', 'seq', df.manufacturer, 'rank')\
.join(df_mod2, 'id', 'left')\
.select('id', df.seq, df.manufacturer, 'rank', 'Factory_chk_2')\
.withColumn('Factory_chk', when(df.manufacturer=='Factory', 1))\
.withColumn('Factory_chk_2', when(col('Factory_chk_2')=='Factory', 1))\
.withColumn('checker',when(col('Factory_chk_2')=='1', coalesce(col('rank'),col('Factory_chk'))).otherwise(lit(0)))\
.select('id', 'seq', 'manufacturer', 'checker')\
.na.fill(value=0)\
.orderBy('id', 'seq')
df_f.show()
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
+---+---+-------------+-------+
only showing top 20 rows
I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))
I have spark dataframe df and df1 both with different schemas.
DF:-
val DF = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")
+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
| 1| acv| 34| a| 1|
| 2| fbg| 56| b| 3|
| 3| rty| 78| c| 5|
+---+----+---+------+----+
DF1:-
val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
| 1| gbj| 67| a| 5|
| 2| gbj| 67| a| 7|
| 2| jku| 88| b| 8|
| 4| jku| 88| b| 7|
| 5| uuu| 12| c| 9|
+---+----+---+------+----+
I want to merge DF1 with DF based on value of id and DBName. So if my id and DBName already exists in DF then the record should be updated and if id and DBName doesn't exist then the new record should be added. So the resulting data frame should be like this:
+---+----+---+------+----+----+
| id|name|age|DBName|Test|col |
+---+----+---+------+----+----+
| 5| uuu| 12| c|NULL|9 |
| 2| jku| 88| b|NULL|8 |
| 4| jku| 88| b|NULL|7 |
| 1| gbj| 67| a|NULL|5 |
| 3| rty| 78| c|5 |NULL|
| 2| gbj| 67| a|NULL|7 |
+---+----+---+------+----+----+
I have tried so far
val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" && $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)
Error:-
org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;
You're selecting non-existent columns, and also there is a typo in the coalesce. You can follow the example below to fix your issue:
val updatedDF = DF.as("a").join(
DF1.as("b"),
$"a.id" === $"b.id" && $"a.DBName" === $"b.DBName",
"outer"
).select(
DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c)
:+ col(DF.columns.last)
:+ col(DF1.columns.last)
:_*
)
updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
| 5| uuu| 12| c|null| 9|
| 2| jku| 88| b| 3| 8|
| 4| jku| 88| b|null| 7|
| 1| gbj| 67| a| 1| 5|
| 3| rty| 78| c| 5|null|
| 2| gbj| 67| a|null| 7|
+---+----+---+------+----+----+
I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.
Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+
I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')