Using the PySpark 3 DataFrame#transform method with arguments - pyspark

This question talks about how to chain custom PySpark 2 transformations.
The DataFrame#transform method was added to the PySpark 3 API.
This code snippet shows a custom transformation that doesn't take arguments and is working as expected and another custom transformation that takes arguments and is not working.
from pyspark.sql.functions import col, lit
df = spark.createDataFrame([(1, 1.0), (2, 2.)], ["int", "float"])
def with_funny(word):
def inner(df):
return df.withColumn("funny", lit(word))
return inner
def cast_all_to_int(input_df):
return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
df.transform(with_funny("bumfuzzle")).transform(cast_all_to_int).show()
Here's what's outputted:
+---+-----+-----+
|int|float|funny|
+---+-----+-----+
| 1| 1| null|
| 2| 2| null|
+---+-----+-----+
How should the with_funny() method be defined to output a value for the PySpark 3 API?

If I understood, your first transform method will add a new column with a literal that is passed as an argument and the last transform casts all the columns to int type, correct?
casting a string to int will return a null value, your final output is correct:
from pyspark.sql.functions import col, lit
df = spark.createDataFrame([(1, 1.0), (2, 2.)], ["int", "float"])
def with_funny(word):
def inner(df):
return df.withColumn("funny", lit(word))
return inner
def cast_all_to_int(input_df):
return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
#first transform
df1 = df.transform(with_funny("bumfuzzle"))
df1.show()
#second transform
df2 = df1.transform(cast_all_to_int)
df2.show()
#all together
df_final = df.transform(with_funny("bumfuzzle")).transform(cast_all_to_int)
df_final.show()
Output:
+---+-----+---------+
|int|float| funny|
+---+-----+---------+
| 1| 1.0|bumfuzzle|
| 2| 2.0|bumfuzzle|
+---+-----+---------+
+---+-----+-----+
|int|float|funny|
+---+-----+-----+
| 1| 1| null|
| 2| 2| null|
+---+-----+-----+
+---+-----+-----+
|int|float|funny|
+---+-----+-----+
| 1| 1| null|
| 2| 2| null|
+---+-----+-----+
Maybe what you want is switch the order of your transformations like this:
df_final = df.transform(cast_all_to_int).transform(with_funny("bumfuzzle"))
df_final.show()
Output:
+---+-----+---------+
|int|float| funny|
+---+-----+---------+
| 1| 1|bumfuzzle|
| 2| 2|bumfuzzle|
+---+-----+---------+

It has been solved in pyspark 3.3.0
def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":
"""Returns a new :class:`DataFrame`. Concise syntax for chaining custom transformations.
.. versionadded:: 3.0.0
Parameters
----------
func : function
a function that takes and returns a :class:`DataFrame`.
*args
Positional arguments to pass to func.
.. versionadded:: 3.3.0
**kwargs
Keyword arguments to pass to func.
.. versionadded:: 3.3.0
Examples
--------
>>> from pyspark.sql.functions import col
>>> df = spark.createDataFrame([(1, 1.0), (2, 2.0)], ["int", "float"])
>>> def cast_all_to_int(input_df):
... return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
>>> def sort_columns_asc(input_df):
... return input_df.select(*sorted(input_df.columns))
>>> df.transform(cast_all_to_int).transform(sort_columns_asc).show()
+-----+---+
|float|int|
+-----+---+
| 1| 1|
| 2| 2|
+-----+---+
>>> def add_n(input_df, n):
... return input_df.select([(col(col_name) + n).alias(col_name)
... for col_name in input_df.columns])
>>> df.transform(add_n, 1).transform(add_n, n=10).show()
+---+-----+
|int|float|
+---+-----+
| 12| 12.0|
| 13| 13.0|
+---+-----+
"""
result = func(self, *args, **kwargs)
assert isinstance(
result, DataFrame
), "Func returned an instance of type [%s], " "should have been DataFrame." % type(result)
return result

Related

Converting row into column using spark scala

I want to covert row into column using spark dataframe.
My table is like this
Eno,Name
1,A
1,B
1,C
2,D
2,E
I want to convert it into
Eno,n1,n2,n3
1,A,B,C
2,D,E,Null
I used this below code :-
val r = spark.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load("C:\\Users\\axy\\Desktop\\abc2.csv")
val n =Seq("n1","n2","n3"
r
.groupBy("Eno")
.pivot("Name",n).agg(expr("coalesce(first(Name),3)").cast("double")).show()
But I am getting result as-->
+---+----+----+----+
|Eno| n1| n2| n3|
+---+----+----+----+
| 1|null|null|null|
| 2|null|null|null|
+---+----+----+----+
Can anyone help to get the desire result.
val m= map(lit("A"), lit("n1"), lit("B"),lit("n2"), lit("C"), lit("n3"), lit("D"), lit("n1"), lit("E"), lit("n2"))
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
df.withColumn("new", m($"Name")).groupBy("Eno").pivot("new").agg(first("Name"))
+---+---+---+----+
|Eno| n1| n2| n3|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+
import org.apache.spark.sql.functions._
import spark.implicits._
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
val getName=udf {(names: Seq[String],i : Int) => if (names.size>i) names(i) else null}
val tdf=df.groupBy($"Eno").agg(collect_list($"name").as("names"))
val ndf=(0 to 2).foldLeft(tdf){(ndf,i) => ndf.withColumn(s"n${i}",getName($"names",lit(i))) }.
drop("names")
ndf.show()
+---+---+---+----+
|Eno| n0| n1| n2|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

How to use DataFrame Window expressions and withColumn and not to change partition?

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.
My interface is RDD,so I have to convert DataFrame to RDD, and when I use df.withcolumn, the partition change to 1, so I have to repartition and sortBy RDD.
Is there any cleaner solution ?
This is my code :
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition = rdd.getNumPartitions
println(partition + "rdd")
val df=rdd.toDF()
val rdd2=df.rdd
val result = rdd.toDF("col1")
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum").rdd
println(result.getNumPartitions + "rdd2")
Let's make this as simple as possible, we will generate the same data into 4 partitions
scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
+---+
scala> df.rdd.getNumPartitions
res13: Int = 4
We don't need 3 window functions to prove this, so let's do it with one :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]
So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :
scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1
scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
| 6| 21|
| 7| 28|
| 8| 36|
+---+----+
So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :
scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]
It has indeed the same number of partitions that we defined explicitly on df :
scala> df3.rdd.getNumPartitions
res18: Int = 4
Let's perform our window operation using the column x to partition :
scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]
scala> df4.show
+---+---+----+
| id| x|csum|
+---+---+----+
| 5| b| 5|
| 6| b| 11|
| 7| b| 18|
| 8| b| 26|
| 1| a| 1|
| 2| a| 3|
| 3| a| 6|
| 4| a| 10|
+---+---+----+
The window function will repartition our data using the default number of partitions set in spark configuration.
scala> df4.rdd.getNumPartitions
res20: Int = 200
I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like
windowSpec = Window \
.partitionBy('colA', 'colB') \
.orderBy('timeCol') \
.rowsBetween(1, 1)
and then doing
next_event = F.lead('timeCol', 1).over(windowSpec)
and creating a dataframe via
df2 = df.withColumn('next_event', next_event)
and indeed, it has 200 partitions. But, if I do
df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
it has 10!

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.
You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+
It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

I want to convert all my existing UDTFs in Hive to Scala functions and use it from Spark SQL

Can any one give me an example UDTF (eg; explode) written in scala which returns multiple row and use it as UDF in SparkSQL?
Table: table1
+------+----------+----------+
|userId|someString| varA|
+------+----------+----------+
| 1| example1| [0, 2, 5]|
| 2| example2|[1, 20, 5]|
+------+----------+----------+
I'd like to create the following Scala code:
def exampleUDTF(var: Seq[Int]) = <Return Type???> {
// code to explode varA field ???
}
sqlContext.udf.register("exampleUDTF",exampleUDTF _)
sqlContext.sql("FROM table1 SELECT userId, someString, exampleUDTF(varA)").collect().foreach(println)
Expected output:
+------+----------+----+
|userId|someString|varA|
+------+----------+----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+----+
You can't do this with a UDF. A UDF can only add a single column to a DataFrame. There is, however, a function called DataFrame.explode, which you can use instead. To do it with your example, you would do this:
import org.apache.spark.sql._
val df = Seq(
(1,"example1", Array(0,2,5)),
(2,"example2", Array(1,20,5))
).toDF("userId", "someString", "varA")
val explodedDf = df.explode($"varA"){
case Row(arr: Seq[Int]) => arr.toArray.map(a => Tuple1(a))
}.drop($"varA").withColumnRenamed("_1", "varA")
+------+----------+-----+
|userId|someString| varA|
+------+----------+-----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+-----+
Note that explode takes a function as an argument. So even though you can't create a UDF to do what you want, you can create a function to pass to explode to do what you want. Like this:
def exploder(row: Row) : Array[Tuple1[Int]] = {
row match { case Row(arr) => arr.toArray.map(v => Tuple1(v)) }
}
df.explode($"varA")(exploder)
That's about the best you are going to get in terms of recreating a UDTF.
Hive Table:
name id
["Subhajit Sen","Binoy Mondal","Shantanu Dutta"] 15
["Gobinathan SP","Harsh Gupta","Rahul Anand"] 16
Creating a scala function :
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
Registering function as UDF :
sqlContext.udf.register("toUpper",toUpper _)
Calling the UDF using sqlContext and storing output as DataFrame object :
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name")
Exploding the DataFrame :
df.explode(df("Name")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name")).withColumnRenamed("_1","Name").show
Result:
+--------------+
| Name|
+--------------+
| SUBHAJIT SEN|
| BINOY MONDAL|
|SHANTANU DUTTA|
| GOBINATHAN SP|
| HARSH GUPTA|
| RAHUL ANAND|
+--------------+