Convert from IndexedSeq[DataFrame] to DataFrame? - scala

Newbie question ,
I am try to add columns to exist DataFrame , I am working with Spark 1.4.1
import sqlContext.implicits._
case class Test(rule: Int)
val test = sc.parallelize((1 to 2).map(i => Test(i-i))).toDF
test.registerTempTable("test")
test.show
+----+
|rule|
+----+
| 0|
| 0|
+----+
Then - add columns, one column - OK
import org.apache.spark.sql.functions.lit
val t1 = test.withColumn("1",lit(0) )
t1.show
+----+-+
|rule|1|
+----+-+
| 0|0|
| 0|0|
+----+-+
Problem appears when I try to add several columns:
val t1 = (1 to 5).map( i => test.withColumn(i,lit(i) ))
t1.show()
error: value show is not a member of scala.collection.immutable.IndexedSeq[org.apache.spark.sql.DataFrame]

You need a reduce process, so instead of using map, you can use foldLeft with test data frame as your initial parameter:
val t1 = (1 to 5).foldLeft(test){ case(df, i) => df.withColumn(i.toString, lit(i))}
t1.show
+----+---+---+---+---+---+
|rule| 1| 2| 3| 4| 5|
+----+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5|
| 0| 1| 2| 3| 4| 5|
+----+---+---+---+---+---+

Related

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Rank per row over multiple columns in Spark Dataframe

I am using spark with Scala to transform a Dataframe , where I would like to compute a new variable which calculates the rank of one variable per row within many variables.
Example -
Input DF-
+---+---+---+
|c_0|c_1|c_2|
+---+---+---+
| 11| 11| 35|
| 22| 12| 66|
| 44| 22| 12|
+---+---+---+
Expected DF-
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 11| 35| 2| 3| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
This has aleady been answered using R - Rank per row over multiple columns in R,
but I need to do the same in spark-sql using scala. Thanks for the Help!
Edit- 4/1 . Encountered one scenario where if the values are same the ranks should be different. Editing first row for replicating the situation.
If I understand correctly, you want to have the rank of each column, within each row.
Let's first define the data, and the columns to "rank".
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
val cols = df.columns
Then we define a UDF that finds the index of an element in an array.
val pos = udf((a : Seq[Int], elt : Int) => a.indexOf(elt)+1)
We finally create a sorted array (in descending order) and use the UDF to find the rank of each column.
val ranks = cols.map(c => pos(col("array"), col(c)).as(c+"_rank"))
df.withColumn("array", sort_array(array(cols.map(col) : _*), false))
.select((cols.map(col)++ranks) :_*).show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 12| 35| 3| 2| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
EDIT:
As of Spark 2.4, the pos UDF that I defined can be replaced by the built in function array_position(column: Column, value: Any) that works exactly the same way (the first index is 1). This avoids using UDFs that can be slightly less efficient.
EDIT2:
The code above will generate duplicated indices in case you have duplidated keys. If you want to avoid it, you can create the array, zip it to remember which column is which, sort it and zip it again to get the final rank. It would look like this:
val colMap = df.columns.zipWithIndex.map(_.swap).toMap
val zip = udf((s: Seq[Int]) => s
.zipWithIndex
.sortBy(-_._1)
.map(_._2)
.zipWithIndex
.toMap
.mapValues(_+1))
val ranks = (0 until cols.size)
.map(i => 'zip.getItem(i) as colMap(i) + "_rank")
val result = df
.withColumn("zip", zip(array(cols.map(col) : _*)))
.select(cols.map(col) ++ ranks :_*)
One way to go about this would be to use windows.
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
d.withColumn(column+"_rank", rank() over Window.orderBy(desc(column))))
.show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 22| 12| 66| 2| 3| 1|
| 11| 21| 35| 3| 2| 2|
| 44| 22| 12| 1| 1| 3|
+---+---+---+--------+--------+--------+
But this is not a good idea. All the data will end up in one partition which will cause an OOM error if all the data does not fit inside one executor.
Another way would require to sort the dataframe three times, but at least that would scale to any size of data.
Let's define a function that zips a dataframe with consecutive indices (it exists for RDDs but not for dataframes)
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ (i+1)) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
And let's use it on the same dataframe df:
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
zipWithIndex(d.orderBy(desc(column)), column+"_rank"))
.show
which provides the exact same result as above.
You could probably create a window function. Do note that this is susceptible to OOM if you have too much data. But, I just wanted to introduce to the concept of window functions here.
inputDF.createOrReplaceTempView("my_df")
val expectedDF = spark.sql("""
select
c_0
, c_1
, c_2
, rank(c_0) over (order by c_0 desc) c_0_rank
, rank(c_1) over (order by c_1 desc) c_1_rank
, rank(c_2) over (order by c_2 desc) c_2_rank
from my_df""")
expectedDF.show()
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 44| 22| 12| 3| 3| 1|
| 11| 21| 35| 1| 2| 2|
| 22| 12| 66| 2| 1| 3|
+---+---+---+--------+--------+--------+

Spark withColumn working for modifying column but not adding a new one

Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

I want to convert all my existing UDTFs in Hive to Scala functions and use it from Spark SQL

Can any one give me an example UDTF (eg; explode) written in scala which returns multiple row and use it as UDF in SparkSQL?
Table: table1
+------+----------+----------+
|userId|someString| varA|
+------+----------+----------+
| 1| example1| [0, 2, 5]|
| 2| example2|[1, 20, 5]|
+------+----------+----------+
I'd like to create the following Scala code:
def exampleUDTF(var: Seq[Int]) = <Return Type???> {
// code to explode varA field ???
}
sqlContext.udf.register("exampleUDTF",exampleUDTF _)
sqlContext.sql("FROM table1 SELECT userId, someString, exampleUDTF(varA)").collect().foreach(println)
Expected output:
+------+----------+----+
|userId|someString|varA|
+------+----------+----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+----+
You can't do this with a UDF. A UDF can only add a single column to a DataFrame. There is, however, a function called DataFrame.explode, which you can use instead. To do it with your example, you would do this:
import org.apache.spark.sql._
val df = Seq(
(1,"example1", Array(0,2,5)),
(2,"example2", Array(1,20,5))
).toDF("userId", "someString", "varA")
val explodedDf = df.explode($"varA"){
case Row(arr: Seq[Int]) => arr.toArray.map(a => Tuple1(a))
}.drop($"varA").withColumnRenamed("_1", "varA")
+------+----------+-----+
|userId|someString| varA|
+------+----------+-----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+-----+
Note that explode takes a function as an argument. So even though you can't create a UDF to do what you want, you can create a function to pass to explode to do what you want. Like this:
def exploder(row: Row) : Array[Tuple1[Int]] = {
row match { case Row(arr) => arr.toArray.map(v => Tuple1(v)) }
}
df.explode($"varA")(exploder)
That's about the best you are going to get in terms of recreating a UDTF.
Hive Table:
name id
["Subhajit Sen","Binoy Mondal","Shantanu Dutta"] 15
["Gobinathan SP","Harsh Gupta","Rahul Anand"] 16
Creating a scala function :
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
Registering function as UDF :
sqlContext.udf.register("toUpper",toUpper _)
Calling the UDF using sqlContext and storing output as DataFrame object :
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name")
Exploding the DataFrame :
df.explode(df("Name")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name")).withColumnRenamed("_1","Name").show
Result:
+--------------+
| Name|
+--------------+
| SUBHAJIT SEN|
| BINOY MONDAL|
|SHANTANU DUTTA|
| GOBINATHAN SP|
| HARSH GUPTA|
| RAHUL ANAND|
+--------------+