How can I join a list of Spark dataframes together in Scala? - scala

I have a Seq of Spark dataframes (i.e. Seq[org.apache.spark.sql.DataFrame]), it could contain 1 or many elements.
There is a list of columns that is common to each of those dataframes, each dataframe also has some additional columns. What I would like to do is join together all those dataframes using those common columns in the join conditions (remember, the number of dataframes is unknown)
How can I join together all these dataframes? I guess I could foreach over them but that doesn't seem very elegant. Can anyone come up with a more functional way of doing it? edit: A recursive function would be better than a foreach, I'm working on that now, will post it up here when done.
Here is some code that creates a list of n dataframes (n=3 in this case), each of which contains columns id & Product:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val conf = new SparkConf().setMaster("local[*]")
val spark = SparkSession.builder().appName("Feature Generator tests").config(conf).config("spark.sql.warehouse.dir", "/tmp/hive").enableHiveSupport().getOrCreate()
val df = spark.range(0, 1000).toDF().withColumn("Product", concat(lit("product"), col("id")))
val dataFrames = Seq(1,2,3).map(s => df.withColumn("_" + s.toString, lit(s)))
To clarify, dataFrames.head.columns returns Array[String] = Array(id, Product, _1).
How might I join those n dataframes together on columns id & Product so that the returned dataframe has columns Array[String] = Array(id, Product, _1, _2, _3)?

dataFrames is a List; You can use the List.reduce method to join all data frames inside:
dataFrames.reduce(_.join(_, Seq("id", "Product"))).show
//+---+---------+---+---+---+
//| id| Product| _1| _2| _3|
//+---+---------+---+---+---+
//| 0| product0| 1| 2| 3|
//| 1| product1| 1| 2| 3|
//| 2| product2| 1| 2| 3|
//| 3| product3| 1| 2| 3|
//| 4| product4| 1| 2| 3|
//| ... more rows

Related

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

rearrange order of spark columns

I have a spark dataframe with many columns. Using Spark and Scala, I would like to select the columns in a specified order, but I don't want to hardcode the desired order. In pseudo-code, I'd like do something like:
val colNames = df.columns
val newOrder = colNames(colNames.length) ++ colNames(0:colNames.length-1)
df.select(newOrder)
How can I do this? Thanks!
You can do something like this:
val df = Seq((1,2,3)).toDF("A","B","C")
df.select(df.columns.last, df.columns.dropRight(1): _*).show
+---+---+---+
| C| A| B|
+---+---+---+
| 3| 1| 2|
+---+---+---+

Scala: Any better way to join two DataFrames by the relationship from the third one

I have to two DataFrames, and want to outer join them. But the joining mapping is in another dataframe.
Now I am using below way, it works, but I hope there is more efficient way for I have >1,000,000 rows
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
scala> ta.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
scala> tb.show
+---+---+
| C| D|
+---+---+
| 2| 1|
+---+---+
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
scala> tc.show
+---+---+---+
| D| E| F|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
scala> val tmp = ta.join(tb, Seq("C"), "left_outer")
tmp: org.apache.spark.sql.DataFrame = [C: int, A: int, B: int, D: int]
scala> tmp.show
+---+---+---+----+
| C| A| B| D|
+---+---+---+----+
| 1| 1| 1|null|
| 2| 2| 2| 1|
+---+---+---+----+
scala> tmp.join(tc, Seq("D"), "outer").show
+----+----+----+----+----+----+
| D| C| A| B| E| F|
+----+----+----+----+----+----+
|null| 1| 1| 1|null|null|
| 1| 2| 2| 2| 1| 1|
| 2|null|null|null| 2| 2|
+----+----+----+----+----+----+
As Umberto noted, a good reference on how to improve performance of your joins is Holden Karau and Rachel Warren's High Performance Spark > Chapter 4. Joins (SQL & Core).
From the standpoint of your code, running it as you noted or the SQL equivalent (as noted below) should result in about the same performance.
// Create initial tables
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
// _.createOrReplaceTempView
ta.createOrReplaceTempView("ta")
tb.createOrReplaceTempView("tb")
tc.createOrReplaceTempView("tc")
// SQL Query
spark.sql("
select tc.D, ta.A, ta.B, ta.C, tc.E, tc.F
from ta
left outer join tb
on tb.C = ta.C
full outer join tc
on tc.D = tb.D
")
The reason why is because the Spark SQL Catalyst Optimizer (as noted in the diagram below) takes the DataFrame query and builds up an optimized logical plan. A number of physical plans are developed and Spark SQL Engine's Cost Optimizer chooses the best physical plan and generates the code to produce the RDDs.
Saying this, the key concern is that when you're working with a lot of rows that use up a lot of memory, you have to take into account of the partitioning. For example, if you can ensure that the mapping DataFrame (tc) have the same / similar partitioning scheme as the other DataFrames (ta, tb) so that way you can have a co-located join (this is Figure 4-3 within High Performance Spark > Chapter 4. Join).
If the partitions for your three DataFrames (ta, tb, tc) all have different partitioning, this means the keys for your DataFrames will not have a 1-to-1 matching between the partitions. That is, this will result in a shuffle join (this is Figure 4-2 within High Performance Spark > Chapter 4. Join) which potentially could be more costly.
Basically, from the standpoint of your query, the concern is less about the query itself and more about the partitioning schemes for your DataFrames. But before experimenting too much with the partitioning schemes of your DataFrames, experiment with your queries to see if the default Spark SQL / DataFrame queries are able to take care of the partitioning by itself.