Create JOIN condition on variable number of columns in Scala - scala

Suppose I have two data frames and I would like to join them based on certain columns. The list of these join columns can differ based on the data frames that I'm joining, but we can always count on the fact that the two data frames we will join using joinDfs will always have the same column names as joinCols.
I am trying to figure out how to form the joinCondition given the assumptions/requirements above. Currently, it is returning (((a.colName1 = b.colName1) AND (a.colName2 = b.colName2)) AND (a.colName3 = b.colName3)), which is not quite returning what I'm expecting from the INNER JOIN in the example below.
Thank you in advance for helping me, a newbie to Scala and Spark figure out how to form a proper joinCondition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.types._
def joinDfs(firstDf: DataFrame, secondDf: DataFrame, joinCols: Array[String]): DataFrame = {
val firstDfAlias: String = "a"
val secondDfAlias: String = "b"
// This is what I am trying to figure out and need help with
val joinCondition = joinCols
.map(c => col(f"${firstDfAlias}.${c}") === col(f"${secondDfAlias}.${c}"))
.reduce((x,y) => (x && y))
secondDf.as(secondDfAlias).join(
firstDfAlias.as(firstDfAlias),
joinCondition,
"inner"
).select(cols.map(col): _*)
}
// This is an example of data frames that I'm trying to join
// But these data frames can change in terms of number of columns in each
// and data types, etc. The only thing we know for sure is that these
// data frames will contain some or all columns with the same name and
// we will use them to join the two data frames.
val firstDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable=true),
StructField(name = "colName3", dataType = LongType, nullable=true),
StructField(name = "market_id", dataType = LongType, nullable=true)
))
val firstDfData = Seq(
Row(123L, "123", 123L, 123L),
Row(234L, "234", 234L, 234L),
Row(345L, "345", 345L, 345L),
Row(456L, "456", 456L, 456L)
)
val firstDf = spark.createDataFrame(spark.sparkContext.parallelize(firstDfData), firstDfSchema)
val secondDfSchema = StructType(List(
StructField(name = "colName1", dataType = LongType, nullable=true),
StructField(name = "colName2", dataType = StringType, nullable = true),
StructField(name = "colName3", dataType = LongType, nullable = true),
StructField(name = "num_orders", dataType = LongType, nullable=true)
))
val secondDfData = Seq(
Row(123L, "123", 123L, 1L),
Row(234L, "234", 234L, 2L),
Row(567L, "567", 567L, 3L)
)
val secondDf = spark.createDataFrame(spark.sparkContext.parallelize(secondDfData), secondDfSchema)
// Suppose we are going to join the two data frames above on the following columns
val joinCols: Array[String] = Array("colName1", "colName2", "colName3")
val finalDf = joinDfs(firstDf, secondDf, joinCols)

Related

Spark scala join with duplicate columns when used finial result df.*

When I join two dataframes using Seq(col), I still get multiple columns when using df.*
E.g:
'''
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("users", StringType,true)
))
val rowData= Seq(Row("Java", "20000"),
Row("Python", "100000"),
Row("Scala", "3000"))
var dfFromData3 = spark.createDataFrame(rowData,schema)
val schema1 = StructType( Array(
StructField("language", StringType,true),
StructField("price", StringType,true)
))
val rowData1= Seq(Row("Java", "20"),
Row("Python", "10"))
var dfFromData4 = spark.createDataFrame(rowData1,schema1)
val combined = dfFromData3.join(dfFromData4,Seq("language"),"left")
'''
'''display(combined)''' - Has only one "language" column
but
'''display(combined.as("df").select("df.*"))''' - Has two "language" columns
Can someone please explain what is happening here?

Converting RDD into Dataframe

I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.

Spark error when using except on a dataframe with MapType

I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?

Converting a Spark's DataFrame column to List[String] in Scala

I am working on Movie Lens data set. In one the the csv files, the data is structured as:
movieId movieTitle genres
and genres again is a list of | separated values, the field is nullable.
I am trying to get a unique list of all the genres so that I can rearrange the data as following:
movieId movieTitle genre1 genre2 ... genreN
and a row, which has genre as genre1 | genre2 will look like:
1 Title1 1 1 0 ... 0
So far, I have been able to read the csv file using the following code:
val conf = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context = new SparkContext(conf)
val sparkSession = SparkSession.builder()
.appName(App.name)
.config("header", "true")
.config(conf = conf)
.getOrCreate()
val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)
If I try something like:
movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()
Then I get the following exception:
java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1
Then, in addition, I tried providing the schema explicitly using the following code:
val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("genres", StringType, nullable = true)))
and tried:
val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)
and then I got the same exception.
Is there any way in which I can the set of genres as a List or a Set so I can further massage the data into the desired format? Any help will be appreciated.
Here is how I got the set of genres:
val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] = for {
g ← genreList
genres ← g.split("\\|")
} yield genres
val genreSet : Set[String] = genres.toSet
This worked to give an Array[Array[String]]
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()
To get Array[String]
val genres = genreLst.flatten
or
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten

how to convert VertexRDD to DataFrame

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns