Append/Union multiple dataframes in Scala

Append/Union multiple dataframes in Scala - scala

I'm coming from a python background, trying to convert a function over in to scala.
In this dummy example, I have multiple (unknown number) of dataframes that I need to union together.
%python
list_of_dfs = [
spark.createDataFrame(
[('A', 'C'),
('B', 'E')
], ['dummy1','dummy2']),
spark.createDataFrame(
[('F', 'G'),
('H', 'I')
], ['dummy1','dummy2'])]
for i, df in enumerate(list_of_dfs):
if i == 0:
union_df = df
else:
union_df = union_df.unionAll(df)
union_df.display()
Works just how I want it to. The "union_df = union_df.unionAll(df)" is specifically what I'm having trouble reproducing in scala.
%scala
... outer loop creates each iterations dataframe
if(i==0) {
val union_df=df
} else{
val union_df=union_df.union(df)
}
I get this "error: recursive value union_df needs type". Which I'm having trouble translating the documentation in to my solution, because the type is a dataframe. Obviously I need to actually learn something about scala, but this is the bridge I'm trying to cross right now. Appreciate any help.

You don't need to manually manage a loop to go through the collection in Scala. Since you're trying to go from many values to one we can use the reduce method:
val dfs: Iterable[DataFrame] = ???
val union_df = dfs.reduce(_ union _)

I'll accept Jarrod Baker's answer since I'm sure it's more appropriate.
But what ended up working for me was instantiating it as an empty dataframe and then doing the appends.
%scala
... outer loop creates each iterations dataframe
var union_df = spark.emptyDataFrame
if(i==0) {
union_df=df
} else{
union_df=union_df.union(df)
}

in the Scala code you have val union_df=union_df.union(df) -> you are defining a value and tried to call it.
should be something like this:
if(i==0) {
val union_df=df
} else{
union_df = union_df.union(df)
}
The previous answer is better, use reduce or foldLeft(foldRight) function instead.

Related

value defined in "if-else structure" is not found outside the "if-else structure"

In the following code, I expected the compiler to identify that the output gets defined either in the if section or in the else section.
val df1 = spark.createDataFrame(Seq(
(1, 10),
(2, 20)
)).toDF("A", "B")
val df2 = spark.emptyDataFrame
if(df2.isEmpty){
val output = df1
}
else{
val output = df2
}
println(output.show)
However, it gives me an error saying error: not found: value output. if I do the same exact implementation in python it works fine and I get the expected output. In order to make this work in spark using scala I have defined output as a mutable variable and update it inside the if-else.
var output = spark.emptyDataFrame
if(df2.isEmpty){
output = df1
}
else{
output = df2
}
println(output.show)
Why doesn't the first implementation work and is there a way to get the expected outcome without using a mutable variable?

I suspect you come from a Python background where this kind of behavior is allowed.
In Scala this is not possible to achieve as is, because the if / else structure creates a new block, and what is defined in a block only resides in such block.
You may fix this by using a mutable variable...
var output: DataFrame = _
if(df2.isEmpty){
output = df1
}
else{
output = df2
}
However, this is very Java and goes against the immutable principle.
In Scala, a block is an expression, and as such, they can return values.
Thus, this is the more idiomatic way to solve the problem in Scala.
val output = if(df2.isEmpty) df1 else df2

Spark : how to parallelize subsequent specific work on each dataframe partitions

My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks

You could use foreachPartition if writing to data stores outside Hadoop scope with specific logic needed for that particular env.
Else map, etc.
.par parallel collections (Scala) - but that is used with caution. For reading files and pre-processing them, otherwise possibly considered risky.
Threads.
You need to check what you are doing and if the operations can be referenced, usewd within a foreachPartition block, etc. You need to try as some aspects can only be written for the driver and then get distributed to the executors via SPARK to the workers. But you cannot write, for example, spark.sql for the worker as per below - at the end due to some formatting aspect errors I just got here in the block of text. See end of post.
Likewise df.write or df.read cannot be used in the below either. What you can do is write individual execute/mutate statements to, say, ORACLE, mySQL.
Hope this helps.
rdd.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
// do something
spark.sql("INSERT INTO tableX VALUES(2,7, 'CORN', 100, item)")
// do some other stuff
})
or
RDD.foreachPartition (records => {
val JDBCDriver = "com.mysql.jdbc.Driver" ...
...
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(ConnectionURL, jdbcUsername, jdbcPassword)
...
val mutateStatement = connection.createStatement()
val queryStatement = connection.createStatement()
...
records.foreach (record => {
val val1 = record._1
val val2 = record._2
...
mutateStatement.execute (s"insert into sample (k,v) values(${val1}, ${nIterVal})")
})
}
)

scala java.lang.NullPointerException

The following code is causing java.lang.NullPointerException.
val sqlContext = new SQLContext(sc)
val dataFramePerson = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema1).load("c:\\temp\\test.csv")
val dataFrameAddress = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema2).load("c:\\temp\\test2.csv")
val personData = dataFramePerson.map(data => {
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
val addressRow = addressData.first;
address = addressRow.asInstanceOf[Address];
}
Person(data.getAs("Name"),data.getAs("Phone"),address)
})
I narrowed it down to the following line of that is causing the exception.
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
Can someone point out what the issue is?

Your code has a big structural flaw, that is, you can only refer to dataframes from the code that executes in the driver, but not in the code that is run by the executors. Your code contains a reference to another dataframe from within a map, that is executed in executors. See this link Can I use Spark DataFrame inside regular Spark map operation?
val personData = dataFramePerson.map(data => { // WITHIN A MAP
val addressData = dataFrameAddress.filter(i => // <--- REFERRING TO OTHER DATAFRAME WITHIN A MAP
i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
What you want to do instead is a left outer join, then do further processing.
dataFramePerson.join(dataFrameAddress, Seq("ID"), "left_outer")
Note also than when using getAs you want to specify the type, like getAs[String]("ID")

The only thing that can be said is that either dataFrameAddress, or i, or data is null. Use your favorite debugging technique to know which one actually is e.g., debugger, print statements or logs.
Note that if you see the filter call in the stacktrace of your NullPointerException, it would mean that only i, or data could be null. On the other hand, if you don't see the filter call, it would rather mean that it is dataFrameAddress that is null.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?

As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)

Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

How to efficiently extract a value from HiveContext Query

I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Append/Union multiple dataframes in Scala - scala

You don't need to manually manage a loop to go through the collection in Scala. Since you're trying to go from many values to one we can use the reduce method: val dfs: Iterable[DataFrame] = ??? val union_df = dfs.reduce(_ union _)

in the Scala code you have val union_df=union_df.union(df) -> you are defining a value and tried to call it. should be something like this: if(i==0) { val union_df=df } else{ union_df = union_df.union(df) } The previous answer is better, use reduce or foldLeft(foldRight) function instead.

Related

value defined in "if-else structure" is not found outside the "if-else structure"

Spark : how to parallelize subsequent specific work on each dataframe partitions

scala java.lang.NullPointerException

Spark: How to get String value while generating output file

How to efficiently extract a value from HiveContext Query

Categories

Resources