Functional Programming in Spark/Scala - scala

I am learning more about Scala and Spark but have came stuck upon how to structure a function when I am using two tables as an input. My goal is to condense my code and utilise more functions. I am stuck on how I structure the functions when using two tables which I intend to join. My code without a function looks like:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
val df1 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/YYY/Documents/YYY.csv")
// df1: org.apache.spark.sql.DataFrame = [customerID: int, StoreID: int, FirstName: string, Surname: string, dateofbirth: int]
val df2 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/XXX/Documents/XXX.csv")
df1.printSchema()
df1.createOrReplaceTempView("customerinfo")
df2.createOrReplaceTempView("customerorders")
def innerjoinA(df1: DataFrame, df2:Dataframe): Array[String]={
val innerjoindf= df1.join(df2,"customerId")
}
innerjoin().show()
}
My question is: how do I properly define the function for innerjoinA (&why?) and how exactly am I able to call it later in the program? And to a greater point, what else could I format as a function in this example?

you could do something like this.
Create A function to create Spark Session, and ReadCSV. This function if you need put into a different file if it's being called by other programs as well.
Just for join, no Need to crate a function. However, you could create to understand the business flow and give it a proper name.
import org.apache.spark.sql.{DataFrame, SparkSession}
def getSparkSession(unit: Unit) : SparkSession = {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
spark
}
def readCSV(filePath: String): DataFrame = {
val df = getSparkSession().sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load(filePath)
df
}
def getCustomerDetails(customer: DataFrame, details: DataFrame) : DataFrame = {
customer.join(details,"customerId")
}
val xxxDF = readCSV("C:/Users/XXX/Documents/XXX.csv")
val yyyDF = readCSV("C:/Users/XXX/Documents/YYY.csv")
getCustomerDetails(xxxDF, yyyDF).show()

The basic premise on grouping complex tranformations and joins in methods is sound. Only you know if a special innerjoin method makes sense in you usecase.
I usually define them as extension methods so I can chain them one after another.
trait/object DataFrameExtensions{
implicit class JoinDataFrameExtensions(df:DataFrame){
def innerJoin(df2:DataFrame):DataFrame = df.join(df2, Seq("ColumnName"))
}
}
And then later on in the code import/mixin the methods I want and call them on the DataFrame.
originalDataFrame.innerJoin(toBeJoinedDataFrame).show()
I prefer extension methods but you can also just declare a method DataFrame => DataFrame and use it in the .transform method already defined on the Dataset API.
def innerJoin(df2:DataFrame)(df1:DataFrame):DataFrame = df1.join(df2, Seq("ColumnName"))
val join = innerJoin(tobeJoinedDataFrame) _
originalDataFrame.transform(join).show()

Related

How to create a Dataset from a csv which doesn't have a header and has more than 150 columns using scala spark

I've a csv which I need to read as Dataset. The csv is having 140 columns and it doesn't have a header.
I created a schema with StructType(Seq(StructFiled(...), Seq(StructFiled(...), ...)) and the code to read that is as follows:-
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
And the case class I created is like:-
case class MycaseClass(
mycaseClass1: MyCaseClass1,
mycaseClass2: MyCaseClass2,
mycaseClass3: MyCaseClass3,
mycaseClass4: MyCaseClass4,
mycaseClass5: MyCaseClass5,
mycaseClass6: MyCaseClass6,
mycaseClass7: MyCaseClass7,
)
MyCaseClass1(
first 20 columns of csv: it's datatypes
)
MyCaseClass2(
next 20 columns of csv: it's datatypes
)
and so on.
But when I'm trying to compile it, it gives me an error as below:-
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .as[myCaseClass]
I'm calling this from my Scala App as :-
object MyTestApp{
def main(args: Array[String]): Unit ={
implicit val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
run(args)
}
def run(args: Array[String])(implicit spark: SparkSession): Unit = {
val inputPath = args.get("inputData")
val delimeter = Constants.delimeter
val myData = Dataparser.getData(inputPath, delimeter)
}
}
```
I'm not very sure about the approach also as I'm new to Dataset.
I saw multiple answers around this issue but they were mainly for very small no of columns which can be contained within the scope of a single case class and that too with header which makes this little simpler.
Any help would be really appreciated.
Thanks to all the viewers. Actually I found the issue. Posting the answer here so that other's who come across any such issue, will be able to get rid of this issue.
I needed to import the spark.implicits._ here
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
**import spark.implicits._**
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

Unable to filter CSV columns stored in dataframe in spark 2.2.0

I am reading a CSV file from my local machine using spark and scala and storing into a dataframe (called df). I have to select only few selected columns with new aliasing names from the df and save to new dataframe newDf. I have tried to do the same but I am getting the error below.
main" org.apache.spark.sql.AnalysisException: cannot resolve '`history_temp.time`' given input columns: [history_temp.time, history_temp.poc]
Below is the code written to read the csv file from my local machine.
import org.apache.spark.sql.SparkSession
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
df.show(5) // Till this code is working fine
val newDf = df.select("history_temp.time","history_temp.poc")
Below are the code which I tried but not working.
// val newDf = df.select($"history_temp.time",$"history_temp.poc")
// val newDf = df.select("history_temp.time","history_temp.poc")
// val newDf = df.select( df("history_temp.time").as("TIME"))
// val newDf = df.select(df.col("history_temp.time"))
// df.select(df.col("*")) // This is working
newDf.show(10)
}
}
from the looks of it. your column name format is the issue here. i am guessing they are just regular stringType but when you have something like history_temp.time spark thinks it as an arrayed column. which is not the case. I would rename all of the columns and replace "." to "". then you can run the same select and it should work. you can use foldleft to rplace all "." with "" like below.
val replacedDF = df.columns.foldleft(df){ (newdf, colname)=>
newdf.withColumnRenamed (colname, colname.replace(".","_"))
}
With that done you can select from replacedDF with below
val newDf= replacedDf.select("history_temp_time","history_temp_poc")
Let me know how it works out for you.

Task not serializable when iterating through dataframe, scala

Below is my code and when I try to iterate through each row:
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true) // Use first line of all files as header
.option("delimiter", TILDE)
.option("inferSchema", "true") // Automatically infer data types
.load(fileName._2)
val accGrpCountsIds: DataFrame = df.groupBy("accgrpid").count()
LOGGER.info(s"DataFrame Count - ${accGrpCountsIds.count()}")
accGrpCountsIds.show(3)
//switch based on file names and update the model.
accGrpCountsIds.foreach(accGrpRow => {
val accGrpId = accGrpRow.getLong(0)
val rowCount = accGrpRow.getInt(1)
}
When I try to interate through the dataframe above using foreach, I get an task not serializable error. How can I do this?
Do you have any other types in your foreach that you didn't share? or that's all you do and it doesn't work?
accGrpCountsIds.foreach(accGrpRow => {
val accGrpId = accGrpRow.getLong(0)
val rowCount = accGrpRow.getInt(1)
}
Also, you may find that useful?
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.