Passing case class into function arguments - scala

sorry for asking a simple question. I want to pass a case class to a function argument and I want to use it further inside the function. Till now I have tried this with TypeTag and ClassTag but for some reason, I am unable to properly use it or may be I am not looking at the correct place.
Use cases is something similar to this:
case class infoData(colA:Int,colB:String)
case class someOtherData(col1:String,col2:String,col3:Int)
def readCsv[T:???](path:String,passedCaseClass:???): Dataset[???] = {
sqlContext
.read
.option("header", "true")
.csv(path)
.as[passedCaseClass]
}
It will be called something like this:
val infoDf = readCsv("/src/main/info.csv",infoData)
val otherDf = readCsv("/src/main/someOtherData.csv",someOtherData)

There are two things which you should pay attention to,
class names should be in CamelCase, so InfoData.
Once you have bound a type to a DataSet, its not a DataFrame. DataFrame is a special name for a DataSet of general purpose Row.
What you need is to ensure that your provided class has an implicit instance of corresponding Encoder in current scope.
case class InfoData(colA: Int, colB: String)
Encoder instances for primitive types (Int, String, etc) and case classes can be obtained by importing spark.implicits._
def readCsv[T](path: String)(implicit encoder: Encoder: T): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Or, you can use context bound,
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Now, you can use it as following,
val spark = ...
import spark.implicits._
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
val infoDS = readCsv[InfoData]("/src/main/info.csv")

First change your function definition to:
object t0 {
def readCsv[T] (path: String)(implicit spark: SparkSession, encoder: Encoder[T]): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
}
You donĀ“t need to perform any kind of reflection to create a generic readCsv function. The key here is that Spark needs the encoder at compile time. So you can pass it as implicit parameter and the compiler will add it.
Because Spark SQL can deserialize product types(your case classes) including the default encoders, it is easy to call your function like:
case class infoData(colA: Int, colB: String)
case class someOtherData(col1: String, col2: String, col3: Int)
object test {
import t0._
implicit val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
readCsv[infoData]("/tmp")
}
Hope it helps

Related

Problem creating dataset in Spark and Scala

I ran into a problem using spark dataset!
I keep getting the exception about encoders when I want to use case class
the code is a simple one below:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
I get this exception during compile:
Unable to find encoder for type OrderDataType. An implicit Encoder[OrderDataType] is needed to store OrderDataType instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already added this: import spark.implicits._ but it doesn't solve the problem!
According to spark and scala documentation, the encoding must be done implicitly with scala!
What is wrong with this code and what should I do to fix it!
Define your case class outside of main method then in main method read the csv file and convert to dataset.
Example:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
}
//or
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[(String,String,String)]
}
Other way is ... you can use every thing inside object Orders extends App (intelligent enough to identify case class from out side of def main)
mydata/Orders.csv
orderId,customerId,orderDate
1,2,21/08/1977
1,2,21/08/1978
Example code :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql._
object Orders extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
case class OrderDataType(orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds1 = spark.read.option("header", "true").csv("mydata/Orders.csv").as[OrderDataType]
ds1.show
}
Result :
+-------+----------+----------+
|orderId|customerId| orderDate|
+-------+----------+----------+
| 1| 2|21/08/1977|
| 1| 2|21/08/1978|
+-------+----------+----------+
Why case class outside of def main ....
Seems like this is by design of the Encoder from annotation
#implicitNotFound below

How to create a Dataset from a csv which doesn't have a header and has more than 150 columns using scala spark

I've a csv which I need to read as Dataset. The csv is having 140 columns and it doesn't have a header.
I created a schema with StructType(Seq(StructFiled(...), Seq(StructFiled(...), ...)) and the code to read that is as follows:-
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
And the case class I created is like:-
case class MycaseClass(
mycaseClass1: MyCaseClass1,
mycaseClass2: MyCaseClass2,
mycaseClass3: MyCaseClass3,
mycaseClass4: MyCaseClass4,
mycaseClass5: MyCaseClass5,
mycaseClass6: MyCaseClass6,
mycaseClass7: MyCaseClass7,
)
MyCaseClass1(
first 20 columns of csv: it's datatypes
)
MyCaseClass2(
next 20 columns of csv: it's datatypes
)
and so on.
But when I'm trying to compile it, it gives me an error as below:-
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .as[myCaseClass]
I'm calling this from my Scala App as :-
object MyTestApp{
def main(args: Array[String]): Unit ={
implicit val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
run(args)
}
def run(args: Array[String])(implicit spark: SparkSession): Unit = {
val inputPath = args.get("inputData")
val delimeter = Constants.delimeter
val myData = Dataparser.getData(inputPath, delimeter)
}
}
```
I'm not very sure about the approach also as I'm new to Dataset.
I saw multiple answers around this issue but they were mainly for very small no of columns which can be contained within the scope of a single case class and that too with header which makes this little simpler.
Any help would be really appreciated.
Thanks to all the viewers. Actually I found the issue. Posting the answer here so that other's who come across any such issue, will be able to get rid of this issue.
I needed to import the spark.implicits._ here
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
**import spark.implicits._**
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}

Functional Programming in Spark/Scala

I am learning more about Scala and Spark but have came stuck upon how to structure a function when I am using two tables as an input. My goal is to condense my code and utilise more functions. I am stuck on how I structure the functions when using two tables which I intend to join. My code without a function looks like:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
val df1 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/YYY/Documents/YYY.csv")
// df1: org.apache.spark.sql.DataFrame = [customerID: int, StoreID: int, FirstName: string, Surname: string, dateofbirth: int]
val df2 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/XXX/Documents/XXX.csv")
df1.printSchema()
df1.createOrReplaceTempView("customerinfo")
df2.createOrReplaceTempView("customerorders")
def innerjoinA(df1: DataFrame, df2:Dataframe): Array[String]={
val innerjoindf= df1.join(df2,"customerId")
}
innerjoin().show()
}
My question is: how do I properly define the function for innerjoinA (&why?) and how exactly am I able to call it later in the program? And to a greater point, what else could I format as a function in this example?
you could do something like this.
Create A function to create Spark Session, and ReadCSV. This function if you need put into a different file if it's being called by other programs as well.
Just for join, no Need to crate a function. However, you could create to understand the business flow and give it a proper name.
import org.apache.spark.sql.{DataFrame, SparkSession}
def getSparkSession(unit: Unit) : SparkSession = {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
spark
}
def readCSV(filePath: String): DataFrame = {
val df = getSparkSession().sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load(filePath)
df
}
def getCustomerDetails(customer: DataFrame, details: DataFrame) : DataFrame = {
customer.join(details,"customerId")
}
val xxxDF = readCSV("C:/Users/XXX/Documents/XXX.csv")
val yyyDF = readCSV("C:/Users/XXX/Documents/YYY.csv")
getCustomerDetails(xxxDF, yyyDF).show()
The basic premise on grouping complex tranformations and joins in methods is sound. Only you know if a special innerjoin method makes sense in you usecase.
I usually define them as extension methods so I can chain them one after another.
trait/object DataFrameExtensions{
implicit class JoinDataFrameExtensions(df:DataFrame){
def innerJoin(df2:DataFrame):DataFrame = df.join(df2, Seq("ColumnName"))
}
}
And then later on in the code import/mixin the methods I want and call them on the DataFrame.
originalDataFrame.innerJoin(toBeJoinedDataFrame).show()
I prefer extension methods but you can also just declare a method DataFrame => DataFrame and use it in the .transform method already defined on the Dataset API.
def innerJoin(df2:DataFrame)(df1:DataFrame):DataFrame = df1.join(df2, Seq("ColumnName"))
val join = innerJoin(tobeJoinedDataFrame) _
originalDataFrame.transform(join).show()

How to add a new method to DataFrame type?

Imagine I have this Scala function that operates upon a Spark dataframe:
class MyClass {
def makeColumnNull(df: DataFrame, columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
I call it like so:
val df = spark.range(0,10).toDF()
val df2 = MyClass.makeColumnNull(df, "id")
That works fine however it doesn't work in the same fluent manner as Spark's API. What I'd like to is rewrite my function in a way that enables me to do this:
val df2 = df.makeColumnNull("id")
Can anyone help?
Implicit classes is the way to go, I've used them to extend several spark classes. So you need this:
package com.mycompany.utils.spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
object DataFrameExtensions {
implicit class DataFrameWrapper(df: DataFrame) {
def makeColumnNull(columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
}
then you have to import com.mycompany.utils.spark.DataFrameExtensions._ and you will able to invoke makeColumnNull() against any DataFrame object

Apache Spark - Generic method for loading csv data to dataset

I would like to write generic method with three input parameters:
filePath - String
schema - ?
case class
So, my idea is to write a method like this:
def load_sms_ds(filePath: String, schemaInfo: ?, cc: ?) = {
val ds = spark.read
.format("csv")
.option("header", "true")
.schema(?)
.option("delimiter",",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.load(schemaInfo)
.as[?]
ds
}
and to return dataset depending on a input parameters. I am not sure though what type should parameters schemaInfo and cc be?
First of all I would reccommend reading the spark sql programming guide. This contains some thing that I think will generally help you as you learn spark.
Lets run through the process of reading in a csv file using a case class to define the schema.
First add the varioud imports needed for this example:
import java.io.{File, PrintWriter} // for reading / writing the example data
import org.apache.spark.sql.types.{StringType, StructField} // to define the schema
import org.apache.spark.sql.catalyst.ScalaReflection // used to generate the schema from a case class
import scala.reflect.runtime.universe.TypeTag // used to provide type information of the case class at runtime
import org.apache.spark.sql.Dataset, SparkSession}
import org.apache.spark.sql.Encoder // Used by spark to generate the schema
Define a case class, the different types available can be found here:
case class Example(
stringField : String,
intField : Int,
doubleField : Double
)
Add the method for extracting a schema (StructType) given a case class type as a parameter:
// T : TypeTag means that an implicit value of type TypeTag[T] must be available at the method call site. Scala will automatically generate this for you. See [here][3] for further details.
def schemaOf[T: TypeTag]: StructType = {
ScalaReflection
.schemaFor[T] // this method requires a TypeTag for T
.dataType
.asInstanceOf[StructType] // cast it to a StructType, what spark requires as its Schema
}
Defnie a method to read in a csv file from a path with the schema defined using a case class:
// The implicit Encoder is needed by the `.at` method in order to create the Dataset[T]. The TypeTag is required by the schemaOf[T] call.
def readCSV[T : Encoder : TypeTag](
filePath: String
)(implicit spark : SparkSession) : Dataset[T]= {
spark.read
.option("header", "true")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.schema(schemaOf[T])
.csv(filePath) // spark provides this more explicit call to read from a csv file by default it uses comma and the separator but this can be changes.
.as[T]
}
Create a sparkSession:
implicit val spark = SparkSession.builder().master("local").getOrCreate()
Write some sample data to a temp file:
val data =
s"""|stringField,intField,doubleField
|hello,1,1.0
|world,2,2.0
|""".stripMargin
val file = File.createTempFile("test",".csv")
val pw = new PrintWriter(file)
pw.write(data)
pw.close()
An example of calling this method:
import spark.implicits._ // so that an implicit Encoder gets pulled in for the case class
val df = readCSV[Example](file.getPath)
df.show()