Explicit cast reading .csv with case class Spark 2.1.0 - scala

I have the following case class:
case class OrderDetails(OrderID : String, ProductID : String, UnitPrice : Double,
Qty : Int, Discount : Double)
I am trying read this csv: https://github.com/xsankar/fdps-v3/blob/master/data/NW-Order-Details.csv
This is my code:
val spark = SparkSession.builder.master(sparkMaster).appName(sparkAppName).getOrCreate()
import spark.implicits._
val orderDetails = spark.read.option("header","true").csv( inputFiles + "NW-Order-Details.csv").as[OrderDetails]
And the error is:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast `UnitPrice` from string to double as it may truncate
The type path of the target object is:
- field (class: "scala.Double", name: "UnitPrice")
- root class: "es.own3dh2so4.OrderDetails"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Why can not it be transformed if all fields are "doubles" values? What do not I understand?
Spark version 2.1.0, Scala version 2.11.7

You just need to explicitly cast your field to a Double:
val orderDetails = spark.read
.option("header","true")
.csv( inputFiles + "NW-Order-Details.csv")
.withColumn("unitPrice", 'UnitPrice.cast(DoubleType))
.as[OrderDetails]
On a side note, by Scala (and Java) convention, your case class constructor parameters should be lower camel case:
case class OrderDetails(orderID: String,
productID: String,
unitPrice: Double,
qty: Int,
discount: Double)

If we want to change the datatype for multiple columns; if we use withColumn option it will look ugly.
The better way to apply schema for the data is
Get the Case Class schema using Encoders as shown below
val caseClassschema = Encoders.product[CaseClass].schema
Apply this schema while reading data
val data = spark.read.schema(caseClassschema)

Related

Scala - template method pattern in a trait

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}
You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

Spark error: Exception in thread "main" java.lang.UnsupportedOperationException

I am writing a Scala/spark program which would find the max salary of the employee. The employee data is available in a CSV file, and the salary column has a comma separator for thousands and also it has a $ prefixed to it e.g. $74,628.00.
To handle this comma and dollar sign, I have written a parser function in scala which would split each line on "," and then map each column to individual variables to be assigned to a case class.
My parser program looks like below. In this to eliminate the comma and dollar signs I am using the replace function to replace it with empty, and then finally typecase to Int.
def ParseEmployee(line: String): Classes.Employee = {
val fields = line.split(",")
val Name = fields(0)
val JOBTITLE = fields(2)
val DEPARTMENT = fields(3)
val temp = fields(4)
temp.replace(",","")//To eliminate the ,
temp.replace("$","")//To remove the $
val EMPLOYEEANNUALSALARY = temp.toInt //Type cast the string to Int
Classes.Employee(Name, JOBTITLE, DEPARTMENT, EMPLOYEEANNUALSALARY)
}
My Case class look like below
case class Employee (Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Number,
)
My spark dataframe sql query looks like below
val empMaxSalaryValue = sc.sqlContext.sql("Select Max(EMPLOYEEANNUALSALARY) From EMP")
empMaxSalaryValue.show
when I Run this program I am getting this below exception
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Number
- field (class: "java.lang.Number", name: "EMPLOYEEANNUALSALARY")
- root class: "Classes.Employee"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:282)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:272)
at CalculateMaximumSalary$.main(CalculateMaximumSalary.scala:27)
at CalculateMaximumSalary.main(CalculateMaximumSalary.scala)
Any idea why I am getting this error? what is the mistake I am doing here and why it is not able to typecast to number?
Is there any better approach to handle this problem of getting maximum salary of the employee?
Spark SQL provides only a limited number of Encoders which target concrete classes. Abstract classes like Number are not supported (can be used with limited binary Encoders).
Since you convert to Int anyway, just redefine the class:
case class Employee (
Name: String,
JOBTITLE: String,
DEPARTMENT: String,
EMPLOYEEANNUALSALARY: Int
)

Scala.js convert Dynamic to scala type

Is there a way to convert a js Dynaimc type to a desired Scala type like String, Int, Double or BigDecimal?
Looking at the source code there does not appear to be a way in the companion object to do these things.
I got it. You must use asInstanceOf[T].
For example
if you have some object called data with an id of Int and name of String
val myPromise = $.ajax(url)
myPromise.done((data: js.Dynamic, textStatus: String, jqXHr: JQueryXHR) => {
val id = data.id.asInstanceOf[Int]
val name = data.name.asInstanceOf[String]
})

Schema for type Any is not supported

I'm trying to create a spark UDF to extract a Map of (key, value) pairs from a User defined case class.
The scala function seems to work fine, but when I try to convert that to a UDF in spark2.0, I'm running into the " Schema for type Any is not supported" error.
case class myType(c1: String, c2: Int)
def getCaseClassParams(cc: Product): Map[String, Any] = {
cc
.getClass
.getDeclaredFields // all field names
.map(_.getName)
.zip(cc.productIterator.to) // zipped with all values
.toMap
}
But when I try to instantiate a function value as a UDF it results in the following error -
val ccUDF = udf{(cc: Product, i: String) => getCaseClassParams(cc).get(i)}
java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:668)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2841)
The error message says it all. You have an Any in the map. Spark SQL and Dataset api does not support Any in the schema. It has to be one of the supported type (which is a list of basic types such as String, Integer etc. a sequence of supported types or a map of supported types).

Spark Dataset and java.sql.Date

Let's say I have a Spark Dataset like this:
scala> import java.sql.Date
scala> case class Event(id: Int, date: Date, name: String)
scala> val ds = Seq(Event(1, Date.valueOf("2016-08-01"), "ev1"), Event(2, Date.valueOf("2018-08-02"), "ev2")).toDS
I want to create a new Dataset with only the name and date fields. As far as I can see, I can either use ds.select() with TypedColumn or I can use ds.select() with Column and then convert the DataFrame to Dataset.
However, I can't get the former option working with the Date type. For example:
scala> ds.select($"name".as[String], $"date".as[Date])
<console>:31: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
ds.select($"name".as[String], $"date".as[Date])
^
The later option works:
scala> ds.select($"name", $"date").as[(String, Date)]
res2: org.apache.spark.sql.Dataset[(String, java.sql.Date)] = [name: string, date: date]
Is there a way to select Date fields from Dataset without going to DataFrame and back?
Been bashing my head against problems like these for the whole day. I think you can solve your problem with one line:
implicit val e: Encoder[(String, Date)] = org.apache.spark.sql.Encoders.kryo[(String,Date)]
At least that has been working for me.
EDIT
In these cases, the problem is that for most Dataset operations, Spark 2 requires an Encoder that stores schema information (presumably for optimizations). The schema information takes the form of an implicit parameter (and a bunch of Dataset operations have this sort of implicit parameter).
In this case, the OP found the correct schema for java.sql.Date so the following works:
implicit val e = org.apache.spark.sql.Encoders.DATE