Transform a dataframe to a dataset using case class spark scala - scala

I wrote the following code which aims to transform a dataframe to a dataset using a case class
def toDs[T](df: DataFrame): Dataset[T] = {
df.as[T]
}
then case class DATA( name:String, age:Double, location:String)
I am getting:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] df.as[T]
Any idea how to fix this

You can read the data into a Dataset[MyCaseClass] in the following two ways:
Say you have the following class: case class MyCaseClass
1) First way: Import sparksession implicits in the scope and use the as operator to convert your DataFrame to Dataset[MyCaseClass]:
case class MyCaseClass
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
import spark.implicits._
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]
2) You can create you own encoder in another object and import them in your current code
package com.funky.package
import org.apache.spark.sql.{Encoder, Encoders}
case class MyCaseClass
object MyCustomEncoders{
implicit val mycaseClass:Encoder[MyCaseClass] = Encoders.product[MyCaseClass]
}
In the file containing the main method, import the above implicit value
import com.funky.package.MyCustomEncoders
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]

Related

toDF is not a member of Seq, getting this error in Databricks notebook

I am trying to create empty DF based on a case class and am trying to do that in Databricks notebook. But if I am doing that using object/class then getting error and if I take out the object defination then it runs successfully. Is this a bug in databricks notebook or I need to import anything.
Please suggest.
import org.apache.spark.sql.SparkSession
object Emp {
lazy val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
def emptyDf {
case class Employee(Name: String, Age: Integer, Address: String)
var empDf = Seq.empty[Employee].toDF()
}
}
--Error Message
command-1113195242149456:9: error: value toDF is not a member of Seq[Employee]
var empDf = Seq.empty[Employee].toDF()
I am able to fix the problem by just moving the case class outside the def.
import org.apache.spark.sql.SparkSession
object Emp {
lazy val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
case class Employee(Name: String, Age: Integer, Address: String)
def emptyDf {
var empDf = Seq.empty[Employee].toDF()
}
}
Note: Add teh case classes outside of the object defination, otherwise it will throw error when you call methods of the object.

Problem creating dataset in Spark and Scala

I ran into a problem using spark dataset!
I keep getting the exception about encoders when I want to use case class
the code is a simple one below:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
I get this exception during compile:
Unable to find encoder for type OrderDataType. An implicit Encoder[OrderDataType] is needed to store OrderDataType instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already added this: import spark.implicits._ but it doesn't solve the problem!
According to spark and scala documentation, the encoding must be done implicitly with scala!
What is wrong with this code and what should I do to fix it!
Define your case class outside of main method then in main method read the csv file and convert to dataset.
Example:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
}
//or
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[(String,String,String)]
}
Other way is ... you can use every thing inside object Orders extends App (intelligent enough to identify case class from out side of def main)
mydata/Orders.csv
orderId,customerId,orderDate
1,2,21/08/1977
1,2,21/08/1978
Example code :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql._
object Orders extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
case class OrderDataType(orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds1 = spark.read.option("header", "true").csv("mydata/Orders.csv").as[OrderDataType]
ds1.show
}
Result :
+-------+----------+----------+
|orderId|customerId| orderDate|
+-------+----------+----------+
| 1| 2|21/08/1977|
| 1| 2|21/08/1978|
+-------+----------+----------+
Why case class outside of def main ....
Seems like this is by design of the Encoder from annotation
#implicitNotFound below

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?

I've written spark job:
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val ctx = new org.apache.spark.sql.SQLContext(sc)
import ctx.implicits._
case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String)
case class Person2(name: String, age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
persons.printSchema()
}
}
In IDE when I run the main function, 2 error occurs:
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
Error:(15, 67) not enough arguments for method as: (implicit evidence$1: org.apache.spark.sql.Encoder[Person])org.apache.spark.sql.Dataset[Person].
Unspecified value parameter evidence$1.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
but in Spark Shell I can run this job without any error. what is the problem?
The error message says that the Encoder is not able to take the Person case class.
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
Move the declaration of the case class outside the scope of SimpleApp.
You have the same error if you add sqlContext.implicits._ and spark.implicits._ in SimpleApp (the order doesn't matter).
Removing one or the other will be the solution:
val spark = SparkSession
.builder()
.getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._ //sqlContext OR spark implicits
//import spark.implicits._ //sqlContext OR spark implicits
case class Person(age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
Tested with Spark 2.1.0
The funny thing is if you add the same object implicits twice you will not have problems.
#Milad Khajavi
Define Person case classes outside object SimpleApp.
Also, add import sqlContext.implicits._ inside main() function.