Problem creating dataset in Spark and Scala - scala

I ran into a problem using spark dataset!
I keep getting the exception about encoders when I want to use case class
the code is a simple one below:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
I get this exception during compile:
Unable to find encoder for type OrderDataType. An implicit Encoder[OrderDataType] is needed to store OrderDataType instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already added this: import spark.implicits._ but it doesn't solve the problem!
According to spark and scala documentation, the encoding must be done implicitly with scala!
What is wrong with this code and what should I do to fix it!

Define your case class outside of main method then in main method read the csv file and convert to dataset.
Example:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
}
//or
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[(String,String,String)]
}

Other way is ... you can use every thing inside object Orders extends App (intelligent enough to identify case class from out side of def main)
mydata/Orders.csv
orderId,customerId,orderDate
1,2,21/08/1977
1,2,21/08/1978
Example code :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql._
object Orders extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
case class OrderDataType(orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds1 = spark.read.option("header", "true").csv("mydata/Orders.csv").as[OrderDataType]
ds1.show
}
Result :
+-------+----------+----------+
|orderId|customerId| orderDate|
+-------+----------+----------+
| 1| 2|21/08/1977|
| 1| 2|21/08/1978|
+-------+----------+----------+
Why case class outside of def main ....
Seems like this is by design of the Encoder from annotation
#implicitNotFound below

Related

Error in returning spark rdd from a function called inside the map function

I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection. Goal is to get union of RDDs from fetchData method for each element from plants collection. I have given the relevant part of code below. My issue is, the code is giving compilation error for return type of fetchData:
println("PartB: "+ hBaseRDD.getNumPartitions)
error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]
I am using scala 2.11.8 spark 2.2.0 and maven compilation
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
def fetchData(plant: String): RDD[Record] = {
val start_index = plant
val end_index = plant + "z"
//The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
}
def main(args: Array[String]) {
//the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table")
val plants = Vector("a8","cu","aw","fx")
val hBaseRDD = plants.map( pp => fetchData(pp))
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
Here is the working version of code. The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
val plants = Vector("a8","cu","aw","fx")
for (plant <- plants){
val start_index = plant
val end_index = plant + "z"
val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
}
After trying, this is where I am stuck now. So how can I cast the type to required.
scala> def fetchData(plant: String) = {
| val start_index = plant
| val end_index = plant + "~"
| val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
| x1
| }
Define the function in REPL and running it
scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
Thanks in Advance!
The type of hBaseRDD is Vector[_] and not RDD[_], so you can't execute method getNumPartitions on it. If I understand correctly you want to union fetched RDDs. You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)
Also the returned type of fetchData is RDD[U], but I didn't find any definition of U. Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]]. In order to avoid subsequent errors you should also change RDD[U] to RDD[Record].

Transform a dataframe to a dataset using case class spark scala

I wrote the following code which aims to transform a dataframe to a dataset using a case class
def toDs[T](df: DataFrame): Dataset[T] = {
df.as[T]
}
then case class DATA( name:String, age:Double, location:String)
I am getting:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] df.as[T]
Any idea how to fix this
You can read the data into a Dataset[MyCaseClass] in the following two ways:
Say you have the following class: case class MyCaseClass
1) First way: Import sparksession implicits in the scope and use the as operator to convert your DataFrame to Dataset[MyCaseClass]:
case class MyCaseClass
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
import spark.implicits._
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]
2) You can create you own encoder in another object and import them in your current code
package com.funky.package
import org.apache.spark.sql.{Encoder, Encoders}
case class MyCaseClass
object MyCustomEncoders{
implicit val mycaseClass:Encoder[MyCaseClass] = Encoders.product[MyCaseClass]
}
In the file containing the main method, import the above implicit value
import com.funky.package.MyCustomEncoders
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate()
val ds: Dataset[MyCaseClass]= spark.read.format("FORMAT_HERE").load().as[MyCaseClass]

How to add a new method to DataFrame type?

Imagine I have this Scala function that operates upon a Spark dataframe:
class MyClass {
def makeColumnNull(df: DataFrame, columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
I call it like so:
val df = spark.range(0,10).toDF()
val df2 = MyClass.makeColumnNull(df, "id")
That works fine however it doesn't work in the same fluent manner as Spark's API. What I'd like to is rewrite my function in a way that enables me to do this:
val df2 = df.makeColumnNull("id")
Can anyone help?
Implicit classes is the way to go, I've used them to extend several spark classes. So you need this:
package com.mycompany.utils.spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
object DataFrameExtensions {
implicit class DataFrameWrapper(df: DataFrame) {
def makeColumnNull(columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
}
then you have to import com.mycompany.utils.spark.DataFrameExtensions._ and you will able to invoke makeColumnNull() against any DataFrame object

Dataframe.rdd.count throws No applicable constructor/method found for actual parameters when using case class with List

I am using Spark 2.1.0 and also checked this scenario on 2.1.1.
This is very weird to me, and it seems like a Serialization problem.
I am using case class that contains a scala List and I am hitting the following error:
Executor task launch worker for task 1] [CodeGenerator.logError]: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 111, Column 64: No applicable constructor/method found for actual parameters "scala.collection.Seq"; candidates are: "com.minute.playground.WithList(scala.collection.immutable.List)"
/* 001 */
I created a short example code that present the problem
import org.apache.spark.sql.{Dataset, SparkSession}
case class Simple(str: String)
case class WithList(list : List[String])
object RddCountFailOnCaseClassWithList {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("WhyLikeThis").master("local[2]").enableHiveSupport.getOrCreate
import spark.implicits._
val df1: Dataset[Simple] = spark.sqlContext.createDataFrame[Simple](List(new Simple("1"))).as[Simple]
df1.show()
println(df1.rdd.count())
val df2: Dataset[WithList] = spark.sqlContext.createDataFrame[WithList](List(new WithList(List("1")))).as[WithList]
df2.show()
//This will fail !!! ? But why ?
println(df2.rdd.count())
}
}
What am i doing wrong ?
Am I not supposed to use List inside my model ?
Looks like a bug :)
You did nothing wrong. Looks like List somewhere converted to Seq and code generator can't find required constructor.
Your code works with spark 2.2.0 though, so it got fixed.
If you want to use spark 2.1.0 you can try to change your implementation to
import org.apache.spark.sql.{Dataset, SparkSession}
case class Simple(str: String)
case class WithSeq(list: Seq[String])
object RddCountFailOnCaseClassWithList {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("WhyLikeThis").master("local[2]").enableHiveSupport.getOrCreate
import spark.implicits._
val df1: Dataset[Simple] = {
spark.sqlContext.createDataFrame[Simple](List(Simple("1"))).as[Simple]
}
df1.show()
println(df1.rdd.count())
val df2: Dataset[WithSeq] = spark.sqlContext.createDataFrame[WithSeq](List(WithSeq(List("1")))).as[WithSeq]
df2.show()
//hooray, it works! :)
println(df2.rdd.count())
}
}
case class WithList(list : Seq[String])
List is scala.collection.immutable.List
Seq is scala.collection.Seq

Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?

I've written spark job:
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val ctx = new org.apache.spark.sql.SQLContext(sc)
import ctx.implicits._
case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String)
case class Person2(name: String, age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
persons.printSchema()
}
}
In IDE when I run the main function, 2 error occurs:
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
Error:(15, 67) not enough arguments for method as: (implicit evidence$1: org.apache.spark.sql.Encoder[Person])org.apache.spark.sql.Dataset[Person].
Unspecified value parameter evidence$1.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
but in Spark Shell I can run this job without any error. what is the problem?
The error message says that the Encoder is not able to take the Person case class.
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
Move the declaration of the case class outside the scope of SimpleApp.
You have the same error if you add sqlContext.implicits._ and spark.implicits._ in SimpleApp (the order doesn't matter).
Removing one or the other will be the solution:
val spark = SparkSession
.builder()
.getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._ //sqlContext OR spark implicits
//import spark.implicits._ //sqlContext OR spark implicits
case class Person(age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
Tested with Spark 2.1.0
The funny thing is if you add the same object implicits twice you will not have problems.
#Milad Khajavi
Define Person case classes outside object SimpleApp.
Also, add import sqlContext.implicits._ inside main() function.