Spark Scala Encoder Case class [duplicate] - scala

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}

Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853

For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).

I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)

Related

Error: Unable to find encoder for type org.apache.spark.sql.Dataset[(String, Long)]

Following test for Dataset comparison is failing with the error:
Error:(55, 38) Unable to find encoder for type org.apache.spark.sql.Dataset[(String, Long)]. An implicit Encoder[org.apache.spark.sql.Dataset[(String, Long)]] is needed to store org.apache.spark.sql.Dataset[(String, Long)] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
).toDF("lower(word)", "count").as[Dataset[(String, Long)]]
Error:(55, 38) not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[org.apache.spark.sql.Dataset[(String, Long)]])org.apache.spark.sql.Dataset[org.apache.spark.sql.Dataset[(String, Long)]].
Unspecified value parameter evidence$2.
).toDF("lower(word)", "count").as[Dataset[(String, Long)]]
Test
As you can see, I tried creating the Kryo Encoder for (String, Long)
class WordCountDSAppTestSpec extends FlatSpec with SparkSessionTestWrapper with DatasetComparer {
import spark.implicits._
"countWords" should "return count of each word" in {
val wordsDF = Seq(
("one", "one"),
("two", "two"),
("three Three", "three"),
("three Three", "Three"),
("", "")
).toDF("line", "word").as[LineAndWord]
implicit val tupleEncoder = org.apache.spark.sql.Encoders.kryo[(String, Long)]
val expectedDF = Seq(
("one", 1L),
("two", 1L),
("three", 2L)
).toDF("lower(word)", "count").as[Dataset[(String, Long)]]
val actualDF = WordCountDSApp.countWords(wordsDF)
assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
}
}
Spark App under test
import com.aravind.oss.Logging
import com.aravind.oss.eg.wordcount.spark.WordCountUtil.{WhitespaceRegex, getClusterCfg, getPaths, getSparkSession}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{explode, split}
object WordCountDSApp extends App with Logging {
logInfo("WordCount with Dataset API and multiple Case classes")
val paths = getPaths(args)
val cluster = getClusterCfg(args)
if (paths.size > 1) {
logInfo("More than one file to process")
}
logInfo("Path(s): " + paths)
logInfo("Cluster: " + cluster)
val spark = getSparkSession("WordCountDSApp", cluster)
import spark.implicits._
/*
case class <code>Line<code> SHOULD match the number of columns in the input file
*/
val linesDs: Dataset[Line] = spark.read
.textFile(paths: _*)
.toDF("line")
.as[Line]
logInfo("Dataset before splitting line")
linesDs.show(false)
/*
<code>toWords<code> adds additional column (word) to the output so we need a
new case class <code>LineAndWord</code> that contains two properties to represent two columns.
The names of the properties should match the name of the columns as well.
*/
val wordDs: Dataset[LineAndWord] = toWords(linesDs)
logInfo("Dataset after splitting the line into words")
wordDs.show(false)
val wordCount = countWords(wordDs)
wordCount
.orderBy($"count(1)".desc)
.show(false)
def toWords(linesDs: Dataset[Line]): Dataset[LineAndWord] = {
import linesDs.sparkSession.implicits._
linesDs
.select($"line",
explode(split($"line", WhitespaceRegex)).as("word"))
.as[LineAndWord]
}
def countWords(wordsDs: Dataset[LineAndWord]): Dataset[(String, Long)] = {
import wordsDs.sparkSession.implicits._
val result = wordsDs
.filter(_.word != null)
.filter(!_.word.isEmpty)
.groupByKey(_.word.toLowerCase)
.count()
result
}
case class Line(line: String)
case class LineAndWord(line: String, word: String)
}
You should call as[Something], not .as[Dataset[Something]]. Here is working version:
"countWords" should "return count of each word" in {
import org.apache.spark.sql.{Encoder, Encoders}
import spark.implicits._
implicit def tuple2[A1, A2](implicit e1: Encoder[A1],
e2: Encoder[A2]): Encoder[(A1, A2)] =
Encoders.tuple[A1, A2](e1, e2)
val expectedDF = Seq(("one", 1L), ("two", 1L), ("three", 2L))
.toDF("value", "count(1)")
.as[(String, Long)]
val wordsDF1 = Seq(
("one", "one"),
("two", "two"),
("three Three", "three"),
("three Three", "Three"),
("", "")
).toDF("line", "word").as[LineAndWord]
val actualDF = WordCountDSApp.countWords(wordsDF1)
actualDF.show()
expectedDF.show()
assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
}

Cannot resolve symbol mapValue

I am getting this error, where am I going wrong. Please help as I am new to spark. How will I use mapValues on an RDD
package com.udemyexamples
import org.apache.spark.sql.SparkSession
object AverageFriendByAge {
def parseFile(line:String): Unit =
{
val field= line.split(",")
val age=field(2).toInt
val friend=field(3).toInt
(age,friend)
}
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local")
.appName("AverageFriendAge")
.getOrCreate()
val sc=spark.sparkContext
.textFile("C:\\SparkScala\\SparkScalaStudy\\fakefriend.csv")
val rdd=sc.map(parseFile)
val y= rdd.**mapValues**(x => (x, 1))
}
}
you first need an instance of SparkSession, your code should look like :
val spark = SparkSession
.builder()
.appName("dataFrameExample")
.master("local")
.getOrCreate()
import spark.implicits._

Confused with scala type infer and type invariance

I have following simple code,
case class Person(name: String, age: Int)
val sc: SparkContext = ...
val rdd: RDD[Product] = sc.parallelize(List(Person("a", 1), Person("b", 2))) //line 2
val rdd1 = sc.parallelize(List(Person("a", 1), Person("b", 2)))
val rdd2: RDD[Product] = rdd1 //compiling error
RDD[T] is invariance, so RDD[Person] is not subtype of RDD[Product], so there is compiling errors in the last line.
But I don't understand line 2
sc.parallelize(List(Person("a", 1), Person("b", 2)))
it is of type RDD[Person], why can it be assigned to RDD[Product]?
Because a very important part of type inference in Scala is expected type. The rules are spread out over https://www.scala-lang.org/files/archive/spec/2.11/06-expressions.html, but to explain this specific case:
In line 2,
sc.parallelize(List(Person("a", 1), Person("b", 2))) is typed with expected type RDD[Product], so
List(Person("a", 1), Person("b", 2)) is typed with expected type List[Product], so
Person("a", 1) and Person("b", 2) are typed with expected type Product and this succeeds because Person is a subtype of Product.
The compiler inserts type parameters
sc.parallelize[Product](List[Product](Person("a", 1), Person("b", 2)))
Note that RDD[Person] never even appears in this process. I.e.
sc.parallelize(List(Person("a", 1), Person("b", 2)))
it is of type RDD[Person]
isn't correct; it can be of this type, but in line 2 it isn't.
Taken From Here: https://www.scala-lang.org/api/2.12.x/scala/Product.html
Base trait for all products, which in the standard library include at
least scala.Product1 through scala.Product22 and therefore also their
subclasses scala.Tuple1 through scala.Tuple22. In addition, all case
classes implement Product with synthetically generated methods.
I was able to run your code:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object SparkSample {
object SparkSessionConf {
val LOCAL_MASTER = "local[*]"
}
def initializeSpark(master: String, appName: String): SparkSession = {
Logger.getLogger("org").setLevel(Level.ERROR)
SparkSession.builder
.master(master)
.appName(appName)
.getOrCreate()
}
case class Person(name: String, age: Int)
def main(args: Array[String]): Unit = {
val sparkSession = initializeSpark(SparkSessionConf.LOCAL_MASTER, "SparkTry")
val rdd: RDD[Product] = sparkSession.sparkContext.parallelize(List(Person("ss", 10), Person("ss", 20)))
val rdd1: RDD[Product] = sparkSession.sparkContext.parallelize(List(Person("ss", 10), Person("ss", 20)))
}
}
Here is detailed description about invariance with example and explained why you can not cast Person class to Product trait even though it seems natural that we should be able to do so.

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853
For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).
I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)

How to match Dataframe column names to Scala case class attributes?

The column names in this example from spark-sql come from the case class Person.
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. This would cause columns to not be found if the file has not been updated to reflect the change.
How can I specify an appropriate mapping?
I am thinking something like:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)
Basically, all the mapping you need to do can be achieved with DataFrame.select(...). (Here, I assume, that no type conversions need to be done.)
Given the forward- and backward-mapping as maps, the essential part is
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Columns with alias.
Example code
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
Remark
If you need to convert a dataframe back to an RDD[Person], then
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Alternatives
Have also a look at How to convert spark SchemaRDD into RDD of my case class?