How to match Dataframe column names to Scala case class attributes? - scala

The column names in this example from spark-sql come from the case class Person.
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. This would cause columns to not be found if the file has not been updated to reflect the change.
How can I specify an appropriate mapping?
I am thinking something like:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)

Basically, all the mapping you need to do can be achieved with DataFrame.select(...). (Here, I assume, that no type conversions need to be done.)
Given the forward- and backward-mapping as maps, the essential part is
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Columns with alias.
Example code
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
Remark
If you need to convert a dataframe back to an RDD[Person], then
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Alternatives
Have also a look at How to convert spark SchemaRDD into RDD of my case class?

Related

Select case class based on String in Scala

How can I select a case class based on a String value?
My code is
val spark = SparkSession.builder()...
val rddOfJsonStrings: RDD[String] = // some json strings as RDD
val classSelector: String = ??? // could be "Foo" or "Bar", or any other String value
case class Foo(foo: String)
case class Bar(bar: String)
if (classSelector == "Foo") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
} else if (classSelector == "Bar") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
} else {
throw ClassUnknownException //custom Exception
}
The variable classSeletector is a simple String that should be used to point to the case class of the same name.
Imagine I don't only have Foo and Bar as case classes but more then those two. How is it possible to call the df.as[] statement based on the String (if possible at all)?
Or is there a completely different approach available in Scala?
Check below code
classSeletector match {
case c if Foo.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Foo]
case c if Bar.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Bar]
case _ => throw ClassUnknownException //custom Exception
}
How is it possible to call the df.as[] statement based on the String (if possible at all)?
It isn't (or based on any runtime value). You may note that all answers still need to:
have a separate branch for Foo and Bar (and one more branch for each class you'll want to add);
repeat the class name twice in the branch.
You can avoid the second:
import scala.reflect.{classTag, ClassTag}
val df: DataFrame = spark.read.json(rddOfJsonStrings)
// local function defined where df and classSelector are visible
def dfAsOption[T : Encoder : ClassTag] =
Option.when(classSelector == classTag[T].runtimeClass.simpleName)(df.as[T])
dfAsOption[Foo].dfAsOption(asOption[Bar]).getOrElse(throw ClassUnknownException)
But for the first you'd need a macro if it's possible at all. I would guess it isn't.
Define a generic method and invoke it,
getDs[Foo](spark,rddOfJsonStrings)
getDs[Bar](spark,rddOfJsonStrings)
def getDs[T](spark : SparkSession, rddOfJsonStrings:String) {
spark.read.json(rddOfJsonStrings).as[T](Encoders.bean[T](classOf[T]))
}
Alternative-
highlights-
Use simpleName of the case class and not of the companion object
if classSelector is null, the solution won't fail
case class Foo(foo: String)
case class Bar(bar: String)
Testcase-
val rddOfJsonStrings: RDD[String] = spark.sparkContext.parallelize(Seq("""{"foo":1}"""))
val classSelector: String = "Foo" // could be "Foo" or "Bar", or any other String value
val ds = classSelector match {
case foo if classOf[Foo].getSimpleName == foo =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
case bar if classOf[Bar].getSimpleName == bar =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
case _ => throw new UnsupportedOperationException
}
ds.show(false)
/**
* +---+
* |foo|
* +---+
* |1 |
* +---+
*/
You can use reflective toolbox
import org.apache.spark.sql.{Dataset, SparkSession}
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"foo":"aaa"}"""))
// val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"bar":"bbb"}"""))
val classSelector: String = "Foo"
// val classSelector: String = "Bar"
case class Foo(foo: String)
case class Bar(bar: String)
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.DataFrame
import Main._
import spark.implicits._
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[$classSelector]
""")).asInstanceOf[Dataset[_]]
println(res) // [foo: string]
}
Notice that statically you will have a Dataset[_], not Dataset[Foo] or Dataset[Bar].

Spark Scala Encoder Case class [duplicate]

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853
For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).
I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853
For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).
I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)

error mismatching in scala

I am trying to return RDD[(String,String,String)] and I am not able to do that using flatMap. I tried (tweetId, tweetBody, gender) and (tweetId, tweetBody, gender) but it give me an error of type mismatch can you guid me to know how I can return RDD[(String, String, String)] from flatMap
override def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String,String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
r.flatMap(tweet => tweet match {
case _ :: tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result =genderModel.predict(tweetVector)
val gender = if(result == 1.0){"Male"}else{"Female"}
(tweetId, tweetBody, gender)
// Array(1).map(x => (tweetId, tweetBody, gender))
})
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1,x._2,x._3))
val schema = StructType(Array(StructField(idColumnName,StringType, true),StructField(bodyColumnName, StringType, true),StructField(genderColumnName,StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
}
Try to use map instead of flatMap.
flatMap is being used when result type of parameter function is collection or RDD
I.e. flatMap is being used when every element of current collection is mapped to zero or more elements.
While map is being used when every element of current collection is mapped to exactly one element.
map with A => B exchanges symbol A with symbol B in functorial types, i.e. transforms RDD[A] to RDD[B]
flatMap could be read as map then flatten in monadic types. E.g. you have and RDD[A] and parameter function is of type A => RDD[B] result of simple map will be RDD[RDD[B]] and that pair of occurences could be simplified to just RDD[B] via flatten
Here the example of succesfully compiled code.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
class UserTransformConfig {
def getConfigString(name: String): Option[String] = ???
}
class PhaseLogger
object byteUtils {
def bytesToUTF8String(r: Array[Byte]): String = ???
}
class HashingTF {
def transform(strs: Array[String]): Array[Double] = ???
}
object genderModel {
def predict(v: Array[Double]): Double = ???
}
def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String, String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
r.map { tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result = genderModel.predict(tweetVector)
val gender = if (result == 1.0) {"Male"} else {"Female"}
(tweetId, tweetBody, gender)
}
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1, x._2, x._3))
val schema = StructType(Array(StructField(idColumnName, StringType, true), StructField(bodyColumnName, StringType, true), StructField(genderColumnName, StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
please note how much I should bring from my imagination because you did not supply the minimum example. In general questions like this are not worth to answer