Encoder error while trying to map dataframe row to updated row - scala

When I m trying to do the same thing in my code as mentioned below
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
I have taken the above reference from here:
Scala: How can I replace value in Dataframs using scala
But I am getting encoder error as
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, S tring, etc) and Product types (case classes) are supported by
importing spark.im plicits._ Support for serializing other types will
be added in future releases.
Note: I am using spark 2.0!

There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:
in 1.x DataFrame.map is ((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
in 2.x Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]
To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame API:
import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map you should use statically typed Dataset:
import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)

For scenario where dataframe schema is known in advance answer given by #zero323 is the solution
but for scenario with dynamic schema / or passing multiple dataframe to a generic function:
Following code has worked for us while migrating from 1.6.1 from 2.2.0
import org.apache.spark.sql.Row
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
val data = df.rdd.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
this code executes on both the versions of spark.
disadvantage : optimization provided
by spark on dataframe/datasets api wont be applied.

Just to add a few other important-to-know points in order to well understand the other answers (especially the final point of #zero323's answer about map over Dataset[Row]):
First of all, Dataframe.map gives you a Dataset (more specifically, Dataset[T], rather than Dataset[Row])!
And Dataset[T] always requires an encoder, that's what this sentence "Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]" means.
There are indeed lots of encoders predefined already by Spark (which can be imported by doing import spark.implicits._), but still the list would not be able to cover many domain specific types that developers may create, in which case you need to create encoders yourself.
In the specific example on this page, df.map returns a Row type for Dataset, and hang on a minute, Row type is not within the list of types that have encoders predefined by Spark, hence you are going to create one on your own.
And I admit that creating an encoder for Row type is a bit different than the approach described in the above link, and you have to use RowEncoder which takes StructType as param describing type of a row, like what #zero323 provides above:
// this describes the internal type of a row
val schema = StructType(Seq(StructField("year", IntegerType), StructField("make", StringType), StructField("model", StringType)))
// and this completes the creation of encoder
// for the type `Row` with internal schema described above
val encoder = RowEncoder(schema)

In my case of spark 2.4.4 version, I had to import implicits. This is a general answer
val spark2 = spark
import spark2.implicits._
val data = df.rdd.map(row => my_func(row))
where my_func did some operation.

Related

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Spark Structured Streaming MemoryStream + Row + Encoders issue

I am trying to run some tests on my local machine with spark structured streaming.
In batch mode here is the Row that i am dealing with:
val recordSchema = StructType(List(StructField("Record", MapType(StringType, StringType), false)))
val rows = List(
Row(
Map("ID" -> "1",
"STRUCTUREID" -> "MFCD00869853",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "803.482",
"FORMULA" -> "C44H69NO12",
"NAME" -> "Tacrolimus",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
)),
Row(
Map("ID" -> "2",
"STRUCTUREID" -> "MFCD00869854",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "603.482",
"FORMULA" -> "",
"NAME" -> "Tacrolimus2",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
))
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(rows), recordSchema)
Working with that in Batch more works as a charm, no issue.
Now I'm try to move in streaming mode using MemoryStream for testing. I added the following:
implicit val ctx = spark.sqlContext
val intsInput = MemoryStream[Row]
But the compiler complain with the as follows:
No implicits found for parameter evidence$1: Encoder[Row]
Hence, my question: What should I do here to get that working
Also i saw that if I add the following import the error goes away:
import spark.implicits._
Actually, I now get the following warning instead of an error
Ambiguous implicits for parameter evidence$1: Encoder[Row]
I do not understand the encoder mechanism well and would appreciate if someone could explain to me how not to use those implicits. The reason being that I red the following in a book when it comes to the creation of DataFrame from Rows.
Recommended appraoch:
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark.sparkContext.parallelize(myRows)
val myDf = spark.createDataFrame(myRDD, myManualSchema)
myDf.show()
And then the author goes on with this:
In Scala, we can also take advantage of Spark’s implicits in the
console (and if you import them in your JAR code) by running toDF on a
Seq type. This does not play well with null types, so it’s not
necessarily recommended for production use cases.
val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3")
If someone could take the time to explain what is happening in my scenario when i use the implicit, and if it is rather safe to do so, or else is there a way to do it more explicitly without importing the implicit.
Finally, if someone could point me to a good doc around Encoder and Spark Type mapping that would be great.
EDIT1
I finally got it to work with
implicit val ctx = spark.sqlContext
import spark.implicits._
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
Although my problem here is that i am not confident about what I am doing. It seems to me that it is like in some situation I need to create a DataSet to be able to convert it in an DF[ROW] with toDF conversion. I understood that working with DS is typeSafe but slower than with DF. So why this intermediary with DataSet? This is not the first time that i see that in Spark Structured Streaming. Again if someone could help me with those, that would be great.
I encourage you to use Scala's case classes for data modeling.
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
Now you can have a List of Product in memory:
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
The structured streaming API makes it easy to reason about stream processing by using the widely known Dataset[T] abstraction. Roughly speaking, you just have to worry about three things:
Source: a source can generate an input data stream which we can represent as a Dataset[Input]. Every new data item Input that arrives is going to be appended into this unbounded dataset. You can manipulate the data as you wish (e.g. Dataset[Input] => Dataset[Output]).
StreamingQueries and Sink: a query generates a result table that's updated from the Source every trigger interval. Changes are written into external storage called a Sink.
Output modes: there are different modes on which you can write data into the Sink: complete mode, append mode, and update mode.
Let's assume that you want to know the products that contain a molecular weight bigger than 200 units.
As you said, using the batch API is fairly simple and straight-forward:
// Create an static dataset using the in-memory data
val staticData: Dataset[Product] = spark.createDataset(inMemoryRecords)
// Processing...
val result: Dataset[Product] = staticData.filter(_.weight > 200)
// Print results!
result.show()
When using the Streaming API you just need to define a source and a sink as an extra step. In this example, we can use the MemoryStream and the console sink to print out the results.
// Create an streaming dataset using the in-memory data (memory source)
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val streamingData: Dataset[Product] = productSource.toDS()
// Processing...
val result: Dataset[Product] = streamingData.filter(_.weight > 200)
// Print results by using the console sink.
val query: StreamingQuery = result.writeStream.format("console").start()
// Stop streaming
query.awaitTermination(timeoutMs=5000)
query.stop()
Note that the staticData and the streamingData have the exact type signature (i.e., Dataset[Product]). This allows us to apply the same processing steps regardless of using the Batch or Streaming API. You can also think of implementing a generic method def processing[In, Out](inputData: Dataset[In]): Dataset[Out] = ??? to avoid repeating yourself in both approaches.
Complete code example:
object ExMemoryStream extends App {
// Boilerplate code...
val spark: SparkSession = SparkSession.builder
.appName("ExMemoryStreaming")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext
// Define your data models
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
// Create some in-memory instances
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
// Defining processing step
def processing(inputData: Dataset[Product]): Dataset[Product] =
inputData.filter(_.weight > 200)
// STATIC DATASET
val datasetStatic: Dataset[Product] = spark.createDataset(inMemoryRecords)
println("This is the static dataset:")
processing(datasetStatic).show()
// STREAMING DATASET
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val datasetStreaming: Dataset[Product] = productSource.toDS()
println("This is the streaming dataset:")
val query: StreamingQuery = processing(datasetStreaming).writeStream.format("console").start()
query.awaitTermination(timeoutMs=5000)
// Stop query and close Spark
query.stop()
spark.close()
}

Encoder[Row] in Scala Spark

I'm trying to perform a simple map on a Dataset[Row] (DataFrame) in Spark 2.0.0. Something as simple as this
val df: DataSet[Row] = ...
df.map { r: Row => r }
But the compiler is complaining that I'm not providing the implicit Encoder[Row] argument to the map function:
not enough arguments for method map: (implicit evidence$7:
Encoder[Row]).
Everything works fine if I convert to an RDD first ds.rdd.map { r: Row => r } but shouldn't there be an easy way to get an Encoder[Row] like there is for tuple types Encoders.product[(Int, Double)]?
[Note that my Row is dynamically sized in such a way that it can't easily be converted into a strongly-typed Dataset.]
An Encoder needs to know how to pack the elements inside the Row. So you could write your own Encoder[Row] by using row.structType which determines the elements of your Row at runtime and uses the corresponding decoders.
Or if you know more about the data that goes into Row, you could use https://github.com/adelbertc/frameless/
SSry to be a "bit" late. Hopefully this helps to someone who is hitting the problem right now. Easiest way to define encoder is deriving the structure from existing DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c").toDF("id", "name")
val myEncoder = RowEndocer(df.schema)
Such approach could be useful when you need altering existing fields from your original DataFrame.
If you're dealing with completely new structure, explicit definition relying on StructType and StructField (as suggested in #Reactormonk 's little cryptic response).
Example defining the same encoder:
val myEncoder2 = RowEncoder(StructType(
Seq(StructField("id", IntegerType),
StructField("name", StringType)
)))
Please remember org.apache.spark.sql._, org.apache.spark.sql.types._ and org.apache.spark.sql.catalyst.encoders.RowEncoder libraries have to be imported.
In your specific case where the mapped function does not change the schema, you can pass in the encoder of the DataFrame itself:
df.map(r => r)(df.encoder)

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either