Encoder[Row] in Scala Spark - scala

I'm trying to perform a simple map on a Dataset[Row] (DataFrame) in Spark 2.0.0. Something as simple as this
val df: DataSet[Row] = ...
df.map { r: Row => r }
But the compiler is complaining that I'm not providing the implicit Encoder[Row] argument to the map function:
not enough arguments for method map: (implicit evidence$7:
Encoder[Row]).
Everything works fine if I convert to an RDD first ds.rdd.map { r: Row => r } but shouldn't there be an easy way to get an Encoder[Row] like there is for tuple types Encoders.product[(Int, Double)]?
[Note that my Row is dynamically sized in such a way that it can't easily be converted into a strongly-typed Dataset.]

An Encoder needs to know how to pack the elements inside the Row. So you could write your own Encoder[Row] by using row.structType which determines the elements of your Row at runtime and uses the corresponding decoders.
Or if you know more about the data that goes into Row, you could use https://github.com/adelbertc/frameless/

SSry to be a "bit" late. Hopefully this helps to someone who is hitting the problem right now. Easiest way to define encoder is deriving the structure from existing DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c").toDF("id", "name")
val myEncoder = RowEndocer(df.schema)
Such approach could be useful when you need altering existing fields from your original DataFrame.
If you're dealing with completely new structure, explicit definition relying on StructType and StructField (as suggested in #Reactormonk 's little cryptic response).
Example defining the same encoder:
val myEncoder2 = RowEncoder(StructType(
Seq(StructField("id", IntegerType),
StructField("name", StringType)
)))
Please remember org.apache.spark.sql._, org.apache.spark.sql.types._ and org.apache.spark.sql.catalyst.encoders.RowEncoder libraries have to be imported.

In your specific case where the mapped function does not change the schema, you can pass in the encoder of the DataFrame itself:
df.map(r => r)(df.encoder)

Related

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Add a list as a column to dataframe in scala [duplicate]

I have List[Double], how to convert it to org.apache.spark.sql.Column. I am trying to insert it as a column using .withColumn() to existing DataFrame.
It cannot be done directly. Column is not a data structure but a representation of a specific SQL expression. It is not bound to a specific data. You'll have to transform your data first. One way to approach this is to parallelize and join by index:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, DoubleType}
val df = Seq(("a", 2), ("b", 1), ("c", 0)).toDF("x", "y")
val aList = List(1.0, -1.0, 0.0)
val rows = df.rdd.zipWithIndex.map(_.swap)
.join(sc.parallelize(aList).zipWithIndex.map(_.swap))
.values
.map { case (row: Row, x: Double) => Row.fromSeq(row.toSeq :+ x) }
sqlContext.createDataFrame(rows, df.schema.add("z", DoubleType, false))
Another similar approach is to index and use and UDF to handle the rest:
import scala.util.Try
val indexedDf = sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row: Row, i: Long) => Row.fromSeq(row.toSeq :+ i)
},
df.schema.add("idx_", "long")
)
def addValue(vs: Vector[Double]) = udf((i: Long) => Try(vs(i.toInt)).toOption)
indexedDf.withColumn("z", addValue(aList.toVector)($"idx_"))
Unfortunately both solutions will suffer from the issues. First of all passing local data through driver introduces a serious bottleneck in your program. Typically data should accessed directly from the executors. Another problem are growing RDD lineages if you want to perform this operation iteratively.
While the second issue can be addressed by checkpointing the first one makes this idea useless in general. I would strongly suggest that you either build completely structure first, and read it on Spark, or rebuild you pipeline in a way that can leverage Spark architecture. For example if data comes from an external source perform reads directly for each chunk of data using map / mapPartitions.

Encoder error while trying to map dataframe row to updated row

When I m trying to do the same thing in my code as mentioned below
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
I have taken the above reference from here:
Scala: How can I replace value in Dataframs using scala
But I am getting encoder error as
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, S tring, etc) and Product types (case classes) are supported by
importing spark.im plicits._ Support for serializing other types will
be added in future releases.
Note: I am using spark 2.0!
There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:
in 1.x DataFrame.map is ((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
in 2.x Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]
To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame API:
import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map you should use statically typed Dataset:
import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
For scenario where dataframe schema is known in advance answer given by #zero323 is the solution
but for scenario with dynamic schema / or passing multiple dataframe to a generic function:
Following code has worked for us while migrating from 1.6.1 from 2.2.0
import org.apache.spark.sql.Row
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
val data = df.rdd.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
this code executes on both the versions of spark.
disadvantage : optimization provided
by spark on dataframe/datasets api wont be applied.
Just to add a few other important-to-know points in order to well understand the other answers (especially the final point of #zero323's answer about map over Dataset[Row]):
First of all, Dataframe.map gives you a Dataset (more specifically, Dataset[T], rather than Dataset[Row])!
And Dataset[T] always requires an encoder, that's what this sentence "Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]" means.
There are indeed lots of encoders predefined already by Spark (which can be imported by doing import spark.implicits._), but still the list would not be able to cover many domain specific types that developers may create, in which case you need to create encoders yourself.
In the specific example on this page, df.map returns a Row type for Dataset, and hang on a minute, Row type is not within the list of types that have encoders predefined by Spark, hence you are going to create one on your own.
And I admit that creating an encoder for Row type is a bit different than the approach described in the above link, and you have to use RowEncoder which takes StructType as param describing type of a row, like what #zero323 provides above:
// this describes the internal type of a row
val schema = StructType(Seq(StructField("year", IntegerType), StructField("make", StringType), StructField("model", StringType)))
// and this completes the creation of encoder
// for the type `Row` with internal schema described above
val encoder = RowEncoder(schema)
In my case of spark 2.4.4 version, I had to import implicits. This is a general answer
val spark2 = spark
import spark2.implicits._
val data = df.rdd.map(row => my_func(row))
where my_func did some operation.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)