How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala? - scala

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.

If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

Related

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Rewrite scala code to be more functional

I am trying to teach myself Scala whilst at the same time trying to write code that is idiomatic of a functional language, i.e. write better, more elegant, functional code.
I have the following code that works OK:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
val dataFrames = Seq(df.featuresGroup1(groupBy, asAt),df.featuresGroup2(groupBy, asAt))
The last line bothers me though. The two functions (featuresGroup1, featuresGroup2) both have the same signature:
scala> :type df.featuresGroup1(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
scala> :type df.featuresGroup2(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
and take the same vals as parameters so I assume I can write that line in a more functional way (perhaps using .map somehow) that means I can write the parameter list just once and pass it to both functions. I can't figure out the syntax though. I thought maybe I could construct a list of those functions but that doesn't work:
scala> Seq(featuresGroup1, featuresGroup2)
<console>:23: error: not found: value featuresGroup1
Seq(featuresGroup1, featuresGroup2)
^
<console>:23: error: not found: value featuresGroup2
Seq(featuresGroup1, featuresGroup2)
^
Can anyone help?
I thought maybe I could construct a list of those functions but that doesn't work:
Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
df.featuresGroup1 _ should work as well.
df.featuresGroup1 by itself would work if you had an expected type, e.g.
val dataframes: Seq[(Seq[String], LocalDate) => DataFrame] =
Seq(df.featuresGroup1, df.featuresGroup2)
but in this specific case providing the expected type is more verbose than using lambdas.
I thought maybe I could construct a list of those functions but that doesn't work
You need to explicitly perform eta expansion to turn methods into functions (they are not the same in Scala), by using an underscore operator:
val funcs = Seq(featuresGroup1 _, featuresGroup2 _)
or by using placeholders:
val funcs = Seq(featuresGroup1(_, _), featuresGroup2(_, _))
And you are absolutely right about using map operator:
val dataFrames = funcs.map(f => f(groupBy, asAdt))
I strongly recommend against using implicits of types String or Seq, as if used in multiple places, these lead to subtle bugs that are not immediately obvious from the code and the code will be prone to breaking when it's moved somewhere.
If you want to use implicits, wrap them into a custom types:
case class DfGrouping(groupBy: Seq[String]) extends AnyVal
implicit val grouping: DfGrouping = DfGrouping(Seq("a", "b"))
Why no just create a function in DataFrameExtensions to do so?
def getDataframeGroups(groupBy: Seq[String], asAt: String) = Seq(featuresGroup1(groupBy,asAt), featuresGroup2(groupBy,asAt))
I think you could create a list of functions as below:
val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame] = List(_.featuresGroup1, _.featuresGroup1)
funcs.map(x => x(df)(groupBy, asAt))
It seems you have a list of functions which convert a DataFrame to another DataFrame. If that is the pattern, you could go a little bit further with Endo in Scalaz
I like this answer best, courtesy of Alexey Romanov.
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

Encoder error while trying to map dataframe row to updated row

When I m trying to do the same thing in my code as mentioned below
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
I have taken the above reference from here:
Scala: How can I replace value in Dataframs using scala
But I am getting encoder error as
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, S tring, etc) and Product types (case classes) are supported by
importing spark.im plicits._ Support for serializing other types will
be added in future releases.
Note: I am using spark 2.0!
There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:
in 1.x DataFrame.map is ((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
in 2.x Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]
To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame API:
import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map you should use statically typed Dataset:
import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
For scenario where dataframe schema is known in advance answer given by #zero323 is the solution
but for scenario with dynamic schema / or passing multiple dataframe to a generic function:
Following code has worked for us while migrating from 1.6.1 from 2.2.0
import org.apache.spark.sql.Row
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
val data = df.rdd.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
this code executes on both the versions of spark.
disadvantage : optimization provided
by spark on dataframe/datasets api wont be applied.
Just to add a few other important-to-know points in order to well understand the other answers (especially the final point of #zero323's answer about map over Dataset[Row]):
First of all, Dataframe.map gives you a Dataset (more specifically, Dataset[T], rather than Dataset[Row])!
And Dataset[T] always requires an encoder, that's what this sentence "Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]" means.
There are indeed lots of encoders predefined already by Spark (which can be imported by doing import spark.implicits._), but still the list would not be able to cover many domain specific types that developers may create, in which case you need to create encoders yourself.
In the specific example on this page, df.map returns a Row type for Dataset, and hang on a minute, Row type is not within the list of types that have encoders predefined by Spark, hence you are going to create one on your own.
And I admit that creating an encoder for Row type is a bit different than the approach described in the above link, and you have to use RowEncoder which takes StructType as param describing type of a row, like what #zero323 provides above:
// this describes the internal type of a row
val schema = StructType(Seq(StructField("year", IntegerType), StructField("make", StringType), StructField("model", StringType)))
// and this completes the creation of encoder
// for the type `Row` with internal schema described above
val encoder = RowEncoder(schema)
In my case of spark 2.4.4 version, I had to import implicits. This is a general answer
val spark2 = spark
import spark2.implicits._
val data = df.rdd.map(row => my_func(row))
where my_func did some operation.

udf spark column names

I need to specify a sequence of columns. If I pass two strings, it works fine
val cols = array("predicted1", "predicted2")
but if I pass a sequence or an array, I get an error:
val cols = array(Seq("predicted1", "predicted2"))
Could you please help me? Many thanks!
You have at least two options here:
Using a Seq[String]:
val columns: Seq[String] = Seq("predicted1", "predicted2")
array(columns.head, columns.tail: _*)
Using a Seq[ColumnName]:
val columns: Seq[ColumnName] = Seq($"predicted1", $"predicted2")
array(columns: _*)
Function signature is def array(colName: String, colNames: String*): Column which means that it takes one string and then one or more strings. If you want to use a sequence, do it like this:
array("predicted1", Seq("predicted2"):_*)
From what I can see in the code, there are a couple of overloaded versions of this function, but neither one takes a Seq directly. So converting it into varargs as described should be the way to go.
You can use Spark's array form def array(cols: Column*): Column where the cols val is defined without using the $ column name notation -- i.e. when you want to have a Seq[ColumnName] type specifically, but created using strings. Here is how to solve that...
import org.apache.spark.sql.ColumnName
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val some_states: Seq[String] = Seq("state_AK","state_AL","state_AR","state_AZ")
val some_state_cols: Seq[ColumnName] = some_states.map(s => symbolToColumn(scala.Symbol(s)))
val some_array = array(some_state_cols: _*)
...using Spark's symbolToColumn method.
or with the ColumnName(s) constructor directly.
val some_array: Seq[ColumnName] = some_states.map(s => new ColumnName(s))