Create SparkSQL UDF with non serializable objects - scala

I'm trying to write an UDF that I would like to use on Hive tables in an sqlContext. Is it in any way possible to include objects from other libraries that are not serializable? Here's a minimal example of what does not work:
def myUDF(s: String) = {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= decoder.encode(s)
encoded
}
I register the function in the spark shell as udf function
val encoding = sqlContext.udf.register("encoder", myUDF)
If I try to run it on a table "test"
sqlContext.sql("SELECT encoder(colname) from test").show()
I get the error
org.apache.spark.SparkException: Task not serializable
object not serializable (class: sun.misc.BASE64Encoder, value: sun.misc.BASE64Encoder#4a7f9a94)
Is there a workaround for this? I tried embedding myUDF in an object and in a class but that didn't work either.

You can try defining udf function as
def encoder = udf((s: String) => {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= coder.encode(s.getBytes("UTF-8"))
encoded
})
And call the udf function as
dataframe.withColumn("encoded", encoder(col("id"))).show
Updated
As #santon has pointed out that BASE64Encoder encoder is initiated for each row in the dataframe which might lead to performance issues. The solution to that would be to create a static object of BASE64Encoder and call it within udf function.

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

How can I convert BinaryType to Array[Byte] when calling Scala UDF in Spark?

I've written the following UDF in Scala:
import java.io.{ByteArrayOutputStream, ByteArrayInputStream}
import java.util.zip.{GZIPInputStream}
def Decompress(compressed: Array[Byte]): String = {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
val output = scala.io.Source.fromInputStream(inputStream).mkString
return output
}
val decompressUdf = (compressed: Array[Byte]) => {
Decompress(compressed)
}
spark.udf.register("Decompress", decompressUdf)
I'm then attempting to call the UDF with the following:
val sessionsRawDF =
sessionRawDF
.withColumn("WebsiteSession", decompressUdf(sessionRawDF("body")))
.select(
current_timestamp().alias("ingesttime"),
current_timestamp().cast("date").alias("p_ingestdate"),
col("partition"),
col("enqueuedTime"),
col("WebsiteSession").alias("Json")
)
When I run this, I get the following error:
command-130062350733681:9: error: type mismatch;
found: org.apache.spark.sql.Column
required: Array[Byte]
decompressUdf(col("WebsiteSession")).alias("Json")
I was under the impression Spark would implicitly get the value and go from the spark type to Array[Byte] in this case.
Would some please help me understand what's going on, I've been fighting this for a while and not sure what else to try.
You need to convert the Scala function to a Spark UDF first, before you can register it as a UDF. For example,
val decompressUdf = udf(Decompress _)
spark.udf.register("Decompress", decompressUdf)
In fact, there is no need to register the UDF if you're just using it in the DataFrame API. You can simply run the first line and use decompressUdf. Registering is only needed if you want to use the UDF in SQL.

Apache Spark - Generic method for loading csv data to dataset

I would like to write generic method with three input parameters:
filePath - String
schema - ?
case class
So, my idea is to write a method like this:
def load_sms_ds(filePath: String, schemaInfo: ?, cc: ?) = {
val ds = spark.read
.format("csv")
.option("header", "true")
.schema(?)
.option("delimiter",",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.load(schemaInfo)
.as[?]
ds
}
and to return dataset depending on a input parameters. I am not sure though what type should parameters schemaInfo and cc be?
First of all I would reccommend reading the spark sql programming guide. This contains some thing that I think will generally help you as you learn spark.
Lets run through the process of reading in a csv file using a case class to define the schema.
First add the varioud imports needed for this example:
import java.io.{File, PrintWriter} // for reading / writing the example data
import org.apache.spark.sql.types.{StringType, StructField} // to define the schema
import org.apache.spark.sql.catalyst.ScalaReflection // used to generate the schema from a case class
import scala.reflect.runtime.universe.TypeTag // used to provide type information of the case class at runtime
import org.apache.spark.sql.Dataset, SparkSession}
import org.apache.spark.sql.Encoder // Used by spark to generate the schema
Define a case class, the different types available can be found here:
case class Example(
stringField : String,
intField : Int,
doubleField : Double
)
Add the method for extracting a schema (StructType) given a case class type as a parameter:
// T : TypeTag means that an implicit value of type TypeTag[T] must be available at the method call site. Scala will automatically generate this for you. See [here][3] for further details.
def schemaOf[T: TypeTag]: StructType = {
ScalaReflection
.schemaFor[T] // this method requires a TypeTag for T
.dataType
.asInstanceOf[StructType] // cast it to a StructType, what spark requires as its Schema
}
Defnie a method to read in a csv file from a path with the schema defined using a case class:
// The implicit Encoder is needed by the `.at` method in order to create the Dataset[T]. The TypeTag is required by the schemaOf[T] call.
def readCSV[T : Encoder : TypeTag](
filePath: String
)(implicit spark : SparkSession) : Dataset[T]= {
spark.read
.option("header", "true")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.schema(schemaOf[T])
.csv(filePath) // spark provides this more explicit call to read from a csv file by default it uses comma and the separator but this can be changes.
.as[T]
}
Create a sparkSession:
implicit val spark = SparkSession.builder().master("local").getOrCreate()
Write some sample data to a temp file:
val data =
s"""|stringField,intField,doubleField
|hello,1,1.0
|world,2,2.0
|""".stripMargin
val file = File.createTempFile("test",".csv")
val pw = new PrintWriter(file)
pw.write(data)
pw.close()
An example of calling this method:
import spark.implicits._ // so that an implicit Encoder gets pulled in for the case class
val df = readCSV[Example](file.getPath)
df.show()

Removing newlines in a DataFrame field with udf function gives TypeTag Error

val trim: String => String = _.trim.replace("[\\r\\n]", "")
def main(args: Array[String]) {
val spark = ... ...
import spark.implicits._
val trimUDF = udf[String,String](trim)
val df = spark.read.json(df_path) ...
val fixed_dblogs_df = df.withColumn("qp_new", trimUDF('qp)) ...
}
When I run this code I get a compile time error:
No TypeTag available for String
This error is where I define the udf function. I have no idea why this is happening. I have used udf functions before but this one is making this error. I used Spark 2.1.1 and that's it.
The purpose of the code is to remove all the new lines in one of my fields of columns that is StringType and I just want it to not have any newlines in it
Is there some reason you're using a UDF instead of the replace_regexp builtin?
val fixed_dblogs_df = df.withColumn("qp_new", replace_regexp('qp, "[\\r\\n]", "") ...)
UDF's break Spark's plan optimization.

How to convert RDD of custom Java class objects to a DataFrame with toDF()?

I am trying to convert a Spark RDD to a Spark SQL dataframe with toDF(). I have used this function successfully many times, but in this case I'm getting a compiler error:
error: value toDF is not a member of org.apache.spark.rdd.RDD[com.example.protobuf.SensorData]
Here is my code below:
// SensorData is an auto-generated class
import com.example.protobuf.SensorData
def loadSensorDataToRdd : RDD[SensorData] = ???
object MyApplication {
def main(argv: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("My application")
conf.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val sensorDataRdd = loadSensorDataToRdd()
val sensorDataDf = sensorDataRdd.toDF() // <-- CAUSES COMPILER ERROR
}
}
I am guessing that the problem is with the SensorData class, which is a Java class that was auto-generated from a Protocol Buffer. What can I do in order to convert the RDD to a dataframe?
The reason for the compilation error is that there's no Encoder in scope to convert a RDD with com.example.protobuf.SensorData to a Dataset of com.example.protobuf.SensorData.
Encoders (ExpressionEncoders to be exact) are used to convert InternalRow objects into JVM objects according to the schema (usually a case class or a Java bean).
There's a hope you can create an Encoder for the custom Java class using org.apache.spark.sql.Encoders object's bean method.
Creates an encoder for Java Bean of type T.
Something like the following:
import org.apache.spark.sql.Encoders
implicit val SensorDataEncoder = Encoders.bean(classOf[com.example.protobuf.SensorData])
If SensorData uses unsupported types you'll have to map the RDD[SensorData] to an RDD of some simpler type(s), e.g. a tuple of the fields, and only then expect toDF work.