Pyspark Pipeline Custom Transformer - pyspark

I'm having some trouble understanding the creation of custom transformers for Pyspark pipelines.
I am writing a custom transformer that will take the dataframe column Company and remove stray commas:
from pyspark.sql.functions import *
class DFCommaDropper(Transformer):
def__init__(self, *args, **kwargs):
self.name = CommaDropper
def transform(self,df):
df = df.withColumn('Company', regexp_replace('Company',',','')
return df
The above code is obviously wrong. I'm unsure what/how to initialize this and then how to use the initialized class instance in the transform function.
Thanks in advance for your help.,

Related

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

Spark UDF - Return type DataFrame

I have created a UDF which will add a column flag in DataFrame and return new dataFrame.
def find_mismatch = udf((df: DataFrame) => {
df.withColumn("Flag",when(df("T_RTR_NUM").isNull && df("P_RTR_NUM").isNull ,
"Present in Flex but missing Trn and Platform"))
}
)
I am able to create UDF but when I pass a DataFrame into this , it gets errored out.
It works with normal function but when it comes to Spark UDF , it gets errored out.
Also, help me in understanding what difference will it make If I use normal function instead of spark UDF.
Please help. I have attached screenshot of code.
You can't pass a DataFrame to a UDF as a DataFrame is handled by a spark context i.e. at the driver and you can't pass that along to a UDF which runs on the different executors (and only hold a fraction of a dataframe)
Specifically about the problem you're trying to solve - as mentioned by #Manoj you don't actually need to use a UDF to get the result you need
You can do this without udf like below
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
def findMismatch(df:Dataset[Row]):Dataset[Row]={
val transDF=df.withColumn("Flag",when(df("T_RTR_NUM").isNull && df("P_RTR_NUM").isNull ,"Present in Flex but missing Trn and Platform"))
transDF
}
val transDF=findMismatch(df)

Create a map to call the POJO for each row of Spark Dataframe

I built an H2O model in R and saved the POJO code. I want to score parquet files in hdfs using the POJO but I'm not sure how to go about it. I plan on reading the parquet files into spark (scala/SparkR/PySpark) and scoring them on there. Below is the excerpt I found on H2O's documentation page.
"How do I run a POJO on a Spark Cluster?
The POJO provides just the math logic to do predictions, so you won’t find any Spark (or even H2O) specific code there. If you want to use the POJO to make predictions on a dataset in Spark, create a map to call the POJO for each row and save the result to a new column, row-by-row"
Does anyone have some example code of how I can do this? I'd greatly appreciate any assistance. I code primarily in R and SparkR, and I'm not sure how I can "map" the POJO to each line.
Thanks in advance.
I just posted a solution that actually uses DataFrame/Dataset. The post used a Star Wars dataset to build a model in R and then scored MOJO on the test set in Spark. I'll paste the only relevant part here:
Scoring with Spark (and Scala)
You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:
/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar
Now in the Spark shell, import the dependencies
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel
Using DataFrame
val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"
// Import data
val dfStarWars = spark.read.option("header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)
// score
val dfScore = dfStarWars.map {
x =>
val r = new RowData
r.put("height", x.getAs[String](1))
r.put("mass", x.getAs[String](2))
val score = easyModel.predictBinomial(r).classProbabilities
(x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")
The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.
Using Dataset
To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.
case class StarWars (
name: String,
height: String,
mass: String,
is_human: String
)
case class Score (
name: String,
isHumanScore: Double
)
// Dataset
val dtStarWars = dfStarWars.as[StarWars]
val dtScore = dtStarWars.map {
x =>
val r = new RowData
r.put("height", x.height)
r.put("mass", x.mass)
val score = easyModel.predictBinomial(r).classProbabilities
Score(x.name, score(1))
}
With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.
If you want to perform scoring with POJO or MOJO in spark you should be using RowData which is provided within h2o-genmodel.jar class as row by row input data to call easyPredict method to generate scores.
Your solution will be to read the parquet file from HDFS and then for each row, convert that to RowData object by filling each entry and then pass that to your POJO scoring function. Remember POJO and MOJO they both use exact same scoring function to score and the only difference is on how the POJO Class is used vs MOJO resources zip package is used. As MOJO are backward compatible and could work with any newer h2o-genmodel.jar it is best if you use MOJO instead of POJO.
Following is the full Scala code you can use on Spark to load a MOJO model and then do the scoring:
import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData
// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/gbm_model.zip")
val easyModel = new EasyPredictModelWrapper(mojo)
// Get Mojo Details
var features = mojo.getNames.toBuffer
// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")
// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities
Here is an example of reading parquet files in Spark and then saving as CSV. You can use the same code to read the parquet from HDFS and then pass the each row as RowData to above example.
Here is detailed example of using MOJO model in spark and perform scoring using RowData.

Create SparkSQL UDF with non serializable objects

I'm trying to write an UDF that I would like to use on Hive tables in an sqlContext. Is it in any way possible to include objects from other libraries that are not serializable? Here's a minimal example of what does not work:
def myUDF(s: String) = {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= decoder.encode(s)
encoded
}
I register the function in the spark shell as udf function
val encoding = sqlContext.udf.register("encoder", myUDF)
If I try to run it on a table "test"
sqlContext.sql("SELECT encoder(colname) from test").show()
I get the error
org.apache.spark.SparkException: Task not serializable
object not serializable (class: sun.misc.BASE64Encoder, value: sun.misc.BASE64Encoder#4a7f9a94)
Is there a workaround for this? I tried embedding myUDF in an object and in a class but that didn't work either.
You can try defining udf function as
def encoder = udf((s: String) => {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= coder.encode(s.getBytes("UTF-8"))
encoded
})
And call the udf function as
dataframe.withColumn("encoded", encoder(col("id"))).show
Updated
As #santon has pointed out that BASE64Encoder encoder is initiated for each row in the dataframe which might lead to performance issues. The solution to that would be to create a static object of BASE64Encoder and call it within udf function.

How can I call a UDF in a UDF?

Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?