How to write a custom Transformer in MLlib? - scala

I want to write a custom Transformer for a pipeline in spark 2.0 in scala. So far it is not really clear for me what the copy or transformSchema methods should return. Is it correct that they return a null? https://github.com/SupunS/play-ground/blob/master/test.spark.client_2/src/main/java/CustomTransformer.java for copy?
As the Transformer extends PipelineStage I conclude, that a fit calls the transformSchema method. Do I understand correctly that transformSchema is similar to sk-learns fit?
As my Transformer should join the dataset with a (very small) second dataset I want to store that one in the serialized pipeline as well. How should I store this in the transformer to properly work with the pipelines serialization mechanism?
How would a simple transformer look like which computes the mean for a single column and fills the nan values + persists this value?
#SerialVersionUID(serialVersionUID) // TODO store ibanList in copy + persist
class Preprocessor2(someValue: Dataset[SomeOtherValues]) extends Transformer {
def transform(df: Dataset[MyClass]): DataFrame = {
}
override def copy(extra: ParamMap): Transformer = {
}
override def transformSchema(schema: StructType): StructType = {
schema
}
}

transformSchema should return the schema which is expected after applying Transformer. Example:
If transfomer adds column of IntegerType, and output column name is foo:
import org.apache.spark.sql.types._
override def transformSchema(schema: StructType): StructType = {
schema.add(StructField("foo", IntegerType))
}
So if the schema is not changed for the dataset as only a name value is filled for mean imputation I should return the original case class as the schema?
It is not possible in Spark SQL (and MLlib, too) since a Dataset is immutable once created. You can only add or "replace" (which is add followed by drop operations) columns.

First of all, I'm not sure you want a Transformer per se (or UnaryTransformer as #LostInOverflow suggested in the answer) as you said:
How would a simple transformer look like which computes the mean for a single column and fills the nan values + persists this value?
For me, it's as if you wanted to apply a aggregate function (aka aggregation) and "join" it with all the columns to produce the final value or NaN.
It looks like you want a groupBy to do aggregation for mean and then join which could be a window aggregation, too.
Anyway, I'd start with a UnaryTransformer which would solve the first issue in your question:
So far it is not really clear for me what the copy or transformSchema methods should return. Is it correct that they return a null?
See the complete project spark-mllib-custom-transformer at GitHub in which I implemented the UnaryTransformer to toUpperCase a string column which for the UnaryTransformer looks as follows:
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{DataType, StringType}
class UpperTransformer(override val uid: String)
extends UnaryTransformer[String, String, UpperTransformer] {
def this() = this(Identifiable.randomUID("upp"))
override protected def createTransformFunc: String => String = {
_.toUpperCase
}
override protected def outputDataType: DataType = StringType
}

Related

Update spark schema by converting to dataset

I would like to update the schema of an spark dataframe by first converting it to a dataset which contains less columns. Background: i would like to remove some deeply nested fields from a schema.
I tried the following but the schema does not change:
import org.apache.spark.sql.functions._
val initial_df = spark.range(10).withColumn("foo", lit("foo!")).withColumn("bar", lit("bar!"))
case class myCaseClass(bar: String)
val reduced_ds = initial_df.as[myCaseClass]
The schema still includes the other fields:
reduced_ds.schema // StructType(StructField(id,LongType,false),StructField(foo,StringType,false),StructField(bar,StringType,false))
Is there a way to update the schema that way?`
It also confuses me that when i collect the dataset it only returns the fields defined in the case class:
reduced_ds.limit(1).collect() // Array(myCaseClass(bar!))
Add a fake map operation to force the projection using the predefined identity function:
import org.apache.spark.sql.functions._
val initial_df = spark.range(10).withColumn("foo", lit("foo!")).withColumn("bar", lit("bar!"))
case class myCaseClass(bar: String)
val reduced_ds = initial_df.as[myCaseClass].map(identity)
This yields
reduced_ds.schema // StructType(StructField(bar,StringType,true))
in the doc: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#as%5BU%5D(implicitevidence$2:org.apache.spark.sql.Encoder%5BU%5D):org.apache.spark.sql.Dataset%5BU%5D
it says:
Note that as[] only changes the view of the data that is passed into
typed operations, such as map(), and does not eagerly project away any
columns that are not present in the specified class.
To achieve what you want to do you need to
initial_df.select(the columns in myCaseClass).as[myCaseClass]
It is normal since when u collect reduced_ds it returns record of Type myCaseClass, myCaseClass has only one attribute named bar. That's not conflicting with the fact that the dataset schema is something else

Avoid specifying schema twice (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column.
Also my strong preference is to do it in generic way so I do not have to list all columns of a row and do df.as[my_record] or case Row(...) => as shown here. Instead, I want to access row columns by their names and just add result column(s) to source row.
Below approach works just fine but I'd like to avoid specifying schema twice: first time so that I can access columns by name while iterating and second time to process output.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val q = """
select 2 part, 1 id
union all select 2 part, 4 id
union all select 2 part, 3 id
union all select 2 part, 2 id
"""
val df = spark.sql(q)
def f_row(iter: Iterator[Row]) : Iterator[Row] = {
if (iter.hasNext) {
def complex_logic(p: Int): Integer = if (p == 3) null else p * 10;
val head = iter.next
val schema = StructType(head.schema.fields :+ StructField("result", IntegerType))
val r =
new GenericRowWithSchema((head.toSeq :+ complex_logic(head.getAs("id"))).toArray, schema)
iter.scanLeft(r)((r1, r2) =>
new GenericRowWithSchema((r2.toSeq :+ complex_logic(r2.getAs("id"))).toArray, schema)
)
} else iter
}
val schema = StructType(df.schema.fields :+ StructField("result", IntegerType))
val encoder = RowEncoder(schema)
df.repartition($"part").sortWithinPartitions($"id").mapPartitions(f_row)(encoder).show
What information is lost after applying mapPartitions so output cannot be processed without explicit encoder? How to avoid specifying it?
What information is lost after applying mapPartitions so output cannot be processed without
The information is hardly lost - it wasn't there from the begining - subclasses of Row or InternalRow are basically untyped, variable shape containers, which don't provide any useful type information, that could be used to derive an Encoder.
schema in GenericRowWithSchema is inconsequential as it describes content in terms of metadata not types.
How to avoid specifying it?
Sorry, you're out of luck. If you want to use dynamically typed constructs (a bag of Any) in a statically typed language you have to pay the price, which here is providing an Encoder.
OK - I have checked some of my spark code and using .mapPartitions with the Dataset API does not require me to explicitly build/pass an encoder.
You need something like:
case class Before(part: Int, id: Int)
case class After(part: Int, id: Int, newCol: String)
import spark.implicits._
// Note column names/types must match case class constructor parameters.
val beforeDS = <however you obtain your input DF>.as[Before]
def f_row(it: Iterator[Before]): Iterator[After] = ???
beforeDS.reparition($"part").sortWithinPartitions($"id").mapPartitions(f_row).show
I found below explanation sufficient, maybe it will be useful for others.
mapPartitions requires Encoder because otherwise it cannot construct Dataset from iterator or Rows. Even though each row has a schema, that shema cannot be derived (used) by constructor of Dataset[U].
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] = {
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
On the other hand, without calling mapPartitions Spark can use the schema derived from initial query because structure (metadata) of the original columns is not changed.
I described alternatives in this answer: https://stackoverflow.com/a/53177628/7869491.

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

How do I create a Row with a defined schema in Spark-Scala?

I'd like to create a Row with a schema from a case class to test one of my map functions. The most straightforward way I can think of doing this is:
import org.apache.spark.sql.Row
case class MyCaseClass(foo: String, bar: Option[String])
def buildRowWithSchema(record: MyCaseClass): Row = {
sparkSession.createDataFrame(Seq(record)).collect.head
}
However, this seemed like a lot of overhead to just get a single Row, so I looked into how I could directly create a Row with a schema. This led me to:
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.{Encoders, Row}
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
field.get(record)
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
Unfortunately, the Row that the second version returns is different from the first Row. Option fields in the first version are reduced to their primitive values, while they are still Options in the second version. Also, the second version is quite unwieldy.
Is there a better way to do this?
The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
val returnValue = field.get(record)
if(returnValue.isInstanceOf[Option[String]]){
returnValue.asInstanceOf[Option[String]].get
}
else
returnValue
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
But meanwhile I would suggest you to use DataFrame or DataSet.
DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class
For example:
lets say you have input data as
val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))
If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code
import sqlContext.implicits._
val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF
.toDS would generate dataset
Thus you have collection of Rows with schema
for validation you can do the followings
println(dataFrame.schema) //for checking if there is schema
println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows
Hope you have the right answer.

UDAF in Spark with multiple input columns

I am trying to develop a user defined aggregate function that computes a linear regression on a row of numbers. I have successfully done a UDAF that calculates confidence intervals of means (with a lot trial and error and SO!).
Here's what actually runs for me already:
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{StructType, StructField, DoubleType, LongType, DataType, ArrayType}
case class RegressionData(intercept: Double, slope: Double)
class Regression {
import org.apache.commons.math3.stat.regression.SimpleRegression
def roundAt(p: Int)(n: Double): Double = { val s = math pow (10, p); (math round n * s) / s }
def getRegression(data: List[Long]): RegressionData = {
val regression: SimpleRegression = new SimpleRegression()
data.view.zipWithIndex.foreach { d =>
regression.addData(d._2.toDouble, d._1.toDouble)
}
RegressionData(roundAt(3)(regression.getIntercept()), roundAt(3)(regression.getSlope()))
}
}
class UDAFRegression extends UserDefinedAggregateFunction {
import java.util.ArrayList
def deterministic = true
def inputSchema: StructType =
new StructType().add("units", LongType)
def bufferSchema: StructType =
new StructType().add("buff", ArrayType(LongType))
def dataType: DataType =
new StructType()
.add("intercept", DoubleType)
.add("slope", DoubleType)
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, new ArrayList[Long]())
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
val longList: ArrayList[Long] = new ArrayList[Long](buffer.getList(0))
longList.add(input.getLong(0));
buffer.update(0, longList);
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val longList: ArrayList[Long] = new ArrayList[Long](buffer1.getList(0))
longList.addAll(buffer2.getList(0))
buffer1.update(0, longList)
}
def evaluate(buffer: Row) = {
import scala.collection.JavaConverters._
val list = buffer.getList(0).asScala.toList
val regression = new Regression
regression.getRegression(list)
}
}
However the datasets do not come in order, which is obviously very important here. Hence instead of regression($"longValue") I need to a second param regression($"longValue", $"created_day"). created_day is a sql.types.DateType.
I am pretty confused by DataTypes, StructTypes and what-not and due to the lack of examples on the web, I got stuck w/ my trial and order attempts here.
What would my bufferSchema look like?
Are those StructTypes overhead in my case? Wouldn't a (mutable) Map just do? Is MapType actually immutable and isn't this rather pointless to be a buffer type?
What would my inputSchema look like?
Does this have to match the type I retrieve in update() via in my case input.getLong(0)?
Is there a standard way how to reset the buffer in initialize()
I have seen buffer.update(0, 0.0) (when it contains Doubles, obviously), buffer(0) = new WhatEver() and I think even buffer = Nil. Does any of these make a difference?
How to update data?
The example above seems over complicated. I was expecting to be able to do sth. like buffer += input.getLong(0) -> input.getDate(1).
Can I expect to access the input this way
How to merge data?
Can I just leave the function block empty like
def merge(…) = {}?
The challenge to sort that buffer in evaluate() is sth. I should be able to figure out, although I am still interested in the most elegant ways of how you guys do this (in a fraction of the time).
Bonus question: What role does dataType play?
I return a case class, not the StructType as defined in dataType which does not seem to be an issue. Or is it working since it happens to match my case class?
Maybe this will clear things up.
The UDAF APIs work on DataFrame Columns. Everything you are doing has to get serialized just like all the other Columns in the DataFrame. As you note, the only support MapType is immutable, because this is the only thing you can put in a Column. With immutable collections, you just create a new collection that contains the old collection plus a value:
var map = Map[Long,Long]()
map = map + (0L -> 1234L)
map = map + (1L -> 4567L)
Yes, just like working with any DataFrame, your types have to match. Do buffer.getInt(0) when there's really a LongType there is going to be a problem.
There's no standard way to reset the buffer because other than whatever makes sense for your data type / use case. Maybe zero is actually last month's balanace; maybe zero is a running average from another dataset; maybe zero is an null or an empty string or maybe zero is really zero.
merge is an optimization that only happens in certain circumstances, if I remember correctly -- a way to sub-total that the SQL optimization may use if the circumstances warrant it. I just use the same function I use for update.
A case class will automatically get converted to the appropriate schema, so for the bonus question the answer is yes, it's because the schemas match. Change the dataType to not match, you will get an error.