Scala: how to modify the default metric for cross validation - scala

I find te code below on this site:
https://spark.apache.org/docs/2.3.1/ml-tuning.html
// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
// is areaUnderROC.
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2) // Use 3+ in practice
.setParallelism(2) // Evaluate up to 2 parameter settings in parallel
As they said the default metric for BinaryClassificationEvaluator is "AUC".
How can I do to change this default metric to F1-score?
I tried:
// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
// is areaUnderROC.
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator.setMetricName("f1"))
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2) // Use 3+ in practice
.setParallelism(2) // Evaluate up to 2 parameter settings in parallel
But I got some errors...
I search on many sites but I did not find the solution...

setMetricName only accepts "areaUnderPR" or "areaUnderROC". You will need to write your own Evaluator; something like this:
import org.apache.spark.ml.evaluation.Evaluator
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.param.shared.{HasLabelCol, HasPredictionCol}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.{Dataset, functions => F}
class FScoreEvaluator(override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol{
def this() = this(Identifiable.randomUID("FScoreEvaluator"))
def evaluate(dataset: Dataset[_]): Double = {
val truePositive = F.sum(((F.col(getLabelCol) === 1) && (F.col(getPredictionCol) === 1)).cast(IntegerType))
val predictedPositive = F.sum((F.col(getPredictionCol) === 1).cast(IntegerType))
val actualPositive = F.sum((F.col(getLabelCol) === 1).cast(IntegerType))
val precision = truePositive / predictedPositive
val recall = truePositive / actualPositive
val fScore = F.lit(2) * (precision * recall) / (precision + recall)
dataset.select(fScore).collect()(0)(0).asInstanceOf[Double]
}
override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
}

Based on the answer of #gmds. Make sure Spark version >=2.3.
You can also follow the implementation of RegressionEvaluator in Spark to implement other custom evaluators.
I also added isLargerBetter so that the instantiated evaluator can be used in model selection (e.g. CV)
import org.apache.spark.ml.evaluation.Evaluator
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.param.shared.{HasLabelCol, HasPredictionCol, HasWeightCol}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.{Dataset, functions => F}
class WRmseEvaluator(override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol with HasWeightCol {
def this() = this(Identifiable.randomUID("wrmseEval"))
def setPredictionCol(value: String): this.type = set(predictionCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setWeightCol(value: String): this.type = set(weightCol, value)
def evaluate(dataset: Dataset[_]): Double = {
dataset
.withColumn("residual", F.col(getLabelCol) - F.col(getPredictionCol))
.select(
F.sqrt(F.sum(F.col(getWeightCol) * F.pow(F.col("residual"), 2)) / F.sum(getWeightCol))
)
.collect()(0)(0).asInstanceOf[Double]
}
override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
override def isLargerBetter: Boolean = false
}
The following is how to use it.
val wrmseEvaluator = new WRmseEvaluator()
.setLabelCol(labelColName)
.setPredictionCol(predColName)
.setWeightCol(weightColName)

Related

Assert RDD is not sorted

I have a method called split that accepts an RDD[T] and a splitSize and returns an Array[RDD[T]].
Now, one of the test cases I write for it should verify that this function also randomly shuffles the RDD.
So I create a sorted RDD, and then see the results:
it should "randomize shuffle" in {
val inputRDD = sc.parallelize((0 until 16))
val result = RDDUtils.split(inputRDD, 2)
result.foreach(rdd => {
rdd.collect.foreach(println)
})
// Asset result is not sorted
}
If the results are:
0
1
2
3
..
15
Then it's not working as expected.
A good result can be something like:
11
3
9
14
...
1
6
How can I assert the output Array[RDD[T]]] is not sorted?
You could try something like this
val resultOrder = result.sortBy(....)
assert(!resultOrder.sameElements(result))
or
val resultOrder = result.sortBy(....)
assert(!resultOrder.toList == result.toList)
It's important to note that the key is to know how to sort the Array. For an Integer data type it would be easy, but for a complex data type you could need an implicit Ordering for your data type. e.g:
implicit val ordering: Ordering[T] =
Ordering.fromLessThan[T]((sa: T, sb: T) => sa < sb)
// OR
implicit val ordering: Ordering[MyClass] =
Ordering.fromLessThan[MyClass]((sa: MyClass, sb: MyClass) => sa.field1 < sb.field1)
The exact code would depend of your data type.
As a full example of this
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object SortArrayRDD {
val spark = SparkSession
.builder()
.appName("SortArrayRDD")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","SortArrayRDD") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
try {
Logger.getRootLogger.setLevel(Level.ERROR)
val arrRDD: Array[RDD[Int]] = Array(sc.parallelize(List(2,3)),sc.parallelize(List(10,11)),sc.parallelize(List(6,7)),sc.parallelize(List(8,9)),
sc.parallelize(List(4,5)),sc.parallelize(List(0,1)),sc.parallelize(List(12,13)),sc.parallelize(List(14,15)))
val aux = arrRDD
implicit val ordering: Ordering[RDD[Int]] = Ordering.fromLessThan[RDD[Int]]((sa: RDD[Int], sb: RDD[Int]) => sa.sum() < sb.sum())
aux.sorted.foreach(rdd => println(rdd.collect().mkString(",")))
val resultOrder = aux.sorted
assert(!resultOrder.sameElements(arrRDD))
println("It's unordered")
} finally {
sc.stop()
}
}
}

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.

flink sink to parquet file with AvroParquetWriter is not writing data to file

I am trying to write a parquet file as sink using AvroParquetWriter. The file is created but with 0 length (no data is written). am I doing something wrong ? couldn't figure out what is the problem
import io.eels.component.parquet.ParquetWriterConfig
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericRecord}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.AvroParquetWriter
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import scala.io.Source
import org.apache.flink.streaming.api.scala._
object Tester extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
def now = System.currentTimeMillis()
val path = new Path(s"/tmp/test-$now.parquet")
val schemaString = Source.fromURL(getClass.getResource("/request_schema.avsc")).mkString
val schema: Schema = new Schema.Parser().parse(schemaString)
val compressionCodecName = CompressionCodecName.SNAPPY
val config = ParquetWriterConfig()
val genericReocrd: GenericRecord = new GenericData.Record(schema)
genericReocrd.put("name", "test_b")
genericReocrd.put("code", "NoError")
genericReocrd.put("ts", 100L)
val stream = env.fromElements(genericReocrd)
val writer: ParquetWriter[GenericRecord] = AvroParquetWriter.builder[GenericRecord](path)
.withSchema(schema)
.withCompressionCodec(compressionCodecName)
.withPageSize(config.pageSize)
.withRowGroupSize(config.blockSize)
.withDictionaryEncoding(config.enableDictionary)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withValidation(config.validating)
.build()
writer.write(genericReocrd)
stream.addSink{r =>
writer.write(r)
}
env.execute()
The problem is that you don't close the ParquetWriter. This is necessary to flush pending elements to disk. You could solve the problem by defining your own RichSinkFunction where you close the ParquetWriter in the close method:
class ParquetWriterSink(val path: String, val schema: String, val compressionCodecName: CompressionCodecName, val config: ParquetWriterConfig) extends RichSinkFunction[GenericRecord] {
var parquetWriter: ParquetWriter[GenericRecord] = null
override def open(parameters: Configuration): Unit = {
parquetWriter = AvroParquetWriter.builder[GenericRecord](new Path(path))
.withSchema(new Schema.Parser().parse(schema))
.withCompressionCodec(compressionCodecName)
.withPageSize(config.pageSize)
.withRowGroupSize(config.blockSize)
.withDictionaryEncoding(config.enableDictionary)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withValidation(config.validating)
.build()
}
override def close(): Unit = {
parquetWriter.close()
}
override def invoke(value: GenericRecord, context: SinkFunction.Context[_]): Unit = {
parquetWriter.write(value)
}
}

What are the ways to convert a String into runnable code?

I could not find how to convert a String into runnable code, for instance:
val i = "new String('Yo')"
// conversion
println(i)
should print
Yo
after the conversion.
I found the following example in another post:
import scala.tools.nsc.interpreter.ILoop
import java.io.StringReader
import java.io.StringWriter
import java.io.PrintWriter
import java.io.BufferedReader
import scala.tools.nsc.Settings
object FuncRunner extends App {
val line = "sin(2 * Pi * 400 * t)"
val lines = """import scala.math._
|var t = 1""".stripMargin
val in = new StringReader(lines + "\n" + line + "\nval f = (t: Int) => " + line)
val out = new StringWriter
val settings = new Settings
val looper = new ILoop(new BufferedReader(in), new PrintWriter(out))
val res = looper process settings
Console println s"[$res] $out"
}
link: How to convert a string from a text input into a function in a Scala
But it seems like scala.tools is not available anymore, and I'm a newbie in Scala so i could not figure out how to replace it.
And may be there are just other ways to do it now.
Thanks !
You can simple execute your code contained inside String using Quasiquotes(Experimental Module).
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
// TO compile and run code we will use a ToolBox api.
val toolbox = currentMirror.mkToolBox()
// write your code starting with q and put it inside double quotes.
// NOTE : you will have to use triple quotes if you have any double quotes usage in your code.
val code1 = q"""new String("hello")"""
//compile and run your code.
val result1 = toolbox.compile(code1)()
// another example
val code2 = q"""
case class A(name:String,age:Int){
def f = (name,age)
}
val a = new A("Your Name",22)
a.f
"""
val result2 = toolbox.compile(code2)()
Output in REPL :
// Exiting paste mode, now interpreting.
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
toolbox: scala.tools.reflect.ToolBox[reflect.runtime.universe.type] = scala.tools.reflect.ToolBoxFactory$ToolBoxImpl#69b34f89
code1: reflect.runtime.universe.Tree = new String("hello")
result1: Any = hello
code2: reflect.runtime.universe.Tree =
{
case class A extends scala.Product with scala.Serializable {
<caseaccessor> <paramaccessor> val name: String = _;
<caseaccessor> <paramaccessor> val age: Int = _;
def <init>(name: String, age: Int) = {
super.<init>();
()
};
def f = scala.Tuple2(name, age)
};
val a = new A("Your Name", 22);
a.f
}
result2: Any = (Your Name,22)
scala>
To learn more about Quasiquotes :
http://docs.scala-lang.org/overviews/quasiquotes/setup.html
I found a simple solution using the ToolBox tool :
val cm = universe.runtimeMirror(getClass.getClassLoader)
val tb = cm.mkToolBox()
val str = tb.eval(tb.parse("new String(\"Yo\")"))
println(str)
This is printing:
Yo
The scala compiler (and the "interpreter loop") are available here. For example:
https://github.com/scala/scala/blob/v2.11.7/src/repl/scala/tools/nsc/interpreter/ILoop.scala
A compiled jar that has that class will be in your scala distribution under lib\scala-compiler.jar.
One way could be to take/parse that string code and then write it yourself in some file as Scala code. This way it would be executed by Scala compiler when you will call it.
An example: just like Scala is doing with Java. It takes this code and then convert it into Java by making use of main method.
object Xyz extends App {
println ("Hello World!")
}

Serializing to disk and deserializing Scala objects using Pickling

Given a stream of homogeneous typed object, how would I go about serializing them to binary, writing them to disk, reading them from disk and then deserializing them using Scala Pickling?
For example:
object PicklingIteratorExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
case class Person(name: String, age: Int)
val personsIt = Iterator.from(0).take(10).map(i => Person(i.toString, i))
val pklsIt = personsIt.map(_.pickle)
??? // Write to disk
val readIt: Iterator[Person] = ??? // Read from disk and unpickle
}
I find a way to so for standard files:
object PickleIOExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
val tempPath = File.createTempFile("pickling", ".gz").getAbsolutePath
val outputStream = new FileOutputStream(tempPath)
val inputStream = new FileInputStream(tempPath)
val persons = for{
i <- 1 to 100
} yield Person(i.toString, i)
val output = new StreamOutput(outputStream)
persons.foreach(_.pickleTo(output))
outputStream.close()
val personsIt = new Iterator[Person]{
val streamPickle = BinaryPickle(inputStream)
override def hasNext: Boolean = inputStream.available > 0
override def next(): Person = streamPickle.unpickle[Person]
}
println(personsIt.mkString(", "))
inputStream.close()
}
But I am still unable to find a solution that will work with gzipped files. Since I do not know how to detect the EOF? The following throws an EOFexception since GZIPInputStream available method does not indicate the EOF:
object PickleIOExample extends App {
import scala.pickling.Defaults._
import scala.pickling.binary._
import scala.pickling.static._
val tempPath = File.createTempFile("pickling", ".gz").getAbsolutePath
val outputStream = new GZIPOutputStream(new FileOutputStream(tempPath))
val inputStream = new GZIPInputStream(new FileInputStream(tempPath))
val persons = for{
i <- 1 to 100
} yield Person(i.toString, i)
val output = new StreamOutput(outputStream)
persons.foreach(_.pickleTo(output))
outputStream.close()
val personsIt = new Iterator[Person]{
val streamPickle = BinaryPickle(inputStream)
override def hasNext: Boolean = inputStream.available > 0
override def next(): Person = streamPickle.unpickle[Person]
}
println(personsIt.mkString(", "))
inputStream.close()
}