Registering UDF's dynamically using scala reflect not working - scala

I am registering my UDF's dynamically using scala reflect as shown below and this code works fine. However when I try to list spark functions using spark.catalog then I don't see it. Can you please help me understanding what I am missing here:
spark.catalog.listFunctions().foreach{
fun =>
if (fun.name == "ModelIdToModelYear") {
println(fun.name)
}
}
def registerUDF(spark: SparkSession) : Unit = {
val runtimeMirror = scala.reflect.runtime.universe.runtimeMirror(getClass.getClassLoader)
val moduleSymbol = runtimeMirror.moduleSymbol(Class.forName("com.local.practice.udf.UdfModelIdToModelYear"))
val targetMethod = moduleSymbol.typeSignature.members.filter{
x => x.isMethod && x.name.toString == "ModelIdToModelYear"
}.head.asMethod
val function = runtimeMirror.reflect(runtimeMirror.reflectModule(moduleSymbol).instance).reflectMethod(targetMethod)
function(spark.udf)
}
Below is my UDF definition
package com.local.practice.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
//noinspection ScalaStyle
object UdfModelIdToModelYear {
val ModelIdToModelYear: UserDefinedFunction = udf((model_id : String) => {
val numPattern = "(\\d{2})_.+".r
numPattern.findFirstIn(model_id).getOrElse("0").toInt
})
}

Related

How to extract value from Scala cats IO

I need to get Array[Byte] value from ioArray which is IO[Array[Byte]] // IO is from cats library
object MyTransactionInputApp extends App{
val ioArray : IO[Array[Byte]] = generateKryoBinary()
val i : Array[Byte] = ioArray.unsafeRunSync();
println(i)
def generateKryoBinaryIO(transaction: Transaction): IO[Array[Byte]] = {
KryoSerializer
.forAsync[IO](kryoRegistrar)
.use { implicit kryo =>
transaction.toBinary.liftTo[IO]
}
}
def generateKryoBinary(): IO[Array[Byte]] = {
val transaction = new Transaction(Hash(""),"","","","","")
val ioArray = generateKryoBinaryIO(transaction);
return ioArray
}
}
I tried the below, but not working
val i : Array[Byte] = for {
array <- ioArray
} yield array
If you just started working with cats-effect I recommend reading about cats.effect.IOApp which runs your IO.
Otherwise simple solutions would be:
run it explicitly and get the result:
import cats.effect.unsafe.implicits.global
ioArray.unsafeRunSync()
or maybe work with Future:
import cats.effect.unsafe.implicits.global
ioArray.unsafeToFuture()
Could you give us more context about your application ?

How to use Array in JCommander in Scala

I want to use JCommander to parse args.
I wrote some code:
import com.beust.jcommander.{JCommander, Parameter}
import scala.collection.mutable.ArrayBuffer
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories = new ArrayBuffer[String]
}
object Main {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
}
}
Howewer it fails with
com.beust.jcommander.ParameterException: Could not invoke null
Reason: Can not set static scala.collection.mutable.ArrayBuffer field InterestRulesConfig$.categories to java.lang.String
What am i doing wrong?
JCommander uses knowledge about types in Java to map values to parameters. But Java doesn't have a type scala.collection.mutable.ArrayBuffer. It has a type java.util.List. If you want to use JCommander you have to stick to Java's build-in types.
If you want to use Scala's types use one of Scala's libraries that handle in in more idiomatic manner: scopt or decline.
Working example
import java.util
import com.beust.jcommander.{JCommander, Parameter}
import scala.jdk.CollectionConverters._
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories: java.util.List[Integer] = new util.ArrayList[Integer]()
}
object Hello {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
println(cfg.categories.getClass())
val a = cfg.categories.asScala
for (x <- a) {
println(x.toInt)
println(x.toInt.getClass())
}
}
}

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?
Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.
The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

compiler crash when I use macros and playframework

I wrote a macro that parses JSON into a matching case class.
def parse(jsonTree: JsValue): BaseType = macro parserImpl
def parserImpl(c: blackbox.Context)(jsonTree: c.Tree) = {
import c.universe._
val q"$json" = jsonTree
val cases = List("X", "Y").map { caseClassName =>
val caseClass = c.parse(caseClassName)
val reader = c.parse(s"JSONHelp.${caseClassName}_reads")
val y = cq"""$caseClassName => (($json \ "config").validate[$caseClass]($reader)).get"""
println(showCode(y))
y
}.toList
val r =
q"""
import play.api.libs.json._
import JSONHelp._
println($json)
($json \ "type").as[String] match { case ..$cases }
"""
println(showCode(r))
r
}
The following is that code it generates (printed by the last println):
{
import play.api.libs.json._;
import JSONHelp._;
println(NodeParser.this.json);
NodeParser.this.json.\("type").as[String] match {
case "X" => NodeParser.this.json.\("config").validate[X](JSONHelp.X_reads).get
case "Y" => NodeParser.this.json.\("config").validate[Y](JSONHelp.Y_reads).get
}
}
The compilation of the subproject containing the macro definition works fine. But when I compile the project(using sbt 0.13.11 and scala 2.11.8) using the macro, I get the following error:
java.lang.NullPointerException
at play.routes.compiler.RoutesCompiler$GeneratedSource$.unapply(RoutesCompiler.scala:37)
at play.sbt.routes.RoutesCompiler$$anonfun$11$$anonfun$apply$2.isDefinedAt(RoutesCompiler.scala:180)
at play.sbt.routes.RoutesCompiler$$anonfun$11$$anonfun$apply$2.isDefinedAt(RoutesCompiler.scala:179)
at scala.Option.collect(Option.scala:250)
at play.sbt.routes.RoutesCompiler$$anonfun$11.apply(RoutesCompiler.scala:179)
at play.sbt.routes.RoutesCompiler$$anonfun$11.apply(RoutesCompiler.scala:178)
I'm not a user, but I see it seems to want tree positions with a source file:
val routesPositionMapper: Position => Option[Position] = position => {
position.sourceFile collect {
case GeneratedSource(generatedSource) => {
It's typical to use atPos(pos)(tree). You might assume the incoming tree.pos for synthetic trees.