Registering UDF's dynamically using scala reflect not working

Registering UDF's dynamically using scala reflect not working - scala

I am registering my UDF's dynamically using scala reflect as shown below and this code works fine. However when I try to list spark functions using spark.catalog then I don't see it. Can you please help me understanding what I am missing here:
spark.catalog.listFunctions().foreach{
fun =>
if (fun.name == "ModelIdToModelYear") {
println(fun.name)
}
}
def registerUDF(spark: SparkSession) : Unit = {
val runtimeMirror = scala.reflect.runtime.universe.runtimeMirror(getClass.getClassLoader)
val moduleSymbol = runtimeMirror.moduleSymbol(Class.forName("com.local.practice.udf.UdfModelIdToModelYear"))
val targetMethod = moduleSymbol.typeSignature.members.filter{
x => x.isMethod && x.name.toString == "ModelIdToModelYear"
}.head.asMethod
val function = runtimeMirror.reflect(runtimeMirror.reflectModule(moduleSymbol).instance).reflectMethod(targetMethod)
function(spark.udf)
}
Below is my UDF definition
package com.local.practice.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
//noinspection ScalaStyle
object UdfModelIdToModelYear {
val ModelIdToModelYear: UserDefinedFunction = udf((model_id : String) => {
val numPattern = "(\\d{2})_.+".r
numPattern.findFirstIn(model_id).getOrElse("0").toInt
})
}

Related

How to extract value from Scala cats IO

I need to get Array[Byte] value from ioArray which is IO[Array[Byte]] // IO is from cats library
object MyTransactionInputApp extends App{
val ioArray : IO[Array[Byte]] = generateKryoBinary()
val i : Array[Byte] = ioArray.unsafeRunSync();
println(i)
def generateKryoBinaryIO(transaction: Transaction): IO[Array[Byte]] = {
KryoSerializer
.forAsync[IO](kryoRegistrar)
.use { implicit kryo =>
transaction.toBinary.liftTo[IO]
}
}
def generateKryoBinary(): IO[Array[Byte]] = {
val transaction = new Transaction(Hash(""),"","","","","")
val ioArray = generateKryoBinaryIO(transaction);
return ioArray
}
}
I tried the below, but not working
val i : Array[Byte] = for {
array <- ioArray
} yield array

If you just started working with cats-effect I recommend reading about cats.effect.IOApp which runs your IO.
Otherwise simple solutions would be:
run it explicitly and get the result:
import cats.effect.unsafe.implicits.global
ioArray.unsafeRunSync()
or maybe work with Future:
import cats.effect.unsafe.implicits.global
ioArray.unsafeToFuture()
Could you give us more context about your application ?

How to use Array in JCommander in Scala

I want to use JCommander to parse args.
I wrote some code:
import com.beust.jcommander.{JCommander, Parameter}
import scala.collection.mutable.ArrayBuffer
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories = new ArrayBuffer[String]
}
object Main {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
}
}
Howewer it fails with
com.beust.jcommander.ParameterException: Could not invoke null
Reason: Can not set static scala.collection.mutable.ArrayBuffer field InterestRulesConfig$.categories to java.lang.String
What am i doing wrong?

JCommander uses knowledge about types in Java to map values to parameters. But Java doesn't have a type scala.collection.mutable.ArrayBuffer. It has a type java.util.List. If you want to use JCommander you have to stick to Java's build-in types.
If you want to use Scala's types use one of Scala's libraries that handle in in more idiomatic manner: scopt or decline.

Working example
import java.util
import com.beust.jcommander.{JCommander, Parameter}
import scala.jdk.CollectionConverters._
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories: java.util.List[Integer] = new util.ArrayList[Integer]()
}
object Hello {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
println(cfg.categories.getClass())
val a = cfg.categories.asScala
for (x <- a) {
println(x.toInt)
println(x.toInt.getClass())
}
}
}

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?

Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.

The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

compiler crash when I use macros and playframework

I wrote a macro that parses JSON into a matching case class.
def parse(jsonTree: JsValue): BaseType = macro parserImpl
def parserImpl(c: blackbox.Context)(jsonTree: c.Tree) = {
import c.universe._
val q"$json" = jsonTree
val cases = List("X", "Y").map { caseClassName =>
val caseClass = c.parse(caseClassName)
val reader = c.parse(s"JSONHelp.${caseClassName}_reads")
val y = cq"""$caseClassName => (($json \ "config").validate[$caseClass]($reader)).get"""
println(showCode(y))
y
}.toList
val r =
q"""
import play.api.libs.json._
import JSONHelp._
println($json)
($json \ "type").as[String] match { case ..$cases }
"""
println(showCode(r))
r
}
The following is that code it generates (printed by the last println):
{
import play.api.libs.json._;
import JSONHelp._;
println(NodeParser.this.json);
NodeParser.this.json.\("type").as[String] match {
case "X" => NodeParser.this.json.\("config").validate[X](JSONHelp.X_reads).get
case "Y" => NodeParser.this.json.\("config").validate[Y](JSONHelp.Y_reads).get
}
}
The compilation of the subproject containing the macro definition works fine. But when I compile the project(using sbt 0.13.11 and scala 2.11.8) using the macro, I get the following error:
java.lang.NullPointerException
at play.routes.compiler.RoutesCompiler$GeneratedSource$.unapply(RoutesCompiler.scala:37)
at play.sbt.routes.RoutesCompiler$$anonfun$11$$anonfun$apply$2.isDefinedAt(RoutesCompiler.scala:180)
at play.sbt.routes.RoutesCompiler$$anonfun$11$$anonfun$apply$2.isDefinedAt(RoutesCompiler.scala:179)
at scala.Option.collect(Option.scala:250)
at play.sbt.routes.RoutesCompiler$$anonfun$11.apply(RoutesCompiler.scala:179)
at play.sbt.routes.RoutesCompiler$$anonfun$11.apply(RoutesCompiler.scala:178)

I'm not a user, but I see it seems to want tree positions with a source file:
val routesPositionMapper: Position => Option[Position] = position => {
position.sourceFile collect {
case GeneratedSource(generatedSource) => {
It's typical to use atPos(pos)(tree). You might assume the incoming tree.pos for synthetic trees.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Registering UDF's dynamically using scala reflect not working - scala

Related

How to extract value from Scala cats IO

How to use Array in JCommander in Scala

Scala : map Dataset[Row] to Dataset[Row]

How to deal with contexts in Spark/Scala when using map()

compiler crash when I use macros and playframework

Categories

Resources