udf No TypeTag available for type string - scala

I don't understand a behavior of spark.
I create an udf which returns an Integer like below
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Show {
def main(args: Array[String]): Unit = {
val (sc,sqlContext) = iniSparkConf("test")
val testInt_udf = sqlContext.udf.register("testInt_udf", testInt _)
}
def iniSparkConf(appName: String): (SparkContext, SQLContext) = {
val conf = new SparkConf().setAppName(appName)//.setExecutorEnv("spark.ui.port", "4046")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
(sc, sqlContext)
}
def testInt() : Int= {
return 2
}
}
I work perfectly but if I change the return type of method test from Int to String
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
def testString() : String = {
return "myString"
}
I get the following error
Error:(34, 43) No TypeTag available for String
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
Error:(34, 43) not enough arguments for method register: (implicit evidence$1: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction.
Unspecified value parameter evidence$1.
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
here are my embedded jars:
datanucleus-api-jdo-3.2.6
datanucleus-core-3.2.10
datanucleus-rdbms-3.2.9
spark-1.6.1-yarn-shuffle
spark-assembly-1.6.1-hadoop2.6.0
spark-examples-1.6.1-hadoop2.6.0
I am a little bit lost... Do you have any idea?

Since I can't reproduce the issue copy-pasting just your example code into a new file, I bet that in your real code String is actually shadowed by something else. To verify this theory you can try to change you signature to
def testString() : scala.Predef.String = {
return "myString"
}
or
def testString() : java.lang.String = {
return "myString"
}
If this one compiles, search for "String" to see how you shadowed the standard type. If you use IntelliJ Idea, you can try to use "Ctrl+B" (GoTo) to find it out. The most obvious candidate is that you used String as a name of generic type parameter but there might be some other choices.

Related

How to extract value from Scala cats IO

I need to get Array[Byte] value from ioArray which is IO[Array[Byte]] // IO is from cats library
object MyTransactionInputApp extends App{
val ioArray : IO[Array[Byte]] = generateKryoBinary()
val i : Array[Byte] = ioArray.unsafeRunSync();
println(i)
def generateKryoBinaryIO(transaction: Transaction): IO[Array[Byte]] = {
KryoSerializer
.forAsync[IO](kryoRegistrar)
.use { implicit kryo =>
transaction.toBinary.liftTo[IO]
}
}
def generateKryoBinary(): IO[Array[Byte]] = {
val transaction = new Transaction(Hash(""),"","","","","")
val ioArray = generateKryoBinaryIO(transaction);
return ioArray
}
}
I tried the below, but not working
val i : Array[Byte] = for {
array <- ioArray
} yield array
If you just started working with cats-effect I recommend reading about cats.effect.IOApp which runs your IO.
Otherwise simple solutions would be:
run it explicitly and get the result:
import cats.effect.unsafe.implicits.global
ioArray.unsafeRunSync()
or maybe work with Future:
import cats.effect.unsafe.implicits.global
ioArray.unsafeToFuture()
Could you give us more context about your application ?

spark - method Error: The argument types of an anonymous function must be fully known

I know there have been quite a few questions on this, but I've created a simple example that I thought should work,but still does not and I'm not sure I understand why
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStream")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val kafkaDStream = KafkaUtils.createStream(streamingContext,"hubble1:2181","aaa",Map("video"->3))
val wordDStream = kafkaDStream.flatMap(t=>t._2.split(" "))
val mapDStream = wordDStream.map((_,1))
val wordToSumDStream = mapDStream.updateStateByKey{
case (seq,buffer) => {
val sum = buffer.getOrElse(0) + seq.sum
Option(sum)
}
}
wordToSumDStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Error:(41, 41) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: (Seq[Int], Option[?]) => Option[?]
val result = flat.updateStateByKey{
Can someone explain why the mapDStream.updateStateByKey method statement does not compile?
Put your logic inside function like below.
def update(seq:Seq[Int],buffer: Option[Int]) = Some(buffer.getOrElse(0) + seq.sum)
val wordToSumDStream = mapDStream.updateStateByKey(update)
Check Example

How to use Array in JCommander in Scala

I want to use JCommander to parse args.
I wrote some code:
import com.beust.jcommander.{JCommander, Parameter}
import scala.collection.mutable.ArrayBuffer
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories = new ArrayBuffer[String]
}
object Main {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
}
}
Howewer it fails with
com.beust.jcommander.ParameterException: Could not invoke null
Reason: Can not set static scala.collection.mutable.ArrayBuffer field InterestRulesConfig$.categories to java.lang.String
What am i doing wrong?
JCommander uses knowledge about types in Java to map values to parameters. But Java doesn't have a type scala.collection.mutable.ArrayBuffer. It has a type java.util.List. If you want to use JCommander you have to stick to Java's build-in types.
If you want to use Scala's types use one of Scala's libraries that handle in in more idiomatic manner: scopt or decline.
Working example
import java.util
import com.beust.jcommander.{JCommander, Parameter}
import scala.jdk.CollectionConverters._
object Config {
#Parameter(names = Array("--categories"), required = true)
var categories: java.util.List[Integer] = new util.ArrayList[Integer]()
}
object Hello {
def main(args: Array[String]): Unit = {
val cfg = Config
JCommander
.newBuilder()
.addObject(cfg)
.build()
.parse(args.toArray: _*)
println(cfg.categories)
println(cfg.categories.getClass())
val a = cfg.categories.asScala
for (x <- a) {
println(x.toInt)
println(x.toInt.getClass())
}
}
}

json4s, how to deserialize json with FullTypeHints w/o explicitly setting TypeHints

I do specify FullTypeHints before deserialization
def serialize(definition: Definition): String = {
val hints = definition.tasks.map(_.getClass).groupBy(_.getName).values.map(_.head).toList
implicit val formats = Serialization.formats(FullTypeHints(hints))
writePretty(definition)
}
It produces json with type hints, great!
{
"name": "My definition",
"tasks": [
{
"jsonClass": "com.soft.RootTask",
"name": "Root"
}
]
}
Deserialization doesn't work, it ignores "jsonClass" field with type hint
def deserialize(jsonString: String): Definition = {
implicit val formats = DefaultFormats.withTypeHintFieldName("jsonClass")
read[Definition](jsonString)
}
Why should I repeat typeHints using Serialization.formats(FullTypeHints(hints)) for deserialization if hints are in json string?
Can json4s infer them from json?
The deserialiser is not ignoring the type hint field name, it just does not have anything to map it with. This is where the hints come in. Thus, you have to declare and assign your hints list object once again and pass it to the DefaultFormats object either by using the withHints method or by overriding the value when creating a new instance of DefaultFormats. Here's an example using the latter approach.
val hints = definition.tasks.map(_.getClass).groupBy(_.getName).values.map(_.head).toList
implicit val formats: Formats = new DefaultFormats {
outer =>
override val typeHintFieldName = "jsonClass"
override val typeHints = hints
}
I did it this way since I have contract:
withTypeHintFieldName is known in advance
withTypeHintFieldName contains fully qualified class name and it's always case class
def deserialize(jsonString: String): Definition = {
import org.json4s._
import org.json4s.native.JsonMethods._
import org.json4s.JsonDSL._
val json = parse(jsonString)
val classNames: List[String] = (json \\ $$definitionTypes$$ \\ classOf[JString])
val hints: List[Class[_]] = classNames.map(clz => Try(Class.forName(clz)).getOrElse(throw new RuntimeException(s"Can't get class for $clz")))
implicit val formats = Serialization.formats(FullTypeHints(hints)).withTypeHintFieldName($$definitionTypes$$)
read[Definition](jsonString)

using Typeclasses on SparkTypes

I am trying to use scala TypeClass on Spark Types, here is a small code snippet I wrote.
trait ThresholdMethods[A] {
def applyThreshold(): Boolean
}
object ThresholdMethodsInstances {
def thresholdMatcher[A](v:A)(implicit threshold: ThresholdMethods[A]):Boolean =
threshold.applyThreshold()
implicit val exactMatchThresholdStringType = new ThresholdMethods[StringType] {
def applyThreshold(): Boolean = {
print("string")
true
}
}
implicit val exactMatchThresholdNumericType = new ThresholdMethods[IntegerType] {
def applyThreshold(): Boolean = {
print("numeric")
true
}
}
}
object Main{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("ZenDataValidationLib").config("spark.some.config.option", "some-value")
.master("local").getOrCreate()
import spark.sqlContext.implicits._
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880, "xyz"), ("2016-04-30", "14", "FR", 9875, 13,"xyz"), ("2017-06-10", "15", "PQR", 9867, 57721,"xyz")
).toDF("WEEK", "DIM1", "DIM2","T1","T2","T3")
import ThresholdMethodsInstances._
println(df1.schema("T1").dataType)
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
}
}
When I run this on my local intellij, following error is thrown
Error:(46, 51) could not find implicit value for parameter threshold: com.amazon.zen.datavalidation.activity.com.amazon.zen.datavalidation.activity.ThresholdMethods[org.apache.spark.sql.types.DataType]
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
Error:(46, 51) not enough arguments for method thresholdMatcher: (implicit threshold: com.amazon.zen.datavalidation.activity.com.amazon.zen.datavalidation.activity.ThresholdMethods[org.apache.spark.sql.types.DataType])Boolean.
Unspecified value parameter threshold.
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
I also tried the same things using String and Int and it worked perfectly fine. Can someone help me in doing this on SparkTypes ?
Please note that StructField.dataType returns DataType, not a specific subclass and your code defines implicit ThresholdMethods only for IntegerType and StringType.
Because implicit resolution happens at compilation time there is not enough information for compile to determine the right instance, which would make the types match.
This would work if you'd (obviously you don't want that):
ThresholdMethodsInstances.thresholdMatcher(
df1.schema("T1").dataType.asInstanceOf[IntegerType]
)
but in practice, you'll need something more explicit like pattern matching, not implicit argument.
You can also consider switching to strongly typed Dataset, which be a better choice here.