How to properly use groupBy on Spark RDD composed of case class instances? - scala

I am trying to do groupBy on an RDD whose elements are instances of a simple case class and I am getting a weird error that I don't know how to work around. The following code reproduces the problem in the Spark-shell (Spark 0.9.0, Scala 2.10.3, Java 1.7.0 ):
case class EmployeeRec( name : String, position : String, salary : Double ) extends Serializable;
// I suspect extends Serializable is not needed for case classes, but just in case...
val data = sc.parallelize( Vector( EmployeeRec("Ana", "Analist", 200 ),
EmployeeRec("Maria", "Manager", 250.0 ),
EmployeeRec("Paul", "Director", 300.0 ) ) )
val groupFun = ( emp : EmployeeRec ) => emp.position
val dataByPos = data.groupBy( groupFun )
The resulting error from the last statement is:
val dataByPos = data.groupBy( groupFun )
<console>:21: error: type mismatch;
found : EmployeeRec => String
required: EmployeeRec => ?
val dataByPos = data.groupBy( groupFun )
So I tried:
val dataByPos = data.groupBy[String]( groupFun )
The error is a bit more scary now:
val dataByPos = data.groupBy[String]( groupFun )
<console>:18: error: overloaded method value groupBy with alternatives:
(f: EmployeeRec => String,p: org.apache.spark.Partitioner)(implicit evidence$8: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String,numPartitions: Int)(implicit evidence$7: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String)(implicit evidence$6: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])]
cannot be applied to (EmployeeRec => String)
val dataByPos = data.groupBy[String]( groupFun )
I tried to be more specific about the version of the overloaded method groupBy that I want to apply by adding the extra argument numPartions = 10 (of course my real dataset is much bigger than just 3 records)
val dataByPos = data.groupBy[String]( groupFun, 10 )
I get the exact same error as before.
Any ideas? I suspect the issue might be related to the implicit evidence argument... Unfortunately this is one of the areas of scala about which I do not understand much.
Note 1: The analog of this code using tuples instead of case class EmployeeRec, works without any problem. However, I was hoping to be able to use case classes instead of tuples for nicer, more maintable code that doesn't require me to remember or handle fields by position instead of by name (in reality I have many more than 3 fields per employee.)
Note 2: It seems that this issue observed (when using case class EmployeeRec) might be fixed in Spark 1.+, since any of the versions of the code above is correctly compiled by the eclipse scala pluggin when using the spark-core_2.10-1.0.0-cdh5.1.0.jar.
However, I am not sure how or whether I will be able to run that version of Spark in the cluster I have access to, and I was hoping to better understand the problem so as to come up with a work-around for Spark 0.9.0

Related

Return Row with schema defined at runtime in Spark UDF

I've dulled my sword on this one, some help would be greatly appreciated!
Background
I am building an ETL pipeline that takes GNMI Protobuf update messages off of a Kafka queue and eventually breaks them out into a bunch of delta tables based on the prefix and parameters of the paths to values (e.g. DataBricks runtime).
Without going into the gory details, each prefix corresponds roughly to a schema for a table, with the caveat that the paths can change (usually new subtrees) upstream, so the schema is not fixed. This is similar to a nested JSON structure .
I first break out the updates by prefix, so all of the updates have roughly the same schema. I defined some transformations so that when the schema does not match exactly, I can coerce them into a common schema.
I'm running into trouble when I try to create a struct column with the common schema.
Attempt 1
I first tried just returning an Array[Any] from my udf, and providing a schema in the UDF definition (I know this is deprecated):
import org.apache.spark.sql.{functions => F, Row, types => T}
def mapToRow(deserialized: Map[String, ParsedValueV2]): Array[Any] = {
def getValue(key: String): Any = {
deserialized.get(key) match {
case Some(value) => value.asType(columns(key))
case None => None
}
}
columns.keys.toArray.map(getValue).toArray
}
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def mapToStructUdf = F.udf(mapToRow _, account.sparkSchemas(prefix))
This snippet creates an Array object with the typed values that I need. Unfortunately when I try to use the UDF, I get this error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line8760b7c10da04d2489451bb90ca42c6535.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ParsedValueV2
I'm not sure what's not matching, but I did notice that the type of the values are Java types, not scala, so perhaps that is related?
Attempt 2
Maybe I can use the Typed UDF interface after all? Can I create a case class at runtime for each schema, and then use that as the return value from my udf?
I've tried to get this to work using various stuff I found like this:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val test = tb.eval(tb.parse("object Test; Test"))
but I can't even get an instance of test, and can't figure out how to use it as the return value of a UDF. I presume I need to use a generic type somehow, but my scala-fu is too weak to figure this one out.
Finally, the question
Can some help me figure out which approach to take, and how to proceed with that approach?
Thanks in advance for your help!!!
Update - is this a Spark bug?
I've distilled the problem down to this code:
import org.apache.spark.sql.{functions => F, Row, types => T}
// thanks #Dmytro Mitin
val spark = SparkSession.builder
.master ("local")
.appName ("Spark app")
.getOrCreate ()
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String", 42L)
// def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String")
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
T.StructField("another_string", T.StringType),
T.StructField("an_int", T.IntegerType),
)
)
)
Seq(("bar", "foo"))
.toDF("column", "input")
.withColumn(
"array_data",
simpleUdf($"input")
)
.show(truncate=false)
which results in this error message
IllegalArgumentException: The value (List(Another String, 42)) of the type (scala.collection.immutable.$colon$colon) cannot be converted to the string type
Hmm... that's odd. Where does that list come from, missing the first element of the row?
Two valued version (e.g. "hello world", "Another String") has the same problem, but if I only have one value in my struct, then its happy:
// def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String")
def simpleFn(foo: Any): Seq[Any] = List("hello world")
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
// T.StructField("another_string", T.StringType),
// T.StructField("an_int", T.IntegerType),
)
)
)
and my query gives me
+------+-----+-------------+
|column|input|array_data |
+------+-----+-------------+
|bar |foo |{hello world}|
+------+-----+-------------+
It looks like its giving me the first element of my Sequence as the first field of the struct, the rest of it as the second element of the struct, and then the third one is null (seen in other cases), and causes an exception.
This looks like a bug to me. Anyone else have any experience with UDFs with schemas built on the fly like this?
Spark 3.3.1, scala 2.12, DBR 12.0
Reflection struggles
A stupid way to accomplish what I want to do would be to take the schema's I've inferred, generate a bunch of scala code that implements case classes that I can use as return types from my UDFs, then compile the code, package up a JAR, load it into my databricks runtime, and then use the case classes as return results from the UDFs.
This seems like a very convoluted way to do things. It would be great if I could just generate the case classes, and then do something like
def myUdf[CaseClass](input: SomeInputType): CaseClass =
CaseClass(input.giveMeResults: _*)
The problem is that I can't figure out how to get the type I've created using eval into the current "context" (I don't know the right word here).
This code:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val test = tb.eval(tb.parse("object Test; Test"))
give me this:
...
test: Any = __wrapper$1$bb89c0cde37c48929fa9d8cdabeeb0f8.__wrapper$1$bb89c0cde37c48929fa9d8cdabeeb0f8$Test$1$#492531c0
test is, I think, an instance of Test, but the type system in the REPL doesn't know about any type named Test, so I can't use test.asInstanceOf[Test] or something like that
I know this is a frequently asked question, but I can't seem to find an answer anywhere about how to actually accomplish what I described above.
Regarding "Reflection struggles". It's not clear for me whether: 1) you already have def myUdf[T] = ... from somewhere and you're trying just to call it for generated case class: myUdf[GeneratedClass] or 2) you're trying to define def myUdf[T] = ... based on the generated class.
In the former case you should use:
tb.define to generate an object (or case class), it returns a class symbol (or module symbol), you can use it further (e.g. in a type position)
tb.eval to call the method (myUdf)
object Main extends App {
def myUdf[T](): Unit = println("myUdf")
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
tb.eval(q"Main.myUdf[$testSymbol]()") // myUdf
}
In this example I changed the signature (and body) of myUdf, you should use your actual ones.
In the latter case you can define myUdf at runtime too:
object Main extends App {
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
val xSymbol = tb.define(
q"""
object X {
def myUdf[T](): Unit = println("myUdf")
}
"""
)
tb.eval(q"$xSymbol.myUdf[$testSymbol]()") //myUdf
}
You should try to write myUdf for ordinary case and we'll translate it for runtime-generated one.
so I can't use test.asInstanceOf[Test] or something like that
Yeah, type Test doesn't exist at compile time so you can't use it like that. It exists at runtime so you should use it inside quasiquotes q"..." (or tb.parse("..."))
object Main extends App {
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
tb.eval(q"Main.test.asInstanceOf[${testSymbol.asModule.moduleClass.asClass.toType}]") // no exception, so test is an instance of Test
tb.eval(q"Main.test.asInstanceOf[$testSymbol.type]") // no exception, so test is an instance of Test
println(
tb.eval(q"Main.test.getClass").asInstanceOf[Class[_]]
) // class __wrapper$1$0bbb246b633b472e8df54efc3e9ff9d9.Test$
println(
tb.eval(q"scala.reflect.runtime.universe.typeOf[$testSymbol.type]").asInstanceOf[universe.Type]
) // __wrapper$1$0bbb246b633b472e8df54efc3e9ff9d9.Test.type
}
Regarding ClassCastException or IllegalArgumentException. I noticed that the exception disappears if you change UDF return type
def simpleUdf = F.udf (
simpleFn (_),
dataType = T.StructType (
List (
T.StructField ("a_string", T.StringType),
T.StructField ("tail1", T.StructType (
List (
T.StructField ("another_string", T.StringType),
T.StructField ("tail2", T.StructType (
List (
T.StructField ("an_int", T.IntegerType),
)
)),
)
)),
)
)
)
//+------+-----+-------------------------------------+
//|column|input|array_data |
//+------+-----+-------------------------------------+
//|bar |foo |{hello world, {Another String, {42}}}|
//+------+-----+-------------------------------------+
I guess this makes sense because a List is :: (aka $colon$colon) of its head and tail, then the tail is :: of its head and tail etc.
#Dmytro Mitin gets the majority of the credit for this answer. Thanks a ton for your help!
The solution I came to uses approach 1) using the untyped APIs. The key is to do two things:
Return a Row (e.g. untyped) from the unwrapped udf
Create the UDF using the untyped API
Here is the toy example
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def simpleFn(foo: Any): Row = Row("a_string", "hello world", 42L)
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
T.StructField("another_string", T.StringType),
T.StructField("an_int", T.LongType),
)
)
)
Now I can use it like this:
Seq(("bar", "foo"))
.toDF("column", "input")
.withColumn(
"struct_data",
simpleUdf($"input")
)
.withColumn(
"field_data",
$"struct_data.a_string"
)
.show(truncate=false)
Output:
+------+-----+---------------------------+----------+
|column|input|struct_data |field_data|
+------+-----+---------------------------+----------+
|bar |foo |{a_string, hello world, 42}|a_string |
+------+-----+---------------------------+----------+

Scala typemismatch in spark map

I write an method to load data from mongodb(version:3.4) to spark(use mongo-spark-connector, version:2.2.1), and my spark version is 2.2.0, scala version is 2.11.8. I want to use a function called resultHandler accept org.bson.Document and return T to etl mongodb raw data.
def loadFromMongodb[T1: ClassTag](
mongoUri: String,
spark: SparkSession,
pipeline: Seq[Document]
)(
resultHandler: Document => T1
): RDD[T1] = {
spark
.sparkContext
.loadFromMongoDB(ReadConfig(Map("uri" -> mongoUri)))
.withPipeline(pipeline) // filter push down
.map(doc => resultHandler(doc))
}
When i compile it, i got a type mismatch error in here:
Error:(50, 33) type mismatch;
found : T1
required: org.bson.Document
.map(doc => resultHandler(doc))
I don't know why.
Please help me, Thanks!
It looks like you are too smart for the mongo-spark-connector or at least there is a conflict between you being smart and mongo-spark-connector being smart with its implicit DefaultsTo magic.
I didn't dig up the whole story but it looks like the fact that you provided your T1 with an implicit evidence of ClassTag[T1] is enough to make the line
.loadFromMongoDB(ReadConfig(Map("uri" -> mongoUri)))
to be interpreted as
.loadFromMongoDB[T1](ReadConfig(Map("uri" -> mongoUri)))
which is obviously not what you want. The simplest way to fix your error is to explicitly type it with Document as in
def loadFromMongodb[T1: ClassTag](
mongoUri: String,
spark: SparkSession,
pipeline: Seq[Document]
)(
resultHandler: Document => T1
): RDD[T1] = {
spark
.sparkContext
// specify [Document] explicitly
.loadFromMongoDB[Document](ReadConfig(Map("uri" -> mongoUri)))
.withPipeline(pipeline) // filter push down
.map(doc => resultHandler(doc))
}

Why does Scala compiler fail with “Cannot resolve reference reduceByKeyAndWindow with such signature”?

I've got the following line in my Spark Streaming application that compiles fine:
val kafkaDirectStream: InputDStream[ConsumerRecord[String,String]] = KafkaUtils.createDirectStream(...)
kafkaDirectStream.map(_ => ("mockkey", 1)).reduceByKeyAndWindow(_+_, Seconds(30))
When I use the variant of reduceByKeyAndWindow with two Durations as follows:
.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
I face the below compiler error:
Cannot resolve reference reduceByKeyAndWindow with such signature
Why?
After kafkaDirectStream.map(_ => ("mockkey", 1)), you'll have DStream[(String, Int)] (which you can read about in the official documentation at org.apache.spark.streaming.dstream.DStream).
It appears that implicit scope does not give enough knowledge about types and hence the error:
missing parameter type for expanded function ((x$3, x$4) => x$3.$plus(x$4))
Unfortunatelly, I can't really explain what the root cause of the compilation error is, but a solution is to define a method or function with the types specified explicitly and use it instead (not underscores alone, i.e. _ + _).
val add: (Int, Int) => Int = _ + _
// or def add(x: Int, y: Int) = x + y
mapped.reduceByKeyAndWindow(add, Seconds(30), Seconds(10))
That will pass the Scala compiler.
(wish I knew if there's a better solution somehow helping the Scala type inferencer).

Compilation Encoder error on spark 2.0

I am trying to move from spark 1.6 to 2.0, I get this error during compilation on 2.0 only:
def getSubGroupCount(df: DataFrame, colNames: String): Array[Seq[Any]] = {
val columns: Array[String] = colNames.split(',')
val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq)
subGroupCount
}
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq)
Regards
The method DataFrame.map has changed between the versions:
In Spark 1.6, it operates on the underlying RDD[Row] and returns an RDD:
def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]
In Spark 2.0, DataFrame is just an alias for Dataset[Row], and therefore it returns a Dataset:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
As you can see, the latter expects an implicit Encoder argument, which is missing in your case.
Why is the Encoder missing?
First, all default encoders will be in scope once you import spark.implicits._. However, since the mapping's result type is Any (x => x.get(0) returns Any), you won't have an Encoder for it.
How to fix this?
If there's a common type (say, String, for the sake of example) for all the columns you're interested in, you can use getAs[String](0) to make the mapping function's return type specific. Once the above mentioned import is added, such types (primitives, Products) will have a matching Encoder in scope
If you don't have a known type that is common for all the relevant columns, and want to retain the same behavior - you can get the Dataframe's RDD using .rdd and use that RDD's map operation, which will be identical to the pre-2.0 behavior:
columns.map(c => df.select(c).distinct.rdd.map(x => x.get(0)).collect.toSeq)

flatMap Compile Error found: TraversableOnce[String] required: TraversableOnce[String]

EDIT#2: This might be memory related. Logs are showing out-of-heap.
Yes, definitely memory related. Basically docker logs reports all the
spewage of out-of-heap from the java, but the jupyter web notebook does not pass that to the user. Instead the user gets kernel failures and occasional weird behavior like code not compiling correctly.
Spark 1.6, particularly docker run -d .... jupyter/all-spark-notebook
Would like to count accounts in a file of ~ 1 million transactions.
This is simple enough, it can be done without spark but I've hit an odd error trying with spark scala.
Input data is type RDD[etherTrans] where etherTrans is a custom type enclosing a single transaction: a timestamp, the from and to accounts, and the value transacted in ether.
class etherTrans(ts_in:Long, afrom_in:String, ato_in:String, ether_in: Float)
extends Serializable {
var ts: Long = ts_in
var afrom: String = afrom_in
var ato: String = ato_in
var ether: Float = ether_in
override def toString():String = ts.toString+","+afrom+","+ato+","+ether.toString
}
data:RDD[etherTrans] looks ok:
data.take(10).foreach(println)
etherTrans(1438918233,0xa1e4380a3b1f749673e270229993ee55f35663b4,0x5df9b87991262f6ba471f09758cde1c0fc1de734,3.1337E-14)
etherTrans(1438918613,0xbd08e0cddec097db7901ea819a3d1fd9de8951a2,0x5c12a8e43faf884521c2454f39560e6c265a68c8,19.9)
etherTrans(1438918630,0x63ac545c991243fa18aec41d4f6f598e555015dc,0xc93f2250589a6563f5359051c1ea25746549f0d8,599.9895)
etherTrans(1438918983,0x037dd056e7fdbd641db5b6bea2a8780a83fae180,0x7e7ec15a5944e978257ddae0008c2f2ece0a6090,100.0)
etherTrans(1438919175,0x3f2f381491797cc5c0d48296c14fd0cd00cdfa2d,0x4bd5f0ee173c81d42765154865ee69361b6ad189,803.9895)
etherTrans(1438919394,0xa1e4380a3b1f749673e270229993ee55f35663b4,0xc9d4035f4a9226d50f79b73aafb5d874a1b6537e,3.1337E-14)
etherTrans(1438919451,0xc8ebccc5f5689fa8659d83713341e5ad19349448,0xc8ebccc5f5689fa8659d83713341e5ad19349448,0.0)
etherTrans(1438919461,0xa1e4380a3b1f749673e270229993ee55f35663b4,0x5df9b87991262f6ba471f09758cde1c0fc1de734,3.1337E-14)
etherTrans(1438919491,0xf0cf0af5bd7d8a3a1cad12a30b097265d49f255d,0xb608771949021d2f2f1c9c5afb980ad8bcda3985,100.0)
etherTrans(1438919571,0x1c68a66138783a63c98cc675a9ec77af4598d35e,0xc8ebccc5f5689fa8659d83713341e5ad19349448,50.0)
This next function parses ok and is written this way because earlier attempts were complaining of type mismatch between Array[String] or List[String] and TraversableOnce[?]:
def arrow(e:etherTrans):TraversableOnce[String] = Array(e.afrom,e.ato)
But then using this function with flatMap to get an RDD[String] of all accounts fails.
val accts:RDD[String] = data.flatMap(arrow)
Name: Compile Error
Message: :38: error: type mismatch;
found : etherTrans(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC) => TraversableOnce[String]
required: etherTrans(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC) => TraversableOnce[String]
val accts:RDD[String] = data.flatMap(arrow)
^
StackTrace:
Make sure you scroll right to see it complain that TraversableOnce[String]
doesn't match TraversableOnce[String]
This must be a fairly common problem as a more blatant type mismatch comes up in Generate List of Pairs and while there isn't enough context, is suggested in I have a Scala List, how can I get a TraversableOnce?.
What's going on here?
EDIT: The issue reported above doesn't appear, and code works fine in older spark-shell, Spark 1.3.1 running standalone in a docker container. Errors are generated running in the spark 1.6 scala jupyter environment with the jupyter/all-spark-notebook docker container.
Also #zero323 says that this toy example:
val rdd = sc.parallelize(Seq((1L, "foo", "bar", 1))).map{ case (ts, fr, to, et) => new etherTrans(ts, fr, to, et)}
rdd.flatMap(arrow).collect
worked for him in the terminal spark-shell 1.6.0/spark 2.10.5 and also Scala 2.11.7 and Spark 1.5.2 work as well.
I think you should switch to use case classes, and it should work fine. Using "regular" classes, might case weird issues when serializing them, and it looks like all you need are value objects, so case classes look like a better fit for your use case.
An example:
case class EtherTrans(ts: Long, afrom: String, ato: String, ether: Float)
val source = sc.parallelize(Array(
(1L, "from1", "to1", 1.234F),
(2L, "from2", "to2", 3.456F)
))
val data = source.as[EtherTrans]
val data = source.map { l => EtherTrans(l._1, l._2, l._3, l._4) }
def arrow(e: EtherTrans) = Array(e.afrom, e.ato)
data.map(arrow).take(5)
/*
res3: Array[Array[String]] = Array(Array(from1, to1), Array(from2, to2))
*/
data.map(arrow).take(5)
// res3: Array[Array[String]] = Array(Array(from1, to1), Array(from2, to2))
If you need to, you can just create some method / object to generate your case classes.
If you don't really need the "toString" method for your logic, but just for "presentation", keep it out of the case class: you can always add it with a map operation before storing if or showing it.
Also, if you are in Spark 1.6.0 or higher, you could try using the DataSet API instead, that would look more or less like this:
val data = sqlContext.read.text("your_file").as[EtherTrans]
https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html