Spark cannot serialize object inside a function? - scala

I am using Spark 1.5.
I run the following code and get exception java.lang.ExceptionInInitializerError
def work() = {
val rdd1:RDD[String] = read_rdd()
val map:Map[String,Boolean] = read_a_map()
object brx extends Serializable { val value = map }
def filter(rdd:RDD[String]) = {
rdd filter brx.value.apply //throws exception
}
def filter2(rdd:RDD[String]) = {
rdd filter map //works fine
}
val x = filter(rdd1) //throws exception
val x = filter2(rdd1) // works fine
}
Why I got java.lang.ExceptionInInitializerError?

Related

Spark: object not serializable

I have a batch job which I am try to convert to structured streaming. I am getting the following error:
20/03/31 15:09:23 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.NotSerializableException: com.apple.ireporter.analytics.compute.AggregateKey
Serialization stack:
- object not serializable (class: com.apple.ireporter.analytics.compute.AggregateKey, value: d_)
... where "d_" is the last row in the dataset
This is the relevant code snippet
df.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
import spark.implicits._
val javaRdd = batchDF.toJavaRDD
val dataframeToRowColFunction = new RowToColumn(table)
println("Back to Main class")
val combinedRdd =javaRdd.flatMapToPair(dataframeToRowColFunction.FlatMapData2).combineByKey(aggrCreateComb.createCombiner,aggrMerge.aggrMerge,aggrMergeCombiner.aggrMergeCombiner)
// spark.createDataFrame( combinedRdd).show(1); // I commented this
// combinedRdd.collect() // I added this as a test
}
This is the FlatMapData2 class
val FlatMapData2: PairFlatMapFunction[Row, AggregateKey, AggregateValue] = new PairFlatMapFunction[Row, AggregateKey, AggregateValue]() {
//val FlatMapData: PairFlatMapFunction[Row, String, AggregateValue] = new PairFlatMapFunction[Row, String, AggregateValue]() {
override def call(x: Row) = {
val tuples = new util.ArrayList[Tuple2[AggregateKey, AggregateValue]]
val decomposedEvents = decomposer.decomposeDistributed(x)
decomposedEvents.foreach {
y => tuples.add(Tuple2(y._1,y._2))
}
tuples.iterator()
}
}
Here is the aggregate Key class
class AggregateKey(var partitionkeys: Map[Int,Any],var clusteringkeys : Map[Int,Any]) extends Comparable [AggregateKey]{
...
}
I am new to this and any help would be appreciated. Please let me know if anything else needs to be added
I was able to solve this problem by making the AggregateKey extend java.io.Serializable
class AggregateKey(var partitionkeys: Map[Int,Any],var clusteringkeys : Map[Int,Any]) extends java.io.Serializable{

type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]

Why the following code has a compilation error at return statement,
def getData(queries: Array[String]): Dataset[Row] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
return res
}
Error,
type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
Scala version 2.11.11
Spark version 2.0.0
EDIT:
Actual case
def getDataFrames(queries: Array[String]) = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl1, query, props)
} else {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl2, query, props)
}
})
}
Here I want to return the combined output from the iteration like an Array[Dataset[Row]] or Array[DataFrame] (but Dataframe is not available in 2.0.0 as a dependency). Do the above code does the magic ? or How can I do it?
You can return a list of dataframes as List[Dataframe]
def getData(queries: Array[String]): List[Dataframe] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props)
//create multiple dataframe from your queries
val df1 = ???
val df2 = ???
val list = List(df1, df2)
//You can create a list dynamically with list of quries
list
}
registerTempTable returns Unit you better remove the registerTempTable and return Dataframe, and return a list of dataframes.
UPDATE:
Here is how you can return list of dataframes with list of queries
def getDataFrames(queries: Array[String]): Array[DataFrame] = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
val dataframe = if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
} else {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
}
dataframe
})
jdbcResult
}
I hope this helps!
Its clear from the error message that there is a type mismatch in your function.
registerTempTable() api creates an in-memory table scoped to the current session and stays accesible till the SparkSession is active.
Check the return type of registerTempTable() api here
change your code to the following to remove the error message:
def getData(queries: Array[String]): Unit = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
}
an even better way would be to write the code as follows:
val tempName: String = "Name_Of_Temp_View"
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView(tempName)
Use the createOrReplaceTempView() as registerTempTable() is deprecated since Spark 2.0.0
The Alternate solution as per your requirement:
def getData(queries: Array[String], spark: SparkSession): Array[DataFrame] = {
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView("Name_Of_Temp_Table")
val result: Array[DataFrame] = queries.map(query => spark.sql(query))
result }

handle spark state istimingout

I'm using the new mapWithState function in spark streaming (1.6) with a timing out state. I want to use the timing out state and add it to another rdd in order to use it for calculations further down the road:
val aggedlogs = sc.emptyRDD[MyLog];
val mappingFunc = (key: String, newlog: Option[MyLog], state: State[MyLog]) => {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
}
else if (_newLog !=null) {
state.update(_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
aggedlogs.union(sc.parallelize(List(stateLog), 1))
}
val stateLog = state.get();
(key,stateLog);
}
val stateDstream = reducedlogs.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(10)))
but when I try to add it to an rdd in the StateSpec function, I get an error that the function is not serializable. Any thoughts on how I can get pass this?
EDIT:
After drilling deeper i found that my approach was wrong. before trying this solution i tried to get the timing out logs from the statesnapeshot(), but they were not there anymore, changing the mapping function to :
def mappingFunc(key: String, newlog: Option[MyLog], state: State[KomoonaLog]) : Option[(String, MyLog)] = {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
Some(key,combinedLog);
}
else if (_newLog !=null) {
state.update(_newLog);
Some(key,_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
stateLog.timinigOut = true;
System.out.println("timinigOut : " +key );
Some(key, stateLog);
}
val stateLog = state.get();
Some(key,stateLog);
}
i managed to filter the mapedwithstatedstream for the logs that are timing out in each batch:
val stateDstream = reducedlogs.mapWithState(
StateSpec.function(mappingFunc _).timeout(Seconds(60)))
val tiningoutlogs= stateDstream.filter (filtertimingout)

How to list out all the files in the public directory in a Play Framework 2.X Scala application?

Here is my controller
class Proguard extends Controller {
val proguardFolder = "/public/proguards/"
val proguardFolderFix = "/public/proguards"
val proguardSuffix = "proguard-"
val proguardExtension = ".pro"
val title = "# Created by https://www.proguard.io/api/%s\n\n%s"
def proguard(libraryName: String) = Action {
val libraries = libraryName.split(',')
val availableLibs = listInDir(proguardFolderFix)
val result = availableLibs.filter(libraries.contains).map(readFile).mkString
Ok(title.format(libraryName, result))
}
def list() = Action {
Ok(Json.toJson(listInDir(proguardFolder)))
}
private def listInDir(filePath: String): List[String] = {
getListOfFiles(Play.getFile(filePath)).map(_.getName.replace(proguardExtension, "").replace(proguardSuffix, ""))
}
def getListOfFiles(dir: File): List[File] = {
dir.listFiles.toList
}
def readFile(string: String): String = {
val source = scala.io.Source.fromFile(Play.getFile(s"$proguardFolder$proguardSuffix$string$proguardExtension"))
val lines = try source.mkString finally source.close()
lines
}
}
It worked totally okay in debug mode, but in production at Heroku dir.listFiles. is giving me NPE
I've tried different ways, but looks like only solution is move my files to s3 or database.

Creating serializable objects from Scala source code at runtime

To embed Scala as a "scripting language", I need to be able to compile text fragments to simple objects, such as Function0[Unit] that can be serialised to and deserialised from disk and which can be loaded into the current runtime and executed.
How would I go about this?
Say for example, my text fragment is (purely hypothetical):
Document.current.elements.headOption.foreach(_.open())
This might be wrapped into the following complete text:
package myapp.userscripts
import myapp.DSL._
object UserFunction1234 extends Function0[Unit] {
def apply(): Unit = {
Document.current.elements.headOption.foreach(_.open())
}
}
What comes next? Should I use IMain to compile this code? I don't want to use the normal interpreter mode, because the compilation should be "context-free" and not accumulate requests.
What I need to get hold off from the compilation is I guess the binary class file? In that case, serialisation is straight forward (byte array). How would I then load that class into the runtime and invoke the apply method?
What happens if the code compiles to multiple auxiliary classes? The example above contains a closure _.open(). How do I make sure I "package" all those auxiliary things into one object to serialize and class-load?
Note: Given that Scala 2.11 is imminent and the compiler API probably changed, I am happy to receive hints as how to approach this problem on Scala 2.11
Here is one idea: use a regular Scala compiler instance. Unfortunately it seems to require the use of hard disk files both for input and output. So we use temporary files for that. The output will be zipped up in a JAR which will be stored as a byte array (that would go into the hypothetical serialization process). We need a special class loader to retrieve the class again from the extracted JAR.
The following assumes Scala 2.10.3 with the scala-compiler library on the class path:
import scala.tools.nsc
import java.io._
import scala.annotation.tailrec
Wrapping user provided code in a function class with a synthetic name that will be incremented for each new fragment:
val packageName = "myapp"
var userCount = 0
def mkFunName(): String = {
val c = userCount
userCount += 1
s"Fun$c"
}
def wrapSource(source: String): (String, String) = {
val fun = mkFunName()
val code = s"""package $packageName
|
|class $fun extends Function0[Unit] {
| def apply(): Unit = {
| $source
| }
|}
|""".stripMargin
(fun, code)
}
A function to compile a source fragment and return the byte array of the resulting jar:
/** Compiles a source code consisting of a body which is wrapped in a `Function0`
* apply method, and returns the function's class name (without package) and the
* raw jar file produced in the compilation.
*/
def compile(source: String): (String, Array[Byte]) = {
val set = new nsc.Settings
val d = File.createTempFile("temp", ".out")
d.delete(); d.mkdir()
set.d.value = d.getPath
set.usejavacp.value = true
val compiler = new nsc.Global(set)
val f = File.createTempFile("temp", ".scala")
val out = new BufferedOutputStream(new FileOutputStream(f))
val (fun, code) = wrapSource(source)
out.write(code.getBytes("UTF-8"))
out.flush(); out.close()
val run = new compiler.Run()
run.compile(List(f.getPath))
f.delete()
val bytes = packJar(d)
deleteDir(d)
(fun, bytes)
}
def deleteDir(base: File): Unit = {
base.listFiles().foreach { f =>
if (f.isFile) f.delete()
else deleteDir(f)
}
base.delete()
}
Note: Doesn't handle compiler errors yet!
The packJar method uses the compiler output directory and produces an in-memory jar file from it:
// cf. http://stackoverflow.com/questions/1281229
def packJar(base: File): Array[Byte] = {
import java.util.jar._
val mf = new Manifest
mf.getMainAttributes.put(Attributes.Name.MANIFEST_VERSION, "1.0")
val bs = new java.io.ByteArrayOutputStream
val out = new JarOutputStream(bs, mf)
def add(prefix: String, f: File): Unit = {
val name0 = prefix + f.getName
val name = if (f.isDirectory) name0 + "/" else name0
val entry = new JarEntry(name)
entry.setTime(f.lastModified())
out.putNextEntry(entry)
if (f.isFile) {
val in = new BufferedInputStream(new FileInputStream(f))
try {
val buf = new Array[Byte](1024)
#tailrec def loop(): Unit = {
val count = in.read(buf)
if (count >= 0) {
out.write(buf, 0, count)
loop()
}
}
loop()
} finally {
in.close()
}
}
out.closeEntry()
if (f.isDirectory) f.listFiles.foreach(add(name, _))
}
base.listFiles().foreach(add("", _))
out.close()
bs.toByteArray
}
A utility function that takes the byte array found in deserialization and creates a map from class names to class byte code:
def unpackJar(bytes: Array[Byte]): Map[String, Array[Byte]] = {
import java.util.jar._
import scala.annotation.tailrec
val in = new JarInputStream(new ByteArrayInputStream(bytes))
val b = Map.newBuilder[String, Array[Byte]]
#tailrec def loop(): Unit = {
val entry = in.getNextJarEntry
if (entry != null) {
if (!entry.isDirectory) {
val name = entry.getName
// cf. http://stackoverflow.com/questions/8909743
val bs = new ByteArrayOutputStream
var i = 0
while (i >= 0) {
i = in.read()
if (i >= 0) bs.write(i)
}
val bytes = bs.toByteArray
b += mkClassName(name) -> bytes
}
loop()
}
}
loop()
in.close()
b.result()
}
def mkClassName(path: String): String = {
require(path.endsWith(".class"))
path.substring(0, path.length - 6).replace("/", ".")
}
A suitable class loader:
class MemoryClassLoader(map: Map[String, Array[Byte]]) extends ClassLoader {
override protected def findClass(name: String): Class[_] =
map.get(name).map { bytes =>
println(s"defineClass($name, ...)")
defineClass(name, bytes, 0, bytes.length)
} .getOrElse(super.findClass(name)) // throws exception
}
And a test case which contains additional classes (closures):
val exampleSource =
"""val xs = List("hello", "world")
|println(xs.map(_.capitalize).mkString(" "))
|""".stripMargin
def test(fun: String, cl: ClassLoader): Unit = {
val clName = s"$packageName.$fun"
println(s"Resolving class '$clName'...")
val clazz = Class.forName(clName, true, cl)
println("Instantiating...")
val x = clazz.newInstance().asInstanceOf[() => Unit]
println("Invoking 'apply':")
x()
}
locally {
println("Compiling...")
val (fun, bytes) = compile(exampleSource)
val map = unpackJar(bytes)
println("Classes found:")
map.keys.foreach(k => println(s" '$k'"))
val cl = new MemoryClassLoader(map)
test(fun, cl) // should call `defineClass`
test(fun, cl) // should find cached class
}