Kryo Serialization not registering even after registering the class in conf - scala

I made a class Person and registered it but on runtime, it shows class not registered.Why is it showing so?
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException: Class is not registered: KyroExample$Person[]
Note: To register this class use: kryo.register(KyroExample$Person[].class);
Here is the sample code :
val conf = new SparkConf().setAppName("kyroExample").setMaster("local")
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Person],classOf[String])) //registered the class
conf.set("spark.kryo.registrationRequired", "true")
val sparkContext = new SparkContext(conf)
case class Person(name:String, age:Int) //this is the class
val personList: immutable.Seq[Person] = (1 to 100000).map(value=> Person(value+"",value))
val rdd: RDD[Person] = sparkContext.parallelize(personList)
val evenAge: RDD[Person] = rdd.filter(_.age %2 ==0)
evenAge.persist(StorageLevel.MEMORY_ONLY_SER)
evenAge.count()
evenAge.persist(StorageLevel.MEMORY_ONLY_SER)
evenAge.count()
Thread.sleep(200000)

It worked after registering both use cases :
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)

Related

Null pointer exception with GenericWriteAheadSink and FileCheckpointCommitter in Flink 1.8

I'm implementing a custom NSQ sink for Flink. I have it working as a subclass of RichSinkFunction, but I'd like to get the write-ahead log implementation working for extra data integrity.
Using O'Reilly's WriteAheadSinkExample available here, I attempted to implement my own:
package com.wistia.analytics
import java.net.{InetSocketAddress, SocketAddress}
import com.github.mitallast.nsq._
import org.apache.flink.api.scala.createTypeInformation
import java.lang.Iterable
import java.nio.file.{Files, Paths}
import java.util.UUID
import org.apache.commons.lang3.StringUtils
import org.apache.flink.api.common.ExecutionConfig
import org.apache.flink.streaming.runtime.operators.{CheckpointCommitter, GenericWriteAheadSink}
import scala.collection.mutable
class WALNsqSink(val topic: String) extends GenericWriteAheadSink[String](
// CheckpointCommitter that commits checkpoints to the local filesystem
new FileCheckpointCommitter(System.getProperty("java.io.tmpdir")),
// Serializer for records
createTypeInformation[String]
.createSerializer(new ExecutionConfig),
// Random JobID used by the CheckpointCommitter
UUID.randomUUID.toString) {
var client: NSQClient = _
var producer: NSQProducer = _
override def open(): Unit = {
val lookup = new NSQLookup {
def nodes(): List[SocketAddress] = List(new InetSocketAddress("127.0.0.1",4150))
def lookup(topic: String): List[SocketAddress] = List(new InetSocketAddress("127.0.0.1",4150))
}
client = NSQClient(lookup)
producer = client.producer()
}
def sendValues(readings: Iterable[String], checkpointId: Long, timestamp: Long): Boolean = {
val arr = mutable.Seq()
readings.forEach{ reading =>
arr :+ reading
}
producer.mpubStr(topic=topic, data=arr)
true
}
}
reusing FileCheckpointCommitter at the bottom of the class, but I get a null pointer exception inside GenericWriteAheadSink:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:638)
at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:123)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
at com.wistia.analytics.NsqProcessor$.main(NsqProcessor.scala:24)
at com.wistia.analytics.NsqProcessor.main(NsqProcessor.scala)
Caused by: java.lang.NullPointerException
at org.apache.flink.streaming.runtime.operators.GenericWriteAheadSink.processElement(GenericWriteAheadSink.java:277)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Nonzero exit code returned from runner: 1
(Compile / run) Nonzero exit code returned from runner: 1
Total time: 45 s, completed Feb 10, 2020 6:41:06 PM
I have no idea where to go from here. Any help appreciated
The issue here is most certainly the fact that You never call the open() method of the superclass. This will cause some of the variables to be uninitialized.
This should be solved by calling the super.open() inside Your open() method.

how to Connect to NEO4J in Spark worker nodes?

I need to get a small subgraph in a spark map function. I have tried to use AnormCypher and NEO4J-SPARK-CONNECTOR, but neither works. AnormCypher will lead to a java IOException Error (I build the connection in a mapPartition function, test at localhost server). And Neo4j-spark-connector will cause TASK NOT SERIALIZABLE EXCEPTION below.
Is there a good way to get a subgraph(or connect to graph data base like neo4j) in the Spark worker node?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
at ....
my code snippet using neo4j-spark-connector 2.0.0-m2:
val neo = Neo4j(sc) // this runs on the driver
// this runs by a map function
def someFunctionToBeMapped(p: List[Long]) = {
val metaGraph = neo.cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return a.id ,r.distance, b.id").loadRowRdd.map( row => ((row(0).asInstanceOf[Long],row(2).asInstanceOf[Long]), row(1).asInstanceOf[Double]) ).collect().toList
The AnromCypher code is :
def partitionMap(partition: Iterator[List[Long]]) = {
import org.anormcypher._
import play.api.libs.ws._
// Provide an instance of WSClient
val wsclient = ning.NingWSClient()
// Setup the Rest Client
// Need to add the Neo4jConnection type annotation so that the default
// Neo4jConnection -> Neo4jTransaction conversion is in the implicit scope
implicit val connection: Neo4jConnection = Neo4jREST("127.0.0.1", 7474, "neo4j", "000000")(wsclient)
//
// Provide an ExecutionContext
implicit val ec = scala.concurrent.ExecutionContext.global
val res = partition.filter( placeList => {
val startPlace = Cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return p")().flatMap( row => row.data )
})
wsclient.close()
res
}
I have used spark standalone mode and able to connect neo4j database
Version used :
spark 2.1.0
neo4j-spark-connector 2.1.0-m2
My code:-
val sparkConf = new SparkConf().setAppName("Neo$j").setMaster("local")
val sc = new SparkContext(sparkConf)
println("***Getting Started ****")
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n) RETURN id(n) as id").loadDataFrame
println(rdd.count)
Spark submit:-
spark-submit --class package.classname --jars pathofneo4jsparkconnectoryJAR --conf spark.neo4j.bolt.password=***** targetJarFile.jar

Code working in Spark-Shell not in eclipse

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this
Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

Task not serializable while using custom dataframe class in Spark Scala

I am facing a strange issue with Scala/Spark (1.5) and Zeppelin:
If I run the following Scala/Spark code, it will run properly:
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
However after declaring a custom dataframe type as proposed here
//DATAFRAME EXTENSION
import org.apache.spark.sql.DataFrame
object ExtraDataFrameOperations {
implicit class DFWithExtraOperations(df : DataFrame) {
//drop several columns
def drop(colToDrop:Seq[String]):DataFrame = {
var df_temp = df
colToDrop.foreach{ case (f: String) =>
df_temp = df_temp.drop(f)//can be improved with Spark 2.0
}
df_temp
}
}
}
and using it for example like following:
//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"
val delimiter = ","
val colToIgnore = Seq("c_9", "c_10")
val inputICFfolder = "hdfs:///group/project/TestSpark/"
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
.option("delimiter", delimiter)
.option("charset", "UTF-8")
.load(inputICFfolder + filename)
.drop(colToIgnore)//call the customize dataframe
This run successfully.
Now if I run again the following code (same as above)
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
I get the error message:
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at
parallelize at :32 testList: List[String] = List(a, b)
org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032) at
org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:314)
...
Caused by: java.io.NotSerializableException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$
Serialization stack: - object not serializable (class:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$,
value:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$#6c7e70e)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: ExtraDataFrameOperations$module, type: class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#4c6d0802) - field (class:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
...
I don't understand:
Why this error occured while no operation on dataframe is performed?
Why "ExtraDataFrameOperations" is not serializable while it was successfully used before??
UPDATE:
Trying with
#inline val testList = List[String]("a", "b")
does not help.
Just add 'extends Serializable'
This work for me
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a # Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'
It looks like spark tries to serialize all the scope around testList.
Try to inline data #inline val testList = List[String]("a", "b") or use different object where you store function/data which you pass to drivers.

Decoupling non-serializable object to avoid Serialization error in Spark

The following class contains the main function which tries to read from Elasticsearch and prints the documents returned:
object TopicApp extends Serializable {
def run() {
val start = System.currentTimeMillis()
val sparkConf = new Configuration()
sparkConf.set("spark.executor.memory","1g")
sparkConf.set("spark.kryoserializer.buffer","256")
val es = new EsContext(sparkConf)
val esConf = new Configuration()
esConf.set("es.nodes","localhost")
esConf.set("es.port","9200")
esConf.set("es.resource", "temp_index/some_doc")
esConf.set("es.query", "?q=*:*")
esConf.set("es.fields", "_score,_id")
val documents = es.documents(esConf)
documents.foreach(println)
val end = System.currentTimeMillis()
println("Total time: " + (end-start) + " ms")
es.shutdown()
}
def main(args: Array[String]) {
run()
}
}
Following class converts the returned document to JSON using org.json4s
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf:HadoopConfig):RDD[String] = {
implicit val formats = DefaultFormats
val source = sc.newAPIHadoopRDD(
esConf,
classOf[EsInputFormat[Text, MapWritable]],
classOf[Text],
classOf[MapWritable]
)
val docs = source.map(
hit => {
val doc = Map("ident" -> hit._1.toString) ++ mwToMap(hit._2)
write(doc)
}
)
docs
}
def shutdown() = sc.stop()
// mwToMap() converts MapWritable to Map
}
Following class creates the local SparkContext for the application:
trait SparkBase extends Serializable {
protected def createSCLocal(name:String, config:HadoopConfig):SparkContext = {
val iterator = config.iterator()
for (prop <- iterator) {
val k = prop.getKey
val v = prop.getValue
if (k.startsWith("spark."))
System.setProperty(k, v)
}
val runtime = Runtime.getRuntime
runtime.gc()
val conf = new SparkConf()
conf.setMaster("local[2]")
conf.setAppName(name)
conf.set("spark.serializer", classOf[KryoSerializer].getName)
conf.set("spark.ui.port", "0")
new SparkContext(conf)
}
}
When I run TopicApp I get the following errors:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)
at TopicApp.EsContext.documents(EsContext.scala:51)
at TopicApp.TopicApp$.run(TopicApp.scala:28)
at TopicApp.TopicApp$.main(TopicApp.scala:39)
at TopicApp.TopicApp.main(TopicApp.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#14f70e7d)
- field (class: TopicApp.EsContext, name: sc, type: class org.apache.spark.SparkContext)
- object (class TopicApp.EsContext, TopicApp.EsContext#2cf77cdc)
- field (class: TopicApp.EsContext$$anonfun$documents$1, name: $outer, type: class TopicApp.EsContext)
- object (class TopicApp.EsContext$$anonfun$documents$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 13 more
Going through other posts that cover similar issue there were mostly recommending making the classes Serializable or try to separate the non-serializable objects from the classes.
From the error that I got I inferred that SparkContext i.e. sc is non-serializable as SparkContext is not a serializable class.
How should I decouple SparkContext, so that the applications runs correctly?
I can't run your program to be sure, but the general rule is not to create anonymous functions that refer to members of unserializable classes if they have to be executed on the RDD's data. In your case:
EsContext has a val of type SparkContext, which is (intentionally) not serializable
In the anonymous function passed to RDD.map in EsContext.documentsAsJson, you call another function of this EsContext instance (mwToMap) which forces Spark to serialize that instance, along with the SparkContext it holds
One possible solution would be removing mwToMap from the EsContext class (possibly into a companion object of EsContext - objects need not be serializable as they are static). If there are other methods of the same nature (write?) they'll have to be moved too. This would look something like:
import EsContext._
class EsContext(sparkConf:HadoopConfig) extends SparkBase {
private val sc = createSCLocal("ElasticContext", sparkConf)
def documentsAsJson(esConf: HadoopConfig): RDD[String] = { /* unchanged */ }
def documents(esConf: HadoopConfig): RDD[EsDocument] = { /* unchanged */ }
def shutdown() = sc.stop()
}
object EsContext {
private def mwToMap(mw: MapWritable): Map[String, String] = { ... }
}
If moving these methods out isn't possible (i.e. if they require some of EsContext's members) - then consider separating the class that does the actual mapping from this context (which seems to be some kind of wrapper around the SparkContext - if that's what it is, that's all that it should be).