Scala : Write to a file inside foreachRDD - scala

I'm using Spark streaming to process data coming from Kafka. And I would like to write the result in a file (on local). When I print on console everything works fine and I get my results but when I try to write that to a file I get an error.
I use PrintWriter to do that, but I get this error :
Exception in thread "main" java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
java.io.PrintWriter
Serialization stack:
- object not serializable (class: java.io.PrintWriter, value: java.io.PrintWriter#20f6f88c)
- field (class: streaming.followProduction$$anonfun$main$1, name: qualityWriter$1, type: class java.io.PrintWriter)
- object (class streaming.followProduction$$anonfun$main$1, <function1>)
- field (class: streaming.followProduction$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class streaming.followProduction$$anonfun$main$1)
- object (class streaming.followProduction$$anonfun$main$1$$anonfun$apply$1, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, <function2>)
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.kafka010.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData,
I guess I can't use the writer like this inside the ForeachRDD !
Here is my code :
object followProduction extends Serializable {
def main(args: Array[String]) = {
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.append("dateTime , quality , status \n")
val sparkConf = new SparkConf().setMaster("spark://address:7077").setAppName("followProcess").set("spark.streaming.concurrentJobs", "4")
val sc = new StreamingContext(sparkConf, Seconds(10))
sc.checkpoint("checkpoint")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "address:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> s"${UUID.randomUUID().toString}",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("A", "C")
topics.foreach(t => {
val stream = KafkaUtils.createDirectStream[String, String](
sc,
PreferConsistent,
Subscribe[String, String](Array(t), kafkaParams)
)
stream.foreachRDD(rdd => {
rdd.collect().foreach(i => {
val record = i.value()
val newCsvRecord = process(t, record)
println(newCsvRecord)
qualityWriter.append(newCsvRecord)
})
})
})
qualityWriter.close()
sc.start()
sc.awaitTermination()
}
var componentQuantity: componentQuantity = new componentQuantity("", 0.0, 0.0, 0.0)
var diskQuality: diskQuality = new diskQuality("", 0.0)
def process(topic: String, record: String): String = topic match {
case "A" => componentQuantity.checkQuantity(record)
case "C" => diskQuality.followQuality(record)
}
}
I have this class I am calling :
case class diskQuality(datetime: String, quality: Double) extends Serializable {
def followQuality(record: String): String = {
val dateFormat: SimpleDateFormat = new SimpleDateFormat("dd-mm-yyyy hh:mm:ss")
var recQuality = msgParse(record).quality
var date: Date = dateFormat.parse(msgParse(record).datetime)
var recDateTime = new SimpleDateFormat("dd-mm-yyyy hh:mm:ss").format(date)
// some operations here
return recDateTime + " , " + recQuality
}
def msgParse(value: String): diskQuality = {
import org.json4s._
import org.json4s.native.JsonMethods._
implicit val formats = DefaultFormats
val res = parse(value).extract[diskQuality]
return res
}
}
How can I achieve this ? I'm new to both Spark and Scala so maybe I'm not doing things right.
Thank you for your time
EDIT :
I've changed My code and I don't get this error anymore. But at the same time, I have only the first line in my file and the records are not appended. The writer (handleWriter) inside is actually not working.
Here is my code :
stream.foreachRDD(rdd => {
val qualityWriter = new PrintWriter(file)
qualityWriter.write("dateTime , quality , status \n")
qualityWriter.close()
rdd.collect().foreach(i =>
{
val record = i.value()
val newCsvRecord = process(topic , record)
val handleWriter = new PrintWriter(file)
handleWriter.append(newCsvRecord)
handleWriter.close()
println(newCsvRecord)
})
})
Where did I miss ? Maybe I'm doing this wrong ...

The PrintWriter is a local resource, bound to a single machine and cannot be serialized.
To remove this object from the Java serialization plan, we can declare it #transient. That means that a serialization form of the followProduction object will not attempt to serialize this field.
In the code of the question, it should be declared as:
#transient val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
Then it becomes possible to use it within the foreachRDD closure.
But, this process does not solve issues that have to do with the proper handling of the file. The qualityWriter.close() will be executed on the first pass of the streaming job and the file descriptor will be closed for writing during the execution of the job. To properly use local resources, such as a File, I would follow Yuval suggestion to recreate the PrintWriter within the foreachRDD closure. The missing piece is declaring the new PrintWritter in append mode. The modified code within the foreachRDD will look like this (making some additional code changes):
// Initialization phase
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.println("dateTime , quality , status")
qualityWriter.close()
....
dstream.foreachRDD{ rdd =>
val data = rdd.map(e => e.value())
.collect() // get the data locally
.map(i=> process(topic , i)) // create csv records
val allRecords = data.mkString("\n") // why do I/O if we can do in-mem?
val handleWriter = new PrintWriter(file, append=true)
handleWriter.append(allRecords)
handleWriter.close()
}
Few notes about the code in the question:
"spark.streaming.concurrentJobs", "4"
This will create an issue with multiple threads writing to the same local file. It's probably also being misused in this context.
sc.checkpoint("checkpoint")
There seems to be no need for checkpointing on this job.

The simplest thing to do would be to create the instance of PrintWriter inside foreachRDD, which means it wouldn't be captured by the function closure:
stream.foreachRDD(rdd => {
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.append("dateTime , quality , status \n")
rdd.collect().foreach(i => {
val record = i.value()
val newCsvRecord = process(t, record)
qualityWriter.append(newCsvRecord)
})
})
})

Related

Materialize mapWithState stateSnapShots to database for later resume of spark streaming app

I have a Spark scala streaming app that sessionizes user generated events coming from Kafka, using mapWithState. I want to mature the setup by enabling to pauze and resume the app in the case of maintenance. I’m already writing kafka offset information to a database, so when restarting the app I can pick up at the last offset processed. But I also want to keep the state information.
So my goal is to;
materialize session information after a key identifying the user times out.
materialize a .stateSnapshot() when I gracefully shutdown the application, so I can use that data when restarting the app by feeding it as a parameter to StateSpec.
1 is working, with 2 I have issues.
For the sake of completeness, I also describe 1 because I’m always interested in a better solution for it:
1) materializing session info after key time out
Inside my update function for mapWithState, I have:
if (state.isTimingOut()) {
// if key is timing out.
val output = (key, stateFilterable(isTimingOut = true
, start = state.get().start
, end = state.get().end
, duration = state.get().duration
))
That isTimingOut boolean I then later on use as:
streamParsed
.filter(a => a._2.isTimingOut)
.foreachRDD(rdd =>
rdd
.map(stuff => Model(key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration)
.saveToCassandra(keyspaceName, tableName)
)
2) materialize a .stateSnapshot() with graceful shutdown
Materializing snapshot info doesn’t work. What is tried:
// define a class Listener
class Listener(ssc: StreamingContext, state: DStream[(String, stateFilterable)]) extends Runnable {
def run {
if( ssc == null )
System.out.println("The spark context is null")
else
System.out.println("The spark context is fine!!!")
var input = "continue"
while( !input.equals("D")) {
input = readLine("Press D to kill: ")
System.out.println(input + " " + input.equals("D"))
}
System.out.println("Accessing snapshot and saving:")
state.foreachRDD(rdd =>
rdd
.map(stuff => Model(key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration)
.saveToCassandra("some_keyspace", "some_table")
)
System.out.println("Stopping context!")
ssc.stop(true, true)
System.out.println("We have stopped!")
}
}
// Inside the app object:
val state = streamParsed.stateSnapshots()
var listener = new Thread(new Listener(ssc, state))
listener.start()
So the full code becomes:
package main.scala.cassandra_sessionizing
import java.text.SimpleDateFormat
import java.util.Calendar
import org.apache.spark.streaming.dstream.{DStream, MapWithStateDStream}
import scala.collection.immutable.Set
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming._
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType, LongType, ArrayType, IntegerType}
import _root_.kafka.serializer.StringDecoder
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
case class userAction(datetimestamp: Double
, action_name: String
, user_key: String
, page_id: Integer
)
case class actionTuple(pages: scala.collection.mutable.Set[Int]
, start: Double
, end: Double)
case class stateFilterable(isTimingOut: Boolean
, start: Double
, end: Double
, duration: Int
, pages: Set[Int]
, events: Int
)
case class Model(user_key: String
, start: Double
, duration: Int
, pages: Set[Int]
, events: Int
)
class Listener(ssc: StreamingContext, state: DStream[(String, stateFilterable)]) extends Runnable {
def run {
var input = "continue"
while( !input.equals("D")) {
input = readLine("Press D to kill: ")
System.out.println(input + " " + input.equals("D"))
}
// Accessing snapshot and saving:
state.foreachRDD(rdd =>
rdd
.map(stuff => Model(user_key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration,
pages = stuff._2.pages,
events = stuff._2.events))
.saveToCassandra("keyspace1", "snapshotstuff")
)
// Stopping context
ssc.stop(true, true)
}
}
object cassandra_sessionizing {
// where we'll store the stuff in Cassandra
val tableName = "sessionized_stuff"
val keyspaceName = "keyspace1"
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("cassandra-sessionizing")
.set("spark.cassandra.connection.host", "10.10.10.10")
.set("spark.cassandra.auth.username", "keyspace1")
.set("spark.cassandra.auth.password", "blabla")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// setup the cassandra connector and recreate the table we'll use for storing the user session data.
val cc = CassandraConnector(conf)
cc.withSessionDo { session =>
session.execute(s"""DROP TABLE IF EXISTS $keyspaceName.$tableName;""")
session.execute(
s"""CREATE TABLE IF NOT EXISTS $keyspaceName.$tableName (
user_key TEXT
, start DOUBLE
, duration INT
, pages SET<INT>
, events INT
, PRIMARY KEY(user_key, start)) WITH CLUSTERING ORDER BY (start DESC)
;""")
}
// setup the streaming context and make sure we can checkpoint, given we're using mapWithState.
val ssc = new StreamingContext(sc, Seconds(60))
ssc.checkpoint("hdfs:///user/keyspace1/streaming_stuff/")
// Defining the stream connection to Kafka.
val kafkaStream = {
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
Map("metadata.broker.list" -> "kafka1.prod.stuff.com:9092,kafka2.prod.stuff.com:9092"), Set("theTopic"))
}
// this schema definition is needed so the json string coming from Kafka can be parsed into a dataframe using spark read.json.
// if an event does not conform to this structure, it will result in all null values, which are filtered out later.
val struct = StructType(
StructField("datetimestamp", DoubleType, nullable = true) ::
StructField("sub_key", StructType(
StructField("user_key", StringType, nullable = true) ::
StructField("page_id", IntegerType, nullable = true) ::
StructField("name", StringType, nullable = true) :: Nil), nullable = true) ::
)
/*
this is the function needed to keep track of an user key's session.
3 options:
1) key already exists, and new values are coming in to be added to the state.
2) key is new, so initialize the state with the incoming value
3) key is timing out, so mark it with a boolean that can be used by filtering later on. Given the boolean, the data can be materialized to cassandra.
*/
def trackStateFunc(batchTime: Time
, key: String
, value: Option[actionTuple]
, state: State[stateFilterable])
: Option[(String, stateFilterable)] = {
// 1 : if key already exists and we have a new value for it
if (state.exists() && value.orNull != null) {
var current_set = state.getOption().get.pages
var current_start = state.getOption().get.start
var current_end = state.getOption().get.end
if (value.get.pages != null) {
current_set ++= value.get.pages
}
current_start = Array(current_start, value.get.start).min // the starting epoch is used to initialize the state, but maybe some earlier events are processed a bit later.
current_end = Array(current_end, value.get.end).max // always update the end time of the session with new events coming in.
val new_event_counter = state.getOption().get.events + 1
val new_output = stateFilterable(isTimingOut = false
, start = current_start
, end = current_end
, duration = (current_end - current_start).toInt
, pages = current_set
, events = new_event_counter)
val output = (key, new_output)
state.update(new_output)
return Some(output)
}
// 2: if key does not exist and we have a new value for it
else if (value.orNull != null) {
var new_set: Set[Int] = Set()
val current_value = value.get.pages
if (current_value != null) {
new_set ++= current_value
}
val event_counter = 1
val current_start = value.get.start
val current_end = value.get.end
val new_output = stateFilterable(isTimingOut = false
, start = current_start
, end = current_end
, duration = (current_end - current_start).toInt
, pages = new_set
, events = event_counter)
val output = (key, new_output)
state.update(new_output)
return Some(output)
}
// 3: if key is timing out
if (state.isTimingOut()) {
val output = (key, stateFilterable(isTimingOut = true
, start = state.get().start
, end = state.get().end
, duration = state.get().duration
, pages = state.get().pages
, events = state.get().events
))
return Some(output)
}
// this part of the function should never be reached.
throw new Error(s"Entered dead end with $key $value")
}
// defining the state specification used later on as a step in the stream pipeline.
val stateSpec = StateSpec.function(trackStateFunc _)
.numPartitions(16)
.timeout(Seconds(4000))
// RDD 1
val streamParsedRaw = kafkaStream
.map { case (k, v: String) => v } // key is empty, so get the value containing the json string.
.transform { rdd =>
val df = sqlContext.read.schema(struct).json(rdd) // apply schema defined above and parse the json into a dataframe,
.selectExpr("datetimestamp"
, "action.name AS action_name"
, "action.user_key"
, "action.page_id"
)
df.as[userAction].rdd // transform dataframe into spark Dataset so we easily cast to the case class userAction.
}
val initialCount = actionTuple(pages = collection.mutable.Set(), start = 0.0, end = 0.0)
val addToCounts = (left: actionTuple, ua: userAction) => {
val current_start = ua.datetimestamp
val current_end = ua.datetimestamp
if (ua.page_id != null) left.pages += ua.page_id
actionTuple(left.pages, current_start, current_end)
}
val sumPartitionCounts = (p1: actionTuple, p2: actionTuple) => {
val current_start = Array(p1.start, p2.start).min
val current_end = Array(p1.end, p2.end).max
actionTuple(p1.pages ++= p2.pages, current_start, current_end)
}
// RDD 2: add the mapWithState part.
val streamParsed = streamParsedRaw
.map(s => (s.user_key, s)) // create key value tuple so we can apply the mapWithState to the user_key.
.transform(rdd => rdd.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)) // reduce to one row per user key for each batch.
.mapWithState(stateSpec)
// RDD 3: if the app is shutdown, this rdd should be materialized.
val state = streamParsed.stateSnapshots()
state.print(2)
// RDD 4: Crucial: loop up sessions timing out, extract the fields that we want to keep and materialize in Cassandra.
streamParsed
.filter(a => a._2.isTimingOut)
.foreachRDD(rdd =>
rdd
.map(stuff => Model(user_key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration,
pages = stuff._2.pages,
events = stuff._2.events))
.saveToCassandra(keyspaceName, tableName)
)
// add a listener hook that we can use to gracefully shutdown the app and materialize the RDD containing the state snapshots.
var listener = new Thread(new Listener(ssc, state))
listener.start()
ssc.start()
ssc.awaitTermination()
}
}
But when running this (so launching the app, waiting several minutes for some state information to build up, and then entering key 'D', I get the below. So I can't do anything 'new' with a dstream after quitting the ssc. I hoped to move from a DStream RDD to a regular RDD, quit the streaming context, and wrap up by saving the normal RDD. But don't know how. Hope someone can help!
Exception in thread "Thread-52" java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after sta$
ting a context is not supported
at org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
at org.apache.spark.streaming.dstream.DStream.<init>(DStream.scala:64)
at org.apache.spark.streaming.dstream.ForEachDStream.<init>(ForEachDStream.scala:34)
at org.apache.spark.streaming.dstream.DStream.org$apache$spark$streaming$dstream$DStream$$foreachRDD(DStream.scala:687)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply$mcV$sp(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:659)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:659)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
at org.apache.spark.streaming.dstream.DStream.foreachRDD(DStream.scala:659)
at main.scala.feaUS.Listener.run(feaUS.scala:119)
at java.lang.Thread.run(Thread.java:745)
There are 2 main changes to the code which should make it work
1> Use the checkpointed directory to start the spark streaming context.
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => createContext(checkpointDirectory));
where createContext method has the logic to create and define new streams and stores the check pointed date in checkpointDirectory.
2> The sql context needs to be constructed in a slightly different way.
val streamParsedRaw = kafkaStream
.map { case (k, v: String) => v } // key is empty, so get the value containing the json string.
.map(s => s.replaceAll("""(\"hotel_id\")\:\"([0-9]+)\"""", "\"hotel_id\":$2")) // some events contain the hotel_id in quotes, making it a string. remove these quotes.
.transform { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = sqlContext.read.schema(struct).json(rdd) // apply schema defined above and parse the json into a dataframe,
.selectExpr("__created_epoch__ AS created_epoch" // the parsed json dataframe needs a bit of type cleaning and name changing
I feel your pain! While checkpointing is useful, it does not actually work if the code changes, and we change the code frequently!
What we are doing is to save the state, as json, every cycle, to hbase. So, if snapshotStream is your stream with the state info, we simply save it, as json, to hbase each window. While expensive, it is the only way we can guarantee the state is available upon restart even if the code changes.
Upon startup we load it, deserialize it, and pass it to the stateSpec as the initial rdd.

Task not serializable while using custom dataframe class in Spark Scala

I am facing a strange issue with Scala/Spark (1.5) and Zeppelin:
If I run the following Scala/Spark code, it will run properly:
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
However after declaring a custom dataframe type as proposed here
//DATAFRAME EXTENSION
import org.apache.spark.sql.DataFrame
object ExtraDataFrameOperations {
implicit class DFWithExtraOperations(df : DataFrame) {
//drop several columns
def drop(colToDrop:Seq[String]):DataFrame = {
var df_temp = df
colToDrop.foreach{ case (f: String) =>
df_temp = df_temp.drop(f)//can be improved with Spark 2.0
}
df_temp
}
}
}
and using it for example like following:
//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"
val delimiter = ","
val colToIgnore = Seq("c_9", "c_10")
val inputICFfolder = "hdfs:///group/project/TestSpark/"
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
.option("delimiter", delimiter)
.option("charset", "UTF-8")
.load(inputICFfolder + filename)
.drop(colToIgnore)//call the customize dataframe
This run successfully.
Now if I run again the following code (same as above)
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
I get the error message:
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at
parallelize at :32 testList: List[String] = List(a, b)
org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032) at
org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:314)
...
Caused by: java.io.NotSerializableException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$
Serialization stack: - object not serializable (class:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$,
value:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$#6c7e70e)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: ExtraDataFrameOperations$module, type: class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC#4c6d0802) - field (class:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
...
I don't understand:
Why this error occured while no operation on dataframe is performed?
Why "ExtraDataFrameOperations" is not serializable while it was successfully used before??
UPDATE:
Trying with
#inline val testList = List[String]("a", "b")
does not help.
Just add 'extends Serializable'
This work for me
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a # Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'
It looks like spark tries to serialize all the scope around testList.
Try to inline data #inline val testList = List[String]("a", "b") or use different object where you store function/data which you pass to drivers.

Using Spark Context in map of Spark Streaming Context to retrieve documents after Kafka Event

I'm new to Spark.
What I'm trying to do is retrieving all related documents from a Couchbase View with a given Id from Spark Kafka Streaming.
When I try to get this documents form the Spark Context, I always have the error Task not serializable.
From there, I do understand that I can't use nesting RDD neither multiple Spark Context in the same JVM, but want to find a work around.
Here is my current approach:
package xxx.xxx.xxx
import com.couchbase.client.java.document.JsonDocument
import com.couchbase.client.java.document.json.JsonObject
import com.couchbase.client.java.view.ViewQuery
import com.couchbase.spark._
import org.apache.spark.broadcast.Broadcast
import _root_.kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ProducerRecord, KafkaProducer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Streaming {
// Method to create a Json document from Key and Value
def CreateJsonDocument(s: (String, String)): JsonDocument = {
//println("- Parsing document")
//println(s._1)
//println(s._2)
val return_doc = JsonDocument.create(s._1, JsonObject.fromJson(s._2))
(return_doc)
//(return_doc.content().getString("click"), return_doc)
}
def main(args: Array[String]): Unit = {
// get arguments as key value
val arguments = args.grouped(2).collect { case Array(k,v) => k.replaceAll("--", "") -> v }.toMap
println("----------------------------")
println("Arguments passed to class")
println("----------------------------")
println("- Arguments")
println(arguments)
println("----------------------------")
// If the length of the passed arguments is less than 4
if (arguments.get("brokers") == null || arguments.get("topics") == null) {
// Provide system error
System.err.println("Usage: --brokers <broker1:9092> --topics <topic1,topic2,topic3>")
}
// Create the Spark configuration with app name
val conf = new SparkConf().setAppName("Streaming")
// Create the Spark context
val sc = new SparkContext(conf)
// Create the Spark Streaming Context
val ssc = new StreamingContext(sc, Seconds(2))
// Setup the broker list
val kafkaParams = Map("metadata.broker.list" -> arguments.getOrElse("brokers", ""))
// Setup the topic list
val topics = arguments.getOrElse("topics", "").split(",").toSet
// Get the message stream from kafka
val docs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
docs
// Separate the key and the content
.map({ case (key, value) => (key, value) })
// Parse the content to transform in JSON Document
.map(s => CreateJsonDocument(s))
// Call the view to all related Review Application Documents
//.map(messagedDoc => RetrieveAllReviewApplicationDocs(messagedDoc, sc))
.map(doc => {
sc.couchbaseView(ViewQuery.from("my-design-document", "stats").key(messagedDoc.content.getString("id"))).collect()
})
.foreachRDD(
rdd => {
//Create a report of my documents and store it in Couchbase
rdd.foreach( println )
}
)
// Start the streaming context
ssc.start()
// Wait for termination and catch error if there is a problem in the process
ssc.awaitTermination()
}
}
Found the solution by using the Couchbase Client instead of the Couchbase Spark Context.
I don't know if it is the best way to go in a performance side, but I can retrieve the docs I need for computation.
package xxx.xxx.xxx
import com.couchbase.client.java.{Bucket, Cluster, CouchbaseCluster}
import com.couchbase.client.java.document.JsonDocument
import com.couchbase.client.java.document.json.JsonObject
import com.couchbase.client.java.view.{ViewResult, ViewQuery}
import _root_.kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ProducerRecord, KafkaProducer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Streaming {
// Method to create a Json document from Key and Value
def CreateJsonDocument(s: (String, String)): JsonDocument = {
//println("- Parsing document")
//println(s._1)
//println(s._2)
val return_doc = JsonDocument.create(s._1, JsonObject.fromJson(s._2))
(return_doc)
//(return_doc.content().getString("click"), return_doc)
}
// Method to retrieve related documents
def RetrieveDocs (doc: JsonDocument, arguments: Map[String, String]): ViewResult = {
val cbHosts = arguments.getOrElse("couchbase-hosts", "")
val cbBucket = arguments.getOrElse("couchbase-bucket", "")
val cbPassword = arguments.getOrElse("couchbase-password", "")
val cluster: Cluster = CouchbaseCluster.create(cbHosts)
val bucket: Bucket = cluster.openBucket(cbBucket, cbPassword)
val docs : ViewResult = bucket.query(ViewQuery.from("my-design-document", "my-view").key(doc.content().getString("id")))
cluster.disconnect()
println(docs)
(docs)
}
def main(args: Array[String]): Unit = {
// get arguments as key value
val arguments = args.grouped(2).collect { case Array(k,v) => k.replaceAll("--", "") -> v }.toMap
println("----------------------------")
println("Arguments passed to class")
println("----------------------------")
println("- Arguments")
println(arguments)
println("----------------------------")
// If the length of the passed arguments is less than 4
if (arguments.get("brokers") == null || arguments.get("topics") == null) {
// Provide system error
System.err.println("Usage: --brokers <broker1:9092> --topics <topic1,topic2,topic3>")
}
// Create the Spark configuration with app name
val conf = new SparkConf().setAppName("Streaming")
// Create the Spark context
val sc = new SparkContext(conf)
// Create the Spark Streaming Context
val ssc = new StreamingContext(sc, Seconds(2))
// Setup the broker list
val kafkaParams = Map("metadata.broker.list" -> arguments.getOrElse("brokers", ""))
// Setup the topic list
val topics = arguments.getOrElse("topics", "").split(",").toSet
// Get the message stream from kafka
val docs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
// Get broadcast arguments
val argsBC = sc.broadcast(arguments)
docs
// Separate the key and the content
.map({ case (key, value) => (key, value) })
// Parse the content to transform in JSON Document
.map(s => CreateJsonDocument(s))
// Call the view to all related Review Application Documents
.map(doc => RetrieveDocs(doc, argsBC))
.foreachRDD(
rdd => {
//Create a report of my documents and store it in Couchbase
rdd.foreach( println )
}
)
// Start the streaming context
ssc.start()
// Wait for termination and catch error if there is a problem in the process
ssc.awaitTermination()
}
}

i want to store each rdd into database in twitter streaming using apache spark but got error of task not serialize in scala

I write a code in which twitter streaming take a rdd of tweet class and store each rdd in database but it got error task not serialize I paste the code.
sparkstreaming.scala
case class Tweet(id: Long, source: String, content: String, retweet: Boolean, authName: String, username: String, url: String, authId: Long, language: String)
trait SparkStreaming extends Connector {
def startStream(appName: String, master: String): StreamingContext = {
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
val sparkConf: SparkConf = new SparkConf().setAppName(appName).setMaster(master).set(" spark.driver.allowMultipleContexts", "true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// .set("spark.kryo.registrator", "HelloKryoRegistrator")
// sparkConf.registerKryoClasses(Array(classOf[DBCrud]))
val sc: SparkContext = new SparkContext(sparkConf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
ssc
}
}
object SparkStreaming extends SparkStreaming
I use this streaming context in plat controller to store tweets in database but it throws exception. I'm using mongodb to store it.
def streamstart = Action {
val stream = SparkStreaming
val a = stream.startStream("ss", "local[2]")
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = DBCrud
val twitterauth = new TwitterClient().tweetCredantials()
val tweetDstream = TwitterUtils.createStream(a, Option(twitterauth.getAuthorization))
val tweets = tweetDstream.filter { x => x.getUser.getLang == "en" }.map { x => Tweet(x.getId, x.getSource, x.getText, x.isRetweet(), x.getUser.getName, x.getUser.getScreenName, x.getUser.getURL, x.getUser.getId, x.getUser.getLang) }
// tweets.foreachRDD { x => x.foreach { x => dbcrud.insert(x) } }
tweets.saveAsTextFiles("/home/knoldus/sentiment project/spark services/tweets/tweets")
// val s=new BirdTweet()
// s.hastag(a.sparkContext)
a.start()
Ok("start streaming")
}
When make a single of streaming which take tweets and use forEachRDD to store each tweet then it works but if I use it from outside it doesn't work.
Please help me.
Try to create connection with MongoDB inside foreachRDD block, as mentioned in Spark Documentation
tweets.foreachRDD { x =>
x.foreach { x =>
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
dbcrud.insert(x)
}
}

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}