ssc.sparkContext.parallelize in foreachPartition raise Task not serializable Exception - scala

In foreachPartition i want to translate a list of String to an rdd, i use ssc.sparkContext.parallelize, and the ssc is an object of StreamingContext. The code is like this:
val ssc = new StreamingContext(sparkConf, Seconds(5))
...
words.foreachRDD(rdd => rdd.foreachPartition { partitionOfRecords => {
var sourcelist = List[String]()
partitionOfRecords.foreach(x=>{sourcelist=sourcelist.+:("source"+x)})
if(sourcelist.length > 0 ){
sourcelist.foreach { x => println(x) }
}else{
println("aa---------------------none")
}
// change the sourcelist to a RDD an convert it to a Dstream
val ssd = ssc.sparkContext.parallelize(sourcelist)
val resultInputStream = ssc.queueStream(scala.collection.mutable.Queue(ssd))
val results = resultInputStream.map(x=>x)
results.print()
}})
But, the code raise SparkException: Task not serializable. I really don't known how to do with this.Appreciate your help, Please!

Related

org.apache.spark.SparkException: This RDD lacks a SparkContext error

Complete error is:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the
streaming job is used in DStream operations. For more information, See
SPARK-13758.
but I think I didn't use nested rdd transform in my code.
how to solve it?
my scala code:
stream.foreachRDD { rdd => {
val nRDD = rdd.map(item => item.value())
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
val top = oldRDD.sortBy(item => {
val arr = item.split(' ')
arr(0)
}, ascending = false).take(200)
val topRDD = sc.makeRDD(top)
val unionRDD = topRDD.union(nRDD)
val validRDD = unionRDD.map(item => {
val arr = item.split(' ')
((arr(1), arr(2)), arr(3).toDouble)
})
.reduceByKey((f, s) => {
if (f > s) f else s
})
.distinct()
val ratings = validRDD.map(item => {
Rating(item._1._2.toInt, item._1._1.toInt, item._2)
})
val rank = 10
val numIterations = 5
val model = ALS.train(ratings, rank, numIterations, 0.01)
nRDD.map(item => {
val arr = item.split(' ')
arr(2)
}).toDS()
.distinct()
.foreach(item=>{
println("als recommending for user "+item)
val recommendRes = model.recommendProducts(item.toInt, 10)
for (elem <- recommendRes) {
println(elem)
}
})
nRDD.saveAsTextFile("hdfs://localhost:9011/recData/miniApp/mall")
}
}
The error is telling you that you're missing a SparkContext. I'm guessing that the program fails on this line:
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
The documentation provides an example of creating a SparkContext to use in this situation.
From the docs:
val stream: DStream[String] = ...
stream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Do things...
}
Although you're using RDDs instead of DataFrames, the same principles should apply.

How to speed up log parsing Spark job?

My architecture right now is AWS ELB writes log to S3 and it sends a message to SQS for further processing by Spark Streaming. It's working right now but my problem right now is it's taking a bit of time. I'm new to Spark and Scala so just want to make sure that I'm not doing something stupid.
val conf = new SparkConf()
.setAppName("SparrowOrc")
.set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version","2")
.set("spark.speculation","false")
val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(1))
val sqs = streamContext.receiverStream(new SQSReceiver("queue")
.at(Regions.US_EAST_1)
.withTimeout(5))
// Got 10 messages at a time
val s3Keys = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
s3Key
})
val rawLogs: DStream[String] = s3Keys.transform(keys => {
val fileKeys = keys.collect()
val files = fileKeys.map(f => {
sc.textFile(f)
})
sc.union(files)
})
val jsonRows = rawLogs.mapPartitions(partition => {
// Parsing raw log to json
val txfm = new LogLine2Json
val log = Logger.getLogger("parseLog")
partition.map(line => {
try{
txfm.parseLine(line)
}
catch {
case e: Throwable => {log.info(line); "";}
}
}).filter(line => line != "{}")
})
val sqlSession = SparkSession
.builder()
.getOrCreate()
// Write to S3
jsonRows.foreachRDD(r => {
val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
val parsedDate = parsedFormat.format(new java.util.Date())
val outputPath = "bucket" + parsedDate
val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r)
jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)
})
streamContext.start()
streamContext.awaitTermination()
}
Here's the DAG and it seems like everything is merged in union transformation.

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}

Spark rdd write to Hbase

I am able to read the messages from Kafka using the below code:
val ssc = new StreamingContext(sc, Seconds(50))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap)
But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success.
lines.foreachRDD(rdd => {
rdd.foreach(record => {
val i = +1
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
})
})
Well, you are not actually executing the Put, you are mereley creating a Put request and adding data to it. What you are missing is an
hTable.put(thePut);
Adding other answer!!
You can use foreachPartition to establish connection at executor level to be more efficient instead of each row which is costly operation.
lines.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
iter.foreach(record => {
val i = +1
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
//missing part in your code
hTable.put(thePut);
})
})
})

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.