Spark Streaming - Issue with Passing parameters - scala

Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
}
def display() {
print(hbaseTable)
}
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
}
}
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
HBase.init(hbaseTable)
HBase.display()
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
lines.print()
ssc.start()
}
}
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:179)
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:149)
org.apache.hadoop.hbase.TableName.<init>(TableName.java:303)
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339)
org.apache.hadoop.hbase.TableName.valueOf(TableName.java:426)
org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:156)
Can you please let me know how to make this code work.

"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
HBase.initialize(...)
partition.foreach(elem => HBase.insert(elem))
}
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.

Related

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

What is Spark execution order with function calls in scala?

I have a spark program as follows:
object A {
var id_set: Set[String] = _
def init(argv: Array[String]) = {
val args = new AArgs(argv)
id_set = args.ids.split(",").toSet
}
def main(argv: Array[String]) {
init(argv)
val conf = new SparkConf().setAppName("some.name")
val rdd1 = getRDD(paras)
val rdd2 = getRDD(paras)
//......
}
def getRDD(paras) = {
//function details
getRDDDtails(paras)
}
def getRDDDtails(paras) = {
//val id_given = id_set
id_set.foreach(println) //worked normal, not empty
someRDD.filter{ x =>
val someSet = x.getOrElse(...)
//id_set.foreach(println) ------wrong, id_set just empty set
(some_set & id_set).size > 0
}
}
class AArgs(args: Array[String]) extends Serializable {
//parse args
}
I have a global variable id_set. At first, it is just an empty set. In main function, I call init which sets id_set to a non-empty set from args. After that, I call getRDD function which calls getRDDDtails. In getRDDDtails, I filter a rdd based on contents in id_set. However, the result semms to be empty. I tried to print is_set in executor, and it is just an empty line. So, the problem seems to be is_set is not well initilized(in init function). However, when I try to print is_set in driver(in head lines of function getRDDDtails), it worked normal, not empty.
So, I have tried to add val id_given = id_set in function getRDDDtails, and use id_given later. This seems to fix the problem. But I'm totally confused why should this happen? What is the execution order of Spark programs? Why does my solution work?

How to wait SparkContext to finish all process?

How do I see if SparkContext has contents executing and when everything finish I stop it? Because currently I am waiting 30 seconds before to call SparkContext.stop, otherwise my app throws an error.
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
object RatingsCounter extends App {
// set the log level to print only errors
Logger.getLogger("org").setLevel(Level.ERROR)
// create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "RatingsCounter")
// load up each line of the ratings data into an RDD (Resilient Distributed Dataset)
val lines = sc.textFile("src/main/resource/u.data", 0)
// convert each line to s string, split it out by tabs and extract the third field.
// The file format is userID, movieID, rating, timestamp
val ratings = lines.map(x => x.toString().split("\t")(2))
// count up how many times each value occurs
val results = ratings.countByValue()
// sort the resulting map of (rating, count) tuples
val sortedResults = results.toSeq.sortBy(_._1)
// print each result on its own line.
sortedResults.foreach { case (key, value) => println("movie ID: " + key + " - rating times: " + value) }
Thread.sleep(30000)
sc.stop()
}
Spark applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
And since you are extending App, you are getting an unexpected behavior.
You can read more about it in the official documentation about Self Contained Applications.
App uses DelayedInit and can cause initialization issues. With the main method you know what's going on. Excerpt from reddit.
object HelloWorld extends App {
var a = 1
a + 1
override def main(args: Array[String]) {
println(a) // guess what's the value of a ?
}
}

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.