Change forEachRDD - scala

I am having a problem, we are using Kafka and spark.
we are using forEachRDD like this:
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message)}
println(newRDD.count())
}
but we are passing the processMessage(message) method. This method will call a class that is creating the sparkContext. I have been reading and it will throw an error if you created the sparkContext inside the foreachRDD.
I have changed it like this:
messages.map{
case (msg) =>
val newRDD3 = (processMessage(msg))
(newRDD3)
}
but I am not sure if this is doing the same as the foreachRDD.
Could you please help me with this?
Any help will be really appreciate it.

use sparksession
SparkConf conf = new SparkConf()
.setAppName("appName")
.setMaster("local");
SparkSession sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate();
return sparkSession;

I created the streamContext, then declare the topic and the KafkaParams.
Finally, I created the messages. Please see code below:
def main(args: Array[String]) {
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
val conf = ConfigFactory.load("spfin.conf")
println("kafka.duration --- "+ conf.getString("kafka.duration").toLong)
val mlFeatures: MLFeatures = new MLFeatures()
// Create context with custom second batch interval
//val sparkConf = new SparkConf().setAppName("SpFinML")
//val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
val ssc = new StreamingContext(mlFeatures.sc, Seconds(conf.getString("kafka.duration").toLong))
// Create direct kafka stream with brokers and topics
val topicsSet = conf.getString("kafka.requesttopic").split(",").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups2"),
"auto.offset.reset" -> conf.getString("kafka.autoOffset"),
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
/* this code is to get messages from request topic*/
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message,"Inside processMessage")}
println("Inside map")
// println(newRDD.count())
}

The processMessage method is this:
I think I might need to change this method, right?
def processMessage(message: ConsumerRecord[String, String],msg: String): ConsumerRecord[String, String] = {
println(msg)
println("Message processed is " + message.value())
val req_message = message.value()
//val tableName = conf.getString("hbase.tableName")
//println("Hbase table name : " + tableName)
//val decisionTree_res = "PredictionModelOutput "
// val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message)
// println(decisionTree_res)
//
// kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
kafkaProducer(conf.getString("kafka.responsetopic"), """[{"payorId":53723,"therapyType":"RMIV","ndcNumber":"'66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"'9427535101","serviceDate":"20161102","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":22957.55,"daysOrUnits":140,"algorithmType":"LR ,RF ,NB ,VoteResult","label":0.0,"prediction":"0.0 ,0.0 ,0.0 ,0.0","finalPrediction":"Approved","rejectOutcome":"Y","neighborCounter":0,"probability":"0.9307022947278968 - 0.06929770527210313 ,0.9879908798891663 - 0.012009120110833629 ,1.0 - 0.0 ,","patientGender":"M","invoiceId":0,"therapyClass":"REMODULIN","patientAge":52,"npiId":0,"prescriptionId":0,"refillNo":0,"requestId":"419568891","requestDateTime":"20171909213055","responseId":"419568891","responseDateTime":"201801103503"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":1,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":2,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":3,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":4,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160621","responseDate":"20160818","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":19677.9,"daysOrUnits":120,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":5,"invoiceId":45226407,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":0,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":6,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":7,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":8,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":9,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":10,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4879911,"procedureCode":"J3285","authNbr":"9818182501","serviceDate":"20160901","responseDate":"20161027","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":11,"invoiceId":46419758,"patientAge":0,"npiId":0,"prescriptionId":1095509,"refillNo":7,"authExpiry":"20170626","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4056274,"procedureCode":"J3285","authNbr":"8914616801","serviceDate":"20160727","responseDate":"20161019","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":12,"invoiceId":45820271,"patientAge":0,"npiId":0,"prescriptionId":1055447,"refillNo":10,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":3476365,"procedureCode":"J3285","authNbr":"9262852501","serviceDate":"20160809","responseDate":"20161013","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":16398.25,"daysOrUnits":100,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":13,"invoiceId":46027459,"patientAge":0,"npiId":0,"prescriptionId":1169479,"refillNo":2,"authExpiry":"20161231","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":1064449,"procedureCode":"J3285","authNbr":"9327540001","serviceDate":"20160825","responseDate":"20161013","serviceBranchId":35,"serviceDuration":14,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":14,"invoiceId":46303200,"patientAge":0,"npiId":0,"prescriptionId":714169,"refillNo":0,"authExpiry":"20170112","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9205248,"procedureCode":"J3285","authNbr":"0000000","serviceDate":"20160823","responseDate":"20161013","serviceBranchId":65,"serviceDuration":20,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":15,"invoiceId":46257476,"patientAge":0,"npiId":0,"prescriptionId":1206606,"refillNo":0,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"}]""")
//saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
message
}

If someone is interested in the solution.
For messages(InputDStream) I use foreachRDD, which convert it to a RDD, next use map to get the json from the consumerRecord. Next, convert the RDD to an Array[String] and pass it to the processMessage method.
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures)
}
Inside the processMessage method, there is a for loop to process all the strings. Also we are inserting the request message and response message to a hbase table.
def processMessage(message: Array[String], mlFeatures: MLFeatures) = {
for (j <- 0 until message.size){
val req_message = message(j)//.get(j).toString()
val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message,mlFeatures)
println("Message processed is " + req_message)
kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
var startTime = new Date().getTime();
saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
var endTime = new Date().getTime();
println("Kafka Consumer savetoHBase took : "+ (endTime - startTime) / 1000 + " seconds")
}
}

Related

Unable to store consumed Kafka offsets into HBase table while committing through Spark D Streams

I am trying to save Kafka Consumer offset in an HBase table, with a success flag, after it is processed through business logic. This whole process is a part of Spark DStream, and I am using below piece of code for the same:
val hbaseTable = "table"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "server",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "topic",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topic = Array("topicName")
val ssc = new StreamingContext(sc, Seconds(40))
val stream = KafkaUtils.createDirectStream[String, String](ssc,PreferConsistent,Subscribe[String, String] (topic, kafkaParams))
stream.foreachRDD((rdd, batchTime) => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach(offset => println(offset.topic, offset.partition,
offset.fromOffset, offset.untilOffset))
rdd.map(value => (value.value())).saveAsTextFile("path")
println("Saved Data into file")
var commits:OffsetCommitCallback = null
rdd.foreachPartition(message => {
val hbaseConf = HBaseConfiguration.create()
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTable))
commits = new OffsetCommitCallback(){
def onComplete(offsets: java.util.Map[TopicPartition, OffsetAndMetadata],
exception: Exception) {
message.foreach(value => {
val key= value.key()
val offset = value.offset()
println(s"offset is: $offset")
val partitionId = TaskContext.get.partitionId()
println(s"partitionID is: $partitionId")
val rowKey = key
val put = new Put(rowKey.getBytes)
if (exception != null) {
println("Got Error Message:" + exception.getMessage)
put.addColumn("o".getBytes, "flag".toString.getBytes(),"Error".toString.getBytes())
put.addColumn("o".getBytes,"error_message".toString.getBytes(),exception.getMessage.toString.getBytes())
println("Got Error Message:" + exception.getMessage)
table.put(put)
} else {
put.addColumn("o".getBytes, "flag".toString.getBytes(),"Success".toString.getBytes())
table.put(put)
println(offsets.values())
}
}
)
println("Inserted into HBase")
}
}
table.close()
conn.close()
}
)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges, commits)
}
)
ssc.start()
This code executes successfully. However, it neither saves data into HBase nor produce logs at executor level (which I am printing while iterating over each partition of RDD). Not sure what exactly am I missing here. Any help would be highly appreciated.

Spark Streaming scala performance drastic slow

I have following code :-
case class event(imei: String, date: String, gpsdt: String,dt: String,id: String)
case class historyevent(imei: String, date: String, gpsdt: String)
object kafkatesting {
def main(args: Array[String]) {
val clients = new RedisClientPool("192.168.0.40", 6379)
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "3g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "100")
.set("spark.streaming.kafka.maxRatePerPartition", "7")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
kafkaStream.foreachRDD { rdd =>
val updatedRDD = rdd.map(a =>
{
implicit val formats = DefaultFormats
val jValue = parse(a._2)
val fleetrecord = jValue.extract[historyevent]
val hash = fleetrecord.imei + fleetrecord.date + fleetrecord.gpsdt
val md5Hash = DigestUtils.md5Hex(hash).toUpperCase()
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
event(fleetrecord.imei, fleetrecord.date, fleetrecord.gpsdt, now, md5Hash)
})
.collect()
updatedRDD.foreach(f =>
{
clients.withClient {
client =>
{
val value = f.imei + " , " + f.gpsdt
val zscore = Calendar.getInstance().getTimeInMillis
val key = new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime())
val dt = new SimpleDateFormat("HH:mm:ss").format(Calendar.getInstance().getTime())
val q1 = "00:00:00"
val q2 = "06:00:00"
val q3 = "12:00:00"
val q4 = "18:00:00"
val quater = if (dt > q1 && dt < q2) {
System.out.println(dt + " lies in quarter 1");
" -> 1"
} else if (dt > q2 && dt < q3) {
System.out.println(dt + " lies in quarter 2");
" -> 2"
} else if (dt > q3 && dt < q4) {
System.out.println(dt + " lies in quarter 3");
" -> 3"
} else {
System.out.println(dt + " lies in quarter 4");
" -> 4"
}
client.zadd(key + quater, zscore, value)
println(f.toString())
}
}
})
val collection = sc.parallelize(updatedRDD)
collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt","dt","id"))
}
ssc.start()
ssc.awaitTermination()
}
}
I'm using this code to insert data from Kafka into Cassandra and Redis, but facing following issues:-
1) application creates a long queue of active batches while the previous batch is currently being processed. So, I want to have next batch only once the previous batch is finished executing.
2) I have four-node cluster which is processing each batch but it takes around 30-40 sec for executing 700 records.
Is my code is optimized or I need to work on my code for better performance?
Yes you can do all your stuff inside mapPartition. There are APIs from datastax that allow you to save the Dstream directly. Here is how you can do it for C*.
val partitionedDstream = kafkaStream.repartition(5) //change this value as per your data and spark cluster
//Now instead of iterating each RDD work on each partition.
val eventsStream: DStream[event] = partitionedDstream.mapPartitions(x => {
val lst = scala.collection.mutable.ListBuffer[event]()
while (x.hasNext) {
val a = x.next()
implicit val formats = DefaultFormats
val jValue = parse(a._2)
val fleetrecord = jValue.extract[historyevent]
val hash = fleetrecord.imei + fleetrecord.date + fleetrecord.gpsdt
val md5Hash = DigestUtils.md5Hex(hash).toUpperCase()
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
lst += event(fleetrecord.imei, fleetrecord.date, fleetrecord.gpsdt, now, md5Hash)
}
lst.toList.iterator
})
eventsStream.cache() //because you are using same Dstream for C* and Redis
//instead of collecting each RDD save whole Dstream at once
import com.datastax.spark.connector.streaming._
eventsStream.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "dt", "id"))
Also cassandra accepts timestamp as Long value, so you can also change some part of your code as below
val now = System.currentTimeMillis()
//also change your case class to take `Long` instead of `String`
case class event(imei: String, date: String, gpsdt: String, dt: Long, id: String)
Similarly you can change for Redis as well.

How to speed up log parsing Spark job?

My architecture right now is AWS ELB writes log to S3 and it sends a message to SQS for further processing by Spark Streaming. It's working right now but my problem right now is it's taking a bit of time. I'm new to Spark and Scala so just want to make sure that I'm not doing something stupid.
val conf = new SparkConf()
.setAppName("SparrowOrc")
.set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version","2")
.set("spark.speculation","false")
val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(1))
val sqs = streamContext.receiverStream(new SQSReceiver("queue")
.at(Regions.US_EAST_1)
.withTimeout(5))
// Got 10 messages at a time
val s3Keys = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
s3Key
})
val rawLogs: DStream[String] = s3Keys.transform(keys => {
val fileKeys = keys.collect()
val files = fileKeys.map(f => {
sc.textFile(f)
})
sc.union(files)
})
val jsonRows = rawLogs.mapPartitions(partition => {
// Parsing raw log to json
val txfm = new LogLine2Json
val log = Logger.getLogger("parseLog")
partition.map(line => {
try{
txfm.parseLine(line)
}
catch {
case e: Throwable => {log.info(line); "";}
}
}).filter(line => line != "{}")
})
val sqlSession = SparkSession
.builder()
.getOrCreate()
// Write to S3
jsonRows.foreachRDD(r => {
val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
val parsedDate = parsedFormat.format(new java.util.Date())
val outputPath = "bucket" + parsedDate
val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r)
jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)
})
streamContext.start()
streamContext.awaitTermination()
}
Here's the DAG and it seems like everything is merged in union transformation.

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}

Spark rdd write to Hbase

I am able to read the messages from Kafka using the below code:
val ssc = new StreamingContext(sc, Seconds(50))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap)
But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success.
lines.foreachRDD(rdd => {
rdd.foreach(record => {
val i = +1
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
})
})
Well, you are not actually executing the Put, you are mereley creating a Put request and adding data to it. What you are missing is an
hTable.put(thePut);
Adding other answer!!
You can use foreachPartition to establish connection at executor level to be more efficient instead of each row which is costly operation.
lines.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
iter.foreach(record => {
val i = +1
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
//missing part in your code
hTable.put(thePut);
})
})
})