Spark Streaming scala performance drastic slow - apache-kafka
I have following code :-
case class event(imei: String, date: String, gpsdt: String,dt: String,id: String)
case class historyevent(imei: String, date: String, gpsdt: String)
object kafkatesting {
def main(args: Array[String]) {
val clients = new RedisClientPool("192.168.0.40", 6379)
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "3g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "100")
.set("spark.streaming.kafka.maxRatePerPartition", "7")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
kafkaStream.foreachRDD { rdd =>
val updatedRDD = rdd.map(a =>
{
implicit val formats = DefaultFormats
val jValue = parse(a._2)
val fleetrecord = jValue.extract[historyevent]
val hash = fleetrecord.imei + fleetrecord.date + fleetrecord.gpsdt
val md5Hash = DigestUtils.md5Hex(hash).toUpperCase()
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
event(fleetrecord.imei, fleetrecord.date, fleetrecord.gpsdt, now, md5Hash)
})
.collect()
updatedRDD.foreach(f =>
{
clients.withClient {
client =>
{
val value = f.imei + " , " + f.gpsdt
val zscore = Calendar.getInstance().getTimeInMillis
val key = new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime())
val dt = new SimpleDateFormat("HH:mm:ss").format(Calendar.getInstance().getTime())
val q1 = "00:00:00"
val q2 = "06:00:00"
val q3 = "12:00:00"
val q4 = "18:00:00"
val quater = if (dt > q1 && dt < q2) {
System.out.println(dt + " lies in quarter 1");
" -> 1"
} else if (dt > q2 && dt < q3) {
System.out.println(dt + " lies in quarter 2");
" -> 2"
} else if (dt > q3 && dt < q4) {
System.out.println(dt + " lies in quarter 3");
" -> 3"
} else {
System.out.println(dt + " lies in quarter 4");
" -> 4"
}
client.zadd(key + quater, zscore, value)
println(f.toString())
}
}
})
val collection = sc.parallelize(updatedRDD)
collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt","dt","id"))
}
ssc.start()
ssc.awaitTermination()
}
}
I'm using this code to insert data from Kafka into Cassandra and Redis, but facing following issues:-
1) application creates a long queue of active batches while the previous batch is currently being processed. So, I want to have next batch only once the previous batch is finished executing.
2) I have four-node cluster which is processing each batch but it takes around 30-40 sec for executing 700 records.
Is my code is optimized or I need to work on my code for better performance?
Yes you can do all your stuff inside mapPartition. There are APIs from datastax that allow you to save the Dstream directly. Here is how you can do it for C*.
val partitionedDstream = kafkaStream.repartition(5) //change this value as per your data and spark cluster
//Now instead of iterating each RDD work on each partition.
val eventsStream: DStream[event] = partitionedDstream.mapPartitions(x => {
val lst = scala.collection.mutable.ListBuffer[event]()
while (x.hasNext) {
val a = x.next()
implicit val formats = DefaultFormats
val jValue = parse(a._2)
val fleetrecord = jValue.extract[historyevent]
val hash = fleetrecord.imei + fleetrecord.date + fleetrecord.gpsdt
val md5Hash = DigestUtils.md5Hex(hash).toUpperCase()
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
lst += event(fleetrecord.imei, fleetrecord.date, fleetrecord.gpsdt, now, md5Hash)
}
lst.toList.iterator
})
eventsStream.cache() //because you are using same Dstream for C* and Redis
//instead of collecting each RDD save whole Dstream at once
import com.datastax.spark.connector.streaming._
eventsStream.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "dt", "id"))
Also cassandra accepts timestamp as Long value, so you can also change some part of your code as below
val now = System.currentTimeMillis()
//also change your case class to take `Long` instead of `String`
case class event(imei: String, date: String, gpsdt: String, dt: Long, id: String)
Similarly you can change for Redis as well.
Related
Change forEachRDD
I am having a problem, we are using Kafka and spark. we are using forEachRDD like this: messages.foreachRDD{ rdd => val newRDD = rdd.map{message => processMessage(message)} println(newRDD.count()) } but we are passing the processMessage(message) method. This method will call a class that is creating the sparkContext. I have been reading and it will throw an error if you created the sparkContext inside the foreachRDD. I have changed it like this: messages.map{ case (msg) => val newRDD3 = (processMessage(msg)) (newRDD3) } but I am not sure if this is doing the same as the foreachRDD. Could you please help me with this? Any help will be really appreciate it.
use sparksession SparkConf conf = new SparkConf() .setAppName("appName") .setMaster("local"); SparkSession sparkSession = SparkSession .builder() .config(conf) .getOrCreate(); return sparkSession;
I created the streamContext, then declare the topic and the KafkaParams. Finally, I created the messages. Please see code below: def main(args: Array[String]) { val date_today = new SimpleDateFormat("yyyy_MM_dd"); val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH"); val PATH_SEPERATOR = "/"; val conf = ConfigFactory.load("spfin.conf") println("kafka.duration --- "+ conf.getString("kafka.duration").toLong) val mlFeatures: MLFeatures = new MLFeatures() // Create context with custom second batch interval //val sparkConf = new SparkConf().setAppName("SpFinML") //val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong)) val ssc = new StreamingContext(mlFeatures.sc, Seconds(conf.getString("kafka.duration").toLong)) // Create direct kafka stream with brokers and topics val topicsSet = conf.getString("kafka.requesttopic").split(",").toSet val kafkaParams = Map[String, Object]( "bootstrap.servers" -> conf.getString("kafka.brokers"), "zookeeper.connect" -> conf.getString("kafka.zookeeper"), "group.id" -> conf.getString("kafka.consumergroups2"), "auto.offset.reset" -> conf.getString("kafka.autoOffset"), "enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean), "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "security.protocol" -> "SASL_PLAINTEXT") /* this code is to get messages from request topic*/ val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)) messages.foreachRDD{ rdd => val newRDD = rdd.map{message => processMessage(message,"Inside processMessage")} println("Inside map") // println(newRDD.count()) }
The processMessage method is this: I think I might need to change this method, right? def processMessage(message: ConsumerRecord[String, String],msg: String): ConsumerRecord[String, String] = { println(msg) println("Message processed is " + message.value()) val req_message = message.value() //val tableName = conf.getString("hbase.tableName") //println("Hbase table name : " + tableName) //val decisionTree_res = "PredictionModelOutput " // val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message) // println(decisionTree_res) // // kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res) kafkaProducer(conf.getString("kafka.responsetopic"), """[{"payorId":53723,"therapyType":"RMIV","ndcNumber":"'66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"'9427535101","serviceDate":"20161102","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":22957.55,"daysOrUnits":140,"algorithmType":"LR ,RF ,NB ,VoteResult","label":0.0,"prediction":"0.0 ,0.0 ,0.0 ,0.0","finalPrediction":"Approved","rejectOutcome":"Y","neighborCounter":0,"probability":"0.9307022947278968 - 0.06929770527210313 ,0.9879908798891663 - 0.012009120110833629 ,1.0 - 0.0 ,","patientGender":"M","invoiceId":0,"therapyClass":"REMODULIN","patientAge":52,"npiId":0,"prescriptionId":0,"refillNo":0,"requestId":"419568891","requestDateTime":"20171909213055","responseId":"419568891","responseDateTime":"201801103503"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":1,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":2,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":3,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":4,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160621","responseDate":"20160818","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":19677.9,"daysOrUnits":120,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":5,"invoiceId":45226407,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":0,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":6,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":7,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":8,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":9,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":10,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4879911,"procedureCode":"J3285","authNbr":"9818182501","serviceDate":"20160901","responseDate":"20161027","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":11,"invoiceId":46419758,"patientAge":0,"npiId":0,"prescriptionId":1095509,"refillNo":7,"authExpiry":"20170626","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4056274,"procedureCode":"J3285","authNbr":"8914616801","serviceDate":"20160727","responseDate":"20161019","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":12,"invoiceId":45820271,"patientAge":0,"npiId":0,"prescriptionId":1055447,"refillNo":10,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":3476365,"procedureCode":"J3285","authNbr":"9262852501","serviceDate":"20160809","responseDate":"20161013","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":16398.25,"daysOrUnits":100,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":13,"invoiceId":46027459,"patientAge":0,"npiId":0,"prescriptionId":1169479,"refillNo":2,"authExpiry":"20161231","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":1064449,"procedureCode":"J3285","authNbr":"9327540001","serviceDate":"20160825","responseDate":"20161013","serviceBranchId":35,"serviceDuration":14,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":14,"invoiceId":46303200,"patientAge":0,"npiId":0,"prescriptionId":714169,"refillNo":0,"authExpiry":"20170112","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9205248,"procedureCode":"J3285","authNbr":"0000000","serviceDate":"20160823","responseDate":"20161013","serviceBranchId":65,"serviceDuration":20,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":15,"invoiceId":46257476,"patientAge":0,"npiId":0,"prescriptionId":1206606,"refillNo":0,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"}]""") //saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res) message }
If someone is interested in the solution. For messages(InputDStream) I use foreachRDD, which convert it to a RDD, next use map to get the json from the consumerRecord. Next, convert the RDD to an Array[String] and pass it to the processMessage method. messages.foreachRDD{ rdd => val newRDD = rdd.map{message => val req_message = message.value() (message.value()) } println("Request messages: " + newRDD.count()) var resultrows = newRDD.collect()//.collectAsList() processMessage(resultrows, mlFeatures: MLFeatures) } Inside the processMessage method, there is a for loop to process all the strings. Also we are inserting the request message and response message to a hbase table. def processMessage(message: Array[String], mlFeatures: MLFeatures) = { for (j <- 0 until message.size){ val req_message = message(j)//.get(j).toString() val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message,mlFeatures) println("Message processed is " + req_message) kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res) var startTime = new Date().getTime(); saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res) var endTime = new Date().getTime(); println("Kafka Consumer savetoHBase took : "+ (endTime - startTime) / 1000 + " seconds") } }
recursive value x$5 needs type
i am getting error at this line val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) = Utils.parseCommandLineWithTwitterCredentials(args) recursive value x$7 needs type recursive value x$1 needs type what does this error meaning, please guide me how to resolve this error. object Collect { private var numTweetsCollected = 0L private var partNum = 0 private var gson = new Gson() def main(args: Array[String]) { // Process program arguments and set properties if (args.length < 3) { System.err.println("Usage: " + this.getClass.getSimpleName + "<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>") System.exit(1) } val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) = Utils.parseCommandLineWithTwitterCredentials(args) val outputDir = new File(outputDirectory.toString) if (outputDir.exists()) { System.err.println("ERROR - %s already exists: delete or specify another directory".format( outputDirectory)) System.exit(1) } outputDir.mkdirs() println("Initializing Streaming Spark Context...") val conf = new SparkConf().setAppName(this.getClass.getSimpleName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(intervalSecs)) val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth) .map(gson.toJson(_)) tweetStream.foreachRDD((rdd, time) => { val count = rdd.count() if (count > 0) { val outputRDD = rdd.repartition(partitionsEachInterval) outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString) numTweetsCollected += count if (numTweetsCollected > numTweetsToCollect) { System.exit(0) } } }) ssc.start() ssc.awaitTermination() } }
Try removing the Utils.IntParam(.. from your pattern matched values. Extract the values, then parse them separately.
How to speed up log parsing Spark job?
My architecture right now is AWS ELB writes log to S3 and it sends a message to SQS for further processing by Spark Streaming. It's working right now but my problem right now is it's taking a bit of time. I'm new to Spark and Scala so just want to make sure that I'm not doing something stupid. val conf = new SparkConf() .setAppName("SparrowOrc") .set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") .set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version","2") .set("spark.speculation","false") val sc = new SparkContext(conf) val streamContext = new StreamingContext(sc, Seconds(1)) val sqs = streamContext.receiverStream(new SQSReceiver("queue") .at(Regions.US_EAST_1) .withTimeout(5)) // Got 10 messages at a time val s3Keys = sqs.map(messages => { val sqsMsg: JsValue = Json.parse(messages) val s3Key = "s3://" + Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" + Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "") s3Key }) val rawLogs: DStream[String] = s3Keys.transform(keys => { val fileKeys = keys.collect() val files = fileKeys.map(f => { sc.textFile(f) }) sc.union(files) }) val jsonRows = rawLogs.mapPartitions(partition => { // Parsing raw log to json val txfm = new LogLine2Json val log = Logger.getLogger("parseLog") partition.map(line => { try{ txfm.parseLine(line) } catch { case e: Throwable => {log.info(line); "";} } }).filter(line => line != "{}") }) val sqlSession = SparkSession .builder() .getOrCreate() // Write to S3 jsonRows.foreachRDD(r => { val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/") val parsedDate = parsedFormat.format(new java.util.Date()) val outputPath = "bucket" + parsedDate val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r) jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath) }) streamContext.start() streamContext.awaitTermination() } Here's the DAG and it seems like everything is merged in union transformation.
Spark Streaming - Broadcast variable - Case Class
My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here. The output of HBase data is as follows In the Driver node HBaseReaderBuilder (org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List())) In the Worker node HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List())) As you can see it has lost the spark context. When i issue the statement val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException java.lang.NullPointerException at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83) at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5) at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67) at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5) at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34) at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33) object testPartition { def main(args: Array[String]) : Unit = { val sparkMaster = "spark://x.x.x.x:7077" val ipaddress = "x.x.x.x:2181" // Zookeeper val hadoopHome = "/home/hadoop/software/hadoop-2.6.0" val topicname = "new_events_test_topic" val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition") val mainSparkContext = new SparkContext(mainConf) val ssc = new StreamingContext(mainSparkContext, Seconds(30)) val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2)) val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data") println ("docRDD from Driver ",docRdd) val broadcastedprof = mainSparkContext.broadcast(docRdd) eventsStream.foreachRDD(dstream => { dstream.foreachPartition(records => { println("Broadcasted docRDD - in Worker ", broadcastedprof.value) val bcdocRdd = broadcastedprof.value records.foreach(record => { //val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) //myRdd.foreach(println) val Rows = record._2.split("\r\n") }) }) }) ssc.start() ssc.awaitTermination() } }
Convert each record in RDD to a Array[Map] using scala and Spark
My RDD is \n separated records that look like Single RDD k1=v1,k2=v2,k3=v3 k1=v1,k2=v2,k3=v3 k1=v1,k2=v2,k3=v3 and want to convert it into a Array[Map[k,v]], where each element in Array will be a different map[k,v] corresponding to a record. Array will contain N number of such maps depending on the records in a single RDD. I am new to both scala and spark. Any help in the conversion will help. object SparkApp extends Logging with App { override def main(args: Array[ String ]): Unit = { val myConfigFile = new File("../sparkconsumer/conf/spark.conf") val fileConfig = ConfigFactory.parseFile(myConfigFile).getConfig(GlobalConstants.CONFIG_ROOT_ELEMENT) val propConf = ConfigFactory.load(fileConfig) val topicsSet = propConf.getString(GlobalConstants.KAFKA_WHITE_LIST_TOPIC).split(",").toSet val kafkaParams = Map[ String, String ]("metadata.broker.list" -> propConf.getString(GlobalConstants.KAFKA_BROKERS)) //logger.info(message = "Hello World , You are entering Spark!!!") val conf = new SparkConf().setMaster("local[2]").setAppName(propConf.getString(GlobalConstants.JOB_NAME)) conf.set("HADOOP_HOME", "/usr/local/hadoop") conf.set("hadoop.home.dir", "/usr/local/hadoop") //Lookup // logger.info("Window of 5 Seconds Enabled") val ssc = new StreamingContext(conf, Seconds(5)) ssc.checkpoint("/tmp/checkpoint") val apiFile = ssc.sparkContext.textFile(propConf.getString(GlobalConstants.API_FILE)) val arrayApi = ssc.sparkContext.broadcast(apiFile.distinct().collect()) val nonApiFile = ssc.sparkContext.textFile(propConf.getString(GlobalConstants.NON_API_FILE)) val arrayNonApi = ssc.sparkContext.broadcast(nonApiFile.distinct().collect()) val messages = KafkaUtils.createDirectStream[ String, String, StringDecoder, StringDecoder ](ssc, kafkaParams, topicsSet) writeTOHDFS2(messages) ssc.start() ssc.awaitTermination() } def writeTOHDFS2(messages: DStream[ (String, String) ]): Unit = { val records = messages.window(Seconds(10), Seconds(10)) val k = records.transform( rdd => rdd.map(r =>r._2)).filter(x=> filterNullImpressions(x)) k.foreachRDD { singleRdd => if (singleRdd.count() > 0) { val maps = singleRdd.map(line => line.split("\n").flatMap(x => x.split(",")).flatMap(x => x.split("=")).foreach( x => new mutable.HashMap().put(x(0),x(1))) val r = scala.util.Random val sdf = new SimpleDateFormat("yyyy/MM/dd/HH/mm") maps.saveAsTextFile("hdfs://localhost:8001/user/hadoop/spark/" + sdf.format(new Date())+r.nextInt) } } } }
Here's some code that should be pretty self-explainatory. val lines = "k1=v1,k2=v2,k3=v3\nk1=v1,k2=v2\nk1=v1,k2=v2,k3=v3,k4=v4" val maps = lines.split("\n") .map(line => line.split(",") .map(kvPairString => kvPairString.split("=")) .map(kvPairArray => (kvPairArray(0), kvPairArray(1)))) .map(_.toMap) // maps is of type Array[Map[String, String]] println(maps.mkString("\n")) // prints: // Map(k1 -> v1, k2 -> v2, k3 -> v3) // Map(k1 -> v1, k2 -> v2) // Map(k1 -> v1, k2 -> v2, k3 -> v3, k4 -> v4) Word of advice - SO is not a "write code for me" platform. I understand that it's pretty hard to just dive into Scala and Spark, but next time please try to solve the problem yourself and post what you tried so far and which problems you ran into.