How to use flatMapValues on Kafka - scala

I am getting an error when using flatMapValues in Scala with Kafka library. Here is my code:
val builder: KStreamBuilder = new KStreamBuilder()
val textLines: KStream[String, String] = builder.stream("streams-plaintext-input")
import collection.JavaConverters.asJavaIterableConverter
val wordCounts: KTable[String, JLong] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+").toIterable.asJava)
.groupBy((_, word) => word)
.count("word-counts")
and I am getting the error missing parameter type for textLine inside flatMapValues. If I replace for flatMapValues((textLine: String) => textLine.toLowerCase.split("\\W+").toIterable.asJava) it still does not work.
Anyone have some idea?
Thanks, Felipe

Working with Scala 2.12.4 I solved like this:
val props = new Properties
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount")
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
val stringSerde: Serde[String] = Serdes.String()
val longSerde: Serde[Long] = Serdes.Long()
val builder = new StreamsBuilder()
val textLines: KStream[String, String] = builder.stream("streams-plaintext-input")
val topology: Topology = builder.build()
println(topology.describe())
val wordCounts: KTable[String, Long] = textLines
.flatMapValues { textLine =>
println(textLine)
println(topology.describe())
textLine.toLowerCase.split("\\W+").toIterable.asJava
}
.groupBy((_, word) => word)
// this is a stateful computation config to the topology
.count("word-counts")
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output")
val streams = new KafkaStreams(topology, props)
streams.start()

Related

How to use zio kafka with google protobuf when you need to read data from topic and get it as Java proto class?

I need to get data from Kafka topic as a Zio Stream, data there is in the google protobuf format, also i need to check schema
I use the following sample protobuf file which generates proto.Data Java class for me:
syntax = "proto3";
package proto;
import "google/protobuf/timestamp.proto";
option java_multiple_files = true;
option java_outer_classname = "Protos";
message Data {
string id = 1;
google.protobuf.Timestamp receiveTimestamp = 2;
}
If i use the following properties i am able to get data as KStream[proto.Data] (so using kafka api) for the proto.Data proto Message class
val props: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, s"kstream-application-${java.util.UUID.randomUUID().toString}")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
p.put("security.protocol", "SSL")
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde")
p.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081")
p.put("enable.auto.commit", "false")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put("specific.protobuf.value.type", classOf[proto.Data])
p
}
And here is the example of code using the KStream (I am able to print record with exact Id equals 1 only):
val builder: StreamsBuilder = new StreamsBuilder
val risks: KStream[String, proto.Data] =
builder
.stream[String, proto.Data](topic)
.filter((_, value) => value.getId=="1")
val sysout = Printed
.toSysOut[String, proto.Data]
.withLabel("protoStream")
risks.print(sysout)
val streams: KafkaStreams = new KafkaStreams(builder.build(), props)
streams.start()
sys.ShutdownHookThread {
streams.close(Duration.ofSeconds(10))
}
Now if i use zio kafka and same properties somehow i am able to print out the whole stream:
val props: Map[String, AnyRef] = Map(
StreamsConfig.APPLICATION_ID_CONFIG -> s"kstream-application-${java.util.UUID.randomUUID().toString}",
StreamsConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
"security.protocol" -> "SSL",
StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG -> Serdes.String.getClass.getName,
StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG -> "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde",
AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> "http://localhost:8081",
"enable.auto.commit" -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
"specific.protobuf.value.type" -> classOf[proto.Data]
)
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
myStream.exitCode
}
But if i try to filter only record with Id equals 1
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.filter(_.value.getId=="1")
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
I get error like
Fiber failed.
An unchecked error was produced.
java.lang.ClassCastException: com.google.protobuf.DynamicMessage cannot be cast to proto.Data
I was wondering if anybody used zio kafka together with google protobuf and deserialization to the Java proto class was successful when you read data from the topic?

Flink 1.12 serialize Avro Generic Record to Kafka failed with com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException

I have a DataStream[GenericRecord]:
val consumer = new FlinkKafkaConsumer[String]("input_csv_topic", new SimpleStringSchema(), properties)
val stream = senv.
addSource(consumer).
map(line => {
val arr = line.split(",")
val schemaUrl = "" // avro schema link, standard .avsc file format
val schemaStr = scala.io.Source.fromURL(schemaUrl).mkString.toString().stripLineEnd
import org.codehaus.jettison.json.{JSONObject, JSONArray}
val schemaFields: JSONArray = new JSONObject(schemaStr).optJSONArray("fields")
val genericDevice: GenericRecord = new GenericData.Record(new Schema.Parser().parse(schemaStr))
for(i <- 0 until arr.length) {
val fieldObj: JSONObject = schemaFields.optJSONObject(i)
val columnName = fieldObj.optString("name")
var columnType = fieldObj.optString("type")
if (columnType.contains("string")) {
genericDevice.put(columnName, arr(i))
} else if (columnType.contains("int")) {
genericDevice.put(columnName, toInt(arr(i)).getOrElse(0).asInstanceOf[Number].intValue)
} else if (columnType.contains("long")) {
genericDevice.put(columnName, toLong(arr(i)).getOrElse(0).asInstanceOf[Number].longValue)
}
}
genericDevice
})
val kafkaSink = new FlinkKafkaProducer[GenericRecord](
"output_avro_topic",
new MyKafkaAvroSerializationSchema[GenericRecord](classOf[GenericRecord], "output_avro_topic", "this is the key", schemaStr),
properties,
FlinkKafkaProducer.Semantic.AT_LEAST_ONCE)
stream.addSink(kafkaSink)
Here is MyKafkaAvroSerializationSchema implementation:
class MyKafkaAvroSerializationSchema[T](avroType: Class[T], topic: String, key: String, schemaStr: String) extends KafkaSerializationSchema[T] {
lazy val schema: Schema = new Schema.Parser().parse(schemaStr)
override def serialize(element: T, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
val cl = Thread.currentThread().getContextClassLoader()
val genericData = new GenericData(cl)
val writer = new GenericDatumWriter[T](schema, genericData)
// val writer = new ReflectDatumWriter[T](schema)
// val writer = new SpecificDatumWriter[T](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(element, encoder)
encoder.flush()
out.close()
new ProducerRecord[Array[Byte], Array[Byte]](topic, key.getBytes, out.toByteArray)
}
}
Here's stack trace screenshot:
com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
reserved (org.apache.avro.Schema$Field)
fieldMap (org.apache.avro.Schema$RecordSchema)
schema (org.apache.avro.generic.GenericData$Record)
How to use Flink to serialize Avro Generic Record to Kafka? I have tested different writers, but still got com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException, thanks for your input.
You can simply add the flink-avro module to Your project and use the already provided AvroSerializationSchema that can be used both for SpecificRecord and GenericRecord after providing the schema.

Change forEachRDD

I am having a problem, we are using Kafka and spark.
we are using forEachRDD like this:
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message)}
println(newRDD.count())
}
but we are passing the processMessage(message) method. This method will call a class that is creating the sparkContext. I have been reading and it will throw an error if you created the sparkContext inside the foreachRDD.
I have changed it like this:
messages.map{
case (msg) =>
val newRDD3 = (processMessage(msg))
(newRDD3)
}
but I am not sure if this is doing the same as the foreachRDD.
Could you please help me with this?
Any help will be really appreciate it.
use sparksession
SparkConf conf = new SparkConf()
.setAppName("appName")
.setMaster("local");
SparkSession sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate();
return sparkSession;
I created the streamContext, then declare the topic and the KafkaParams.
Finally, I created the messages. Please see code below:
def main(args: Array[String]) {
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
val conf = ConfigFactory.load("spfin.conf")
println("kafka.duration --- "+ conf.getString("kafka.duration").toLong)
val mlFeatures: MLFeatures = new MLFeatures()
// Create context with custom second batch interval
//val sparkConf = new SparkConf().setAppName("SpFinML")
//val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
val ssc = new StreamingContext(mlFeatures.sc, Seconds(conf.getString("kafka.duration").toLong))
// Create direct kafka stream with brokers and topics
val topicsSet = conf.getString("kafka.requesttopic").split(",").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups2"),
"auto.offset.reset" -> conf.getString("kafka.autoOffset"),
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
/* this code is to get messages from request topic*/
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message,"Inside processMessage")}
println("Inside map")
// println(newRDD.count())
}
The processMessage method is this:
I think I might need to change this method, right?
def processMessage(message: ConsumerRecord[String, String],msg: String): ConsumerRecord[String, String] = {
println(msg)
println("Message processed is " + message.value())
val req_message = message.value()
//val tableName = conf.getString("hbase.tableName")
//println("Hbase table name : " + tableName)
//val decisionTree_res = "PredictionModelOutput "
// val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message)
// println(decisionTree_res)
//
// kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
kafkaProducer(conf.getString("kafka.responsetopic"), """[{"payorId":53723,"therapyType":"RMIV","ndcNumber":"'66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"'9427535101","serviceDate":"20161102","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":22957.55,"daysOrUnits":140,"algorithmType":"LR ,RF ,NB ,VoteResult","label":0.0,"prediction":"0.0 ,0.0 ,0.0 ,0.0","finalPrediction":"Approved","rejectOutcome":"Y","neighborCounter":0,"probability":"0.9307022947278968 - 0.06929770527210313 ,0.9879908798891663 - 0.012009120110833629 ,1.0 - 0.0 ,","patientGender":"M","invoiceId":0,"therapyClass":"REMODULIN","patientAge":52,"npiId":0,"prescriptionId":0,"refillNo":0,"requestId":"419568891","requestDateTime":"20171909213055","responseId":"419568891","responseDateTime":"201801103503"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":1,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":2,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":3,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":4,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160621","responseDate":"20160818","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":19677.9,"daysOrUnits":120,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":5,"invoiceId":45226407,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":0,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":6,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":7,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":8,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":9,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":10,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4879911,"procedureCode":"J3285","authNbr":"9818182501","serviceDate":"20160901","responseDate":"20161027","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":11,"invoiceId":46419758,"patientAge":0,"npiId":0,"prescriptionId":1095509,"refillNo":7,"authExpiry":"20170626","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4056274,"procedureCode":"J3285","authNbr":"8914616801","serviceDate":"20160727","responseDate":"20161019","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":12,"invoiceId":45820271,"patientAge":0,"npiId":0,"prescriptionId":1055447,"refillNo":10,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":3476365,"procedureCode":"J3285","authNbr":"9262852501","serviceDate":"20160809","responseDate":"20161013","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":16398.25,"daysOrUnits":100,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":13,"invoiceId":46027459,"patientAge":0,"npiId":0,"prescriptionId":1169479,"refillNo":2,"authExpiry":"20161231","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":1064449,"procedureCode":"J3285","authNbr":"9327540001","serviceDate":"20160825","responseDate":"20161013","serviceBranchId":35,"serviceDuration":14,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":14,"invoiceId":46303200,"patientAge":0,"npiId":0,"prescriptionId":714169,"refillNo":0,"authExpiry":"20170112","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9205248,"procedureCode":"J3285","authNbr":"0000000","serviceDate":"20160823","responseDate":"20161013","serviceBranchId":65,"serviceDuration":20,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":15,"invoiceId":46257476,"patientAge":0,"npiId":0,"prescriptionId":1206606,"refillNo":0,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"}]""")
//saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
message
}
If someone is interested in the solution.
For messages(InputDStream) I use foreachRDD, which convert it to a RDD, next use map to get the json from the consumerRecord. Next, convert the RDD to an Array[String] and pass it to the processMessage method.
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures)
}
Inside the processMessage method, there is a for loop to process all the strings. Also we are inserting the request message and response message to a hbase table.
def processMessage(message: Array[String], mlFeatures: MLFeatures) = {
for (j <- 0 until message.size){
val req_message = message(j)//.get(j).toString()
val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message,mlFeatures)
println("Message processed is " + req_message)
kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
var startTime = new Date().getTime();
saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
var endTime = new Date().getTime();
println("Kafka Consumer savetoHBase took : "+ (endTime - startTime) / 1000 + " seconds")
}
}

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}

Spark rdd write to Hbase

I am able to read the messages from Kafka using the below code:
val ssc = new StreamingContext(sc, Seconds(50))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap)
But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success.
lines.foreachRDD(rdd => {
rdd.foreach(record => {
val i = +1
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
})
})
Well, you are not actually executing the Put, you are mereley creating a Put request and adding data to it. What you are missing is an
hTable.put(thePut);
Adding other answer!!
You can use foreachPartition to establish connection at executor level to be more efficient instead of each row which is costly operation.
lines.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
iter.foreach(record => {
val i = +1
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
//missing part in your code
hTable.put(thePut);
})
})
})