ClassCastException in kafka-streams join - scala

I have 2 kafka streams that I want to merge. Each of them is typed, and contains the right data:
private val inputOrderStream: KStream[String, OrderCreationRequest] =
orderCreationRequestBuilder
.stream[OrderCreationRequestKey, OrderCreationRequest]("order-creation-request-topic")
.map[String, OrderCreationRequest]((key: OrderCreationRequestKey, orderCreationRequest: OrderCreationRequest) ⇒ new KeyValue(key.id, orderCreationRequest))
private val inputPaymentStream: KStream[String, OrderPayment] =
orderPaymentBuilder
.stream[OrderPaymentKey, OrderPayment]("payment-topic")
.map[String, OrderPayment]((key: OrderPaymentKey, orderPayment: OrderPayment) ⇒ new KeyValue(key.id, orderPayment))
When I try to join then by bey, I get a very confusing java.lang.ClassCastException:
java.lang.ClassCastException: com.ordercreation.OrderPayment cannot be cast to com.ordercreation.OrderCreationRequest
at com.ordercreation.streams.Streams$$anon$1.apply(Streams.scala:48)
at org.apache.kafka.streams.kstream.internals.AbstractStream$1.apply(AbstractStream.java:71)
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:82)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:189)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndPunctuate(StreamThread.java:679)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:557)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
Here's the join code:
private val outputStream: KStream[String, String] =
inputPaymentStream
.join[OrderCreationRequest, String](
inputOrderStream,
new ValueJoiner[OrderPayment, OrderCreationRequest, String] {
override def apply(op: OrderPayment, ocr: OrderCreationRequest): String = s"Payed ${op.amountInCents} via ${op.method} on behalf of ${ocr.customerNumber}."
},
window,
Serdes.String(),
null,
null)
Investigating a bit more (using AnyRef instead of OrderCreationRequest in the join method) I can see that the same value of OrderPayment is given to both arguments of ValueJoiner.apply. Why is this happening? Am I missing anything?
Note: If I print the contents of the 2 streams I get the expected data so I'm sure the topics don't contains the same data.
Thank you!

Related

Is there a way to name the inner mapValues() topic created as part of the count() operator in kafka-streams?

I'm, trying to name all processors of a simple word count kafka streams application, however, can't figure out how to name the inner topic created due to the mapValues() call inside the count() method, that's created as result of calling to stream(). This is the application code, followed by the topology description (showing only the second sub-topology):
def createTopology(builder: StreamsBuilder, config: Config): Topology = {
val consumed = Consumed
.as(inputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val produced = Produced
.as(outputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.longSerde)
val flatMapValuesProc = Named.as("flatMapValues")
val groupByProc = Grouped
.as("groupBy")
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val textLines: KStream[String, String] = builder.stream[String, String](inputTopic)(consumed)
val wordCounts: KTable[String, Long] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+"), named = flatMapValuesProc)
.groupBy((_, word) => word)(grouped)
.count(Named.as("count"))(Materialized.as(storeName))
wordCounts
.toStream(Named.as("toStream"))
.to(outputTopic)(produced)
builder.build()
}
Looking at the count() method, it seems like it's not possible to name this operation, from the code. Is there another way to name this inner topic?
def count(named: Named)(implicit materialized: Materialized[K, Long, ByteArrayKeyValueStore]): KTable[K, Long] = {
...
new KTable(
javaCountTable.mapValues[Long](
((l: java.lang.Long) => Long2long(l)).asValueMapper,
Materialized.`with`[K, Long, ByteArrayKeyValueStore](tableImpl.keySerde(), Serdes.longSerde)
)
)
}

Converting a String to a Apache Avro GenericRecord

Can someone give advices on how to convert a String into a GenericRecord? I would like to convert my records to avro and put them into a Kafka topic.
I used to have it as a String (I replaced it with GenericRecord).
def producerMethod(socket: Seq[Socket], prot: String): KafkaProducer[String, GenericRecord] =
producer(params(socket, prot))
def producerMethod(params: Map[String, String]): KafkaProducer[String, GenericRecord] =
new KafkaProducer[String, GenericRecord](params.asInstanceOf[Map[String, Object]].asJava)
I also have an apply method, which has an iterator in it.
It iterates over strings and adds the values in a tuple.
In the end I send the data to a topic:
The results variable is a String but I added "asInstanceOf"
result.send(new ProducerRecord[String,GenericRecord](topic,key,results.asInstanceOf[GenericRecord]))
Here is my iterator:
iterator.foreach { tuple =>
val (keyinfo, info, Worked(outinfo, _)) = tuple
val (key, results) = keyResultToString(keyinfo, info, outinfo)
result.send(new ProducerRecord[String,GenericRecord](topic,key,results.asInstanceOf[GenericRecord]))
}
Everything works, when I replace the GenericRecord back to String. But not if I would like to convert it to a GenericRecord

Connecting list of case classes to kafka producer?

I have the below case class:
case class Alpakka(id:Int,name:String,animal_type:String)
I am trying to connect a list of these case classes to a producer in kafka by using the following code:
def connectEntriesToProducer(seq: Seq[Alpakka]) = {
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("localhost:9092")
seq.map(alpakka => new ProducerRecord[String, String]("alpakkas", alpakka.asJson.noSpaces))
.runWith(Producer.plainSink(producerSettings))
}
I am using circe to convert the case class to json. However I keep getting a compiler error saying this:
Error:(87, 34) type mismatch;
found : akka.stream.scaladsl.Sink[org.apache.kafka.clients.producer.ProducerRecord[String,String],scala.concurrent.Future[akka.Done]]
required: org.apache.kafka.clients.producer.ProducerRecord[String,String] => ?
.runWith(Producer.plainSink(producerSettings))
I'm not sure whats going on!
You are trying to build a Graph from a Seq instead of a Source.
Your method connectEntriesToProducer should look like
def connectEntriesToProducer(seq: Source[Alpakka]) = {
Note, Source instead of Seq.
In alternative, you can build a source from a Seq, but you'll have to use immutable.Seq since Source.apply would only take an immutable iterable.
def connectEntriesToProducer(seq: scala.collection.immutable.Seq[Alpakka]) = {
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("localhost:9092")
Source(seq).
map(alpakka => new ProducerRecord[String, String]("alpakkas", alpakka.asJson.noSpaces))
.runWith(Producer.plainSink(producerSettings))
}

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

Scala - Tweets subscribing - Kafka Topic and Ingest into HBase

I have to consume tweets from a Kafka Topic and ingest the same into HBase. The following is the code that i wrote but this is not working properly.
The main code is not calling "convert" method and hence no records are ingested into HBase table. Can someone help me please.
tweetskafkaStream.foreachRDD(rdd => {
println("Inside For Each RDD" )
rdd.foreachPartition( record => {
println("Inside For Each Partition" )
val data = record.map(r => (r._1, r._2)).map(convert)
})
})
def convert(t: (String, String)) = {
println("in convert")
//println("first param value ", t._1)
//println("second param value ", t._2)
val hConf = HBaseConfiguration.create()
hConf.set(TableOutputFormat.OUTPUT_TABLE,hbaseTableName)
hConf.set("hbase.zookeeper.quorum", "192.168.XXX.XXX:2181")
hConf.set("hbase.master", "192.168.XXX.XXX:16000")
hConf.set("hbase.rootdir","hdfs://192.168.XXX.XXX:9000/hbase")
val today = Calendar.getInstance.getTime
val printformat = new SimpleDateFormat("yyyyMMddHHmmss")
val id = printformat.format(today)
val p = new Put(Bytes.toBytes(id))
p.add(Bytes.toBytes("data"), Bytes.toBytes("tweet_text"),(t._2).getBytes())
(id, p)
val mytable = new HTable(hConf,hbaseTableName)
mytable.put(p)
}
I don't want to use the current datetime as the key (t._1) and hence constructing that in my convert method.
Thanks
Bala
Instead of foreachPartition, I changed it to foreach. This worked well.