Is there a way to name the inner mapValues() topic created as part of the count() operator in kafka-streams? - apache-kafka

I'm, trying to name all processors of a simple word count kafka streams application, however, can't figure out how to name the inner topic created due to the mapValues() call inside the count() method, that's created as result of calling to stream(). This is the application code, followed by the topology description (showing only the second sub-topology):
def createTopology(builder: StreamsBuilder, config: Config): Topology = {
val consumed = Consumed
.as(inputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val produced = Produced
.as(outputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.longSerde)
val flatMapValuesProc = Named.as("flatMapValues")
val groupByProc = Grouped
.as("groupBy")
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val textLines: KStream[String, String] = builder.stream[String, String](inputTopic)(consumed)
val wordCounts: KTable[String, Long] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+"), named = flatMapValuesProc)
.groupBy((_, word) => word)(grouped)
.count(Named.as("count"))(Materialized.as(storeName))
wordCounts
.toStream(Named.as("toStream"))
.to(outputTopic)(produced)
builder.build()
}
Looking at the count() method, it seems like it's not possible to name this operation, from the code. Is there another way to name this inner topic?
def count(named: Named)(implicit materialized: Materialized[K, Long, ByteArrayKeyValueStore]): KTable[K, Long] = {
...
new KTable(
javaCountTable.mapValues[Long](
((l: java.lang.Long) => Long2long(l)).asValueMapper,
Materialized.`with`[K, Long, ByteArrayKeyValueStore](tableImpl.keySerde(), Serdes.longSerde)
)
)
}

Related

Scala spark kafka code - functional approach

I've the following code in scala. I am using spark sql to pull data from hadoop, perform some group by on the result, serialize it and then write that message to Kafka.
I've written the code - but i want to write it in functional way. Should i create a new class with function 'getCategories' to get the categories from Hadoop? I am not sure how to approach this.
Here is the code
class ExtractProcessor {
def process(): Unit = {
implicit val formats = DefaultFormats
val spark = SparkSession.builder().appName("test app").getOrCreate()
try {
val df = spark.sql("SELECT DISTINCT SUBCAT_CODE, SUBCAT_NAME, CAT_CODE, CAT_NAME " +
"FROM CATEGORY_HIERARCHY " +
"ORDER BY CAT_CODE, SUBCAT_CODE ")
val result = df.collect().groupBy(row => (row(2), row(3)))
val categories = result.map(cat =>
category(cat._1._1.toString(), cat._1._2.toString(),
cat._2.map(subcat =>
subcategory(subcat(0).toString(), subcat(1).toString())).toList))
val jsonMessage = write(categories)
val kafkaKey = java.security.MessageDigest.getInstance("SHA-1").digest(jsonMessage.getBytes("UTF-8")).map("%02x".format(_)).mkString.toString()
val key = write(kafkaKey)
Logger.log.info(s"Json Message: ${jsonMessage}")
Logger.log.info(s"Kafka Key: ${key}")
KafkaUtil.apply.send(key, jsonMessage, "testTopic")
}
And here is the Kafka Code
class KafkaUtil {
def send(key: String, message: String, topicName: String): Unit = {
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("client.id", "test publisher")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](properties)
try {
val record = new ProducerRecord[String, String](topicName, key, message)
producer.send(record)
}
finally {
producer.close()
Logger.log.info("Kafka producer closed...")
}
}
}
object KafkaUtil {
def apply: KafkaUtil = {
new KafkaUtil
}
}
Also, for writing unit tests what should i be testing in the functional approach. In OOP we unit test the business logic, but in my scala code there is hardly any business logic.
Any help is appreciated.
Thanks in advance,
Suyog
You code consists of
1) Loading the data into spark df
2) Crunching the data
3) Creating a json message
4) Sending json message to kafka
Unit tests are good for testing pure functions.
You can extract step 2) into a method with signature like
def getCategories(df: DataFrame): Seq[Category] and cover it by a test.
In the test data frame will be generated from just a plain hard-coded in-memory sequence.
Step 3) can be also covered by a unit test if you feel it error-prone
Steps 1) and 4) are to be covered by an end-to-end test
By the way
val result = df.collect().groupBy(row => (row(2), row(3))) is inefficient. Better to replace it by val result = df.groupBy(row => (row(2), row(3))).collect
Also there is no need to initialize a KafkaProducer individually for each single message.

ClassCastException in kafka-streams join

I have 2 kafka streams that I want to merge. Each of them is typed, and contains the right data:
private val inputOrderStream: KStream[String, OrderCreationRequest] =
orderCreationRequestBuilder
.stream[OrderCreationRequestKey, OrderCreationRequest]("order-creation-request-topic")
.map[String, OrderCreationRequest]((key: OrderCreationRequestKey, orderCreationRequest: OrderCreationRequest) ⇒ new KeyValue(key.id, orderCreationRequest))
private val inputPaymentStream: KStream[String, OrderPayment] =
orderPaymentBuilder
.stream[OrderPaymentKey, OrderPayment]("payment-topic")
.map[String, OrderPayment]((key: OrderPaymentKey, orderPayment: OrderPayment) ⇒ new KeyValue(key.id, orderPayment))
When I try to join then by bey, I get a very confusing java.lang.ClassCastException:
java.lang.ClassCastException: com.ordercreation.OrderPayment cannot be cast to com.ordercreation.OrderCreationRequest
at com.ordercreation.streams.Streams$$anon$1.apply(Streams.scala:48)
at org.apache.kafka.streams.kstream.internals.AbstractStream$1.apply(AbstractStream.java:71)
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:82)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:47)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:82)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:189)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndPunctuate(StreamThread.java:679)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:557)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
Here's the join code:
private val outputStream: KStream[String, String] =
inputPaymentStream
.join[OrderCreationRequest, String](
inputOrderStream,
new ValueJoiner[OrderPayment, OrderCreationRequest, String] {
override def apply(op: OrderPayment, ocr: OrderCreationRequest): String = s"Payed ${op.amountInCents} via ${op.method} on behalf of ${ocr.customerNumber}."
},
window,
Serdes.String(),
null,
null)
Investigating a bit more (using AnyRef instead of OrderCreationRequest in the join method) I can see that the same value of OrderPayment is given to both arguments of ValueJoiner.apply. Why is this happening? Am I missing anything?
Note: If I print the contents of the 2 streams I get the expected data so I'm sure the topics don't contains the same data.
Thank you!

How can I apply the same Pattern on two different Kafka Streams in Flink?

I have this Flink program below:
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val kafkaStream1 = env.addSource(new FlinkKafkaConsumer010[String](topic1, new SimpleStringSchema(), props))
.assignTimestampsAndWatermarks(new TimestampExtractor)
val kafkaStream2 = env.addSource(new FlinkKafkaConsumer010[String](topic2, new SimpleStringSchema(), props))
.assignTimestampsAndWatermarks(new TimestampExtractor)
val partitionedStream1 = kafkaStream1.keyBy(jsonString => {
extractUserId(jsonString)
})
val partitionedStream2 = kafkaStream2.keyBy(jsonString => {
extractUserId(jsonString)
})
//Is there a way to match the userId from partitionedStream1 and partitionedStream2 in this same pattern?
val patternForMatchingUserId = Pattern.begin[String]("start")
.where(stream1.getUserId() == stream2.getUserId()) //I want to do something like this
//Is there a way to pass in partitionedStream1 and partitionedStream2 to this CEP.pattern function?
val patternStream = CEP.pattern(partitionedStream1, patternForMatchingUserId)
env.execute()
}
}
In the flink program above, I have two streams named partitionedStream1 and partitionedStream2 which is keyedBy the userID.
I want to somehow compare the data from both streams in the patternForMatchingUserId pattern (similar to how I showed above). Is there a way to pass in two streams to the CEP.Pattern function?
Something like this:
val patternStream = CEP.pattern(partitionedStream1, partitionedStream2, patternForMatchingUserId)
There is no way you can pass two streams to CEP, but you can pass a combined stream.
If both streams have the same type/schema. You can just union them. I believe this solutions matches your case.
partitionedStream1.union(partitionedStream2).keyBy(...)
If they have different schema. You can transform them into one stream using some custom logic inside e.g. coFlatMap.

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

Scala - Tweets subscribing - Kafka Topic and Ingest into HBase

I have to consume tweets from a Kafka Topic and ingest the same into HBase. The following is the code that i wrote but this is not working properly.
The main code is not calling "convert" method and hence no records are ingested into HBase table. Can someone help me please.
tweetskafkaStream.foreachRDD(rdd => {
println("Inside For Each RDD" )
rdd.foreachPartition( record => {
println("Inside For Each Partition" )
val data = record.map(r => (r._1, r._2)).map(convert)
})
})
def convert(t: (String, String)) = {
println("in convert")
//println("first param value ", t._1)
//println("second param value ", t._2)
val hConf = HBaseConfiguration.create()
hConf.set(TableOutputFormat.OUTPUT_TABLE,hbaseTableName)
hConf.set("hbase.zookeeper.quorum", "192.168.XXX.XXX:2181")
hConf.set("hbase.master", "192.168.XXX.XXX:16000")
hConf.set("hbase.rootdir","hdfs://192.168.XXX.XXX:9000/hbase")
val today = Calendar.getInstance.getTime
val printformat = new SimpleDateFormat("yyyyMMddHHmmss")
val id = printformat.format(today)
val p = new Put(Bytes.toBytes(id))
p.add(Bytes.toBytes("data"), Bytes.toBytes("tweet_text"),(t._2).getBytes())
(id, p)
val mytable = new HTable(hConf,hbaseTableName)
mytable.put(p)
}
I don't want to use the current datetime as the key (t._1) and hence constructing that in my convert method.
Thanks
Bala
Instead of foreachPartition, I changed it to foreach. This worked well.