How to iterate over key values of a Kafka Streams Table - scala

I'm new to Kafka streams and I tried to iterate over items in a kafka Streams table via the keyValueStore:
The idea is to create a Ktable (I've also tried with a globalKTable) with a KeyValueStore.
Then a separated thread is in charge to read the content of the KeyValueStore in order to iterate over last value of each key.
val streamProperties: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, config.getStringList("kafka.bootstrap.servers").toList.mkString(","))
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray.getClass.getName)
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p
}
val builder: StreamsBuilder = new StreamsBuilder()
import org.apache.kafka.streams.kstream.Materialized
import org.apache.kafka.streams.state.KeyValueStore
val globalTable = builder.table("test",
Materialized
.as[String, Array[Byte], KeyValueStore[org.apache.kafka.common.utils.Bytes, Array[Byte]]]("TestStore")
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.ByteArray())
)
val streams: KafkaStreams = new KafkaStreams(builder.build(), streamProperties)
streams.start()
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
println("\n\n\n tick \n\n\n")
try {
val keyValueStore = streams.store(globalTable.queryableStoreName(), QueryableStoreTypes.keyValueStore())
keyValueStore.all().toIterator.map { k =>
print(k.key)
}
} catch {
case _ => println("error")
}
}
}
val f = ex.scheduleAtFixedRate(task, 1, 10, TimeUnit.SECONDS)
}
}
In the thread the keyValueStore stays empty even when I produce items on topic "test".
Is there something I missed or didn't understand?

One thing missing is state directory location config:
p.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp")
Without it Kafka Streams would not throw exception, but stateful things like global KTables would silently not work.

Related

How to use zio kafka with google protobuf when you need to read data from topic and get it as Java proto class?

I need to get data from Kafka topic as a Zio Stream, data there is in the google protobuf format, also i need to check schema
I use the following sample protobuf file which generates proto.Data Java class for me:
syntax = "proto3";
package proto;
import "google/protobuf/timestamp.proto";
option java_multiple_files = true;
option java_outer_classname = "Protos";
message Data {
string id = 1;
google.protobuf.Timestamp receiveTimestamp = 2;
}
If i use the following properties i am able to get data as KStream[proto.Data] (so using kafka api) for the proto.Data proto Message class
val props: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, s"kstream-application-${java.util.UUID.randomUUID().toString}")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
p.put("security.protocol", "SSL")
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde")
p.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081")
p.put("enable.auto.commit", "false")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put("specific.protobuf.value.type", classOf[proto.Data])
p
}
And here is the example of code using the KStream (I am able to print record with exact Id equals 1 only):
val builder: StreamsBuilder = new StreamsBuilder
val risks: KStream[String, proto.Data] =
builder
.stream[String, proto.Data](topic)
.filter((_, value) => value.getId=="1")
val sysout = Printed
.toSysOut[String, proto.Data]
.withLabel("protoStream")
risks.print(sysout)
val streams: KafkaStreams = new KafkaStreams(builder.build(), props)
streams.start()
sys.ShutdownHookThread {
streams.close(Duration.ofSeconds(10))
}
Now if i use zio kafka and same properties somehow i am able to print out the whole stream:
val props: Map[String, AnyRef] = Map(
StreamsConfig.APPLICATION_ID_CONFIG -> s"kstream-application-${java.util.UUID.randomUUID().toString}",
StreamsConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
"security.protocol" -> "SSL",
StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG -> Serdes.String.getClass.getName,
StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG -> "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde",
AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> "http://localhost:8081",
"enable.auto.commit" -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
"specific.protobuf.value.type" -> classOf[proto.Data]
)
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
myStream.exitCode
}
But if i try to filter only record with Id equals 1
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.filter(_.value.getId=="1")
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
I get error like
Fiber failed.
An unchecked error was produced.
java.lang.ClassCastException: com.google.protobuf.DynamicMessage cannot be cast to proto.Data
I was wondering if anybody used zio kafka together with google protobuf and deserialization to the Java proto class was successful when you read data from the topic?

Kafka Ensure At Least Once

First project with Kafka, trying to prove that an event will get processed at least once. So far, not seeing evidence that processing is retried.
Structure of dummy app is simple: subscribe, process, publish, commit; if exception, abort transaction and hope it gets retried. I am logging every message.
I expect to see (1) "process messageX" (2) "error for messageX" (3) "process messageX". Instead, I see processing continue beyond messageX, i.e. it does not get re-processed.
What I see is: (1) "process messageX" (2) "error for messageX" (3) "process someOtherMessage".
Using Kafka 2.7.0, Scala 2.12.
What am I missing? Showing relevant parts of dummy app below.
I also tried by removing the producer from the code (and all references to it).
UPDATE 1: I managed to get records re-processed by using the offsets with consumer.seek(), i.e. sending the consumer back to the start of the batch of records. Not sure why simply NOT reaching consumer.commitSync() (because of an exception) does not do this already.
import com.myco.somepackage.{MyEvent, KafkaConfigTxn}
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer, OffsetAndMetadata}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.common.{KafkaException, TopicPartition}
import org.slf4j.LoggerFactory
import java.util
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
// Prove that a message can be re-processed if there is an exception
object TopicDrainApp {
private val logger = LoggerFactory.getLogger(this.getClass)
private val subTopic = "input.topic"
private val pubTopic = "output.topic"
val producer = new KafkaProducer[String, String](KafkaConfigTxn.producerProps)
producer.initTransactions()
val consumer = new KafkaConsumer[String, String](KafkaConfigTxn.consumerProps)
private var lastEventMillis = System.currentTimeMillis
private val pollIntervalMillis = 1000
private val pollDuration = java.time.Duration.ofMillis(pollIntervalMillis)
def main(args: Array[String]): Unit = {
subscribe(subTopic)
}
def subscribe(subTopic: String): Unit = {
consumer.subscribe(util.Arrays.asList(subTopic))
while (System.currentTimeMillis - lastEventMillis < 5000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
records.asScala.foreach { record =>
try {
lastEventMillis = System.currentTimeMillis
val event = MyEvent.deserialize(record.value())
logger.info("ReceivedMyEvent:" + record.value())
producer.beginTransaction()
simulateProcessing(event) // [not shown] throw exception to test re-processing
producer.flush()
val offsetsToCommit = getOffsetsToCommit(records)
//consumer.commitSync() // tried this; does not work
//producer.sendOffsetsToTransaction(offsetsToCommit, "group1") // tried this; does not work
producer.commitTransaction()
} catch {
case e: KafkaException => logger.error(s"rollback ${record.value()}", e)
producer.abortTransaction()
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
}
private def getOffsetsToCommit(records: ConsumerRecords[String, String]): util.Map[TopicPartition, OffsetAndMetadata] = {
records.partitions().asScala.map { partition =>
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(partitionedRecords.size - 1).offset
(partition, new OffsetAndMetadata(offset + 1))
}.toMap.asJava
}
}
object KafkaConfigTxn {
// Only relevant properties are shown
def commonProperties: Properties = {
val props = new Properties()
props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "...")
props.put(CommonClientConfigs.GROUP_ID_CONFIG, "...")
props
}
def producerProps: Properties = {
val props = new Properties()
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true") // "enable.idempotence"
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "...") // "transactional.id"
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.RETRIES_CONFIG, "3")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
def consumerProps: Properties = {
val props = new Properties()
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed") // "isolation.level"
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
}
according to the reference I gave you , you need to use sendOffsetsToTransaction in the process, but again your consumer won't get the message of the aborted transcation as you are reading only committed transcation
Transcations were introduced in order to allow exactly once processing between kafka to kafka, being said that kafka supported from day one delivery semantics of at least once and at most once,
To get at least once behavior you disable auto commit and commit when processing had finished successfully that way next time you call poll() if you had exception before commit you will read again the records from the last commit offset
To get at most once behavior you commit before processing starts that way if exception happens , next time you call poll() you get new messages (but lose the other messages)
Exactly once is the most hard to achieve in plain java (not talking on spring framework which makes everything easier) - it involves saving offsets to external db ( usually where you process is done) and reading from there on startup/rebalance
For transcation usage example in java you might read this excellent guide by baeldung
https://www.baeldung.com/kafka-exactly-once
Figured out the correct combination of method calls (subscribe, beginTransaction, process, commit / abortTransaction, etc.), for a demo app. The core of the code is
def readProcessWrite(subTopic: String, pubTopic: String): Int = {
var lastEventMillis = System.currentTimeMillis
val consumer = createConsumer(subTopic)
val producer = createProducer()
val groupMetadata = consumer.groupMetadata()
var numRecords = 0
while (System.currentTimeMillis - lastEventMillis < 10000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
val offsetsToCommit = getOffsetsToCommit(records)
// println(s">>> PollRecords: ${records.count()}")
records.asScala.foreach { record =>
val currentOffset = record.offset()
try {
numRecords += 1
lastEventMillis = System.currentTimeMillis
println(s">>> Topic: $subTopic, ReceivedEvent: offset=${record.offset()}, key=${record.key()}, value=${record.value()}")
producer.beginTransaction()
val eventOut = simulateProcessing(record.value()) // may throw
publish(producer, pubTopic, eventOut)
producer.sendOffsetsToTransaction(offsetsToCommit, groupMetadata)
consumer.commitSync()
producer.commitTransaction()
} catch {
case e: KafkaException => println(s"---------- rollback ${record.value()}", e)
producer.abortTransaction()
offsetsToCommit.forEach { case (topicPartition, _) =>
consumer.seek(topicPartition, currentOffset)
}
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
consumer.close()
producer.close()
numRecords
}
// Consumer created with props.put("max.poll.records", "1")
I was able to prove that this will process each event exactly once, even when simulateProcessing() throws an exception. To be precise: when processing works fine, each event is processed exactly once. If there is an exception, the event is re-processed until success. In my case, there is no real reason for the exceptions, so re-processing will always end in success.

TopologyException on left join

I'm trying to do a simple stream.leftJoin(table) but get the following exception at runtime:
TopologyException: Invalid topology: StateStore null is not added yet
This is what my code roughly looks like, I commented out the implementation details to keep it short:
val streamsConfiguration: Properties = {
val p = new Properties()
// api config
p.put(StreamsConfig.APPLICATION_ID_CONFIG /**/)
p.put(StreamsConfig.CLIENT_ID_CONFIG /**/)
// kafka broker
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// local state store
p.put(StreamsConfig.STATE_DIR_CONFIG, "./streams-state")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
// serdes
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, classOf[StringSerde])
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, classOf[StringSerde])
p
}
val builder = new StreamsBuilderS()
val rawInfoTable: KTableS[String, String] = builder.table("station_info")
val infoTable: KTableS[String, StationInfo] = rawInfoTable.mapValues{jsonString =>
/** */
}.filter(/** */)
.mapValues((/** */)
val rawStatusStream: KStreamS[String, String] = builder.stream("station_status")
val statusStream: KStreamS[String, StationStatus] = rawStatusStream.flatMapValues{jsonString =>
/** */
}
val outputStream: KStreamS[String, String] = statusStream
.leftJoin(infoTable, calculateStats)
.filter((_, availability) => {
/** */
})
.map((stationId: String, availability) => {
/** */
})
outputStream.to("low_availability")
val streams = new KafkaStreams(builder.build(), streamsConfiguration)
streams.cleanUp()
streams.start()
I even tried to manually add a StateStore via:
val store = Stores.inMemoryKeyValueStore("my-store")
val storeBuilder = Stores.keyValueStoreBuilder(store, new StringSerde(), new StringSerde())
val builder = new StreamsBuilderS()
builder.addStateStore(storeBuilder)
But it doesn't seem to change anything. I'm using the kafka streams wrapper from lightbend: "com.lightbend" %% "kafka-streams-scala" % "0.2.1"
All the examples I checked don't seem to care about adding a state store, so I'm somewhat confused. Can somebody point me in the right direction? Does this have to do something the STATE_DIR_CONFIG? Or with the Kafka cluster I'm running locally?

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

stopping spark streaming after reading first batch of data

I am using spark streaming to consume kafka messages. I want to get some messages as sample from kafka instead of reading all messages. So I want to read a batch of messages, return them to caller and stopping spark streaming. Currently I am passing batchInterval time in awaitTermination method of spark streaming context method. I don't now how to return processed data to caller from spark streaming. Here is my code that I am using currently
def getsample(params: scala.collection.immutable.Map[String, String]): Unit = {
if (params.contains("zookeeperQourum"))
zkQuorum = params.get("zookeeperQourum").get
if (params.contains("userGroup"))
group = params.get("userGroup").get
if (params.contains("topics"))
topics = params.get("topics").get
if (params.contains("numberOfThreads"))
numThreads = params.get("numberOfThreads").get
if (params.contains("sink"))
sink = params.get("sink").get
if (params.contains("batchInterval"))
interval = params.get("batchInterval").get.toInt
val sparkConf = new SparkConf().setAppName("KafkaConsumer").setMaster("spark://cloud2-server:7077")
val ssc = new StreamingContext(sparkConf, Seconds(interval))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
var consumerConfig = scala.collection.immutable.Map.empty[String, String]
consumerConfig += ("auto.offset.reset" -> "smallest")
consumerConfig += ("zookeeper.connect" -> zkQuorum)
consumerConfig += ("group.id" -> group)
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
val streams = data.window(Seconds(interval), Seconds(interval)).map(x => new String(x))
streams.foreach(rdd => rdd.foreachPartition(itr => {
while (itr.hasNext && size >= 0) {
var msg=itr.next
println(msg)
sample.append(msg)
sample.append("\n")
size -= 1
}
}))
ssc.start()
ssc.awaitTermination(5000)
ssc.stop(true)
}
So instead of saving messages in a String builder called "sample" I want to return to caller.
You can implement a StreamingListener and then inside it, onBatchCompleted you can call ssc.stop()
private class MyJobListener(ssc: StreamingContext) extends StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) = synchronized {
ssc.stop(true)
}
}
This is how you attach your SparkStreaming to the JobListener:
val listen = new MyJobListener(ssc)
ssc.addStreamingListener(listen)
ssc.start()
ssc.awaitTermination()
We can get sample messages using following piece of code
var sampleMessages=streams.repartition(1).mapPartitions(x=>x.take(10))
and if we want to stop after first batch then we should implement our own StreamingListener interface and should stop streaming in onBatchCompleted method.