How to use zio kafka with google protobuf when you need to read data from topic and get it as Java proto class? - scala

I need to get data from Kafka topic as a Zio Stream, data there is in the google protobuf format, also i need to check schema
I use the following sample protobuf file which generates proto.Data Java class for me:
syntax = "proto3";
package proto;
import "google/protobuf/timestamp.proto";
option java_multiple_files = true;
option java_outer_classname = "Protos";
message Data {
string id = 1;
google.protobuf.Timestamp receiveTimestamp = 2;
}
If i use the following properties i am able to get data as KStream[proto.Data] (so using kafka api) for the proto.Data proto Message class
val props: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, s"kstream-application-${java.util.UUID.randomUUID().toString}")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
p.put("security.protocol", "SSL")
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde")
p.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081")
p.put("enable.auto.commit", "false")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put("specific.protobuf.value.type", classOf[proto.Data])
p
}
And here is the example of code using the KStream (I am able to print record with exact Id equals 1 only):
val builder: StreamsBuilder = new StreamsBuilder
val risks: KStream[String, proto.Data] =
builder
.stream[String, proto.Data](topic)
.filter((_, value) => value.getId=="1")
val sysout = Printed
.toSysOut[String, proto.Data]
.withLabel("protoStream")
risks.print(sysout)
val streams: KafkaStreams = new KafkaStreams(builder.build(), props)
streams.start()
sys.ShutdownHookThread {
streams.close(Duration.ofSeconds(10))
}
Now if i use zio kafka and same properties somehow i am able to print out the whole stream:
val props: Map[String, AnyRef] = Map(
StreamsConfig.APPLICATION_ID_CONFIG -> s"kstream-application-${java.util.UUID.randomUUID().toString}",
StreamsConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
"security.protocol" -> "SSL",
StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG -> Serdes.String.getClass.getName,
StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG -> "io.confluent.kafka.streams.serdes.protobuf.KafkaProtobufSerde",
AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> "http://localhost:8081",
"enable.auto.commit" -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
"specific.protobuf.value.type" -> classOf[proto.Data]
)
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
myStream.exitCode
}
But if i try to filter only record with Id equals 1
val myStream = for {
serdeProto <- Serde.fromKafkaSerde(new KafkaProtobufSerde[proto.Data](), props, true)
_ <- stream
.plainStream(Serde.string, serdeProto)
.provideSomeLayer(consumer ++ Console.live)
.filter(_.value.getId=="1")
.tap(r => console.putStrLn(s"stream: $r"))
.runDrain
} yield ()
I get error like
Fiber failed.
An unchecked error was produced.
java.lang.ClassCastException: com.google.protobuf.DynamicMessage cannot be cast to proto.Data
I was wondering if anybody used zio kafka together with google protobuf and deserialization to the Java proto class was successful when you read data from the topic?

Related

Unable to store consumed Kafka offsets into HBase table while committing through Spark D Streams

I am trying to save Kafka Consumer offset in an HBase table, with a success flag, after it is processed through business logic. This whole process is a part of Spark DStream, and I am using below piece of code for the same:
val hbaseTable = "table"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "server",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "topic",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topic = Array("topicName")
val ssc = new StreamingContext(sc, Seconds(40))
val stream = KafkaUtils.createDirectStream[String, String](ssc,PreferConsistent,Subscribe[String, String] (topic, kafkaParams))
stream.foreachRDD((rdd, batchTime) => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach(offset => println(offset.topic, offset.partition,
offset.fromOffset, offset.untilOffset))
rdd.map(value => (value.value())).saveAsTextFile("path")
println("Saved Data into file")
var commits:OffsetCommitCallback = null
rdd.foreachPartition(message => {
val hbaseConf = HBaseConfiguration.create()
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTable))
commits = new OffsetCommitCallback(){
def onComplete(offsets: java.util.Map[TopicPartition, OffsetAndMetadata],
exception: Exception) {
message.foreach(value => {
val key= value.key()
val offset = value.offset()
println(s"offset is: $offset")
val partitionId = TaskContext.get.partitionId()
println(s"partitionID is: $partitionId")
val rowKey = key
val put = new Put(rowKey.getBytes)
if (exception != null) {
println("Got Error Message:" + exception.getMessage)
put.addColumn("o".getBytes, "flag".toString.getBytes(),"Error".toString.getBytes())
put.addColumn("o".getBytes,"error_message".toString.getBytes(),exception.getMessage.toString.getBytes())
println("Got Error Message:" + exception.getMessage)
table.put(put)
} else {
put.addColumn("o".getBytes, "flag".toString.getBytes(),"Success".toString.getBytes())
table.put(put)
println(offsets.values())
}
}
)
println("Inserted into HBase")
}
}
table.close()
conn.close()
}
)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges, commits)
}
)
ssc.start()
This code executes successfully. However, it neither saves data into HBase nor produce logs at executor level (which I am printing while iterating over each partition of RDD). Not sure what exactly am I missing here. Any help would be highly appreciated.

How to iterate over key values of a Kafka Streams Table

I'm new to Kafka streams and I tried to iterate over items in a kafka Streams table via the keyValueStore:
The idea is to create a Ktable (I've also tried with a globalKTable) with a KeyValueStore.
Then a separated thread is in charge to read the content of the KeyValueStore in order to iterate over last value of each key.
val streamProperties: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, config.getStringList("kafka.bootstrap.servers").toList.mkString(","))
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray.getClass.getName)
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p
}
val builder: StreamsBuilder = new StreamsBuilder()
import org.apache.kafka.streams.kstream.Materialized
import org.apache.kafka.streams.state.KeyValueStore
val globalTable = builder.table("test",
Materialized
.as[String, Array[Byte], KeyValueStore[org.apache.kafka.common.utils.Bytes, Array[Byte]]]("TestStore")
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.ByteArray())
)
val streams: KafkaStreams = new KafkaStreams(builder.build(), streamProperties)
streams.start()
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
println("\n\n\n tick \n\n\n")
try {
val keyValueStore = streams.store(globalTable.queryableStoreName(), QueryableStoreTypes.keyValueStore())
keyValueStore.all().toIterator.map { k =>
print(k.key)
}
} catch {
case _ => println("error")
}
}
}
val f = ex.scheduleAtFixedRate(task, 1, 10, TimeUnit.SECONDS)
}
}
In the thread the keyValueStore stays empty even when I produce items on topic "test".
Is there something I missed or didn't understand?
One thing missing is state directory location config:
p.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp")
Without it Kafka Streams would not throw exception, but stateful things like global KTables would silently not work.

TopologyException on left join

I'm trying to do a simple stream.leftJoin(table) but get the following exception at runtime:
TopologyException: Invalid topology: StateStore null is not added yet
This is what my code roughly looks like, I commented out the implementation details to keep it short:
val streamsConfiguration: Properties = {
val p = new Properties()
// api config
p.put(StreamsConfig.APPLICATION_ID_CONFIG /**/)
p.put(StreamsConfig.CLIENT_ID_CONFIG /**/)
// kafka broker
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// local state store
p.put(StreamsConfig.STATE_DIR_CONFIG, "./streams-state")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
// serdes
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, classOf[StringSerde])
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, classOf[StringSerde])
p
}
val builder = new StreamsBuilderS()
val rawInfoTable: KTableS[String, String] = builder.table("station_info")
val infoTable: KTableS[String, StationInfo] = rawInfoTable.mapValues{jsonString =>
/** */
}.filter(/** */)
.mapValues((/** */)
val rawStatusStream: KStreamS[String, String] = builder.stream("station_status")
val statusStream: KStreamS[String, StationStatus] = rawStatusStream.flatMapValues{jsonString =>
/** */
}
val outputStream: KStreamS[String, String] = statusStream
.leftJoin(infoTable, calculateStats)
.filter((_, availability) => {
/** */
})
.map((stationId: String, availability) => {
/** */
})
outputStream.to("low_availability")
val streams = new KafkaStreams(builder.build(), streamsConfiguration)
streams.cleanUp()
streams.start()
I even tried to manually add a StateStore via:
val store = Stores.inMemoryKeyValueStore("my-store")
val storeBuilder = Stores.keyValueStoreBuilder(store, new StringSerde(), new StringSerde())
val builder = new StreamsBuilderS()
builder.addStateStore(storeBuilder)
But it doesn't seem to change anything. I'm using the kafka streams wrapper from lightbend: "com.lightbend" %% "kafka-streams-scala" % "0.2.1"
All the examples I checked don't seem to care about adding a state store, so I'm somewhat confused. Can somebody point me in the right direction? Does this have to do something the STATE_DIR_CONFIG? Or with the Kafka cluster I'm running locally?

How to use flatMapValues on Kafka

I am getting an error when using flatMapValues in Scala with Kafka library. Here is my code:
val builder: KStreamBuilder = new KStreamBuilder()
val textLines: KStream[String, String] = builder.stream("streams-plaintext-input")
import collection.JavaConverters.asJavaIterableConverter
val wordCounts: KTable[String, JLong] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+").toIterable.asJava)
.groupBy((_, word) => word)
.count("word-counts")
and I am getting the error missing parameter type for textLine inside flatMapValues. If I replace for flatMapValues((textLine: String) => textLine.toLowerCase.split("\\W+").toIterable.asJava) it still does not work.
Anyone have some idea?
Thanks, Felipe
Working with Scala 2.12.4 I solved like this:
val props = new Properties
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount")
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
val stringSerde: Serde[String] = Serdes.String()
val longSerde: Serde[Long] = Serdes.Long()
val builder = new StreamsBuilder()
val textLines: KStream[String, String] = builder.stream("streams-plaintext-input")
val topology: Topology = builder.build()
println(topology.describe())
val wordCounts: KTable[String, Long] = textLines
.flatMapValues { textLine =>
println(textLine)
println(topology.describe())
textLine.toLowerCase.split("\\W+").toIterable.asJava
}
.groupBy((_, word) => word)
// this is a stateful computation config to the topology
.count("word-counts")
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output")
val streams = new KafkaStreams(topology, props)
streams.start()

How to Test Kafka Consumer

I have a Kafka Consumer (built in Scala) which extracts latest records from Kafka. The consumer looks like this:
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "latest")
val consumer = new KafkaConsumer[String, String](consumerProperties)
consumer.subscribe(java.util.Collections.singletonList("topic"))
Now, I want to write an integration test for it. Is there any way or any best practice for Testing Kafka Consumers?
You need to start zookeeper and kafka programmatically for integration tests.
1.1 start zookeeper (ZooKeeperServer)
def startZooKeeper(zooKeeperPort: Int, zkLogsDir: Directory): ServerCnxnFactory = {
val tickTime = 2000
val zkServer = new ZooKeeperServer(zkLogsDir.toFile.jfile, zkLogsDir.toFile.jfile, tickTime)
val factory = ServerCnxnFactory.createFactory
factory.configure(new InetSocketAddress("0.0.0.0", zooKeeperPort), 1024)
factory.startup(zkServer)
factory
}
1.2 start kafka (KafkaServer)
case class StreamConfig(streamTcpPort: Int = 9092,
streamStateTcpPort :Int = 2181,
stream: String,
numOfPartition: Int = 1,
nodes: Map[String, String] = Map.empty)
def startKafkaBroker(config: StreamConfig,
kafkaLogDir: Directory): KafkaServer = {
val syncServiceAddress = s"localhost:${config.streamStateTcpPort}"
val properties: Properties = new Properties
properties.setProperty("zookeeper.connect", syncServiceAddress)
properties.setProperty("broker.id", "0")
properties.setProperty("host.name", "localhost")
properties.setProperty("advertised.host.name", "localhost")
properties.setProperty("port", config.streamTcpPort.toString)
properties.setProperty("auto.create.topics.enable", "true")
properties.setProperty("log.dir", kafkaLogDir.toAbsolute.path)
properties.setProperty("log.flush.interval.messages", 1.toString)
properties.setProperty("log.cleaner.dedupe.buffer.size", "1048577")
config.nodes.foreach {
case (key, value) => properties.setProperty(key, value)
}
val broker = new KafkaServer(new KafkaConfig(properties))
broker.startup()
println(s"KafkaStream Broker started at ${properties.get("host.name")}:${properties.get("port")} at ${kafkaLogDir.toFile}")
broker
}
emit some events to stream using KafkaProducer
Then consume with your consumer to test and verify its working
You can use scalatest-eventstream that has startBroker method which will start Zookeeper and Kafka for you.
Also has destroyBroker which will cleanup your kafka after tests.
eg.
class MyStreamConsumerSpecs extends FunSpec with BeforeAndAfterAll with Matchers {
implicit val config =
StreamConfig(streamTcpPort = 9092, streamStateTcpPort = 2181, stream = "test-topic", numOfPartition = 1)
val kafkaStream = new KafkaEmbeddedStream
override protected def beforeAll(): Unit = {
kafkaStream.startBroker
}
override protected def afterAll(): Unit = {
kafkaStream.destroyBroker
}
describe("Kafka Embedded stream") {
it("does consume some events") {
//uses application.properties
//emitter.broker.endpoint=localhost:9092
//emitter.event.key.serializer=org.apache.kafka.common.serialization.StringSerializer
//emitter.event.value.serializer=org.apache.kafka.common.serialization.StringSerializer
kafkaStream.appendEvent("test-topic", """{"MyEvent" : { "myKey" : "myValue"}}""")
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "earliest")
val myConsumer = new KafkaConsumer[String, String](consumerProperties)
myConsumer.subscribe(java.util.Collections.singletonList("test-topic"))
val events = myConsumer.poll(2000)
events.count() shouldBe 1
events.iterator().next().value() shouldBe """{"MyEvent" : { "myKey" : "myValue"}}"""
println("=================" + events.count())
}
}
}