Kafka Streams NPE in MeteredKeyValueStore - apache-kafka

Im trying to run a very basic Stream using the ProcessorAPI in Scala.
class KafkaProcessor extends Processor[String, GenericRecord] {
private var kvStore: KeyValueStore[String, GenericRecord] = _
override def init(processorContext: ProcessorContext): Unit = {
this.kvStore = Stores
.keyValueStoreBuilder(
Stores.persistentKeyValueStore("random-mame"),
Serdes.String,
new GenericAvroSerde
)
}
override def process(
key: String,
value: GenericRecord
): Unit = {
val currentState = Option(kvStore.get(key)) // NPE
...
}
}
It seems some internal NPE is thrown from the error logs:
Exception in thread "test-4294024b-1390-4c2f-ba8e-e520cca728ff-StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:134)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:64)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:35)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:115)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:146)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:93)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:351)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:104)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:413)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:862)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747)
It is related to the getTime inside of the MeteredKeyValueStore. Im not sure how this happens and how I can prevent it.

If you want to use a store, you need to declare the store outside of the processor (ie, add the store the to StreamBuilder), connect the store (via StreamsBuilder) to the processor.
Within the processor you use the ProcessorContext to get a handle on the store.
See the docs for more details: https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html

Related

Kafka Streams: action on n-th event

I'm trying to find the best way how to perform an action on n-th event in Kafka Streams.
My case: I have an input stream with some Events. I have to filter them by eventType == login and on each n-th login (let's say, fifth) for the same accountId send this Event to the output stream.
After some investigation and different tries, I have the version of the code below (I'm using Kotlin).
data class Event(
val payload: Any = {},
val accountId: String,
val eventType: String = ""
)
// intermediate class to keep the key and value of the original event
data class LoginEvent(
val eventKey: String,
val eventValue: Event
)
fun process() {
val userLoginsStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("logins"),
Serdes.String(),
Serdes.Integer()
)
val streamsBuilder = StreamsBuilder().addStateStore(userCheckInsStoreBuilder)
val inputStream = streamsBuilder.stream<String, String>(inputTopic)
inputStream.map { key, event ->
KeyValue(key, json.readValue<Event>(event))
}.filter { _, event -> event.eventType == "login" }
.map { key, event -> KeyValue(event.accountId, LoginEvent(key, event)) }
.transform(
UserLoginsTransformer("logins", 5),
"logins"
)
.filter { _, value -> value }
.map { key, _ -> KeyValue(key.eventKey, json.writeValueAsString(key.eventValue)) }
.to("fifth_login", Produced.with(Serdes.String(), Serdes.String()))
...
}
class UserLoginsTransformer(private val storeName: String, private val loginsThreshold: Int = 5) :
TransformerSupplier<String, CheckInEvent, KeyValue< LoginEvent, Boolean>> {
override fun get(): Transformer<String, LoginEvent, KeyValue< LoginEvent, Boolean>> {
return object : Transformer<String, LoginEvent, KeyValue< LoginEvent, Boolean>> {
private lateinit var store: KeyValueStore<String, Int>
#Suppress("UNCHECKED_CAST")
override fun init(context: ProcessorContext) {
store = context.getStateStore(storeName) as KeyValueStore<String, Int>
}
override fun transform(key: String, value: LoginEvent): KeyValue< LoginEvent, Boolean> {
val counter = (store.get(key) ?: 0) + 1
return if (counter == loginsThreshold) {
store.delete(key)
KeyValue(value, true)
} else {
store.put(key, counter)
KeyValue(value, false)
}
}
override fun close() {
}
}
}
}
My biggest concern is that transform function is not thread-safe in my case. I've checked the implementation of the KV-store that is used in my case and this is RocksDB store (non-transactional) so the value may be updated between reading and comparison and the wrong event will be sent to the output.
My other ideas:
Use materialized views as a store without a transformer but I'm stuck with implementation.
Create a custom persistent KV store that will use TransactionalRocksDB (not sure if it is worth).
Create a custom persistent KV store that will use ConcurrentHashMap inside (it may lead to the high memory consumption in case of many users that we are expecting).
One more note: I'm using Spring Cloud Stream so maybe this framework has a built-in solution for my case but I didn't find it.
I would appreciate any suggestions. Thanks in advance.
My biggest concern is that transform function is not thread-safe in my case. I've checked the implementation of the KV-store that is used in my case and this is RocksDB store (non-transactional) so the value may be updated between reading and comparison and the wrong event will be sent to the output.
There is no reason to be concerned. If you run with multiple threads, each thread will have it's own RocksDB that store one shard of the overall data (note that the overall state is sharded based in input topic partitions and a single shard is never processed by different threads). Hence, your code will work correctly. The only thing you need to ensure is, that data is partitions by accountId, such that login events of a single account go to the same shard.
If you input data is already partitioned by accountId when written into your input topic, you don't need to do anything. If not, and you can control the upstream application, it might be simplest to use a custom partitioner in the upstream's application producer to get the partitioning you need. If you can't change the upstream application, you would need to repartition the data after you have set the accountId as new key, ie, by doing a through() before you call transform().

How to dispose database connections when flink retarts

I use dbcp2.BasicDataSource as database-connection-pool. The database query is used in some map function to get additional info of sensors; I found out that, when the flink job restarts due to exceptions, the old DB connections are still active on the server side.
flink version 1.7
BasicDataSource construct code here
object DbHelper extends Lazing with Logging {
private lazy val connectionPool: BasicDataSource = createDataSource()
private def createDataSource(): BasicDataSource = {
val conn_str = props.getProperty("db.url")
val conn_user = props.getProperty("db.user")
val conn_pwd = props.getProperty("db.pwd")
val initialSize = props.getProperty("db.initial.size", "3").toInt
val bds = new BasicDataSource
bds.setDriverClassName("org.postgresql.Driver")
bds.setUrl(conn_str)
bds.setUsername(conn_user)
bds.setPassword(conn_pwd)
bds.setInitialSize(initialSize)
bds
}
}
Change your map function to a RichMapFunction. Override the close() method of the RichMapFunction and put the code to close your database connection there. You should likely be putting the code to open the connection in the open() method as well.

How to use a persisted StateStore between two Kafka Streams

I'm having some troubles trying to achieve the following via Kafka Streams:
At the startup of the app, the (compacted) topic alpha gets loaded into a Key-Value StateStore map
A Kafka Stream consumes from another topic, uses (.get) the map above and finally produces a new record into topic alpha
The result is that the in-memory map should aligned with the underlying topic, even if the streamer gets restarted.
My approach is the following:
val builder = new StreamsBuilderS()
val store = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"), kSerde, vSerde)
)
builder.addStateStore(store)
val loaderStreamer = new LoaderStreamer(store).startStream()
[...] // I wait a few seconds until the loading is complete and the stream os running
val map = instance.store("store", QueryableStoreTypes.keyValueStore[K, V]()) // !!!!!!!! ERROR HERE !!!!!!!!
builder
.stream("another-topic")(Consumed.`with`(kSerde, vSerde))
.doMyAggregationsAndgetFromTheMapAbove
.transform(() => new StoreTransformer[K, V]("store"), "store")
.to("alpha")(Produced.`with`(kSerde, vSerde))
LoaderStreamer(store):
[...]
val builders = new StreamsBuilderS()
builder.addStateStore(store)
builder
.table("alpha")(Consumed.`with`(kSerde, vSerde))
builder.build
[...]
StoreTransformer:
[...]
override def init(context: ProcessorContext): Unit = {
this.context = context
this.store =
context.getStateStore(store).asInstanceOf[KeyValueStore[K, V]]
}
override def transform(key: K, value: V): (K, V) = {
store.put(key, value)
(key, value)
}
[...]
...but what I get is:
Caused by: org.apache.kafka.streams.errors.InvalidStateStoreException:
The state store, store, may have migrated to another instance.
while trying to get the store handler.
Any idea on how to achieve this?
Thank you!
You can't share state store between two Kafka Streams applications.
According to documentation: https://docs.confluent.io/current/streams/faq.html#interactive-queries there might be two reason of above exception:
The local KafkaStreams instance is not yet ready and thus its local state stores cannot be queried yet.
The local KafkaStreams instance is ready, but the particular state store was just migrated to another instance behind the scenes.
The easiest way to deal with it is to wait till state store will be queryable:
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
Whole example can be found at confluent github.

Why am I occasionally getting a InvalidStateStoreException PARTITIONS_REVOKED, not RUNNING when retrieving a store to query it?

I am accessing a state store to query it and have had to wrap the store() statement with a try/catch block to retry it because sometimes I am getting this exception:
org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get state store customers-store because the stream thread is PARTITIONS_REVOKED, not RUNNING
at org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:49)
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:57)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1053)
at com.codependent.kafkastreams.customer.service.CustomerService.getCustomer(CustomerService.kt:75)
at com.codependent.kafkastreams.customer.service.CustomerServiceKt.main(CustomerService.kt:108)
This is the code used to retrieve the store (the full code is on a github repo):
fun getCustomer(id: String): Customer? {
var keyValueStore: ReadOnlyKeyValueStore<String, Customer>? = null
while(keyValueStore == null) {
try {
keyValueStore = streams.store(CUSTOMERS_STORE, QueryableStoreTypes.keyValueStore<String, Customer>())
} catch (ex: InvalidStateStoreException) {
ex.printStackTrace()
}
}
val customer = keyValueStore.get(id)
return customer
}
And this is the main program:
fun main(args: Array<String>) {
val customerService = CustomerService("main", "localhost:9092")
customerService.initializeStreams()
customerService.createCustomer(Customer("53", "Joey"))
val customer = customerService.getCustomer("53")
println(customer)
customerService.stopStreams()
}
The exception happens randomly running the program several times, after the previous executions finish. Note: I don't do anything to the executing Kafka cluster and use its default config.
At the time you are accessing the store, the Kafka Streams application is going through a rebalance, and state stores aren't accessible at that time. You want to make sure you only query the stores when the application state is RUNNING and not REBALANCING.
What you could do is check the state of the application before attempting to read from the store like this:
if(streams.state() == State.RUNNING) {
keyValueStore = streams.store(...);
val customer = keyValueStore.get(id);
return customer;
}
There is also a KafkaStreams.setStateListener method you can use to register a KafkStreams.StateListener implementation. The StateListener.onChange method is called each time the application changes its state.

NullPointerException in Flink custom SourceFunction

I wanted to create a SourceFunction which reads a http stream.
I used ScalaJ which does what I want (it splits the incoming text by \n-s).
Obviously the code works outside Flink, but I get a NullPointerExcetion every time I start it as a Flink job (sometimes immediately sometimes after 1-2 seconds after it transmitted 1-2 elements). It kind of looks like the Http object has some problems.
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.io.Source.fromInputStream
import scalaj.http._
class HttpSource(url: String) extends SourceFunction[String] {
#volatile var isRunning = true
override def cancel(): Unit = isRunning = false
override def run(ctx: SourceFunction.SourceContext[String]): Unit =
httpStream(ctx.collect)
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
Here's the exception I usually get:
(Sometimes it's a bit different, for example I tried to make the request value transient, then it's already null when it tries to refer to request)
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
at scala.io.BufferedSource.reader(BufferedSource.scala:24)
at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:25)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader$lzycompute(BufferedSource.scala:35)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader(BufferedSource.scala:33)
at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:62)
at scala.io.BufferedSource$BufferedLineIterator.<init>(BufferedSource.scala:67)
at scala.io.BufferedSource.getLines(BufferedSource.scala:86)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:21)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:19)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:388)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:380)
at scala.Option.getOrElse(Option.scala:121)
at scalaj.http.HttpRequest.toResponse(Http.scala:380)
at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:360)
at scalaj.http.HttpRequest.exec(Http.scala:335)
at scalaj.http.HttpRequest.execute(Http.scala:323)
at flinkextension.HttpSource.httpStream(HttpSource.scala:19)
at flinkextension.HttpSource.run(HttpSource.scala:14)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Everything else seems to be working fine, when I don't use a http request, but something else like file read with the same InputStream type, just a plain while loop with strings or even when I use single http requests, which aren't streaming.
I feel like I'm missing some theoretical background, maybe flink does something in the background which destroys the Http object or the InputStream, but I didn't find anything in the documentation.
UPDATE #1:
If I put a null check into the lambda, the job usually exits immediately, sometimes processes a few elements, sometimes timeouts after hanging for a minute. Here's this version of the httpStream function:
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
if (inputStream == null) println("null inputstream")
else {
println("not null inputstream")
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
UPDATE #2:
The code actually works in distributed mode and with StreamExecutionEnvironment.createLocalEnvironment()
I only experience the issue if I use start-local.sh and submit the jar to it.