How to use a persisted StateStore between two Kafka Streams - apache-kafka

I'm having some troubles trying to achieve the following via Kafka Streams:
At the startup of the app, the (compacted) topic alpha gets loaded into a Key-Value StateStore map
A Kafka Stream consumes from another topic, uses (.get) the map above and finally produces a new record into topic alpha
The result is that the in-memory map should aligned with the underlying topic, even if the streamer gets restarted.
My approach is the following:
val builder = new StreamsBuilderS()
val store = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"), kSerde, vSerde)
)
builder.addStateStore(store)
val loaderStreamer = new LoaderStreamer(store).startStream()
[...] // I wait a few seconds until the loading is complete and the stream os running
val map = instance.store("store", QueryableStoreTypes.keyValueStore[K, V]()) // !!!!!!!! ERROR HERE !!!!!!!!
builder
.stream("another-topic")(Consumed.`with`(kSerde, vSerde))
.doMyAggregationsAndgetFromTheMapAbove
.transform(() => new StoreTransformer[K, V]("store"), "store")
.to("alpha")(Produced.`with`(kSerde, vSerde))
LoaderStreamer(store):
[...]
val builders = new StreamsBuilderS()
builder.addStateStore(store)
builder
.table("alpha")(Consumed.`with`(kSerde, vSerde))
builder.build
[...]
StoreTransformer:
[...]
override def init(context: ProcessorContext): Unit = {
this.context = context
this.store =
context.getStateStore(store).asInstanceOf[KeyValueStore[K, V]]
}
override def transform(key: K, value: V): (K, V) = {
store.put(key, value)
(key, value)
}
[...]
...but what I get is:
Caused by: org.apache.kafka.streams.errors.InvalidStateStoreException:
The state store, store, may have migrated to another instance.
while trying to get the store handler.
Any idea on how to achieve this?
Thank you!

You can't share state store between two Kafka Streams applications.
According to documentation: https://docs.confluent.io/current/streams/faq.html#interactive-queries there might be two reason of above exception:
The local KafkaStreams instance is not yet ready and thus its local state stores cannot be queried yet.
The local KafkaStreams instance is ready, but the particular state store was just migrated to another instance behind the scenes.
The easiest way to deal with it is to wait till state store will be queryable:
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
Whole example can be found at confluent github.

Related

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

How to access a KStreams Materialized State Store from another Stream Processor

I need to be able to remove a record from a Ktable from a separate Stream Processor. Today I'm using aggregate() and passing a materialized state store. In a separate processor that reads from a "termination" topic, I'd like to query that materialized state store either in a .transform() or a different .aggregate() and 'remove' that key/value. Every time I try to access the materialized state from a separate stream processor, it keeps telling me either the store isn't added to the topology, so then I add it and run it again, then it tells me it's already be registered and errors out.
builder.stream("topic1").map().groupByKey().aggregate(() -> null,
(aggKey, newValue, aggValue) -> {
//add to the Ktable
return newValue;
},
stateStoreMaterialized);
and in a separate stream I want to delete a key from that stateStoreMaterialized
builder.stream("topic2")
.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
stateStoreDeleteTransformer will query the key and delete it.
//in ctor
KeyValueBytesStoreSupplier stateStoreSupplier = Stores.persistentKeyValueStore("store1");
stateStoreMaterialized = Materialized.<String, MyObj>as(stateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(mySerDe);
I don't have a terminal flag on my topic1 stream object value that can trigger a deletion. It has to come from another stream/topic.
When I try to use the same Materialized Store on two separate stream processors I get..
Invalid topology: Topic STATE_STORE-repartition has already been registered by another source.
at org.springframework.kafka.config.StreamsBuilderFactoryBean.start(StreamsBuilderFactoryBean.java:268)
Edit:
This is the 1st error I receive.
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORMVALUES-0000000012 has no access to StateStore store1 as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:104)
at org.apache.kafka.streams.processor.internals.ForwardingDisabledProcessorContext.getStateStore(ForwardingDisabledProcessorContext.java:85)
So then I do this:
stateStoreSupplier = Stores.persistentKeyValueStore(STATE_STORE_NAME);
storeStoreBuilder = Stores.keyValueStoreBuilder(stateStoreSupplier, Serdes.String(), jsonSerDe);
stateStoreMaterialized = Materialized.as(stateStoreSupplier);
Then I get this error:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore 'state-store' is already added.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:520)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:512)
Here's the code that fixed my issue. As it turns out, order matters when building the streams. Had to set the materialized store first and then in subsequent lines of code, setup the transformer.
/**
* Create the streams using the KStreams DSL - a method to configure the stream and add any state stores.
*/
#Bean
public KafkaStreamsConfig setup() {
final JsonSerDe<Bus> ltaSerde = new JsonSerDe<>(Bus.class);
final StudentSerde<Student> StudentSerde = new StudentSerde<>();
//start lta stream
KStream<String, Bus> ltaStream = builder
.stream(ltaInputTopic, Consumed.with(Serdes.String(), ltaSerde));
final KStream<String, Student> statusStream = this.builder
.stream(this.locoStatusInputTopic,
Consumed.with(Serdes.String(),
StudentSerde));
//create lta store
KeyValueBytesStoreSupplier ltaStateStoreSupplier = Stores.persistentKeyValueStore(LTA_STATE_STORE_NAME);
final Materialized<String, Bus, KeyValueStore<Bytes, byte[]>> ltaStateStoreMaterialized =
Materialized.
<String, Bus>as(ltaStateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(ltaSerde);
KTable<String, Bus> ltaStateProcessor = ltaStream
//map and convert lta stream into Loco / LTA key value pairs
.groupByKey(Grouped.with(Serdes.String(), ltaSerde))
.aggregate(
//The 'aggregate' and 'reduce' functions ignore messages with null values FYI.
// so if the value after the groupbykey produces a null value, it won't be removed from the state store.
//which is why it's very important to send a message with some terminal flag indicating this value should be removed from the store.
() -> null, /* initializer */
(aggKey, newValue, aggValue) -> {
if (null != newValue.getAssociationEndTime()) { //if there is an endTime associated to this train/loco then remove it from the ktable
logger.trace("removing LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return null; //Returning null removes the record from the state store as well as its changelog topic. re: https://objectpartners.com/2019/07/31/slimming-down-your-kafka-streams-data/
}
logger.trace("adding LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return newValue;
}, /* adder */
ltaStateStoreMaterialized
);
// don't need builder.addStateStore(keyValueStoreStoreBuilder); and CANT use it
// because the ltaStateStoreMaterialized will already be added to the topology in the KTable aggregate method above.
// The below transformer can use the state store because it's already added (apparently) by the aggregate method.
// Add the KTable processors first, then if there are any transformers that need to use the store, add them after the KTable aggregate method.
statusStream.map((k, v) -> new KeyValue<>(v.getLocoId(), v))
.transform(locoStatusTransformerSupplier, ltaStateStoreSupplier.name())
.to("testing.outputtopic", Produced.with(Serdes.String(), StudentSerde));
return this; //can return anything except for void.
}
is stateStoreMaterialized and stateStoreSupplier.name() has the same name?
Use have a error in your topology
KStream.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
You have to supply new instant of StateStoreDeleteTransformer per ProcessContext in TransformerSupplier, like this:
KStream.transform(StateStoreDeleteTransformer::new, stateStoreSupplier.name())
or
KStream.transform(() -> StateStoreDeleteTransformerSupplier.get(), stateStoreSupplier.name())//StateStoreDeleteTransformerSupplier return new instant of StateStoreDeleteTransformer
in stateStoreDeleteTransformer how do you intent on using stateStoreMaterialized inside transformer directly?
I have the similar use case and I using a KeyValueStore<String, MyObj>
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<String, MyObj>) context.getStateStore("store1");
}

Spark Structured Streaming: Running Kafka consumer on separate worker thread

So I have a Spark application that needs to read two streams from two kafka clusters (Kafka A and B) using structured streaming, and do some joins and filtering on the two streams. So is it possible to have a Spark job that reads stream from A, and also run a thread (Called consumer) on each worker that reads Kafka B and put data into a map. So later when we are filtering we can do something like stream.filter(row => consumer.idNotInMap(row.id))?
I have some questions regarding this approach:
If this approach works, will it cause any problems when the application is run on a cluster?
Will all consumer instance on each worker receive the same data in cluster mode? Or can we even let each consumer only listen on the Kafka partition for that worker node (which is probably controlled by Spark)?
How will the consumer instance gets serialized and passed to workers?
Currently they are initialized on Driver node but are there some ways to initialize it once for each worker node?
I feel like in my case I should use stream joining instead. I've already tried that and it didn't work, that's why I am taking this approach. It didn't work because stream from Kafka A is append only and stream B needs to have a state that can be updated, which makes it update only. Then joining streams of append and update mode is not supported in Spark.
Here are some pseudo-code:
// SparkJob.scala
val consumer = Consumer()
val getMetadata = udf(id => consumer.get(id))
val enrichedDataSet = stream.withColumn("metadata", getMetadata(stream("id"))
// Consumer.java
class Consumer implements Serializable {
private final ConcurrentHashMap<Integer, String> metadata;
public MetadataConsumer() {
metadata = new ConcurrentHashMap<>();
// read stream
listen();
}
// process kafka data inside this loop
private void listen() {
Thread t = new Thread(() -> {
KafkaConsumer consumer = ...;
while (consumer.hasNext()) {
var message = consumer.next();
// update metadata or put in new metadata
metadata.put(message.id, message);
}
});
t.start();
}
public String get(Integer key) {
return metadata.get(key);
}
}

Kafka Streams NPE in MeteredKeyValueStore

Im trying to run a very basic Stream using the ProcessorAPI in Scala.
class KafkaProcessor extends Processor[String, GenericRecord] {
private var kvStore: KeyValueStore[String, GenericRecord] = _
override def init(processorContext: ProcessorContext): Unit = {
this.kvStore = Stores
.keyValueStoreBuilder(
Stores.persistentKeyValueStore("random-mame"),
Serdes.String,
new GenericAvroSerde
)
}
override def process(
key: String,
value: GenericRecord
): Unit = {
val currentState = Option(kvStore.get(key)) // NPE
...
}
}
It seems some internal NPE is thrown from the error logs:
Exception in thread "test-4294024b-1390-4c2f-ba8e-e520cca728ff-StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:134)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:64)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:35)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:115)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:146)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:93)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:351)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:104)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:413)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:862)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747)
It is related to the getTime inside of the MeteredKeyValueStore. Im not sure how this happens and how I can prevent it.
If you want to use a store, you need to declare the store outside of the processor (ie, add the store the to StreamBuilder), connect the store (via StreamsBuilder) to the processor.
Within the processor you use the ProcessorContext to get a handle on the store.
See the docs for more details: https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html

Why am I occasionally getting a InvalidStateStoreException PARTITIONS_REVOKED, not RUNNING when retrieving a store to query it?

I am accessing a state store to query it and have had to wrap the store() statement with a try/catch block to retry it because sometimes I am getting this exception:
org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get state store customers-store because the stream thread is PARTITIONS_REVOKED, not RUNNING
at org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:49)
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:57)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1053)
at com.codependent.kafkastreams.customer.service.CustomerService.getCustomer(CustomerService.kt:75)
at com.codependent.kafkastreams.customer.service.CustomerServiceKt.main(CustomerService.kt:108)
This is the code used to retrieve the store (the full code is on a github repo):
fun getCustomer(id: String): Customer? {
var keyValueStore: ReadOnlyKeyValueStore<String, Customer>? = null
while(keyValueStore == null) {
try {
keyValueStore = streams.store(CUSTOMERS_STORE, QueryableStoreTypes.keyValueStore<String, Customer>())
} catch (ex: InvalidStateStoreException) {
ex.printStackTrace()
}
}
val customer = keyValueStore.get(id)
return customer
}
And this is the main program:
fun main(args: Array<String>) {
val customerService = CustomerService("main", "localhost:9092")
customerService.initializeStreams()
customerService.createCustomer(Customer("53", "Joey"))
val customer = customerService.getCustomer("53")
println(customer)
customerService.stopStreams()
}
The exception happens randomly running the program several times, after the previous executions finish. Note: I don't do anything to the executing Kafka cluster and use its default config.
At the time you are accessing the store, the Kafka Streams application is going through a rebalance, and state stores aren't accessible at that time. You want to make sure you only query the stores when the application state is RUNNING and not REBALANCING.
What you could do is check the state of the application before attempting to read from the store like this:
if(streams.state() == State.RUNNING) {
keyValueStore = streams.store(...);
val customer = keyValueStore.get(id);
return customer;
}
There is also a KafkaStreams.setStateListener method you can use to register a KafkStreams.StateListener implementation. The StateListener.onChange method is called each time the application changes its state.