KStream.processValues() - getting a null state store from FixedKeyProcessor - apache-kafka

I have the following topology which uses processValues() method to combine streams DSL with Processor Api. I'm adding a state store here.
KStream<String, SecurityCommand> securityCommands =
builder.stream(
"security-command",
Consumed.with(Serdes.String(), JsonSerdes.securityCommand()));
StoreBuilder<KeyValueStore<String, UserAccountSnapshot>> storeBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("user-account-snapshot"),
Serdes.String(),
JsonSerdes.userAccountSnapshot());
builder.addStateStore(storeBuilder);
securityCommands.processValues(() -> new SecurityCommandProcessor(), Named.as("security-command-processor"), "user-account-snapshot")
.processValues(() -> new UserAccountSnapshotUpdater(), Named.as("user-snapshot-updater"), "user-account-snapshot")
.to("security-event", Produced.with(
Serdes.String(),
JsonSerdes.userAccountEvent()));
The SecurityCommandProcessor code follows:
class SecurityCommandProcessor implements FixedKeyProcessor<String, SecurityCommand, UserAccountEvent> {
private KeyValueStore<String, UserAccountSnapshot> kvStore;
private FixedKeyProcessorContext context;
#Override
public void init(FixedKeyProcessorContext context) {
this.kvStore = (KeyValueStore<String, UserAccountSnapshot>) context.getStateStore("user-account-snapshot");
this.context = context;
}
...
}
The problem is that context.getStateStore("user-account-snapshot") returns null.
I tried doing nearly the same code, by using the obsolete transformValues() and I'm able to get the state store. The problem is with processValues(). Am I doing anything wrong?

The issue is that you're using a lambda instance for the FixedKeyProcessorSupplier. When the processor needs access to a state store, you'll need to override the stores method, which returns null when it's not overridden. The FixedKeyProcessorSupplier extends the ConnectedStoreProvider interface.
So you'll need to provide a concrete instance of the processor supplier.
Let me know how it goes.
HTH,
Bill

Related

Kafka state-store sharing across sub-toplogies

I am trying to create a custom joining consumer to join multiple events.
I have created a topology which have four sub-toplogies(subtopology-0, subtoplogy-1, subtopology-2, subtopology-3) not in the exact order of what is described by topology.describe().
I have created a state-store in three of the sub-toplogies(subtopology-0, subtoplogy-1, subtopology-2) and trying to attach all the state-store created different state-stores using .connectProcessorAndStateStores("PROCESS2", "COUNTS") as per the kafka developer guide https://kafka.apache.org/0110/documentation/streams/developer-guide
Here is the code snippet of how I am creating and attaching processors to the topology.
class StreamCustomizer implements KafkaStreamsInfrastructureCustomizer {
public someMethod(StreamBuilder builder) {
Topology topology = builder.build();
topology.addProcessor("Processor1", new Processor() {...}, "state-store-1).addStateStore(store1,..);
topology.addProcessor("Processor2", new Processor() {...}, "state-store-1)
.addStateStore(store1,..);
topology.addProcessor("Processor3", new Processor() {...}, "state-store-1)
addStateStore(store1,..);
topology.addProcessor("Processor4", new Processor4() {...}, "Processor1", Processor2", "Processor3")
connectProcessorAndStateStores("Prcoessor4", "state-store-1", "state-store-2", "state-store-3");
}
}
This is how the processor is defined for all the sub-toplogies as described above
new Processor {
private ProcessorContext;
private KeyValueStore<K, V> store;
init(ProcessorContext) {
this.context = context;
store = context.getStore("store-name");
}
}
This is hot the processor 4 is written, with all the state-store retrieved in init method from context store.
new Processor4() {
private KeyValueStore<K, V> store1;
private KeyValueStore<K, V> store2;
private KeyValueStore<K, V> store3;
}
I am observing a strange behaviour that with the above code, store1, store2, and store3 all are re-intiailized and no keys are preserved whatever were stored in their respective sub-toplogies(1,2,3). However, the same code works i.e., all state store preserved the key-value stored in their respective sub-topology when state-stores are declared at class level.
class StreamCustomizer implements KafkaStreamsInfrastructureCustomizer {
private KeyValueStore <K, V> store1;
private KeyValueStore <K, V> store2;
private KeyValueStore <K, V> store3;
}
and then in the processor implementation, just init the state-store in init method.
new Processor {
private ProcessorContext;
init(ProcessorContext) {
this.context = context;
store1 = context.getStore("store-name-1");
}
}
Can someone please assist in finding the reason, or if there is anything wrong in this topology? Also, I have read in state-store can be shared within the same sub-topology.
Hard to say (the code snippets are not really clear), however, if share state you effectively merge sub-topologies. Thus, if you do it correct, you would end up with a single sub-topology containing all your processor.
As long as you see 4 sub-topologies, state store are not shared yet, ie, not connected correctly.

How should I define Flink's Schema to read Protocol Buffer data from Pulsar

I am using Pulsar-Flink to read data from Pulsar in Flink. I am having difficulty when the data's format is Protocol Buffer.
In the GitHub top page, Pulsar-Flink is using SimpleStringSchema. However, seemingly it does not comply with Protocol Buffer officially. Does anyone know how to deal with the data format? How should I define the schema?
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
props.setProperty("topic", "test-source-topic")
FlinkPulsarSource<String> source = new FlinkPulsarSource<>(serviceUrl, adminUrl, new SimpleStringSchema(), props);
DataStream<String> stream = see.addSource(source);
// chain operations on dataStream of String and sink the output
// end method chaining
see.execute();
FYI, I am writing Scala code, so if your explanation is for Scala(not for Java), it is really helpful. Surely, any kind of advice is welcome!! Including Java.
You should implement your own DeserializationSchema. Let's assume that you have a protobuf message Address and have generated the respective Java class. Then the schema should look like the following:
public class ProtoDeserializer implements DeserializationSchema<Address> {
#Override
public TypeInformation<Address> getProducedType() {
return TypeInformation.of(Address.class);
}
#Override
public Address deserialize(byte[] message) throws IOException {
return Address.parseFrom(message);
}
#Override
public boolean isEndOfStream(Address nextElement) {
return false;
}
}

Kafka Streams (Scala): Invalid topology: StateStore is not added yet

I have a topology where I have a stream A.
From that stream A, I create a WindowedStore S.
A --> [S]
Then I want to make the objects in A transformed depending on data on S, and also these transformed objects to arrive to the WindowStore logic(via transformValues).
For that, I create a Transformer for that, creating a Stream A', and making the windowing aware of it (i.e. now, S will be made from A', not from A).
A -> A' --> [S]
^__read__|
But I cannot do that, because when I create the Topology, an exception is thrown:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore storeName is not added yet.
Is there a way to work this around? Is this a limitation?
Code example:
// A
val sessionElementsStream: KStream[K, SessionElement] = ...
// A'
val sessionElementsTransformed : KStream[K, SessionElementTransformed] = {
// Here we use the sessionStoreName - but it is not added yet to the Topology
sessionElementsStream.
transformValues(sessionElementTransformerSupplier, sessionStoreName)
}
val sessionElementsWindowedStream: SessionWindowedKStream[K, SessionElementTransformed] = {
sessionElementsTransformed.
groupByKey(sessionElementTransformedGroupedBy).
windowedBy(sessionWindows)
}
val sessionStore : KTable[Windowed[K], List[WindowedSession]] =
sessionElementsWindowedStream.aggregate(
initializer = List.empty[WindowedSession])(
aggregator = anAggregator, merger = aMerger)(materialized = getMaterializedMUPKSessionStore(sessionStoreName))
The original problem, is that depending on previous sessions' values, I would like to change sessions after it. But if I do this in a transformer after the sessioning, these transformed sessions can be changed and sent downstream - but they won't reflect their new state in S - so further requests to the store will have the old values.
Kafka Streams 2.1, Scala 2.12.4.
Co-partitioned topics.
UPDATE
There is a way to do this within the DSL, using an extra topic:
Sent A' to this topic
Create builder.stream from this topic and build store from it.
Define Store before you define the transformation (so the transformation step can use the Store, because it is already defined before).
However, it sounds cumbersome to have to use an extra topic here. Is there no other, simpler way to solve it?
But I cannot do that, because when I create the Topology, an exception is thrown:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore storeName is not added yet.
It looks like you simply forgot to literally "add" the state store to your processing topology, and then attach ("make available") the state store to your Transformer.
Here's a code snippet that demonstrates this (sorry, in Java).
Adding the state store to your topology:
final StreamsBuilder builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, Long> myStateStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("my-state-store-name"),
Serdes.String(),
Serdes.Long())
.withCachingEnabled();
builder.addStateStore(myStateStore);
Attaching the state store to your Transformer:
final KStream<String, Double> stream = builder.stream("your-input-topic", Consumed.with(Serdes.String(), Serdes.Double()));
final KStream<String, Long> transformedStream =
stream.transform(new YourTransformer(myStateStore.name()), myStateStore.name());
And of course your Transformer must integrate the state store, with code like the following (this Transformer reads <String, Double> and writes String, Long>).
class MyTransformer implements TransformerSupplier<String, Double, KeyValue<String, Long>> {
private final String myStateStoreName;
MyTransformer(final String myStateStoreName) {
this.myStateStoreName = myStateStoreName;
}
#Override
public Transformer<String, Double, KeyValue<String, Long>> get() {
return new Transformer<String, Double, KeyValue<String, Long>>() {
private KeyValueStore<String, Long> myStateStore;
private ProcessorContext context;
#Override
public void init(final ProcessorContext context) {
myStateStore = (KeyValueStore<String, Long>) context.getStateStore(myStateStoreName);
}
// ...
}
}

getting statestore data from called function in kafka streams

In Kafka Streams' Processor API, can I pass processor context from init() as follows to other function and get the context back with state store in process()?
public void init(ProcessorContext context) {
this.context = context;
String resourceName = "config.properties";
ClassLoader loader = Thread.currentThread().getContextClassLoader();
Properties props = new Properties();
try(InputStream resourceStream = loader.getResourceAsStream(resourceName)) {
props.load(resourceStream);
}
catch(IOException e){
e.printStackTrace();
}
dataSplitter.timerMessageSource(props, context);//can I pass context like this?
this.context.schedule(1000);
// retrieve the key-value store named "patient"
kvStore = (KeyValueStore<String, PatientDataSummary>) this.context.getStateStore("patient");
//want to get the value of statestore filled by the called function timerMessageSource(), as the data to be put in statestore is getting generated in timerMessageSource()
//is there any way I can get that by using context or so
}
The usage of ProcessorContext is somewhat limited and you cannot call each method is provides at arbitrary times. Thus, it depend how you use it -- in general, you can pass it around as you wish (it will always be the same object throughout the live time of the processor).
If I understand your question correctly, you register a punctuation and use your dataSplitter within the punctuation callback and want to modify the store. That is absolutely possible -- you can either put the store into a class member similar to what you do with the context or use the context object to get the store within the punctuate callback.

Kafka Streams dynamic routing (ProducerInterceptor might be a solution?)

I'm working with Apache Kafka and I've been experimenting with the Kafka Streams functionality.
What I'm trying to achieve is very simple, at least in words and it can be achieved easily with the regular plain Consumer/Producer approach:
Read a from a dynamic list of topics
Do some processing on the message
Push the message to another topic which name is computed based on the message content
Initially I thought I could create a custom Sink or inject some kind of endpoint resolver in order to programmatically define the topic name for each single message, although ultimately couldn't find any way to do that.
So I dug into the code and found the ProducerInterceptor class that is (quoting from the JavaDoc):
A plugin interface that allows you to intercept (and possibly mutate)
the records received by the producer before they are published to the
Kafka cluster.
And it's onSend method:
This is called from KafkaProducer.send(ProducerRecord) and
KafkaProducer.send(ProducerRecord, Callback) methods, before key and
value get serialized and partition is assigned (if partition is not
specified in ProducerRecord).
It seemed like the perfect solution for me as I can effectively return a new ProducerRecord with the topic name I want.
Although apparently there's a bug (I've opened an issue on their JIRA: KAFKA-4691) and that method is called when the key and value have already been serialized.
Bummer as I don't think doing an additional deserialization at this point is acceptable.
My question to you more experienced and knowledgeable users would be your input and ideas and any kind of suggestions on how would be an efficient and elegant way of achieving it.
Thanks in advance for your help/comments/suggestions/ideas.
Below are some code snippets of what I've tried:
public static void main(String[] args) throws Exception {
StreamsConfig streamingConfig = new StreamsConfig(getProperties());
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
MyObjectSerializer myObjectSerializer = new MyObjectSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, myObjectSerializer, Pattern.compile("input-.*"));
.addProcessor("PROCESS", MyCustomProcessor::new, "SOURCE");
System.out.println("Starting PurchaseProcessor Example");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();
System.out.println("Now started PurchaseProcessor Example");
}
private static Properties getProperties() {
Properties props = new Properties();
.....
.....
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), "com.test.kafka.streams.OutputTopicRouterInterceptor");
return props;
}
OutputTopicRouterInterceptor onSend implementation:
#Override
public ProducerRecord<String, MyObject> onSend(ProducerRecord<String, MyObject> record) {
MyObject obj = record.value();
String topic = computeTopicName(obj);
ProducerRecord<String, MyObject> newRecord = new ProducerRecord<String, MyObject>(topic, record.partition(), record.timestamp(), record.key(), obj);
return newRecord;
}