How to access a KStreams Materialized State Store from another Stream Processor

How to access a KStreams Materialized State Store from another Stream Processor - apache-kafka

I need to be able to remove a record from a Ktable from a separate Stream Processor. Today I'm using aggregate() and passing a materialized state store. In a separate processor that reads from a "termination" topic, I'd like to query that materialized state store either in a .transform() or a different .aggregate() and 'remove' that key/value. Every time I try to access the materialized state from a separate stream processor, it keeps telling me either the store isn't added to the topology, so then I add it and run it again, then it tells me it's already be registered and errors out.
builder.stream("topic1").map().groupByKey().aggregate(() -> null,
(aggKey, newValue, aggValue) -> {
//add to the Ktable
return newValue;
},
stateStoreMaterialized);
and in a separate stream I want to delete a key from that stateStoreMaterialized
builder.stream("topic2")
.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
stateStoreDeleteTransformer will query the key and delete it.
//in ctor
KeyValueBytesStoreSupplier stateStoreSupplier = Stores.persistentKeyValueStore("store1");
stateStoreMaterialized = Materialized.<String, MyObj>as(stateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(mySerDe);
I don't have a terminal flag on my topic1 stream object value that can trigger a deletion. It has to come from another stream/topic.
When I try to use the same Materialized Store on two separate stream processors I get..
Invalid topology: Topic STATE_STORE-repartition has already been registered by another source.
at org.springframework.kafka.config.StreamsBuilderFactoryBean.start(StreamsBuilderFactoryBean.java:268)
Edit:
This is the 1st error I receive.
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORMVALUES-0000000012 has no access to StateStore store1 as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:104)
at org.apache.kafka.streams.processor.internals.ForwardingDisabledProcessorContext.getStateStore(ForwardingDisabledProcessorContext.java:85)
So then I do this:
stateStoreSupplier = Stores.persistentKeyValueStore(STATE_STORE_NAME);
storeStoreBuilder = Stores.keyValueStoreBuilder(stateStoreSupplier, Serdes.String(), jsonSerDe);
stateStoreMaterialized = Materialized.as(stateStoreSupplier);
Then I get this error:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore 'state-store' is already added.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:520)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:512)
Here's the code that fixed my issue. As it turns out, order matters when building the streams. Had to set the materialized store first and then in subsequent lines of code, setup the transformer.
/**
* Create the streams using the KStreams DSL - a method to configure the stream and add any state stores.
*/
#Bean
public KafkaStreamsConfig setup() {
final JsonSerDe<Bus> ltaSerde = new JsonSerDe<>(Bus.class);
final StudentSerde<Student> StudentSerde = new StudentSerde<>();
//start lta stream
KStream<String, Bus> ltaStream = builder
.stream(ltaInputTopic, Consumed.with(Serdes.String(), ltaSerde));
final KStream<String, Student> statusStream = this.builder
.stream(this.locoStatusInputTopic,
Consumed.with(Serdes.String(),
StudentSerde));
//create lta store
KeyValueBytesStoreSupplier ltaStateStoreSupplier = Stores.persistentKeyValueStore(LTA_STATE_STORE_NAME);
final Materialized<String, Bus, KeyValueStore<Bytes, byte[]>> ltaStateStoreMaterialized =
Materialized.
<String, Bus>as(ltaStateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(ltaSerde);
KTable<String, Bus> ltaStateProcessor = ltaStream
//map and convert lta stream into Loco / LTA key value pairs
.groupByKey(Grouped.with(Serdes.String(), ltaSerde))
.aggregate(
//The 'aggregate' and 'reduce' functions ignore messages with null values FYI.
// so if the value after the groupbykey produces a null value, it won't be removed from the state store.
//which is why it's very important to send a message with some terminal flag indicating this value should be removed from the store.
() -> null, /* initializer */
(aggKey, newValue, aggValue) -> {
if (null != newValue.getAssociationEndTime()) { //if there is an endTime associated to this train/loco then remove it from the ktable
logger.trace("removing LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return null; //Returning null removes the record from the state store as well as its changelog topic. re: https://objectpartners.com/2019/07/31/slimming-down-your-kafka-streams-data/
}
logger.trace("adding LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return newValue;
}, /* adder */
ltaStateStoreMaterialized
);
// don't need builder.addStateStore(keyValueStoreStoreBuilder); and CANT use it
// because the ltaStateStoreMaterialized will already be added to the topology in the KTable aggregate method above.
// The below transformer can use the state store because it's already added (apparently) by the aggregate method.
// Add the KTable processors first, then if there are any transformers that need to use the store, add them after the KTable aggregate method.
statusStream.map((k, v) -> new KeyValue<>(v.getLocoId(), v))
.transform(locoStatusTransformerSupplier, ltaStateStoreSupplier.name())
.to("testing.outputtopic", Produced.with(Serdes.String(), StudentSerde));
return this; //can return anything except for void.
}

is stateStoreMaterialized and stateStoreSupplier.name() has the same name?
Use have a error in your topology
KStream.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
You have to supply new instant of StateStoreDeleteTransformer per ProcessContext in TransformerSupplier, like this:
KStream.transform(StateStoreDeleteTransformer::new, stateStoreSupplier.name())
or
KStream.transform(() -> StateStoreDeleteTransformerSupplier.get(), stateStoreSupplier.name())//StateStoreDeleteTransformerSupplier return new instant of StateStoreDeleteTransformer
in stateStoreDeleteTransformer how do you intent on using stateStoreMaterialized inside transformer directly?
I have the similar use case and I using a KeyValueStore<String, MyObj>
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<String, MyObj>) context.getStateStore("store1");
}

Related

Stale ktable records when joining kstream with ktable created by kstream aggregation

I'm trying to implement the event sourcing pattern with kafka streams in the following way.
I'm in a Security service and handle two use cases:
Register User, handling RegisterUserCommand should produce UserRegisteredEvent.
Change User Name, handling ChangeUserNameCommand should produce UserNameChangedEvent.
I have two topics:
Command Topic, 'security-command'. Every command is keyed and the key is user's email. For example:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Event Topic, 'security-event'. Every record is keyed by user's email:
foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
foo#bar.com:{"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Kafka Streams version 2.8.0
Kafka version 2.8
The implementation idea can be expressed in the following topology:
commandStream = builder.stream("security-command");
eventStream = builder.stream("security-event",
Consumed.with(
...,
new ZeroTimestampExtractor()
/*always returns 0 to get the latest version of snapshot*/));
// build the snapshot to get the current state of the user.
userSnapshots = eventStream.groupByKey()
.aggregate(() -> new UserSnapshot(),
(key /*email*/, event, currentSnapshot) -> currentSnapshot.apply(event));
// join commands with latest snapshot at the time of the join
commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
joinParams
);
// handle the command given the current snapshot
resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var newEvents = commandHandler(commandWithSnapshot.command(), commandWithSnapshot.snapshot());
return Arrays.stream(newEvents )
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
});
// append events to events topic
resultingEventStream.to("security-event");
For this topology, I'm using EOS exactly_once_beta.
A more explicit version of this topology:
KStream<String, Command<DomainEvent[]>> commandStream =
builder.stream(
commandTopic,
Consumed.with(Serdes.String(), new SecurityCommandSerde()));
KStream<String, DomainEvent> eventStream =
builder.stream(
eventTopic,
Consumed.with(
Serdes.String(),
new DomainEventSerde(),
new LatestRecordTimestampExtractor() /*always returns 0 to get the latest snapshot of the snapshot.*/));
// build the snapshots ktable by aggregating all the current events for a given user.
KTable<String, UserSnapshot> userSnapshots =
eventStream.groupByKey()
.aggregate(
() -> new UserSnapshot(),
(email, event, currentSnapshot) -> currentSnapshot.apply(event),
Materialized.with(
Serdes.String(),
new UserSnapshotSerde()));
// join command stream and snapshot table to get the stream of pairs <Command, UserSnapshot>
Joined<String, Command<DomainEvent[]>, UserSnapshot> commandWithSnapshotJoinParams =
Joined.with(
Serdes.String(),
new SecurityCommandSerde(),
new UserSnapshotSerde()
);
KStream<String, CommandWithUserSnapshot> commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
commandWithSnapshotJoinParams
);
var resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var command = commandWithSnapshot.command();
if (command instanceof RegisterUserCommand registerUserCommand) {
var handler = new RegisterUserCommandHandler();
var events = handler.handle(registerUserCommand);
// multiple events might be produced when a command is handled.
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
if (command instanceof ChangeUserNameCommand changeUserNameCommand) {
var handler = new ChangeUserNameCommandHandler();
var events = handler.handle(changeUserNameCommand, commandWithSnapshot.userSnapshot());
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
throw new IllegalArgumentException("...");
});
resultingEventStream.to(eventTopic, Produced.with(Serdes.String(), new DomainEventSerde()));
Problems I'm getting:
Launching the stream app on a command topic with existing records:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Outcome:
1. Stream application fails when processing the ChangeUserNameCommand, because the snapshot is null.
2. The events topic has a record for successful registration, but nothing for changing the name:
/*OK*/foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
Thoughts:
When processing the ChangeUserNameCommand, the snapshot is missing in the aggregated KTable, userSnapshots. Restarting the application succesfully produces the following record:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect.
Launching the stream app and producing a set of ChangeUserNameCommand commands at a time (fast).
Producing:
// Produce to command topic
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
// event topic outcome
/*OK*/ foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
// Produce at once to command topic
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex2"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex3"}}
// event topic outcome
/*OK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":1}}
Thoughts:
'ChangeUserNameCommand' commands are joined with a stale version of snapshot (pay attention to the version attribute).
The expected outcome would be:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":2}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":3}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect, setting the cache_max_bytes_buffering to 0 has no effect.
What am I missing in building such a topology? I expect that every command to be processed on the latest version of the snapshot. If I produce the commands with a few seconds delay between them, everything works as expected.

I think you missed change-log recovery part for the Tables. Read this to understand what happens with change-log recovery.
For tables, it is more complex because they must maintain additional
information—their state—to allow for stateful processing such as joins
and aggregations like COUNT() or SUM(). To achieve this while also
ensuring high processing performance, tables (through their state
stores) are materialized on local disk within a Kafka Streams
application instance or a ksqlDB server. But machines and containers
can be lost, along with any locally stored data. How can we make
tables fault tolerant, too?
The answer is that any data stored in a table is also stored remotely
in Kafka. Every table has its own change stream for this purpose—a
built-in change data capture (CDC) setup, we could say. So if we have
a table of account balances by customer, every time an account balance
is updated, a corresponding change event will be recorded into the
change stream of that table.
Also keep in mind, Restart a Kafka stream application should not process previously processed events. For that you need to commit offset of the message after processed it.

Found the root cause. Not sure if it is by design or a bug, but a stream task will wait only once per processing cycle for data in other partitions.
So if 2 records from command topic were read first, the stream task will wait max.task.idle.ms, allowing the poll() phase to happen, when processing the first command record. After it is processed, during processing the second one, the stream task will not allow polling to get newly generated events that resulted from first command processing.
In kafka 2.8, the code that is responsible for this behavior is in StreamTask.java. IsProcessable() is invoked at the beginning of processing phase. If it returns false, this will lead to repeating the polling phase.
public boolean isProcessable(final long wallClockTime) {
if (state() == State.CLOSED) {
return false;
}
if (hasPendingTxCommit) {
return false;
}
if (partitionGroup.allPartitionsBuffered()) {
idleStartTimeMs = RecordQueue.UNKNOWN;
return true;
} else if (partitionGroup.numBuffered() > 0) {
if (idleStartTimeMs == RecordQueue.UNKNOWN) {
idleStartTimeMs = wallClockTime;
}
if (wallClockTime - idleStartTimeMs >= maxTaskIdleMs) {
return true;
// idleStartTimeMs is not reset to default, RecordQueue.UNKNOWN, value,
// therefore the next time when the check for all buffered partitions is done, `true` is returned, meaning that the task is ready to be processed.
} else {
return false;
}
} else {
// there's no data in any of the topics; we should reset the enforced
// processing timer
idleStartTimeMs = RecordQueue.UNKNOWN;
return false;
}
}

Getting 409 from the schema registry while saving records to a local state store where multiple state stores are associated with a single processor

Long story short: I am in the middle of implementing a processor topology: the processor is to store the received records into corresponding local state stores and do event-based processing as a record arrives. And the related code looks like this:
#Override
public void configureBuilder(StreamsBuilder builder) {
final Map<String, String> serdeConfig =
Collections.singletonMap("schema.registry.url", processorConfig.getSchemaRegistryUrl());
final Serde<GenericRecord> valueSerde = new GenericAvroSerde();
valueSerde.configure(serdeConfig, false); // `true` for record keys
final Serde<EventKey> keySerde = new SpecificAvroSerde();
keySerde.configure(serdeConfig, true); // `true` for record keys
Map<String, String> stateStoreConfigMap = new HashMap<>();
//stateStoreConfigMap.put(KafkaAvroSerializerConfig.VALUE_SUBJECT_NAME_STRATEGY, RecordNameStrategy.class.getName());
StoreBuilder<KeyValueStore<EventKey, GenericRecord>> aggSequenceStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(processStateStore), keySerde, valueSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
final Serde<EnrichedSmcHeatData> enrichedSmcHeatDataSerde = new SpecificAvroSerde<>();
enrichedSmcHeatDataSerde.configure(serdeConfig, false); // `true` for record keys
StoreBuilder<KeyValueStore<EventKey, EnrichedSmcHeatData>> enrichedSmcHeatStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("enriched-smc-heat-state-store"), keySerde, enrichedSmcHeatDataSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
Topology topology = builder.build();
topology
.addSource(
PROCESS_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputCcmProcessEvents())
.addSource(
SCHEDULED_SEQUENCES_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getScheduledCastSequences())
.addSource(
SMC_HEAT_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputSmcHeatEvents())
.addProcessor(
PROCESS_STATE_AGGREGATOR,
() -> new ProcessStateProcessor(processStateStore, processorConfig),
PROCESS_EVENTS_SOURCE,
SCHEDULED_SEQUENCES_SOURCE,
SMC_HEAT_EVENTS_SOURCE)
.addStateStore(aggSequenceStateStoreBuilder, PROCESS_STATE_AGGREGATOR)
.addStateStore(enrichedSmcHeatStateStoreBuilder, PROCESS_STATE_AGGREGATOR);
If there are updates for the store created by aggSequenceStateStoreBuilder, the record values could be saved to the store without problems. However, if updates came for the second store, the following error was getting thrown:
Caused by:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Schema being registered is incompatible with an earlier schema for
subject
"ccm-process-events-processor-ccm-process-state-store-changelog-value";
error code: 409
My use case: the state processor accepts inbound records from multiple source topics and do event-handling (including storing the modified values to the corresponding stores) when a record arrives from any of the source topics.
It appears that there can only be one schema registered with the schema registry for the same processor. Is that by design, or am I missing anything, or what alternative options do I have instead?
Thanks in advance!

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks

What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

Kafka Stream producing custom list of messages based on certain conditions

We have the following stream processing requirement.
Source Stream ->
transform(condition check - If (true) then generate MULTIPLE ADDITIONAL messages else just transform the incoming message) ->
output kafka topic
Example:
If condition is true for message B(D,E,F are the additional messages produced)
A,B,C -> A,D,E,F,C -> Sink Kafka Topic
If condition is false
A,B,C -> A,B,C -> Sink Kafka Topic
Is there a way we can achieve this in Kafka streams?

You can use flatMap() or flatMapValues() methods. These methods take one record and produce zero, one or more records.
flatMap() can modify the key, values and their datatypes while flatMapValues() retains the original keys and change the value and value data type.
Here is an example pseudocode considering the new messages "C","D","E" will have a new key.
KStream<byte[], String> inputStream = builder.stream("inputTopic");
KStream<byte[], String> outStream = inputStream.flatMap(
(key,value)->{
List<KeyValue<byte[], String>> result = new LinkedList<>();
// If message value is "B". Otherwise place your condition based on data
if(value.equalsTo("B")){
result.add(KeyValue.pair("<new key for message C>","C"));
result.add(KeyValue.pair("<new key for message D>","D"));
result.add(KeyValue.pair("<new key for message E>","E"));
}else{
result.add(KeyValue.pair(key,value));
}
return result;
});
outStream.to("sinkTopic");
You can read more about this :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-transformations-stateless

How to Stream to a Global Kafka Table

I have a Kafka Streams application that needs to join an incoming stream against a global table, then after some processing, write out the result of an aggregate back to that table:
KeyValueBytesStoreSupplier supplier = Stores.persistentKeyValueStore(
storeName
);
Materialized<String, String, KeyValueStore<Bytes, byte[]>> m = Materialized.as(
supplier
);
GlobalKTable<String, String> table = builder.globalTable(
topic, m.withKeySerde(
Serdes.String()
).withValueSerde(
Serdes.String()
)
);
stream.leftJoin(
table
...
).groupByKey().aggregate(
...
).toStream().through(
topic, Produced.with(Serdes.String(), Serdes.String())
);
However, when I try to stream into the KTable changelog, I get the following error: Invalid topology: Topic 'topic' has already been registered by another source.
If I try to aggregate to the store itself, I get the following error: InvalidStateStoreException: Store 'store' is currently closed.
How can both join against the table and write back to its changelog?
If this isn't possible, a solution that involves filtering incoming logs against the store would also work.

Calling through() is a shortcut for
stream.to("topic");
KStream stream2 = builder.stream("topic");
Because you use builder.stream("topic") already, you get Invalid topology: Topic 'topic' has already been registered by another source. because each topic can only be consumed once. If you want to feed the data of a stream/topic into different part, you need to reuse the original KStream you created for this topic:
KStream stream = builder.stream("topic");
// this won't work
KStream stream2 = stream.through("topic");
// rewrite to
stream.to("topic");
KStream stream2 = stream; // or just omit `stream2` and reuse `stream`
Not sure what you mean by
If I try to aggregate to the store itself

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to access a KStreams Materialized State Store from another Stream Processor - apache-kafka

Related

Stale ktable records when joining kstream with ktable created by kstream aggregation

Getting 409 from the schema registry while saving records to a local state store where multiple state stores are associated with a single processor

How to enrich event stream with big file in Apache Flink?

Kafka Stream producing custom list of messages based on certain conditions

How to Stream to a Global Kafka Table

Categories

Resources