Kafka Stream producing custom list of messages based on certain conditions - apache-kafka

We have the following stream processing requirement.
Source Stream ->
transform(condition check - If (true) then generate MULTIPLE ADDITIONAL messages else just transform the incoming message) ->
output kafka topic
Example:
If condition is true for message B(D,E,F are the additional messages produced)
A,B,C -> A,D,E,F,C -> Sink Kafka Topic
If condition is false
A,B,C -> A,B,C -> Sink Kafka Topic
Is there a way we can achieve this in Kafka streams?

You can use flatMap() or flatMapValues() methods. These methods take one record and produce zero, one or more records.
flatMap() can modify the key, values and their datatypes while flatMapValues() retains the original keys and change the value and value data type.
Here is an example pseudocode considering the new messages "C","D","E" will have a new key.
KStream<byte[], String> inputStream = builder.stream("inputTopic");
KStream<byte[], String> outStream = inputStream.flatMap(
(key,value)->{
List<KeyValue<byte[], String>> result = new LinkedList<>();
// If message value is "B". Otherwise place your condition based on data
if(value.equalsTo("B")){
result.add(KeyValue.pair("<new key for message C>","C"));
result.add(KeyValue.pair("<new key for message D>","D"));
result.add(KeyValue.pair("<new key for message E>","E"));
}else{
result.add(KeyValue.pair(key,value));
}
return result;
});
outStream.to("sinkTopic");
You can read more about this :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-transformations-stateless

Related

merge records in a kafka stream

Is it possible to merge records in kafka and publish the output to different stream ?
For example , there is a stream of events produced to a kafka topic like below
{txnId:1,startTime:0900},{txnId:1,endTime:0905},{txnId:2,endTime:0912},{txnId:3,endTime:0930},{txnId:2,startTime:0912},{txnId:3,startTime:0925}......
I want to merge these events by txnId and create the merged output like below
{txnId:1,startTime:0900,endTime:0905},{txnId:2,startTime:0910,endTime:0912},{txnId:3,startTime:0925,endTime:0930}
Please note that order is not maintained in the incoming events.So if endTime is received for a txn Id before start time event , then we need to wait till the start time event is received for that txnId before initiating the merge
I went through the word count example that comes along with Kafka Streams example but its not clear how to wait for events and then merge while doing the transformation.
Any thoughts is highly appreciated.
You could try solving this by splitting the start and end events into 2 separate streams with txnId as the key and then joining both the streams.
KStream<String, String> eventSource = new StreamBuilder().stream("INPUT-TOPIC");
KStream<String, JsonNode>[] splitEvents =
eventSource.map((key, eventString) -> {
JsonNode event = new ObjectMapper().readTree(eventString);
String txnId = event.path("txnId").asText();
return KeyValue.pair(txnId, event);
})
.branch((key, event) -> event.findValue("startTime") != null,
(key, event) -> event.findValue("endTime") != null);
KStream<String, JsonNode> startEvents = splitEvents[0];
KStream<String, JsonNode> endEvents = splitEvents[1];
A join between 2 streams as shown will produce a join result when there is an event in either side of the join. So the order of both events won't matter (you will have to ensure that you set an appropriate window period for the join).
Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
KStream<String, String> completeEvents = startEvents.join(endEvents,
(startEvent, endEvent) -> {
// Add logic to merge startEvent and endEvent as seen fit
ObjectNode completeEvent = JsonNodeFactory.instance.objectNode();
completeEvent.put("startTime", startEvent.path("startTime).asText());
completeEvent.put("endTime", endEvent.path("endTime").asText());
return completeEvent.toString();
},
JoinWindows.of(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(), // key
jsonSerde, // left object
jsonSerde // right object
)
);

Get the last records of KStream

I'm very new to Kafka Stream API.
I have a KStream like this:
KStream<Long,String> joinStream = builder.stream(("output"));
The KStream with records value look like this:
The stream will be updated every 1s.
I need to build a Rest API that will be calculated based on the value profit and spotPrice.
But I've struggled to get the value of the last record.
I am assuming that you mean the max value of the stream when you say the last value as the values are continuously arriving. Then you can use the reduce transformation to always update the output stream with the max value.
final StreamsBuilder builder = new StreamsBuilder();
KStream<Long, String> stream = builder.stream("INPUT_TOPIC", Consumed.with(Serdes.Long(), Serdes.String()));
stream
.mapValues(value -> Long.valueOf(value))
.groupByKey()
.reduce(new Reducer<Long>() {
#Override
public Long apply(Long currentMax, Long v) {
return (currentMax > v) ? currentMax : v;
}
})
.toStream().to("OUTPUT_TOPIC");
return builder.build();
And in case that you want to retrive it in a rest api i suggest to take a look at Spring cloud + Kafka streams (https://cloud.spring.io/spring-cloud-stream-binder-kafka/spring-cloud-stream-binder-kafka.html) that you can exchange messages to spring web.

How to access a KStreams Materialized State Store from another Stream Processor

I need to be able to remove a record from a Ktable from a separate Stream Processor. Today I'm using aggregate() and passing a materialized state store. In a separate processor that reads from a "termination" topic, I'd like to query that materialized state store either in a .transform() or a different .aggregate() and 'remove' that key/value. Every time I try to access the materialized state from a separate stream processor, it keeps telling me either the store isn't added to the topology, so then I add it and run it again, then it tells me it's already be registered and errors out.
builder.stream("topic1").map().groupByKey().aggregate(() -> null,
(aggKey, newValue, aggValue) -> {
//add to the Ktable
return newValue;
},
stateStoreMaterialized);
and in a separate stream I want to delete a key from that stateStoreMaterialized
builder.stream("topic2")
.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
stateStoreDeleteTransformer will query the key and delete it.
//in ctor
KeyValueBytesStoreSupplier stateStoreSupplier = Stores.persistentKeyValueStore("store1");
stateStoreMaterialized = Materialized.<String, MyObj>as(stateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(mySerDe);
I don't have a terminal flag on my topic1 stream object value that can trigger a deletion. It has to come from another stream/topic.
When I try to use the same Materialized Store on two separate stream processors I get..
Invalid topology: Topic STATE_STORE-repartition has already been registered by another source.
at org.springframework.kafka.config.StreamsBuilderFactoryBean.start(StreamsBuilderFactoryBean.java:268)
Edit:
This is the 1st error I receive.
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORMVALUES-0000000012 has no access to StateStore store1 as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:104)
at org.apache.kafka.streams.processor.internals.ForwardingDisabledProcessorContext.getStateStore(ForwardingDisabledProcessorContext.java:85)
So then I do this:
stateStoreSupplier = Stores.persistentKeyValueStore(STATE_STORE_NAME);
storeStoreBuilder = Stores.keyValueStoreBuilder(stateStoreSupplier, Serdes.String(), jsonSerDe);
stateStoreMaterialized = Materialized.as(stateStoreSupplier);
Then I get this error:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore 'state-store' is already added.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:520)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:512)
Here's the code that fixed my issue. As it turns out, order matters when building the streams. Had to set the materialized store first and then in subsequent lines of code, setup the transformer.
/**
* Create the streams using the KStreams DSL - a method to configure the stream and add any state stores.
*/
#Bean
public KafkaStreamsConfig setup() {
final JsonSerDe<Bus> ltaSerde = new JsonSerDe<>(Bus.class);
final StudentSerde<Student> StudentSerde = new StudentSerde<>();
//start lta stream
KStream<String, Bus> ltaStream = builder
.stream(ltaInputTopic, Consumed.with(Serdes.String(), ltaSerde));
final KStream<String, Student> statusStream = this.builder
.stream(this.locoStatusInputTopic,
Consumed.with(Serdes.String(),
StudentSerde));
//create lta store
KeyValueBytesStoreSupplier ltaStateStoreSupplier = Stores.persistentKeyValueStore(LTA_STATE_STORE_NAME);
final Materialized<String, Bus, KeyValueStore<Bytes, byte[]>> ltaStateStoreMaterialized =
Materialized.
<String, Bus>as(ltaStateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(ltaSerde);
KTable<String, Bus> ltaStateProcessor = ltaStream
//map and convert lta stream into Loco / LTA key value pairs
.groupByKey(Grouped.with(Serdes.String(), ltaSerde))
.aggregate(
//The 'aggregate' and 'reduce' functions ignore messages with null values FYI.
// so if the value after the groupbykey produces a null value, it won't be removed from the state store.
//which is why it's very important to send a message with some terminal flag indicating this value should be removed from the store.
() -> null, /* initializer */
(aggKey, newValue, aggValue) -> {
if (null != newValue.getAssociationEndTime()) { //if there is an endTime associated to this train/loco then remove it from the ktable
logger.trace("removing LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return null; //Returning null removes the record from the state store as well as its changelog topic. re: https://objectpartners.com/2019/07/31/slimming-down-your-kafka-streams-data/
}
logger.trace("adding LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return newValue;
}, /* adder */
ltaStateStoreMaterialized
);
// don't need builder.addStateStore(keyValueStoreStoreBuilder); and CANT use it
// because the ltaStateStoreMaterialized will already be added to the topology in the KTable aggregate method above.
// The below transformer can use the state store because it's already added (apparently) by the aggregate method.
// Add the KTable processors first, then if there are any transformers that need to use the store, add them after the KTable aggregate method.
statusStream.map((k, v) -> new KeyValue<>(v.getLocoId(), v))
.transform(locoStatusTransformerSupplier, ltaStateStoreSupplier.name())
.to("testing.outputtopic", Produced.with(Serdes.String(), StudentSerde));
return this; //can return anything except for void.
}
is stateStoreMaterialized and stateStoreSupplier.name() has the same name?
Use have a error in your topology
KStream.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
You have to supply new instant of StateStoreDeleteTransformer per ProcessContext in TransformerSupplier, like this:
KStream.transform(StateStoreDeleteTransformer::new, stateStoreSupplier.name())
or
KStream.transform(() -> StateStoreDeleteTransformerSupplier.get(), stateStoreSupplier.name())//StateStoreDeleteTransformerSupplier return new instant of StateStoreDeleteTransformer
in stateStoreDeleteTransformer how do you intent on using stateStoreMaterialized inside transformer directly?
I have the similar use case and I using a KeyValueStore<String, MyObj>
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<String, MyObj>) context.getStateStore("store1");
}

Kafka stream : class cast exception during left join

I am new to kafka. I am trying to leftJoin a kafka stream (named as inputStream) to kafka-table(named as detailTable) where the kafka-stream is built as:
//The consumer to consume the input topic
Consumed<String, NotifyRecipient> inputNotificationEventConsumed = Consumed
.with(Constants.CONSUMER_KEY_SERDE, Constants.CONSUMER_VALUE_SERDE);
//Now create the stream that is directly reading from the topic
KStream<NotifyKey, NotifyVal> initialInputStream =
streamsBuilder.stream(properties.getInputTopic(), inputNotificationEventConsumed);
//Now re-key the above stream for the purpose of left join
KStream<String, NotifyVal> inputStream = initialInputStream
.map((notifyKey,notifyVal) ->
KeyValue.pair(notifyVal.getId(),notifyVal)
);
And the kafka-table is created this way:
//The consumer for the table
Consumed<String, Detail> notifyDetailConsumed =
Consumed.with(Serdes.String(), Constants.DET_CONSUMER_VALUE_SERDE);
//Now consume from the topic into ktable
KTable<String, Detail> detailTable = streamsBuilder
.table(properties.getDetailTopic(), notifyDetailConsumed);
Now I am trying to join the inputStream to the detailTable as:
//Now join
KStream<String,Pair<Long, SendCmd>> joinedStream = inputStream
.leftJoin(detailTable, valJoiner)
.filter((key,value)->value!=null);
I am getting an error from which it seems that during the join, the key and value of the inputStream were tried to cast into the default key-serde and default value-serde and getting a class cast exception.
Not sure how to fix this and need help there.
Let me know if I should provide more info.
Because you use a map(), key and value type might have changes and thus you need to specify the correct Serdes via Joined.with(...) as third parameter to .leftJoin().

How to Stream to a Global Kafka Table

I have a Kafka Streams application that needs to join an incoming stream against a global table, then after some processing, write out the result of an aggregate back to that table:
KeyValueBytesStoreSupplier supplier = Stores.persistentKeyValueStore(
storeName
);
Materialized<String, String, KeyValueStore<Bytes, byte[]>> m = Materialized.as(
supplier
);
GlobalKTable<String, String> table = builder.globalTable(
topic, m.withKeySerde(
Serdes.String()
).withValueSerde(
Serdes.String()
)
);
stream.leftJoin(
table
...
).groupByKey().aggregate(
...
).toStream().through(
topic, Produced.with(Serdes.String(), Serdes.String())
);
However, when I try to stream into the KTable changelog, I get the following error: Invalid topology: Topic 'topic' has already been registered by another source.
If I try to aggregate to the store itself, I get the following error: InvalidStateStoreException: Store 'store' is currently closed.
How can both join against the table and write back to its changelog?
If this isn't possible, a solution that involves filtering incoming logs against the store would also work.
Calling through() is a shortcut for
stream.to("topic");
KStream stream2 = builder.stream("topic");
Because you use builder.stream("topic") already, you get Invalid topology: Topic 'topic' has already been registered by another source. because each topic can only be consumed once. If you want to feed the data of a stream/topic into different part, you need to reuse the original KStream you created for this topic:
KStream stream = builder.stream("topic");
// this won't work
KStream stream2 = stream.through("topic");
// rewrite to
stream.to("topic");
KStream stream2 = stream; // or just omit `stream2` and reuse `stream`
Not sure what you mean by
If I try to aggregate to the store itself