Kafka Stream fixed window not grouping by key

Kafka Stream fixed window not grouping by key - apache-kafka

I get a single Kafka Stream. How can I accumulate messages for a specific time window irrespective of the key?
My use case is to write a file every 10 minutes out of a stream not considering the key.

You'll need to use a Transformer with a state store and schedule a punctuation call to go through the store every 10 minutes and emit the records. The transformer should return null as you are collecting the records in the state store, so you'll also need a filter after the transformer to ignore any null records.
Here's a quick example of something I think is close to what you are asking for. Let me know how it goes.
class WindowedTransformerExample {
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
final String stateStoreName = "stateStore";
final StoreBuilder<KeyValueStore<String, String>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(stateStoreName),
Serdes.String(),
Serdes.String());
builder.addStateStore(keyValueStoreBuilder);
builder.<String, String>stream("topic").transform(new WindowedTransformer(stateStoreName), stateStoreName)
.filter((k, v) -> k != null && v != null)
// Here's where you do something with records emitted after 10 minutes
.foreach((k, v)-> System.out.println());
}
static final class WindowedTransformer implements TransformerSupplier<String, String, KeyValue<String, String>> {
private final String storeName;
public WindowedTransformer(final String storeName) {
this.storeName = storeName;
}
#Override
public Transformer<String, String, KeyValue<String, String>> get() {
return new Transformer<String, String, KeyValue<String, String>>() {
private KeyValueStore<String, String> keyValueStore;
private ProcessorContext processorContext;
#Override
public void init(final ProcessorContext context) {
processorContext = context;
keyValueStore = (KeyValueStore<String, String>) context.getStateStore(storeName);
// could change this to PunctuationType.STREAM_TIME if needed
context.schedule(Duration.ofMinutes(10), PunctuationType.WALL_CLOCK_TIME, (ts) -> {
try(final KeyValueIterator<String, String> iterator = keyValueStore.all()) {
while (iterator.hasNext()) {
final KeyValue<String, String> keyValue = iterator.next();
processorContext.forward(keyValue.key, keyValue.value);
}
}
});
}
#Override
public KeyValue<String, String> transform(String key, String value) {
if (key != null) {
keyValueStore.put(key, value);
}
return null;
}
#Override
public void close() {
}
};
}
}
}

Related

Kafka streams: how to produce to a topic while aggregating?

I currently have some code that builds a KTable using aggregate:
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
);
Once a given number of messages have been received and aggregated for a single key, I would like to push the latest aggregation state to another topic and then delete the key in the table.
I can obviously use a plain Kafka producer and have something like:
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
if (count > threshold) {
producer.send(new ProducerRecord<String,String>("output-topic",
key, aggregate));
return null;
}
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
);
But I'm looking for a more "Stream" approach.
Any hint ?

I think the best solution here is to just throw the aggregation back to a stream, then filter the values you want before sending it to a topic.
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
)
.toStream()
.filter((key, value) -> (value.count > threshold)
.to("output-topic");
Edit:
I just realized you want to do this before it is serialized.
I think the only way to do this is to use transformer or processor instead of aggregate.
There you get access to a StateStore instead of KTable. And it also gives you access to context.forward() that lets you forward a message downstream any way you want.
Some pseudo-code to show how it could be done using transform
#Override
public Transformer<String, String, KeyValue<String, String>> get() {
return new Transformer<String, String, KeyValue<String, String>>() {
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
#SuppressWarnings("unchecked")
#Override
public void init(final ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore<String, String>) context.getStateStore(STATE_STORE_NAME);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
String prevAggregation = stateStore.get(key);
//use prevAggregation and value to calculate newAggregation here:
//...
if (newAggregation.length() > threshold) {
context.forward(key, newAggregation);
stateStore.delete(key);
} else {
stateStore.put(key, newAggregation);
}
return null; // transform ignore null
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `stateStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
};
}

KeyValueStore.get() returns inconsistent results

stateStore.get() returns inconsistent results, when used from transform() on KStream. It returns null, even though corresponding key-value has been put() into the store.
Can someone explain this behavior of KeyValueStore<>?
#Component
public class StreamProcessor {
#StreamListener
public void process(#Input(KStreamBindings.INPUT_STREAM) KStream<String, JsonNode> inputStream) {
KStream<String, JsonNode> joinedEvents = inputStream
.selectKey((key, value) -> computeKey(value))
.transform(
() -> new SelfJoinTransformer((v1, v2) -> join(v1, v2), "join_store"),
"join_store"
);
joinedEvents
.foreach((key, value) -> System.out.format("%s,joined=%b\n",key, value.has("right")));
}
private JsonNode join(JsonNode left, JsonNode right) {
((ObjectNode) left).set("right", right);
return left;
}
}
public class SelfJoinTransformer implements Transformer<String, JsonNode, KeyValue<String, JsonNode>> {
private KeyValueStore<String, JsonNode> stateStore;
private ValueJoiner<JsonNode, JsonNode, JsonNode> valueJoiner;
private String storeName;
public SelfJoinTransformer(ValueJoiner<JsonNode, JsonNode, JsonNode> valueJoiner, String storeName) {
this.storeName = storeName;
this.valueJoiner = valueJoiner;
}
#Override
public void init(ProcessorContext context) {
this.stateStore = (KeyValueStore<String, JsonNode>) context.getStateStore(storeName);
}
#Override
public KeyValue<String, JsonNode> transform(String key, JsonNode value) {
JsonNode oldValue = stateStore.get(key);
if (oldValue != null) { //this condition rarely holds true
stateStore.delete(key);
System.out.format("%s,joined\n", key);
return KeyValue.pair(key, valueJoiner.apply(oldValue, value));
}
stateStore.put(key, value);
return null;
}
}

The reason, that it seems messages are disappearing (assuming, that punctuator doesn't remove them) is that you use KStream::selectKey(...), it change key, but doesn't do repartitioning
And you might look for the key in wrong partitions.
Look at following scenarion:
Msg1: k1, v1 (partition0)
Msg2: k2, v2 (partition1)
Assumption messages are put in different partition (because of key)
After selectKey: k1 -> k, k2 -> k
Msg1: k, v1
Msg2: k, v2
Operation selectKey is stateless so messages are not sent to downstream (topic) and repartition doesn't happen.
For first message: value is put for key - k in the store (partition0)
When second message arrives: for key - k there is no message, because it is different partition (partition1)

Kafka custom state store is not getting updated

I am trying to build a custom state store which stores key to map of values.
Stream & Store configuration
final Serde<HashMap<String, ?>> userSessionsSerde = Serdes.serdeFrom(new HashMapSerializer(), new HashMapDeserializer());
StoreBuilder sessionStoreBuilder = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(storeName),
Serdes.String(),
userSessionsSerde);
builder.addStateStore(sessionStoreBuilder);
builder.stream("connection-events", Consumed.with(Serdes.String(), wsSerde))
.transform(wsEventTransformerSupplier, storeName)
.to("status-changes", Produced.with(Serdes.String(), Serdes.String()));
KafkaStreams streams = new KafkaStreams(builder.build(), properties);
streams.start();
Transformer
public class WSEventProcessor implements Transformer<String, ConnectionEvent, KeyValue<String, String>> {
private String storeName = "user-sessions";
private KeyValueStore<String, Map<String, ConnectionEvent>> stateStore;
final Serde<HashMap<String, ?>> userSessionsSerde = Serdes.serdeFrom(new HashMapSerializer(), new HashMapDeserializer());
#SuppressWarnings("unchecked")
#Override
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore<String, Map<String, ConnectionEvent>>) context.getStateStore(storeName);
}
#Override
public void close() {
}
#Override
public KeyValue<String, String> transform(String key, ConnectionEvent value) {
boolean sendUpdate = false;
//Send null if there are no updates to be sent to downstream processors
if(value.getState() == WebSocketConnection.CONNECTED) {
if(stateStore.get(key) == null) {
stateStore.put(key, new HashMap<>());
sendUpdate = true;
}
stateStore.get(key).put(value.getSessionId(), value);
return sendUpdate ? KeyValue.pair(key, "Online") : null;
}
else {
stateStore.get(key).remove(value.getSessionId());
int size = stateStore.get(key).size();
return stateStore.get(key).isEmpty() ? KeyValue.pair(key, "Offline") : null;
}
}
}
The state store always has 0 size map for each key irrespective of connected and disconnected events. Am I doing something wrong?

value object that you stored into stateStore.put(key, value) and stateStore.get(key) are different objects (as it serialized and then deserialized).
Your issue is related to modification of object returned from state store:
stateStore.get(key).put(value.getSessionId(), value) and stateStore.get(key).remove(value.getSessionId()). when you update object stateStore.get(key), it's actually not persisted to state store, only changes that object.
So, to fix your issue, calculate required value (in your case HashMap), and only after that apply stateStore.put(key, calculated_value). If you need to remove key-value from state store, use stateStore.put(key, null). Your transform method should look approximately like:
public KeyValue<String, String> transform(String key, ConnectionEvent value) {
Map<String, Object> valueFromStateStore = stateStore.get(key);
Map<String, Object> valueToUpdate = ofNullable(valueFromStateStore).orElseGet(Collections::emptyMap);
KeyValue<String, String> resultKeyValue = null;
//Send null if there are no updates to be sent to downstream processors
if(value.getState() == WebSocketConnection.CONNECTED) {
if(valueToUpdate.isEmpty()) {
resultKeyValue = KeyValue.pair(key, "Online");
}
valueToUpdate.put(value.getSessionId(), value);
}
else {
valueToUpdate.remove(value.getSessionId());
if (valueToUpdate.isEmpty()) {
resultKeyValue = KeyValue.pair(key, "Offline");
}
}
stateStore.put(key, valueToUpdate);
return resultKeyValue;
}

How can we club all the values for same key as a list and return Kafka Streams with key and value as String

I have a data coming on kafka topic as (key:id, {id:1, body:...})
means key for the message is same as id. however there can be multiple messages with the same id but different body.
so I am getting the kstream <String, String>
Now I want to get all the messages having same id (key) and club all the values as a list and return as
Kstream<String, List<String>>
Any sugessions?

//Create a Stream with a state store
StreamsBuilder builder = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, List<String>>> logTracerStateStore = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(LOG_TRACE_STATE_STORE), Serdes.String(),
new ListSerde<String>(Serdes.String()));
//add this to stream builder
builder.addStateStore(logTracerStateStore);
KStream<String, String> kafkaStream = builder.stream(TOPIC);
splitProcessor(kafkaStream);
logger.info("creating stream for topic {} ..", TOPIC);
final Topology topology = builder.build();
return new KafkaStreams(topology, streamConfiguration(bootstrapServers));
// Stream List Serde
public class ListSerde<T> implements Serde<List<T>> {
private final Serde<List<T>> inner;
public ListSerde( final Serde<T> avroSerde) {
inner = Serdes.serdeFrom(new ListSerializer<>( avroSerde.serializer()),
new ListDeserializer<>( avroSerde.deserializer()));
}
#Override
public Serializer<List<T>> serializer() {
return inner.serializer();
}
#Override
public Deserializer<List<T>> deserializer() {
return inner.deserializer();
}
#Override
public void configure(final Map<String, ?> configs, final boolean isKey) {
inner.serializer().configure(configs, isKey);
inner.deserializer().configure(configs, isKey);
}
#Override
public void close() {
inner.serializer().close();
inner.deserializer().close();
}
}
// Serializer & deserializers
public class ListSerializer<T> implements Serializer<List<T>> {
// private final Comparator<T> comparator;
private final Serializer<T> valueSerializer;
public ListSerializer( final Serializer<T> valueSerializer) {
// this.comparator = comparator;
this.valueSerializer = valueSerializer;
}
#Override
public void configure(final Map<String, ?> configs, final boolean isKey) {
// do nothing
}
#Override
public byte[] serialize(final String topic, final List<T> list) {
final int size = list.size();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
final DataOutputStream out = new DataOutputStream(baos);
final Iterator<T> iterator = list.iterator();
try {
out.writeInt(size);
while (iterator.hasNext()) {
final byte[] bytes = valueSerializer.serialize(topic, iterator.next());
out.writeInt(bytes.length);
out.write(bytes);
}
out.close();
} catch (final IOException e) {
throw new RuntimeException("unable to serialize List", e);
}
return baos.toByteArray();
}
#Override
public void close() {
}
}
//------------
public class ListDeserializer<T> implements Deserializer<List<T>> {
// private final Comparator<T> comparator;
private final Deserializer<T> valueDeserializer;
public ListDeserializer(final Deserializer<T> valueDeserializer) {
// this.comparator = comparator;
this.valueDeserializer = valueDeserializer;
}
#Override
public void configure(final Map<String, ?> configs, final boolean isKey) {
// do nothing
}
#Override
public List<T> deserialize(final String s, final byte[] bytes) {
if (bytes == null || bytes.length == 0) {
return null;
}
final List<T> list = new ArrayList<>();
final DataInputStream dataInputStream = new DataInputStream(new ByteArrayInputStream(bytes));
try {
final int records = dataInputStream.readInt();
for (int i = 0; i < records; i++) {
final byte[] valueBytes = new byte[dataInputStream.readInt()];
dataInputStream.read(valueBytes);
list.add(valueDeserializer.deserialize(s, valueBytes));
}
// dataInputStream.close();
} catch (final IOException e) {
throw new RuntimeException("Unable to deserialize PriorityQueue", e);
}finally {
try {
dataInputStream.close();
} catch (Exception e2) {
// TODO: handle exception
}
}
return list;
}
#Override
public void close() {
}
}
/// Now create Stream Processors
public class LogTraceStreamStateProcessor implements Processor<String, String>{
private static final Logger logger = Logger.getLogger(LogTraceStreamStateProcessor.class);
IStore stateStore;
/**
* Initialize the transformer.
*/
#Override
public void init(ProcessorContext context) {
logger.info("initializing processor and looking for monitoring store");
stateStore = MonitoringStateStoreFactory.getInstance().getStore();
logger.debug("found the monitoring store - {} ", stateStore);
stateStore.initLogTraceStoreProcess(context);
logger.debug("initalizing monitoring store.");
}
#Override
public void process(String key, String value) {
logger.debug("Storing the value for logtrace storage - {} ", value);
stateStore.storeLogTrace(value);
logger.debug("finished Storing the value for logtrace storage - {} ", value);
}
#Override
public void close() {
// TODO Auto-generated method stub
}
}
// access the key value state store like below
KeyValueStore<String, List<String>> stateStore = (KeyValueStore<String, List<String>>) traceStreamContext.getStateStore(EXEID_REQ_REL_STORE);
//Now add a list to new key for a new message and if the key exists then add a new message in the list
public void storeTraceData(String traceData) {
try {
TraceEvent tracer = new TraceEvent();
logger.debug("Received the Trace value - {}", traceData);
tracer = mapper.readValue(traceData, TraceEvent.class);
logger.debug("trace unmarshelling has been completed successfully !!!");
String key = tracer.getExecutionId();
List<String> listEvents = stateStore.get(key);
if (listEvents != null && !listEvents.isEmpty()) {
logger.debug("event is already in store so storing in the list for execution id - {}", key);
listEvents.add(requestId);
stateStore.put(key, listEvents);
} else {
logger.debug(
"event is not present in the store so creating a new list and adding into store for execution id - {}",
key);
List<String> list = new ArrayList<>();
list.add(requestId);
stateStore.put(key, list);
}
} catch (Throwable e) {
logger.error("exception while processing the trace event .. ", e);
} finally {
try {
traceStreamContext.commit();
} catch (Exception e2) {
e2.printStackTrace();
}
}
}
/// now this is how you can access the message from state store
public ReadOnlyKeyValueStore<String, List<String>> tracerStore() {
return waitUntilStoreIsQueryable(KEY_NAME);
}

How to evaluate consuming time in kafka stream application

I have 1.0.0 kafka stream application with two classes as below 'class FilterByPolicyStreamsApp' and 'class FilterByPolicyTransformerSupplier'. In my application, I read the events, perform some conditional checks and forward to same kafka in another topic. I able to get the producing time with 'eventsForwardTimeInMs' variable in FilterByPolicyTransformerSupplier class. But I unable to get the consuming time (with and without (de)serialization). How will I get this time? Please help me.
FilterByPolicyStreamsApp .java:
public class FilterByPolicyStreamsApp implements CommandLineRunner {
String policyKafkaTopicName="policy";
String policyFilterDataKafkaTopicName = "policy.filter.data";
String bootstrapServers="11.1.1.1:9092";
String sampleEventsKafkaTopicName = 'sample-.*";
String applicationId="filter-by-policy-app";
String policyFilteredEventsKafkaTopicName = "policy.filter.events";
public static void main(String[] args) {
SpringApplication.run(FilterByPolicyStreamsApp.class, args);
}
#Override
public void run(String... arg0) {
String policyGlobalTableName = policyKafkaTopicName + ".table";
String policyFilterDataGlobalTable = policyFilterDataKafkaTopicName + ".table";
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
config.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallclockTimestampExtractor.class);
KStreamBuilder builder = new KStreamBuilder();
builder.globalTable(Serdes.String(), new JsonSerde<>(List.class), policyKafkaTopicName,
policyGlobalTableName);
builder.globalTable(Serdes.String(), new JsonSerde<>(PolicyFilterData.class), policyFilterDataKafkaTopicName,
policyFilterDataGlobalTable);
KStream<String, SampleEvent> events = builder.stream(Serdes.String(),
new JsonSerde<>(SampleEvent.class), Pattern.compile(sampleEventsKafkaTopicName));
events = events.transform(new FilterByPolicyTransformerSupplier(policyGlobalTableName,
policyFilterDataGlobalTable));
events.to(Serdes.String(), new JsonSerde<>(SampleEvent.class), policyFilteredEventsKafkaTopicName);
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
streams.setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
#Override
public void uncaughtException(Thread t, Throwable e) {
logger.error(e.getMessage(), e);
}
});
}
}
FilterByPolicyTransformerSupplier.java:
public class FilterByPolicyTransformerSupplier
implements TransformerSupplier<String, SampleEvent, KeyValue<String, SampleEvent>> {
private String policyGlobalTableName;
private String policyFilterDataGlobalTable;
public FilterByPolicyTransformerSupplier(String policyGlobalTableName,
String policyFilterDataGlobalTable) {
this.policyGlobalTableName = policyGlobalTableName;
this.policyFilterDataGlobalTable = policyFilterDataGlobalTable;
}
#Override
public Transformer<String, SampleEvent, KeyValue<String, SampleEvent>> get() {
return new Transformer<String, SampleEvent, KeyValue<String, SampleEvent>>() {
private KeyValueStore<String, List<String>> policyStore;
private KeyValueStore<String, PolicyFilterData> policyMetadataStore;
private ProcessorContext context;
#Override
public void close() {
}
#Override
public void init(ProcessorContext context) {
this.context = context;
// Call punctuate every 1 second
this.context.schedule(1000);
policyStore = (KeyValueStore<String, List<String>>) this.context
.getStateStore(policyGlobalTableName);
policyMetadataStore = (KeyValueStore<String, PolicyFilterData>) this.context
.getStateStore(policyFilterDataGlobalTable);
}
#Override
public KeyValue<String, SampleEvent> punctuate(long arg0) {
return null;
}
#Override
public KeyValue<String, SampleEvent> transform(String key, SampleEvent event) {
long eventsForwardTimeInMs = 0;
long forwardedEventCouunt = 0;
List<String> policyIds = policyStore.get(event.getCustomerCode().toLowerCase());
if (policyIds != null) {
for (String policyId : policyIds) {
/*
PolicyFilterData policyFilterMetadata = policyMetadataStore.get(policyId);
Do some condition checks on the event. If it satisfies then will forward them.
if(policyFilterMetadata == null){
continue;
}
*/
// Using context forward as event can map to multiple policies
long startForwardTime = System.currentTimeMillis();
context.forward(policyId, event);
forwardedEventCouunt++;
eventsForwardTimeInMs += System.currentTimeMillis() - startForwardTime;
}
}
return null;
}
};
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka Stream fixed window not grouping by key - apache-kafka

I get a single Kafka Stream. How can I accumulate messages for a specific time window irrespective of the key? My use case is to write a file every 10 minutes out of a stream not considering the key.

Related

Kafka streams: how to produce to a topic while aggregating?

KeyValueStore.get() returns inconsistent results

Kafka custom state store is not getting updated

How can we club all the values for same key as a list and return Kafka Streams with key and value as String

How to evaluate consuming time in kafka stream application

Categories

Resources