KeyValueStore.get() returns inconsistent results - apache-kafka

stateStore.get() returns inconsistent results, when used from transform() on KStream. It returns null, even though corresponding key-value has been put() into the store.
Can someone explain this behavior of KeyValueStore<>?
#Component
public class StreamProcessor {
#StreamListener
public void process(#Input(KStreamBindings.INPUT_STREAM) KStream<String, JsonNode> inputStream) {
KStream<String, JsonNode> joinedEvents = inputStream
.selectKey((key, value) -> computeKey(value))
.transform(
() -> new SelfJoinTransformer((v1, v2) -> join(v1, v2), "join_store"),
"join_store"
);
joinedEvents
.foreach((key, value) -> System.out.format("%s,joined=%b\n",key, value.has("right")));
}
private JsonNode join(JsonNode left, JsonNode right) {
((ObjectNode) left).set("right", right);
return left;
}
}
public class SelfJoinTransformer implements Transformer<String, JsonNode, KeyValue<String, JsonNode>> {
private KeyValueStore<String, JsonNode> stateStore;
private ValueJoiner<JsonNode, JsonNode, JsonNode> valueJoiner;
private String storeName;
public SelfJoinTransformer(ValueJoiner<JsonNode, JsonNode, JsonNode> valueJoiner, String storeName) {
this.storeName = storeName;
this.valueJoiner = valueJoiner;
}
#Override
public void init(ProcessorContext context) {
this.stateStore = (KeyValueStore<String, JsonNode>) context.getStateStore(storeName);
}
#Override
public KeyValue<String, JsonNode> transform(String key, JsonNode value) {
JsonNode oldValue = stateStore.get(key);
if (oldValue != null) { //this condition rarely holds true
stateStore.delete(key);
System.out.format("%s,joined\n", key);
return KeyValue.pair(key, valueJoiner.apply(oldValue, value));
}
stateStore.put(key, value);
return null;
}
}

The reason, that it seems messages are disappearing (assuming, that punctuator doesn't remove them) is that you use KStream::selectKey(...), it change key, but doesn't do repartitioning
And you might look for the key in wrong partitions.
Look at following scenarion:
Msg1: k1, v1 (partition0)
Msg2: k2, v2 (partition1)
Assumption messages are put in different partition (because of key)
After selectKey: k1 -> k, k2 -> k
Msg1: k, v1
Msg2: k, v2
Operation selectKey is stateless so messages are not sent to downstream (topic) and repartition doesn't happen.
For first message: value is put for key - k in the store (partition0)
When second message arrives: for key - k there is no message, because it is different partition (partition1)

Related

Kafka streams: how to produce to a topic while aggregating?

I currently have some code that builds a KTable using aggregate:
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
);
Once a given number of messages have been received and aggregated for a single key, I would like to push the latest aggregation state to another topic and then delete the key in the table.
I can obviously use a plain Kafka producer and have something like:
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
if (count > threshold) {
producer.send(new ProducerRecord<String,String>("output-topic",
key, aggregate));
return null;
}
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
);
But I'm looking for a more "Stream" approach.
Any hint ?
I think the best solution here is to just throw the aggregation back to a stream, then filter the values you want before sending it to a topic.
inputTopic.groupByKey().aggregate(
Aggregator::new,
(key, value, aggregate) -> {
someProcessingDoneHere;
return aggregate;
},
Materialized.with(Serdes.String(), Serdes.String())
)
.toStream()
.filter((key, value) -> (value.count > threshold)
.to("output-topic");
Edit:
I just realized you want to do this before it is serialized.
I think the only way to do this is to use transformer or processor instead of aggregate.
There you get access to a StateStore instead of KTable. And it also gives you access to context.forward() that lets you forward a message downstream any way you want.
Some pseudo-code to show how it could be done using transform
#Override
public Transformer<String, String, KeyValue<String, String>> get() {
return new Transformer<String, String, KeyValue<String, String>>() {
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
#SuppressWarnings("unchecked")
#Override
public void init(final ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore<String, String>) context.getStateStore(STATE_STORE_NAME);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
String prevAggregation = stateStore.get(key);
//use prevAggregation and value to calculate newAggregation here:
//...
if (newAggregation.length() > threshold) {
context.forward(key, newAggregation);
stateStore.delete(key);
} else {
stateStore.put(key, newAggregation);
}
return null; // transform ignore null
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `stateStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
};
}

write to multiple Kafka topics in apache-beam?

I am executing a simple word count program where I used one Kafka topic (producer) as an input source then I apply a pardo to it for calculating the word count. Now I need help to write the words to different topics on the basis of their frequency. Let say all the word with even frequency will go to topic 1 and rest will go to topic 2.
can anyone help me with the example?
This can be done using Kafka.io writeRecord method that takes Producer<key,value> and then using new Produce<>("topic_name","key","value") -
below is the code -:
static class ExtractWordsFn extends DoFn<String, String> {
private final Counter emptyLines = Metrics.counter(ExtractWordsFn.class, "emptyLines");
private final Distribution lineLenDist =
Metrics.distribution(ExtractWordsFn.class, "lineLenDistro");
#ProcessElement
public void processElement(#Element String element, OutputReceiver<String> receiver) {
lineLenDist.update(element.length());
if (element.trim().isEmpty()) {
emptyLines.inc();
}
String[] words = element.split(ExampleUtils.TOKENIZER_PATTERN, -1);
for (String word : words) {
if (!word.isEmpty()) {
receiver.output(word);
}
}
}
}
public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, ProducerRecord<String,String>> {
#Override
public ProducerRecord<String, String> apply(KV<String, Long> input) {
if(input.getValue()%2==0)
return new ProducerRecord("test",input.getKey(),input.getKey()+" "+input.getValue().toString());
else
return new ProducerRecord("copy",input.getKey(),input.getKey()+" "+input.getValue().toString());
}
}
public static class CountWords
extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
#Override
public PCollection<KV<String, Long>> expand(PCollection<String> lines) {
PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));
PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement());
return wordCounts;
}
}
p.apply("ReadLines", KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("copy")// use withTopics(List<String>) to read from multiple topics.
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.withLogAppendTime()
.withReadCommitted()
.commitOffsetsInFinalize()
.withProcessingTime()
.withoutMetadata()
)
.apply(Values.create())
.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn())) //PCollection<ProducerRecord<string,string>>
.setCoder(ProducerRecordCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()))
.apply("WriteCounts", (KafkaIO.<String, String>writeRecords()
.withBootstrapServers("localhost:9092")
//.withTopic("test")
.withKeySerializer(StringSerializer.class)
.withValueSerializer(StringSerializer.class)
))

Kafka custom state store is not getting updated

I am trying to build a custom state store which stores key to map of values.
Stream & Store configuration
final Serde<HashMap<String, ?>> userSessionsSerde = Serdes.serdeFrom(new HashMapSerializer(), new HashMapDeserializer());
StoreBuilder sessionStoreBuilder = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(storeName),
Serdes.String(),
userSessionsSerde);
builder.addStateStore(sessionStoreBuilder);
builder.stream("connection-events", Consumed.with(Serdes.String(), wsSerde))
.transform(wsEventTransformerSupplier, storeName)
.to("status-changes", Produced.with(Serdes.String(), Serdes.String()));
KafkaStreams streams = new KafkaStreams(builder.build(), properties);
streams.start();
Transformer
public class WSEventProcessor implements Transformer<String, ConnectionEvent, KeyValue<String, String>> {
private String storeName = "user-sessions";
private KeyValueStore<String, Map<String, ConnectionEvent>> stateStore;
final Serde<HashMap<String, ?>> userSessionsSerde = Serdes.serdeFrom(new HashMapSerializer(), new HashMapDeserializer());
#SuppressWarnings("unchecked")
#Override
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore<String, Map<String, ConnectionEvent>>) context.getStateStore(storeName);
}
#Override
public void close() {
}
#Override
public KeyValue<String, String> transform(String key, ConnectionEvent value) {
boolean sendUpdate = false;
//Send null if there are no updates to be sent to downstream processors
if(value.getState() == WebSocketConnection.CONNECTED) {
if(stateStore.get(key) == null) {
stateStore.put(key, new HashMap<>());
sendUpdate = true;
}
stateStore.get(key).put(value.getSessionId(), value);
return sendUpdate ? KeyValue.pair(key, "Online") : null;
}
else {
stateStore.get(key).remove(value.getSessionId());
int size = stateStore.get(key).size();
return stateStore.get(key).isEmpty() ? KeyValue.pair(key, "Offline") : null;
}
}
}
The state store always has 0 size map for each key irrespective of connected and disconnected events. Am I doing something wrong?
value object that you stored into stateStore.put(key, value) and stateStore.get(key) are different objects (as it serialized and then deserialized).
Your issue is related to modification of object returned from state store:
stateStore.get(key).put(value.getSessionId(), value) and stateStore.get(key).remove(value.getSessionId()). when you update object stateStore.get(key), it's actually not persisted to state store, only changes that object.
So, to fix your issue, calculate required value (in your case HashMap), and only after that apply stateStore.put(key, calculated_value). If you need to remove key-value from state store, use stateStore.put(key, null). Your transform method should look approximately like:
public KeyValue<String, String> transform(String key, ConnectionEvent value) {
Map<String, Object> valueFromStateStore = stateStore.get(key);
Map<String, Object> valueToUpdate = ofNullable(valueFromStateStore).orElseGet(Collections::emptyMap);
KeyValue<String, String> resultKeyValue = null;
//Send null if there are no updates to be sent to downstream processors
if(value.getState() == WebSocketConnection.CONNECTED) {
if(valueToUpdate.isEmpty()) {
resultKeyValue = KeyValue.pair(key, "Online");
}
valueToUpdate.put(value.getSessionId(), value);
}
else {
valueToUpdate.remove(value.getSessionId());
if (valueToUpdate.isEmpty()) {
resultKeyValue = KeyValue.pair(key, "Offline");
}
}
stateStore.put(key, valueToUpdate);
return resultKeyValue;
}

Kafka Stream fixed window not grouping by key

I get a single Kafka Stream. How can I accumulate messages for a specific time window irrespective of the key?
My use case is to write a file every 10 minutes out of a stream not considering the key.
You'll need to use a Transformer with a state store and schedule a punctuation call to go through the store every 10 minutes and emit the records. The transformer should return null as you are collecting the records in the state store, so you'll also need a filter after the transformer to ignore any null records.
Here's a quick example of something I think is close to what you are asking for. Let me know how it goes.
class WindowedTransformerExample {
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
final String stateStoreName = "stateStore";
final StoreBuilder<KeyValueStore<String, String>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(stateStoreName),
Serdes.String(),
Serdes.String());
builder.addStateStore(keyValueStoreBuilder);
builder.<String, String>stream("topic").transform(new WindowedTransformer(stateStoreName), stateStoreName)
.filter((k, v) -> k != null && v != null)
// Here's where you do something with records emitted after 10 minutes
.foreach((k, v)-> System.out.println());
}
static final class WindowedTransformer implements TransformerSupplier<String, String, KeyValue<String, String>> {
private final String storeName;
public WindowedTransformer(final String storeName) {
this.storeName = storeName;
}
#Override
public Transformer<String, String, KeyValue<String, String>> get() {
return new Transformer<String, String, KeyValue<String, String>>() {
private KeyValueStore<String, String> keyValueStore;
private ProcessorContext processorContext;
#Override
public void init(final ProcessorContext context) {
processorContext = context;
keyValueStore = (KeyValueStore<String, String>) context.getStateStore(storeName);
// could change this to PunctuationType.STREAM_TIME if needed
context.schedule(Duration.ofMinutes(10), PunctuationType.WALL_CLOCK_TIME, (ts) -> {
try(final KeyValueIterator<String, String> iterator = keyValueStore.all()) {
while (iterator.hasNext()) {
final KeyValue<String, String> keyValue = iterator.next();
processorContext.forward(keyValue.key, keyValue.value);
}
}
});
}
#Override
public KeyValue<String, String> transform(String key, String value) {
if (key != null) {
keyValueStore.put(key, value);
}
return null;
}
#Override
public void close() {
}
};
}
}
}

How to achieve similar sql:select a, count(distinct(b)) from x group by a with kafka stream

My idea is:
First groupByKey then ip, device is unique, then map [ip, device] ip is the key device is value.
GroupByKey again, I think the count value is the number of devices corresponding to ip.
Kafka record is
key value(ip,deviceId)
1 127.0.0.1,aa-bb-cc
2 127.0.0.1,aa-bb-cc
3 127.0.0.1,aa-bb-cc
.....(more, but all value are 127.0.0.1,aa-bb-cc)
I want to get the number of deviceIds owned by ip in the hopping time window.
code:
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> records = builder.stream(topic);
KStream<String, String> formatRecoed = records.map(new KeyValueMapper<String, String, KeyValue<String, String>>(){
#Override
public KeyValue<String, String> apply(String key, String value) {
return new KeyValue<>(value, key);
}
}
formatRecoed.groupByKey().windowedBy(TimeWindows.of(1000 * 60).advanceBy(1000 * 6).until(1000 * 60)).count().toStream(new KeyValueMapper<Windowed<String>, Long, String>(){
#Override
public String apply(Windowed<String> key, Long value) {
return key.toString();
}
}).map(new KeyValueMapper<String, Long, KeyValue<String, String>>(){
#Override
public KeyValue<String, String> apply(String key, Long value) {
String[] keys = key.split(",");
return new KeyValue<>(keys[0], keys[1]);
}
}).groupByKey().windowedBy(TimeWindows.of(1000 * 60).advanceBy(1000 * 6).until(1000 * 60)).count().toStream(new KeyValueMapper<Windowed<String>, Long, String>(){
#Override
public String apply(Windowed<String> key, Long value) {
return key.toString();
}
}).map(new KeyValueMapper<String, Long, KeyValue<String, String>>(){
#Override
public KeyValue<String, String> apply(String key, Long value) {
return new KeyValue<>(key, "" + value);
}
}).to("topic");
The expected result Each time window is
key value
127.0.0.1#1543495068000/1543495188000 1
127.0.0.1#1543495074000/1543495194000 1
127.0.0.1#1543495080000/1543495200000 1
But my running result is:
127.0.0.1#1543495068000/1543495188000 3
127.0.0.1#1543495074000/1543495194000 4
127.0.0.1#1543495080000/1543495200000 1
Why is that?
I am looking forward to someone helping me.
There are two windows in your code and this may be the cause of the issues. I'd propose this flow:
records.map((key, value) -> {
String[] data = value.split(",");
return KeyValue.pair(data[0], data[1]);
})
.groupByKey() // by IP
.windowedBy(TimeWindows.of(1000 * 60).advanceBy(1000 * 6).until(1000 * 60))
.reduce((device1, device2) -> device1 + "|" + device2)
.toStream() // stream of lists of devices per IP in window
.mapValues(devices -> new HashSet<>(Arrays.asList(devices.split("|"))) // set of devices
.mapValues(set -> set.size().toString())
Resulting KStream is windowed stream of (IP, count(distinct(devices))) (both strings), so you can forward it to other topic. The method assumes that there is one character that doesn't exist in devices names (|), if there is no any you'd need to change serialization method.