Kafka streams join: How to wait a duration of time before emitting records? - apache-kafka

We currently have 2 Kafka stream topics that have records coming in continuously. We're looking into joining the 2 streams based on a key after waiting for a window of 5 minutes but with my current code, I see records being emitted immediately without "waiting" to see if a matching record arrives in the other stream. My current implementation:
KStream<String, String> streamA =
builder.stream(topicA, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("Stream A incoming record key " + key + " value " + value));
KStream<String, String> streamB =
builder.stream(topicB, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("Stream B incoming record key " + key + " value " + value));
ValueJoiner<String, String, String > recordJoiner =
(recordA, recordB) -> {
if(recordA != null) {
return recordA;
} else {
return recordB;
}
};
KStream<String, String > combinedStream =
streamA(
streamB,
recordJoiner,
JoinWindows
.of(Duration.ofMinutes(5)),
StreamJoined.with(
Serdes.String(),
Serdes.String(),
Serdes.String()))
.peek((key, value) -> System.out.println("Stream-Stream Join record key " + key + " value " + value));
combinedStream.to("test-topic"
Produced.with(
Serdes.String(),
Serdes.String()));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsConfiguration);
kafkaStreams.start();
Although I have the JoinWindows.of(Duration.ofMinutes(5)), I see some records being emitted immediately. How do I ensure they are not?
Additionally, is this the most efficient way of joining 2 Kafka streams or is it better to come up with our own consumer implementation that reads from 2 streams etc?

Related

Why are all producer messages sent to one partition?

I have a topic first_topic with 3 partitions.
Here is my code, I send 55 messages to a consumer ( which is running in cmd ), the code below shows the partition to which the message was sent. Every time I launch the code all the messages go to one partition only ( that is picked randomly ), it may be partition 0, 1 or 2.
Why doesn't round-robin work here? I do not specify the key, so I hope it should.
Logger logger = LoggerFactory.getLogger(Producer.class);
Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "127.0.0.1:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
KafkaProducer<String, String> producer = new KafkaProducer(properties);
for (int i = 0; i < 55; i++) {
producer.send(new ProducerRecord<>("first_topic", "Hello world partitionCheck " + i), (recordMetadata, e) -> {
// executes every time a record is sent or Exception is thrown
if (e == null) {
// record was successfully sent
logger.info("metaData: " +
"topic " + recordMetadata.topic() +
" Offset " + recordMetadata.offset() +
" TimeStamp " + recordMetadata.timestamp() +
" Partition " + recordMetadata.partition());
} else {
logger.error(e.toString());
}
});
}
producer.flush();
producer.close();
As per theory, if you are not specifying any custom partition it will use the default partitioner as per the below rule
If a partition is specified in the record, use it that partition to publish.
If no partition is specified but a key is present choose a partition based on a hash of the key
If no partition or key is present choose a partition in a round-robin fashion
Can you confirm, how are you checking this. "Every time I launch the code all the messages go to one partition only ( that is picked randomly ), it may be partition 0, 1 or 2. "

Intermittent incorrect count from TimeWindowKStream in kafka streams

My intention with this topology is to window incoming messages and then count them, and then send the count to another topic.
When I test this with a single key and one or more values to the input topic, I get inconsistent results. Sometimes the count is correct. Sometimes I will send in a single message, see the single message at the first peek and instead of getting a count of 1, I get some other value at the second peek and in the output topic. When I send in multiple messages, the count is usually right, but sometimes off. I'm careful to send the messages inside the time window, so I don't think they're getting split into two windows.
Is there a flaw in my topology?
public static final String INPUT_TOPIC = "test-topic";
public static final String OUTPUT_TOPIC = "test-output-topic";
public static void buildTopo(StreamsBuilder builder) {
WindowBytesStoreSupplier store = Stores.persistentTimestampedWindowStore(
"my-state-store",
Duration.ofDays(1),
Duration.ofMinutes(1),
false);
Materialized<String, Long, WindowStore<Bytes, byte[]>> materialized = Materialized
.<String, Long>as(store)
.withKeySerde(Serdes.String());
Suppressed<Windowed> suppression = Suppressed
.untilWindowCloses(Suppressed.BufferConfig.unbounded());
TimeWindows window = TimeWindows
.of(Duration.ofMinutes(1))
.grace(Duration.ofSeconds(0));
// windowedKey has a string plus the kafka time window
builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("****key = " + key + " value= " + value))
.groupByKey()
.windowedBy(window)
.count(materialized)
.suppress(suppression)
.toStream()
.peek((key, value) -> System.out.println("key = " + key + " value= " + value))
.map((key, value) -> new KeyValue<>(key.key(), value))
.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), Serdes.Long()));
}

merge records in a kafka stream

Is it possible to merge records in kafka and publish the output to different stream ?
For example , there is a stream of events produced to a kafka topic like below
{txnId:1,startTime:0900},{txnId:1,endTime:0905},{txnId:2,endTime:0912},{txnId:3,endTime:0930},{txnId:2,startTime:0912},{txnId:3,startTime:0925}......
I want to merge these events by txnId and create the merged output like below
{txnId:1,startTime:0900,endTime:0905},{txnId:2,startTime:0910,endTime:0912},{txnId:3,startTime:0925,endTime:0930}
Please note that order is not maintained in the incoming events.So if endTime is received for a txn Id before start time event , then we need to wait till the start time event is received for that txnId before initiating the merge
I went through the word count example that comes along with Kafka Streams example but its not clear how to wait for events and then merge while doing the transformation.
Any thoughts is highly appreciated.
You could try solving this by splitting the start and end events into 2 separate streams with txnId as the key and then joining both the streams.
KStream<String, String> eventSource = new StreamBuilder().stream("INPUT-TOPIC");
KStream<String, JsonNode>[] splitEvents =
eventSource.map((key, eventString) -> {
JsonNode event = new ObjectMapper().readTree(eventString);
String txnId = event.path("txnId").asText();
return KeyValue.pair(txnId, event);
})
.branch((key, event) -> event.findValue("startTime") != null,
(key, event) -> event.findValue("endTime") != null);
KStream<String, JsonNode> startEvents = splitEvents[0];
KStream<String, JsonNode> endEvents = splitEvents[1];
A join between 2 streams as shown will produce a join result when there is an event in either side of the join. So the order of both events won't matter (you will have to ensure that you set an appropriate window period for the join).
Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
KStream<String, String> completeEvents = startEvents.join(endEvents,
(startEvent, endEvent) -> {
// Add logic to merge startEvent and endEvent as seen fit
ObjectNode completeEvent = JsonNodeFactory.instance.objectNode();
completeEvent.put("startTime", startEvent.path("startTime).asText());
completeEvent.put("endTime", endEvent.path("endTime").asText());
return completeEvent.toString();
},
JoinWindows.of(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(), // key
jsonSerde, // left object
jsonSerde // right object
)
);

Not able to query on local key-value store(Kafka-Stream)

I am working on the use-case where I need to query KTable(Using local key-value stores approach).My sample data which is present inside the topic:
A,Blue
A,Blue
A,Yellow
A,Red
A,Yellow
A,Yellow
B,Blue
C,Red
C,Red
B,Blue
Based On Input I want to generate the output and store in the topic:
A Blue:2,Yellow:3,Red:1
B Blue:2
C Red:2
Approach:
1) I first performed count operation by reading topic data in Kstream.
//set the properties for interactive queries
props.put(StreamsConfig.APPLICATION_SERVER_CONFIG,"localhost:9092" );
props.put(StreamsConfig.STATE_DIR_CONFIG, "D:\\Kafka_data\\Local_store");
//read the user input from Kafka topic: data
final KStream<String,String> userDataSource = builder.stream("data");
final KGroupedStream<String,String> inputData = userDataSource.
map((key, value) -> new KeyValue<>(value.split(",")[0].toString() + "_"+ value.split(",")[1].toString() , value.split(",")[1].toString()) )
.selectKey((s, s2) -> s)
.groupByKey(Grouped.with(Serdes.String(),Serdes.String()));
final KTable<String,Long> inputAggregationResult = inputData.count();
Result of the above code:
A_Blue 1
A_Yellow 1
A_Red 1
A_Yellow 2
A_Yellow 3
B_Blue 1
C_Red 1
C_Red 2
B_Blue 2
2) Then store the result in topic:
inputAggregationResult.toStream().to("input-data-aggregation", Produced.with(Serdes.String(), Serdes.Long()));
3) Now reading data from the topic(input-data-aggregation) as Ktable so that I can query.
final StreamsBuilder builder = new StreamsBuilder();
KTable<String, Object> ktableInformation = builder.table("input-data-aggregation", Materialized.<String, Object, KeyValueStore<Bytes, byte[]>>as("CountsValueStore"));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<String, Object> keyValueStore;
Map<String,Object> information = new LinkedHashMap<String,Object>();
while (true) {
try {
// Get the key-value store CountsKeyValueStore
keyValueStore =
streams.store(ktableInformation.queryableStoreName(), QueryableStoreTypes.keyValueStore());
//read the value
KeyValueIterator<String, Object> range = keyValueStore.range("all", "streams");
while (range.hasNext()) {
KeyValue<String, Object> next = range.next();
information.put(next.key,next.value);
System.out.println("count for " + next.key + ": " + next.value);
}
// close the iterator to release resources
range.close();
} catch (InvalidStateStoreException ignored) {
ignored.printStackTrace();
}
}
4)When I am trying to query data it is giving empty data(No output is getting print).
Can someone guide me if I missed any step as part Querying local key-value stores? or any other alternative to achieve the target output. I have verified that Kafka is writing Local Key-value store data inside my local instance but while reading(Querying) data it's giving an empty result.

How to read data using key in Kafka Consumer API?

I'm constructing messages using below code...
Producer<String, String> producer = new kafka.javaapi.producer.Producer<String, String>(producerConfig);
KeyedMessage<String, String> keyedMsg = new KeyedMessage<String, String>(topic, "device-420", "{message:'hello world'}");
producer.send(keyedMsg);
And Consuming using following code block...
//Key = topic name, Value = No. of threads for topic
Map<String, Integer> topicCount = new HashMap<String, Integer>();
topicCount.put(topic, 1);
//ConsumerConnector creates the message stream for each topic
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumerConnector.createMessageStreams(topicCount);
// Get Kafka stream for topic
List<KafkaStream<byte[], byte[]>> kStreamList = consumerStreams.get(topic);
// Iterate stream using ConsumerIterator
for (final KafkaStream<byte[], byte[]> kStreams : kStreamList) {
ConsumerIterator<byte[], byte[]> consumerIte = kStreams.iterator();
while (consumerIte.hasNext()) {
MessageAndMetadata<byte[], byte[]> msg = consumerIte.next();
System.out.println(topic.toUpperCase() + ">"
+ " Partition:" + msg.partition()
+ " | Key:"+ new String(msg.key())
+ " | Offset:" + msg.offset()
+ " | Message:"+ new String(msg.message()));
}
}
Everything is working fine because I'm reading data topic wise. So I want to know that Is there any way to to consume data using message key i.e. device-420 in this example?
Short answer: no.
The smallest granularity in Kafka is a partition. You can write a client that reads only from a single partition. However, a partition can contain multiple keys and you need to consume all the keys contained in this partition.