How to understand Kafka streams? - apache-kafka

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.
https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html
On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.
Questions:
1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?
2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?
For example:
KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
// Here, we generate two output records for each input record.
// We also change the key and value types.
// Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
(key, value) -> {
List<KeyValue<String, Integer>> result = new LinkedList<>();
result.add(KeyValue.pair(value.toUpperCase(), 1000));
result.add(KeyValue.pair(value.toLowerCase(), 9000));
return result;
}
);
In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.

This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.
When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?
If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option.
I hope this clarifies your question.

Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.
This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.
However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

Related

Get output record partition within Kafka Streams

I have a KStream which branches and writes output records into different topics based on some internal logic. Is there any way I can know the partition of the output message from inside the KStream? The output topics have different number of partitions from the input ones.
When using the high-level DSL, you don't have access to the record metadata (which holds a key/value pair on specific partition the record came from). So you won't be able to achieve the goal using KStream implementation.
You could use the low-level processor API if you wanted, which would allow access to the metadata. Otherwise, you can add an implementation of ConsumerInterceptor, and map the partition value to the message before it goes to the consumer.

Reading already partitioning topic in Kafka Streams DSL

Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field as a key.
One simple way to render this round-trip unnecessary is to write the original messages topic partitioned by the msg.field in the first place. But Kafka Streams knows nothing about messages topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy would mark stream as already partitioned by msg.field. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.
Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey() instead of groupBy(). Whereas groupBy() always results in repartitioning, its sibling groupByKey() assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey() would work if key == msg.field.
Original answer below:
Repartitioning a high-volume topic in Kafka Streams could be very expensive.
Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy() function. Hence there is no need to tell it "not to partition" as you say in your question.
One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?

How Kafka streams handle distributed data

I have tried to go through various tutorials but am not clear on two aspects of Kafka streams.
Lets take the word count example mentioned in:
https://docs.confluent.io/current/streams/quickstart.html
// Serializers/deserializers (serde) for String and Long types
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
// Construct a `KStream` from the input topic "streams-plaintext-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream<String, String> textLines = builder.stream("streams-plaintext-input", Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words. The text lines are the message
// values, i.e. we can ignore whatever data is in the message keys and thus invoke
// `flatMapValues` instead of the more generic `flatMap`.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
// Count the occurrences of each word (message key).
.count();
// Convert the `KTable<String, Long>` into a `KStream<String, Long>` and write to the output topic.
wordCounts.toStream().to("streams-wordcount-output",
Produced.with(stringSerde, longSerde));
Couple of questions here:
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ? Do different nodes serving same topic's partition coordinate here to aggregate the count ?
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Will appreciate any help here!
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ?
It is done here. This is the relevant code:
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
Here, "words" become the new message key, which means that words are re-partitioned so that each word is put into one partition only.
Do different nodes serving same topic's partition coordinate here to aggregate the count ?
No, they don't. A partition is processed by one node only (more precisely: by one stream task only, see below).
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Not sure I understand your question, notably the "recalcuated" comment. Operations (like aggregations) are always performed per partition, and Kafka Streams maps partitions to stream tasks (slightly simplified: a partition is always processed by one and only one stream task). Stream tasks are executed by the various instances of your Kafka Streams application, which typically run on different containers/VMs/machines. If need be, data will need to be re-partitioned (see the question #1 and answer above) for an operation to produce the expected result -- perhaps that's what you mean when you say "recalculated".
I'd suggest to read Kafka's documentation, such as https://kafka.apache.org/documentation/streams/architecture#streams_architecture_tasks.

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

How to store only latest key values in a kafka topic

I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
I thought a KTable's whole purpose was that it will store the latest value given a key rather than storing the whole stream of events. However I can't seem to get this to work. Running the code below produces the keystore but that keystore (maintopiclatest) has a stream of events in it (not just the latest values). So if I send a request with 1000 records in the topic twice, rather than seeing 1000 records, I see 2000 records.
var serializer = new KafkaSpecificRecordSerializer();
var deserializer = new KafkaSpecificRecordDeserializer();
var stream = kStreamBuilder.stream("maintopic",
Consumed.with(Serdes.String(), Serdes.serdeFrom(serializer, deserializer)));
var table = stream
.groupByKey()
.reduce((aggV, newV) -> newV, Materialized.as("maintopiclatest"));
The other problem is if I want to store the KTable in a new topic I'm not sure how to do that. In order to do that it seems that I have to turn it back into a Stream so that I can call ".to" on it. But then that has the whole stream of events in it not just the latest values.
This is not how a KTable works.
A KTable itself, has an internal state store and stores exactly one record per key. However, a KTable is constantly updated and subject to the so-called stream-table-duality. Each update to the KTable is sent downstream as a changelog record: https://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables. Thus, each input record result in an output record.
Because it's stream processing, there is no "last key per value".
I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
At which point in time do you want a KTable to emit an update? There is no answer to this question because the input stream is conceptually infinite.