Kafka Streams: pipe one topic into another - apache-kafka

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.

Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.

Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

Related

Get output record partition within Kafka Streams

I have a KStream which branches and writes output records into different topics based on some internal logic. Is there any way I can know the partition of the output message from inside the KStream? The output topics have different number of partitions from the input ones.
When using the high-level DSL, you don't have access to the record metadata (which holds a key/value pair on specific partition the record came from). So you won't be able to achieve the goal using KStream implementation.
You could use the low-level processor API if you wanted, which would allow access to the metadata. Otherwise, you can add an implementation of ConsumerInterceptor, and map the partition value to the message before it goes to the consumer.

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

How to understand Kafka streams?

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.
https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html
On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.
Questions:
1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?
2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?
For example:
KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
// Here, we generate two output records for each input record.
// We also change the key and value types.
// Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
(key, value) -> {
List<KeyValue<String, Integer>> result = new LinkedList<>();
result.add(KeyValue.pair(value.toUpperCase(), 1000));
result.add(KeyValue.pair(value.toLowerCase(), 9000));
return result;
}
);
In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.
This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.
When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?
If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option.
I hope this clarifies your question.
Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.
This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.
However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

How to check which partition is a key assign to in kafka?

I am trying to debug a issue for which I am trying to prove that each distinct key only goes to 1 partition if the cluster is not rebalancing.
So I was wondering for a given topic, is there a way to determine which partition a key is send to?
As explained here or also in the source code
You need the byte[] keyBytes assuming it isn't null, then using org.apache.kafka.common.utils.Utils, you can run the following.
Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
For strings or JSON, it's UTF8 encoded, and the Utils class has helper functions to get that.
For Avro, such as Confluent serialized values, it's a bit more complicated (a magic byte, then a schema ID, then the data). See Wire format
In Kafka Streams API, You should have a ProcessorContext available in your Processor#init , which you can store a reference to and then access in your Processor#process method, such as ctx.recordMetadata.get().partition() (recordMetadata returns an Optional)
only goes to 1 partition
This isn't a guarantee. Hashes can collide.
It makes more sense to say that a given key isn't in more than one partition.
if the cluster is not rebalancing
Rebalancing will still preserve a partition value.
when you send message,
Partitions are determined by the following classes
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
If you want change logics, implement org.apache.kafka.clients.producer.Partitioner interface and,
set ProduceConfig's 'partitioner.class'
reference docuement :
https://kafka.apache.org/documentation/#producerconfigs

What's the purpose of Kafka's key/value pair-based messaging?

All of the examples of Kafka | producers show the ProducerRecord's key/value pair as not only being the same type (all examples show <String,String>), but the same value. For example:
producer.send(new ProducerRecord<String, String>("someTopic", Integer.toString(i), Integer.toString(i)));
But in the Kafka docs, I can't seem to find where the key/value concept (and its underlying purpose/utility) is explained. In traditional messaging (ActiveMQ, RabbitMQ, etc.) I've always fired a message at a particular topic/queue/exchange. But Kafka is the first broker that seems to require key/value pairs instead of just a regulare 'ole string message.
So I ask: What is the purpose/usefulness of requiring producers to send KV pairs?
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system.
Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message. The examples are actually not very "good" with this regard; usually you would have a complex type as value (like a tuple-type or a JSON or similar) and you would extract one field as key.
See: http://kafka.apache.org/intro#intro_topics and http://kafka.apache.org/intro#intro_producers
In general the key and/or value can be null, too. If the key is null a random partition will the selected. If the value is null it can have special "delete" semantics in case you enable log-compaction instead of log-retention policy for a topic (http://kafka.apache.org/documentation#compaction).
Late addition... Specifying the key so that all messages on the same key go to the same partition is very important for proper ordering of message processing if you will have multiple consumers in a consumer group on a topic.
Without a key, two messages on the same key could go to different partitions and be processed by different consumers in the group out of order.
Another interesting use case
We could use the key attribute in Kafka topics for sending user_ids and then can plug in a consumer to fetch streaming events (events stored in value attributes). This could allow you to process any max-history of user event sequences for creating features in your machine learning models.
I still have to find out if this is possible or not. Will keep updating my answer with further details.