Kafka Streams: one record to multiple records - apache-kafka

Given: I have two topics in Kafka let's say topic A and topic B. The Kafka Stream reads a record from topic A, processes it and produces multiple records (let's say recordA and recordB) corresponding to the consumed record. Now, the question is how can I achieve this using Kafka Streams.
KStream<String, List<Message>> producerStreams[] = recordStream.mapValues(new ValueMapper<Message, List<Message>>() {
#Override
public List<Message> apply(final Message message) {
return consumerRecordHandler.process(message);
}
}).*someFunction*()
Here, the record read is Message; After processing it returns a list of Message. How can I divide this list to two producer streams? Any help will be appreciated.

I am not sure if I understand the question correctly, and I also don't understand the answer from #Abhishek :(
If you have an input stream, and you want to get zero, one, or more output records per input records, you would apply a flatMap() or flatMapValues() (depending if you want to modify the key or not).
You are also asking about "How can I divide this list to two producer streams?" If you mean to split one stream into multiple, you can use branch().
For more details, I refer to the docs:
https://docs.confluent.io/platform/current/streams/developer-guide/dsl-api.html#stateless-transformations

What's your key (type) ? I am guessing its not String. After executing the mapValues you'll have this - KStream<K,List<Message>>. If K is not String then someFunction() can be a map which will convert K into String (if its is, you already have the result) and leave the List<Message> (the value) untouched since that's your intended end result

Related

In Kafka Streams Application the second output stream is not written anymore

I am currently implementing a Kafka Streams Application where I am reading one topic and doing some processing. During processing I am splitting it into two streams, one is written into one topic (Avro Schema) the other one is a counting aggregation (word count) writing Key/Value Pairs (String/Long) into a different topic. The code worked fine beforehand but recently the second stream is not written anymore.
In this code example:
KStream<String, ProcessedSentence> sentenceKStream = stream
.map((k,v) -> {
[...]
});
// configure serializers for publishing to topic
final Serde<ProcessedSentence> valueProcessedSentence = new SpecificAvroSerde<>();
valueProcessedSentence.configure(serdeConfig, false);
stringSerde.configure(serdeConfig, true);
// write to Specific Avro Record
sentenceKStream
.to(EnvReader.KAFKA_SENTENCES_TOPIC_NAME_OUT,
Produced.with(
Serdes.String(),
valueProcessedSentence));
the stream of sentences (sentenceKStream) is written correctly but the problem arises with the word count grouping:
KStream<String, Long> wordCountKStream =
sentenceKStream.flatMap((key, processedSentence) -> {
List<KeyValue<String, Long>> result = new LinkedList<>();
Map<CharSequence, Long> words = processedSentence.getWords();
for (CharSequence word: words.keySet() ) {
result.add(KeyValue.pair(word.toString(), words.get(word)));
}
return result;
})
.groupByKey(Grouped.with(Serdes.String(), Serdes.Long()))
.reduce(Long::sum)
.toStream();
// write to Specific Avro Record
wordCountKStream
.to(EnvReader.KAFKA_WORDS_COUNT_TOPIC_NAME_OUT,
Produced.with(
Serdes.String(),
Serdes.Long()));
I really don't get why the wordCountKStream is not written anymore.
Maybe somebody could provide some help? I'd be happy to provide any further details!
Many Thanks
Update: I found out that the data is missing in both new output streams. Actually, everything is written correctly but a couple of minutes after writing the data is deleted from both topics (0 Bytes left).
It had nothing to do with the implementation itself. I've just delete all topic offsets using
kafka-consumer-groups.sh --bootstrap-server [broker:port] --delete-offsets --group [group_name] --topic [topic_name]
which solved the problem. There just had been a problem with the stored offsets and conflicted with multiple restarts of the streams application during the debug process.
For those who want to list the groups in order to find the stored topic positions call
kafka-consumer-groups.sh --bootstrap-server node1:9092 --list
Update: Unfortunately, the deletion of the group offsets also did not work properly. The actual problem has been, that the timestamp taken for the new entries in the output topics was the one from the original topic (consumed) which did not change at all. Therefore, the new entries are carrying timestamps older than the default retention time.
As the consumed topic had a rentention.ms of -1 (keep data forever) and the new topics the standard of, I think, 6 days, the entries in the consumed topic are still there but the ones in produced topic were always deleted because they were older than 6 days.
The final solution (for me) was to change the retention.ms to -1 for the output topics, which means "keep forever". Which is probably not the best solution for a production environment.
Hint: for Streams Applications it is recommended to use the Application Reset Tool instead of the manual reset/deletion of the offsets as shown above.

How to understand Kafka streams?

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.
https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html
On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.
Questions:
1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?
2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?
For example:
KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
// Here, we generate two output records for each input record.
// We also change the key and value types.
// Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
(key, value) -> {
List<KeyValue<String, Integer>> result = new LinkedList<>();
result.add(KeyValue.pair(value.toUpperCase(), 1000));
result.add(KeyValue.pair(value.toLowerCase(), 9000));
return result;
}
);
In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.
This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.
When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?
If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option.
I hope this clarifies your question.
Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.
This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.
However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

How Kafka streams handle distributed data

I have tried to go through various tutorials but am not clear on two aspects of Kafka streams.
Lets take the word count example mentioned in:
https://docs.confluent.io/current/streams/quickstart.html
// Serializers/deserializers (serde) for String and Long types
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
// Construct a `KStream` from the input topic "streams-plaintext-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream<String, String> textLines = builder.stream("streams-plaintext-input", Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words. The text lines are the message
// values, i.e. we can ignore whatever data is in the message keys and thus invoke
// `flatMapValues` instead of the more generic `flatMap`.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
// Count the occurrences of each word (message key).
.count();
// Convert the `KTable<String, Long>` into a `KStream<String, Long>` and write to the output topic.
wordCounts.toStream().to("streams-wordcount-output",
Produced.with(stringSerde, longSerde));
Couple of questions here:
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ? Do different nodes serving same topic's partition coordinate here to aggregate the count ?
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Will appreciate any help here!
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ?
It is done here. This is the relevant code:
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
Here, "words" become the new message key, which means that words are re-partitioned so that each word is put into one partition only.
Do different nodes serving same topic's partition coordinate here to aggregate the count ?
No, they don't. A partition is processed by one node only (more precisely: by one stream task only, see below).
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?
Not sure I understand your question, notably the "recalcuated" comment. Operations (like aggregations) are always performed per partition, and Kafka Streams maps partitions to stream tasks (slightly simplified: a partition is always processed by one and only one stream task). Stream tasks are executed by the various instances of your Kafka Streams application, which typically run on different containers/VMs/machines. If need be, data will need to be re-partitioned (see the question #1 and answer above) for an operation to produce the expected result -- perhaps that's what you mean when you say "recalculated".
I'd suggest to read Kafka's documentation, such as https://kafka.apache.org/documentation/streams/architecture#streams_architecture_tasks.

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}