how to write data to multiple partitions of topic with Kafka Stream API - apache-kafka

I have kafka stream application as mentioned at How to evaluate consuming time in kafka stream application
With this application, I able to write the data in one partition of a topic. How can I write the data to multiple partitions of a topic? Please help me.

If you use Kafka Streams and write data into a topic via #to(String topicName) data will be automatically written to "all" partitions. (Ie, each message is written to a single partition but different messages can be written to different partitions.) The partition will be picked via hashing based on the message key. If the key is null a random partition is used.
If all your output data has the same key, it would all go to a single partition.
You can also customize the partitioning by using #to(StreamPartitioner, String) (old API) or #to(String, Produced)` (new API, v1.0+).

Related

How to sync data for a particular user, when reading from kafka?

I have a streaming serving using kafka, where I receive data from multiple users and I want to process the data where each users data must be processed in sync manner where as different User's data data can be processed on async manner? Is there any standard pattern available for such scenarios or situations ?
You can achieve so, by using userId as the key while publishing the message to kafka.
Keys are used to ensure the messages published to kafka with a particular key are ordered by pushing them into a single partition.
And as each consumer is assigned one partition (in best case, i.e. there can't be any such case where one partition is shared among consumers), thus consumer would be consuming the data from partition in sequence it is pushed.

How to request data from producer at beginning position that does not exist in Kafka?

I have a database with time series data and this data is sent to Kafka.
Many consumers build aggregations and reporting based on this data.
My Kafka cluster stores data with TTL for 1 day.
But how I can build a new report and run a new consumer from 0th position that does not exist in Kafka but exists in source storage.
For example - some callback for the producer if I request an offset that does not exist in Kafka?
If it is not possible please advise other architectural solutions. I want to use the same codebase to aggregate this data.
For example - some callback for the producer if I request an offset
that does not exist in Kafka?
If the data does not exist in Kafka, you cannot consume it much less do any aggregation on top of it.
Moreover, there is no concept of a consumer requesting a producer. Producer sends data to Kafka broker(s) and consumers consume from those broker(s). There is no direct interaction between a producer and a consumer as such.
Since you say that the data still exists in the source DB, you can fetch your data from there and reproduce it to Kafka.
When you produce that data again, they will be new messages which will be eventually consumed by the consumers as usual.
In case you would like to differentiate between initial consumption and re-consumption, you can produce these messages to a new topic and have your consumers consume from them.
Other way is to increase your TTL (I suppose you mean retention in Kafka when you say TTL) and then you can seek back to a timestamp in the consumers using the offsetsForTimes(Map<TopicPartition,Long> timestampToSearch) and seek(TopicPartition topicPartition, long offset) methods.

How to determine topic has been read completely by Kafka Stream application from very first offset to last offset from Java application

I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));

MQTT topics and kafka topics mapping

I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.