Kafka Consumer to read from multiple topics - apache-kafka

I am very new to Kafka. I am creating two topics and publishing on these two topics from two Producers. I have one consumer which consumes the messages from both the topics. This is because I want to process according to the priority.
I am getting a stream from both the topics but as soon as I start iterating on ConsumerItreator of any stream, it blocks there. As it's written in documentation, it will be blocked till it gets a new message.
Is any one aware of how to read from two topics and two streams from a single Kafka Consumer?
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(KafkaConstants.HIGH_TEST_TOPIC, new Integer(1));
topicCountMap.put(KafkaConstants.LOW_TEST_TOPIC, new Integer(1));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumerConnector.createMessageStreams(topicCountMap);
KafkaStream<byte[], byte[]> highPriorityStream = consumerMap.get(KafkaConstants.HIGH_TEST_TOPIC).get(0);
ConsumerIterator<byte[], byte[]> highPrioerityIterator = highPriorityStream.iterator();
while (highPriorityStream.nonEmpty() && highPrioerityIterator.hasNext())
{
byte[] bytes = highPrioerityIterator.next().message();
Object obj = null;
CLoudDataObject thunderDataObject = null;
try
{
obj = SerializationUtils.deserialize(bytes);
if (obj instanceof CLoudDataObject)
{
thunderDataObject = (CLoudDataObject) obj;
System.out.println(thunderDataObject);
// TODO Got the Thunder object here, now write code to send it to Thunder service.
}
}
catch (Exception e)
{
}
}

If you don't want to process lower priority messages before high priority ones, how about setting consumer.timeout.ms property and catch ConsumerTimeoutException to detect that the flows for high priority reach the last message available? By default it's set -1 to block until a new message arrives. (http://kafka.apache.org/07/configuration.html)
The below explains a way to process multiple flows concurrently with different priorities.
Kafka requires multi-thread programming. In your case, the streams of the two topics need to be processed by threads for the flows. Because each thread will run independently to process messages, one blocking flow (thread) won't affect other flows.
Java's ThreadPool implementation can help the job in creating multi-thread application. You can find example implementation here:
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
Regarding the priority of execution, you can call Thread.currentThread.setPriority method to have the proper priorities of threads based on their serving Kafka topic.

Related

INCOMPLETE_SOURCE_TOPIC_METADATA error in KafkaStreams

I writing KafkaStreams application and set the maximum.num.threads as one. I've three source topics and have 6,8,8 partitions respectively. Currently running this streamtopology with 4 instances so 4 running streams threads.
I'm getting INCOMPLETE_SOURCE_TOPIC_METADATA in one of my kafka topics. I found below code from github
throwing this error and trying to understand the code
final Map<String, InternalTopicConfig> repartitionTopicMetadata = new HashMap<>();
for (final InternalTopologyBuilder.TopicsInfo topicsInfo : topicGroups.values()) {
for (final String topic : topicsInfo.sourceTopics) {
if (!topicsInfo.repartitionSourceTopics.keySet().contains(topic) &&
!metadata.topics().contains(topic)) {
log.error("Missing source topic {} during assignment. Returning error {}.",
topic, AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.name());
return new GroupAssignment(
errorAssignment(clientMetadataMap, topic,
AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.code())
);
}
}
for (final InternalTopicConfig topic : topicsInfo.repartitionSourceTopics.values()) {
repartitionTopicMetadata.put(topic.name(), topic);
}
}
My questions:
Is this error comes because of partition mismatch on Kafka topics, or TopicsInfo is not available at the time (Think like Kafka group lost its access to the Kafka topic)?
what is meant by topicsInfo.repartitionSourceTopics call?
The error means that Kafka Streams could not fetch all required metadata for the used topics and thus cannot proceed. This could happen if a topic does not exist (ie, not created yet or miss spelled), or if the corresponding metadata was not broadcasted to all brokers yet (for this case, the error should be transient). Note that metadata is updated asynchronously and thus there might be some delay until all brokers learn about a new topic.

Kafka exactly_once processing - do you need your streams app to produce to kafka topic as well?

I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}

Kafka Consumer API jumping offsets

I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Is consumer.createMessageStreams(map) method read sequentially or read in some sort of batch

I am very new to kafka. We are writing consumer in our current application, which consumes from a topic and have some processing of data that is consumed. I want to understand, what happens internally when I write below piece of code .
It is working as expected, consuming data and getting processed but just curious to know about how data is been read from a topic.
Will createMessageStreams method reads data sequentially from a topic or it reads in a particular number of batch and process them ?
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(map);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
First of all, would mention that ConsumerConnector or kafka.consumer.KafkaStream classes are deprecated in kafka v#0.11.0 version. In case, if you are using old version, you should plan to upgrade to newer version atleast v#1.0 or more.
Will createMessageStreams method reads data sequentially from a topic
or it reads in a particular number of batch and process them ?
.createMessageStreams returns a map of topic and list of KafkaStream pair. (topic,list#stream) Each stream supports the iterator over messages or metadata pair for a topic. It reads data sequentially only within the partition. If you have more partitions than the number of stream thread, one thread can read from multiple partition. But only within the partitions, sequence order is guaranteed.
for (final KafkaStream<byte[], byte[]> stream : streamList)
{
ConsumerIterator<byte[], byte[]> it= stream.iterator();
while (it.hasNext())
{
String message = new String(it.next().message());
System.out.println(message);
}
}
}
Equivalent functionality in v#0.11 onwards is .poll() method. You can set max.poll.records or max.poll.interval.ms to set the number of records per poll request and interval duration respectively.
You can find the new consumer here:
https://kafka.apache.org/20/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

kafka new api 0.10 doesn't provide a list of stream and consumer objects per topic

Previously I have been using 0.8 API. When you pass topics list to it, it returns a map of streams (one entry per topic). This allows me to spawn a separate thread and assign each topic's stream to it. Having too much data in each topic, spawning a separate thread helps multi tasking.
//0.8 code sample
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap =
consumer.createMessageStreams(topicCountMap);
I want to upgrade to 0.10. I checked KafkaStreams and KafkaConsumer classes. KafkaConsumer object takes config properties and provide the subscribe method that takes topics List and its return type is void. I cannot find a way where I can get a handle to each topic.
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.subscribe(topicsList);
conusmer.poll(long ms)
KafkaStreams on the other hand seems to have the same problem.
KStreamBuilder builder = new KStreamBuilder();
String [] topics = new String[] {"topic1", "topic2"};
KStream<byte[], byte[]> source = builder.stream(stringSerde, stringSerde, topics);
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
There is source.foreach() method available but it is a stream of all topics. Anyone, any ideas ?
First, using a multi threaded consumer is tricky, thus the pattern you employed in 0.8 is hopefully well designed :)
Best practice is to use a single threaded consumer and thus, there is "no need" to separate different topics if a single consumer subscribes to a list of topics at once. Nevertheless, while consuming the record, the record object provides information about from which topic it originates from (it carries this metadata). Thus, you could theoretically dispatch a record according to its topics to a different thread for the actual processing (even if this is not recommended!).
Kafka scales out via partitions, thus, if a single-threaded consumer is not able to handle the load, you should start multiple consumers (as a consumer group) to scale out your consumer processing capacity.
A more general question: if you want to process data per topic, why not using multiple consumers each subscribing to a single topic each?
Last but not least, in Apache Kafka 0.10+ the Kafka Streams API is a newly introduced stream processing library -- though it must not be confused with 0.8 KafkaStream class (hint, there is no "s"). Both are completely unrelated to each other.