INCOMPLETE_SOURCE_TOPIC_METADATA error in KafkaStreams - apache-kafka

I writing KafkaStreams application and set the maximum.num.threads as one. I've three source topics and have 6,8,8 partitions respectively. Currently running this streamtopology with 4 instances so 4 running streams threads.
I'm getting INCOMPLETE_SOURCE_TOPIC_METADATA in one of my kafka topics. I found below code from github
throwing this error and trying to understand the code
final Map<String, InternalTopicConfig> repartitionTopicMetadata = new HashMap<>();
for (final InternalTopologyBuilder.TopicsInfo topicsInfo : topicGroups.values()) {
for (final String topic : topicsInfo.sourceTopics) {
if (!topicsInfo.repartitionSourceTopics.keySet().contains(topic) &&
!metadata.topics().contains(topic)) {
log.error("Missing source topic {} during assignment. Returning error {}.",
topic, AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.name());
return new GroupAssignment(
errorAssignment(clientMetadataMap, topic,
AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.code())
);
}
}
for (final InternalTopicConfig topic : topicsInfo.repartitionSourceTopics.values()) {
repartitionTopicMetadata.put(topic.name(), topic);
}
}
My questions:
Is this error comes because of partition mismatch on Kafka topics, or TopicsInfo is not available at the time (Think like Kafka group lost its access to the Kafka topic)?
what is meant by topicsInfo.repartitionSourceTopics call?

The error means that Kafka Streams could not fetch all required metadata for the used topics and thus cannot proceed. This could happen if a topic does not exist (ie, not created yet or miss spelled), or if the corresponding metadata was not broadcasted to all brokers yet (for this case, the error should be transient). Note that metadata is updated asynchronously and thus there might be some delay until all brokers learn about a new topic.

Related

How to send Kafka message from one topic to another topic?

suppose my producer is writing the message to Topic A...once the message is in Topic A, i want to copy the same message to Topic B. Is this possible in kafka?
If I understand correctly, you just want stream.to("topic-b"), although, that seems strange without doing something to the data.
Note:
The specified topic should be manually created before it is used
I am not clear about what use case you are exactly trying to achieve by simply copying data from one topic to another topic. If both the topics are in the same Kafka cluster then it is never a good idea to have two topics with the same message/content.
I believe the gap here is that probably you are not clear about the concept of the Consumer group in Kafka. Probably you have two action items to do by consuming the message from the Kafka topic. And you are believing that if the first application consumes the message from the Kafka topic, will it be available for the second application to consume the same message or not. Kafka allows you to solve this kind of common use case with the help of the consumer group.
Let's try to differentiate between other message queue and Kafka and you will understand that you do not need to copy the same data/message between two topics.
In other message queues, like SQS(Simple Queue Service) where if the message is consumed by a consumer, the same message is not available to get consumed by other consumers. It is the responsibility of the consumer to delete the message safely once it has processed the message. By doing this we guarantee that the same message should not get processed by two consumers leading to inconsistency.
But, In Kafka, it is totally fine to have multiple sets of consumers consuming from the same topic. The set of consumers form a group commonly termed as the consumer group. Here one of the consumers from the consumer group can process the message based on the partition of the Kafka topic the message is getting consumed from.
Now the catch here is that we can have multiple consumer groups consuming from the same Kafka topic. Each consumer group will process the message in the way they want to do. There is no interference between consumers of two different consumer groups.
To fulfill your use case I believe you might need two consumer groups that can simply process the message in the way they want. You do not essentially have to copy the data between two topics.
Hope this helps.
There are two immediate options to forward the contents of one topic to another:
by using the stream feature of Kafka to create a forwarding link
between the two topics.
by creating a consumer / producer pair
and using those to receive and then forward on messages
I have a short piece of code that shows both (in Scala):
def topologyPlan(): StreamsBuilder = {
val builder = new StreamsBuilder
val inputTopic: KStream[String, String] = builder.stream[String, String]("topic2")
inputTopic.to("topic3")
builder
}
def run() = {
val kafkaStreams = createStreams(topologyPlan())
kafkaStreams.start()
val kafkaConsumer = createConsumer()
val kafkaProducer = createProducer()
kafkaConsumer.subscribe(List("topic1").asJava)
while (true) {
val record = kafkaConsumer.poll(Duration.ofSeconds(5)).asScala
for (data <- record.iterator) {
kafkaProducer.send(new ProducerRecord[String, String]("topic2", data.value()))
}
}
}
Looking at the run method, the first two lines set up a streams object to that uses the topologyPlan() to listen for messages in 'topic2' and forward then to 'topic3'.
The remaining lines show how a consumer can listen to a 'topic1' and use a producer to send them onward to 'topic2'.
The final point of the example here is Kafka is flexible enough to let you mix options depending on what you need, so the code above will take messages in 'topic1', and send them to 'topic3' via 'topic2'.
If you want to see the code that sets up consumer, producer and streams, see the full class here.

Kafka Consumer API jumping offsets

I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Is it possible to reset offsets to a topic for a kafka consumer group in a kafka connector?

My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.
I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(
You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning
We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.

kafka consumer java with multiple topics

we have one consumer group and three topics, all three topics are of different schema . created one consumer with a for loop passing each topic at a time and polling it processing and committing manually. Method used is consumer created common and in for loop I am subscribing one topic at a time and processing data.
I am seeing a random lag of consumer , although the topic has data my consumer fetches no records from topic and fetches sometimes. When I work out with a single topic instead of looping through three topics it is working but unable to reproduce.
need help to debug the issue and reproduce the same,
Rather than looping three topics in a single method, you could create a skeleton thread like so that consumes from any topic. See examples here
I can't say if this will "fix" the problem, but trying to consume from topics with different schemas in one application is usually not a scalable pattern, but it's not really clear what you're trying to do.
class ConsumerThread extends Thread {
KafkaConsumer consumer;
AtomicBoolean stopped = new AtomicBoolean();
ConsumerThread(Properties props, String subscribePattern) {
this.consumer = new KafkaConsumer...
this.consumer.subscribe(subscribePattern);
}
#Override
public void run() {
while (!this.stopped.get()) {
... records = this.consumer.poll(100);
for ( ... each record ... ) {
// Process record
}
}
}
public void stop() {
this.stopped.set(true);
}
}
Not meant to be production-grade
Then run three consumers independently.
new ConsumerThread("t1").start();
new ConsumerThread("t2").start();
new ConsumerThread("t3").start();
Note: KafkaConsumer is not thread-safe.

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}