kafka Java API Consumer and producer Offset value comparison? - apache-kafka

I have a requirement to match Kafka producer offset value to consumer offset by using Java API?
I am new to KAFKA,Could anyone suggest how to proceed with this ?

Depending on your exact use case there are a couple of ways that you could go about this, but all of them will require an external system.
First of, Confluent offers the Confluent Control Center as part of their commercial offering, this would probably be the easiest way to go about this, if you are willing to spend the money.
If that is not for you, then you'd need to implement some sort of system to keep track of what you are producing and what you are consuming. For example you could simply use a database, take topic, partition and offset as primary key and have columns for produced_at and consumed_at.
Every time your producer writes a message to the cluster you have it update the produced_at column (look at ProducerInterceptor). Same on the consumer side, you could implement an interceptor that confirms having read the message, or confirm from the consumer itself, once it has successfully been processed.
Or if you don't need every message confirmed you could just implement regular checkpointing every 10k messages or something similar and trust that the consumer read everything up to the last offset it confirmed.
There's also the possibility of injecting checkpoint messages into the stream at regular intervalls and when the consumer sees one of these it triggers an action - again, you have to trust the consumer that it got everything in between the checkpoints.
As I said initially, it all depends on your exact use case, if you give us more detail I'm sure we can come up with something that works for you.
Update:
If you want to retrieve the offset after sending a message to Kafka you need to check the Future that the producer returns on send, this will contain the offset.
// Send message and store the future
Future<RecordMetadata> messageFuture = producer.send(new ProducerRecord<String, byte[]>(topic, serialize(currentMessage)));
producer.flush();
// as flush blocks until all operations have been completed (regardless of success or failure) we can be sure
// that our future is available at this point
try {
RecordMetadata metaData = messageFuture.get();
System.out.println("Sent message with offset: " + metaData.offset());
} catch (Exception e) {
// do some error handling
}

You can expose the offset of the producer and the consumer via Java Management Beans. There by you can do the comparison in realtime using the JConsole provided with the JDK.
Read about Gauge on how to expose the offset position of the producer and the consumer.

Related

What happens to the consumer offset when an error occurs within a custom class in a KStream topology?

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

Update state in a Kafka stream chain without using Kafka Streams in an EOS way

I am currently working on the deployment of a distributed stream process chain using Kafka but not Kafka stream library. I've created a kind of node which can be executed and take as input a topic, process the obtained data and send it to an output topic. The node is a simple consumer/producer couple which is associated to a unique upstream partition. The producer is idempotent, the processing is done in a transaction context such as :
producer.initTransaction();
try
{
producer.beginTransaction();
//process
producer.commitTransaction();
}
catch (KafkaException e)
{
producer.abortTransaction();
}
I also used the producer.sendoffsetstotransaction method to ensure an atomic commit for the consumer.
I would like to use a key-value store for keeping the state of my nodes (i was thinking about MapDB which looks simple to use).
But I wonder if I update my state inside the transaction with a map.put(key, value) for example, will the transaction ensure that the state will be updated exactly-once ?
Thank you very much
Kafka only promises exactly once for its components - i.e. When I produce X to output-topic, I will also commit X to input-topic. Either both succeeds or both fails - i.e. Atomic.
So whatever you do between consuming and producing is totally on you to ensure the exactly-once. UNLESS, you use the state-store provided by Kafka itself. That is available to you if you use Kafka-streams.
If you cannot switch to kafka streams, it is still possible to ensure exactly once yourself if you track kafka's offsets in mapDB and add sufficient checks.
For eg, assuming you are trying to do deduplication here,
This is just one way of doing things - assuming that whatever you put in mapDB is committed right away. Even if not, you can always consult the "source of truth" - which are the topics here - and reconstruct the lost data.

Kafka Streams with single partition to pause on error

I have a single Kafka broker with single partition. The requirement was to do following:
Read from this partition
Transform message by invoking a REST API
Publish the transformed message to another REST API
Push the response message to another topic
I am using Kafka Streams for achieving this using the following code
StreamsBuilder builder = new StreamsBuilder();`
KStream<Object, Object> consumerStream = builder.stream(kafkaConfiguration.getConsumerTopic());
consumerStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerStream.to(kafkaConfiguration.getProducerTopic(), Produced.with(lStringKeySerde, lAvroValueSerde));
return builder.build();
FOllowing is my configuration:
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, String.join(",", bootstrapServers));
if (schemaRegistry != null && schemaRegistry.length > 0) {
streamsConfig.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, String.join(",", schemaRegistry));
}
streamsConfig.put(this.keySerializerKeyName, keyStringSerializerClassName);
streamsConfig.put(this.valueSerialzerKeyName, valueAVROSerializerClassName);
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
streamsConfig.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, FailOnInvalidTimestamp.class);
streamsConfig.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 30000);
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);
streamsConfig.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 1);
streamsConfig.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DeserializationExceptionHandler.class);
streamsConfig.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG, ProductionExceptionHandler.class);
streamsConfig.put(StreamsConfig.TOPOLOGY_OPTIMIZATION,StreamsConfig.OPTIMIZE);
streamsConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, compressionMode);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
I was looking for a mechanism to do the following in my KeyValueMapper:
If any of the REST API is down then I catch the exception
I would like the same offset to be kept on looping until the system is back up OR pause the consumption till the system is back up
I've checked the following links but they do not seem to help.
How to run kafka streams effectively with single app instance and single topic partitions?
Following link talks about KafkaTransactionManager but that would not work I guess the way KStream is initialized above
Kafka transaction failed but commits offset anyway
Any help / pointers in this direction would be much appreciated.
What you want to do is not really supported. Pausing the consumer is not possible in Kafka Streams.
You can "halt" processing only, if you loop withing your KeyValueMapper, however, for this case, the consumer may drop out of the consumer group. For your case, with a single input topic partition and can only have a single thread in a single KafkaStreams instance anyway, hence, it would not affect any other member of the group (as there are none). However, the problem will be that committing the offset would fail after the thread dropped out of the group. Hence, after the thread rejoin the group it would fetch an older offset and reprocess some data (ie, you get duplicate data processing). To avoid dropping out of the consumer group, you could set max.poll.interval.ms config to a high value (maybe even Integer.MAX_VALUE) though -- given that you have a single member in the consumer group, setting a high value should be ok.
Another alternative might be te use a transform() with a state store. If you cannot make the REST calls, you put the data into the store and retry later. This way the consumer would not drop out of the group. However, reading new data would never stop, and you would need to buffer all data in the store until the REST API can be called again. You should be able to slow down reading new data (to reduce the amount of data you need to buffer) by "sleeping" in your Transformer -- you just need to ensure that you don't violate max.poll.interval.ms config (default is 30 seconds).

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Having a Kafka Consumer read a single message at a time

We have Kafka setup to be able to process messages in parallel by several servers. But every message must only be processed exactly once (and by only one server). We have this up and running and it’s working fine.
Now, the problem for us is that the Kafka Consumers reads messages in batches for maximal efficiency. This leads to a problem if/when processing fails, the server shuts down or whatever, because then we loose data that was about to be processed.
Is there a way to get the Consumer to only read on message at a time to let Kafka keep the unprocessed messages? Something like; Consumer pulls one message -> process -> commit offset when done, repeat. Is this feasible using Kafka? Any thoughts/ideas?
Thanks!
You might try setting max.poll.records to 1.
You mention having exactly one processing, but then you're worried about losing data. I'm assuming you're just worried about the edge case when one of your server fails? And you lose data?
I don't think there's a way to accomplish one message at a time. Looking through the consumer configurations, there only seems to be a option for setting the max bytes a consumer can fetch from Kafka, not number of messages.
fetch.message.max.bytes
But if you're worried about losing data completely, if you never commit the offset Kafka will not mark is as being committed and it won't be lost.
Reading through the Kafka documentation about delivery semantics,
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
So to achieve exactly-once processing is not something that Kafka enables by default. It requires you to implement storing the offset whenever you write the output of your processing to storage.
But this can be handled more simply and generally by simply letting
the consumer store its offset in the same place as its output...As an example of this,
our Hadoop ETL that populates data in HDFS stores its offsets in HDFS
with the data it reads so that it is guaranteed that either data and
offsets are both updated or neither is.
I hope that helps.
It depends on what client you are going to use. For C++ and python, it is possible to consume ONE message each time.
For python, I used https://github.com/mumrah/kafka-python. The following code can consume one message each time:
message = self.__consumer.get_message(block=False, timeout=self.IterTimeout, get_partition_info=True )
__consumer is the object of SimpleConsumer.
See my question and answer here:How to stop Python Kafka Consumer in program?
For C++, I am using https://github.com/edenhill/librdkafka. The following code can consume one message each time.
214 while( m_bRunning )
215 {
216 // Start to read messages from the local queue.
217 RdKafka::Message *msg = m_consumer->consume(m_topic, m_partition, 1000);
218 msg_consume(msg);
219 delete msg;
220 m_consumer->poll(0);
221 }
m_consumer is the pointer to C++ Consumer object (C++ API).
Hope this help.