What happens to the consumer offset when an error occurs within a custom class in a KStream topology? - apache-kafka

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?

I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

Related

Does KStream filter consume every message?

I have used Kafka in the past, but never the streams API. I am tasked with building a scalable service that accepts websocket connections and routes outbound messages from a central topic to the correct session based on user id.
This looks ridiculously simple using KStream<String, Object>. From one online tutorial:
builder.stream(inputTopic, Consumed.with(Serdes.String(), publicationSerde))
.filter((name, publication) -> "George R. R. Martin".equals(publication.getName()))
.to(outputTopic, Produced.with(Serdes.String(), publicationSerde));
But does the filter command consume every message from the topic and perform a filter in application space? Or does KStream<K, V> filter(Predicate<? super K,? super V> predicate) contain hooks into the inner workings of Kafka that allow it only to receive messages matching the correct key?
The wording on the KStream<K,V> javadoc seem to suggest the former: "consumed message by message."
If the only purpose of the filter is to consume every message of a topic and throw away those that are not relevant, I could do that by hand.
You are correct - messages need to be deserialized, then inspected against a predicate (in application space)
throw away those that are not relevant, I could do that by hand
Sure, you could, but Kafka Streams has useful methods for defining session windows. Plus, you wouldn't need to define a consumer and producer instance to forward to new topics.

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

Kafka Streams with single partition to pause on error

I have a single Kafka broker with single partition. The requirement was to do following:
Read from this partition
Transform message by invoking a REST API
Publish the transformed message to another REST API
Push the response message to another topic
I am using Kafka Streams for achieving this using the following code
StreamsBuilder builder = new StreamsBuilder();`
KStream<Object, Object> consumerStream = builder.stream(kafkaConfiguration.getConsumerTopic());
consumerStream = consumerStream.map(getKeyValueMapper(keyValueMapperClassName));
consumerStream.to(kafkaConfiguration.getProducerTopic(), Produced.with(lStringKeySerde, lAvroValueSerde));
return builder.build();
FOllowing is my configuration:
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, String.join(",", bootstrapServers));
if (schemaRegistry != null && schemaRegistry.length > 0) {
streamsConfig.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, String.join(",", schemaRegistry));
}
streamsConfig.put(this.keySerializerKeyName, keyStringSerializerClassName);
streamsConfig.put(this.valueSerialzerKeyName, valueAVROSerializerClassName);
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
streamsConfig.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, FailOnInvalidTimestamp.class);
streamsConfig.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 30000);
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 1);
streamsConfig.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 1);
streamsConfig.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DeserializationExceptionHandler.class);
streamsConfig.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG, ProductionExceptionHandler.class);
streamsConfig.put(StreamsConfig.TOPOLOGY_OPTIMIZATION,StreamsConfig.OPTIMIZE);
streamsConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, compressionMode);
streamsConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
I was looking for a mechanism to do the following in my KeyValueMapper:
If any of the REST API is down then I catch the exception
I would like the same offset to be kept on looping until the system is back up OR pause the consumption till the system is back up
I've checked the following links but they do not seem to help.
How to run kafka streams effectively with single app instance and single topic partitions?
Following link talks about KafkaTransactionManager but that would not work I guess the way KStream is initialized above
Kafka transaction failed but commits offset anyway
Any help / pointers in this direction would be much appreciated.
What you want to do is not really supported. Pausing the consumer is not possible in Kafka Streams.
You can "halt" processing only, if you loop withing your KeyValueMapper, however, for this case, the consumer may drop out of the consumer group. For your case, with a single input topic partition and can only have a single thread in a single KafkaStreams instance anyway, hence, it would not affect any other member of the group (as there are none). However, the problem will be that committing the offset would fail after the thread dropped out of the group. Hence, after the thread rejoin the group it would fetch an older offset and reprocess some data (ie, you get duplicate data processing). To avoid dropping out of the consumer group, you could set max.poll.interval.ms config to a high value (maybe even Integer.MAX_VALUE) though -- given that you have a single member in the consumer group, setting a high value should be ok.
Another alternative might be te use a transform() with a state store. If you cannot make the REST calls, you put the data into the store and retry later. This way the consumer would not drop out of the group. However, reading new data would never stop, and you would need to buffer all data in the store until the REST API can be called again. You should be able to slow down reading new data (to reduce the amount of data you need to buffer) by "sleeping" in your Transformer -- you just need to ensure that you don't violate max.poll.interval.ms config (default is 30 seconds).

Kafka + Streams as Event Store in a CQRS application - Command Model consistency

I've been reading a few articles about using Kafka and Kafka Streams (with state store) as Event Store implementation.
https://www.confluent.io/blog/event-sourcing-using-apache-kafka/
https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
The implementation idea is the following:
Store entity changes (events) in a kafka topic
Use Kafka streams with state store (by default uses RethinkDB) to update and cache the entity snapshot
Whenever a new Command is being executed, get the entity from the store execute the operation on it and continue with step #1
The issue with this workflow is that the State Store is being updated asynchronously (step 2) and when a new command is being processed the retrieved entity snapshot might be stale (as it was not updated with events from previous commands).
Is my understanding correct? Is there a simple way to handle such case with kafka?
Is my understanding correct?
As far as I have been able to tell, yes -- which means that it is an unsatisfactory event store for many event-sourced domain models.
In short, there's no support for "first writer wins" when adding events to a topic, which means that Kafka doesn't help you ensure that the topic satisfies its invariants.
There have been proposals/tickets to address this, but I haven't found evidence of progress.
https://issues.apache.org/jira/browse/KAFKA-2260
https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish
Yes it's simple way.
Use key for Kafka message. Messages with the same key always* go the the same partition.
One consumer can read from one or many portions, but two partitions can not be read by two consumer simultaneously.
Max count of working consumer is always <= count of partition for a topic. You can create more consumer but consumer will be backup nodes.
Something like example:
Assumption.
There is a kafka topic abc with partitions p0,p1.
There is consumer C1 consuming from p0, and consumer C2 consuming from p1. Consumers are working asynchronicity
km(key,command) - kafka message.
#Procedure creating message
km(key1,add) -> p0
km(key2,add) -> p1
km(key1,edit) -> p0
km(key3,add) -> p1
km(key3,edit) -> p1
#consumer C1 will read messages km(key1,add), km(key1,edit) and order will be persist
#consumer c2 will read messages km(key2,add) , km(key3,add) ,km(key3,edit)
If you write commands to Kafka then materialize a view in KStreams the materialized view will be updated asynchronously. This helps you separate writes from reads so the read path can scale.
If you want consistent read-write semantics over your commands/events you might be better writing to a database. Events can either be extracted from the database into Kafka using a CDC connector (write-through) or you can write to the database and then to Kafka in a transaction (write-aside).
Another option is to implement long polling on the read (so if you write trade1.version2 then want to read it again the read will block until trade1.version2 is available). This isn't suitable for all use cases but it can be useful.
Example here: https://github.com/confluentinc/kafka-streams-examples/blob/4eb3aa4cc9481562749984760de159b68c922a8f/src/main/java/io/confluent/examples/streams/microservices/OrdersService.java#L165
The Command Pattern that you want to implement is already a part of the Akka Framework, I don't know you have experience with the framework or not but I strongly advice you to look there before you implement your own solution.
Also for the amount of Events that we receive in todays IT, I also advice to integrate it with a State Machine.
If you like to see how can we put all together please check my blog :)

kafka Java API Consumer and producer Offset value comparison?

I have a requirement to match Kafka producer offset value to consumer offset by using Java API?
I am new to KAFKA,Could anyone suggest how to proceed with this ?
Depending on your exact use case there are a couple of ways that you could go about this, but all of them will require an external system.
First of, Confluent offers the Confluent Control Center as part of their commercial offering, this would probably be the easiest way to go about this, if you are willing to spend the money.
If that is not for you, then you'd need to implement some sort of system to keep track of what you are producing and what you are consuming. For example you could simply use a database, take topic, partition and offset as primary key and have columns for produced_at and consumed_at.
Every time your producer writes a message to the cluster you have it update the produced_at column (look at ProducerInterceptor). Same on the consumer side, you could implement an interceptor that confirms having read the message, or confirm from the consumer itself, once it has successfully been processed.
Or if you don't need every message confirmed you could just implement regular checkpointing every 10k messages or something similar and trust that the consumer read everything up to the last offset it confirmed.
There's also the possibility of injecting checkpoint messages into the stream at regular intervalls and when the consumer sees one of these it triggers an action - again, you have to trust the consumer that it got everything in between the checkpoints.
As I said initially, it all depends on your exact use case, if you give us more detail I'm sure we can come up with something that works for you.
Update:
If you want to retrieve the offset after sending a message to Kafka you need to check the Future that the producer returns on send, this will contain the offset.
// Send message and store the future
Future<RecordMetadata> messageFuture = producer.send(new ProducerRecord<String, byte[]>(topic, serialize(currentMessage)));
producer.flush();
// as flush blocks until all operations have been completed (regardless of success or failure) we can be sure
// that our future is available at this point
try {
RecordMetadata metaData = messageFuture.get();
System.out.println("Sent message with offset: " + metaData.offset());
} catch (Exception e) {
// do some error handling
}
You can expose the offset of the producer and the consumer via Java Management Beans. There by you can do the comparison in realtime using the JConsole provided with the JDK.
Read about Gauge on how to expose the offset position of the producer and the consumer.