I am using kafka processor api (not DSL)
public class StreamProcessor implements Processor<String, String>
{
public ProcessorContext context;
public void init(ProcessorContext context)
{
this.context = context;
context.commit()
//statestore initialized with key,value
}
public void process(String key, String val)
{
try
{
String[] topicList = stateStore.get(key).split("|");
for(String topic: topicList)
{
context.forward(key,val,To.child(consumerTopic));
} // forward same message to list of topics ( 1..n topics) , rollback if write to some topics failed ?
}
}
}
Scenario : we are reading data from a source topic and stream
processor writes data to multiple sink topics (topicList above) .
Question: How to implement rollback mechanism using kafka streams
processor api when one or more of the topics in the topicList above
fails to receive the message ? .
What I understand is processor api has rollback mechanism for each
record it failed to send, or can roll back for an an entire batch of
messages which failed be achieved as well? as process method in
processor interface is called per record rather than per batch hence I
would surmise it can only be done per record.Is this correct assumption ?, if not please suggest
how to achieve per record and per batch rollbacks for failed topics using processor api.
You would need to implement it yourself. For example, you could use two stores: main-store, and "buffer" store and first only update the buffer store, call context.forward() second to make sure all write are in the output topic, and afterward merge the "buffer" store into the main store.
If you need to roll back, you drop the content from the buffer store.
Related
I use spring cloud stream with kafka. I have a topic X, with partition Y and consumer group Z. Spring boot starter parent 2.7.2, spring kafka version 2.8.8:
#StreamListener("input-channel-name")
public void processMessage(final DomainObject domainObject) {
// some processing
}
It works fine.
I would like to have an endpoint in the app, that allows me to re-read/re-process (seek right?) all the messages in X.Y (again). But not after rebalancing (ConsumerSeekAware#onPartitionsAssigned) or after app restart (KafkaConsumerProperties#resetOffsets) but on demand like this:
#RestController
#Slf4j
#RequiredArgsConstructor
public class SeekController {
#GetMapping
public void seekToBeginningForDomainObject() {
/**
* seekToBeginning for X, Y, input-channel-name
*/
}
}
I just can't achieve that. Is it even possible ?. I understand that I have to do that on the consumer level, probably the one that is created after #StreamListener("input-channel-name") subscription, right ? but I've no clue how to obtain that consumer. How can I execute seek on demand to make kafka send the messages to the consumer again ? I just want to reset the offset for X.Y.Z to 0 to just make the app, load and process all the messages again.
https://docs.spring.io/spring-cloud-stream/docs/current/reference/html/spring-cloud-stream-binder-kafka.html#rebalance-listener
KafkaBindingRebalanceListener.onPartitionsAssigned() provides a boolean to indicate whether this is an initial assignment Vs. a rebalance assignment.
Spring cloud stream does not currently support arbitrary seeks at runtime, even though the underlying KafkaMessageDrivenChannelAdapter does support getting access to a ConsumerSeekCallback (which allows arbitrary seeks between polls). It would need an enhancement to the binder to allow access to this code.
It is possible, though, to consume idle container events in an event listener; the event contains the consumer, so you could do arbitrary seeks under those conditions.
There is topic Users with partitions.
Each partitions have messages about user data.
How to avoid duplications, for example dont allow inserting of the same user's name?
If I got this right I should create seperate topic Usernames and append all requested usernames.
Then before adding a new user in topic Users I ensure that there are not dublications in topic Usernames, right?
Accordingly using streams
I assume you are talking about a scenario where you are trying to publish events to Kafka topic from a micro-service.
Also, assuming you want to publish users profile --> username as key, user profile as value.
There are 2 issues of deduplication here :-
1.) you might get different usernames to your service at different times and publishing to topic.
2.) Duplicate message processing - During Broker failure(ack not received) or kafka client failures, the same message can be re-processed as kafka client does not hace ack info.
This can be taken care by enabling idempotency on kafka producers and atomic transactions.(Refer to Exactly Once processing)
I believe your question is about 1.) where your service receives duplicate messages.
Solution 1:-
If you are using micro-service, you can have an inmemory cache/DB of usernames and publish to kafka if duplicate is not found.
Solution 2:- (Handle on Kafka itself using streams)
input topic - users
Build an Kafka Stream client with stateStore(keyValueStore) and transformer to implement your dedupe logic.
So, your kafka stream client consumes the events from users topic and transforms in UserDedupeTransformer(where you have dedupe logic) and then produces to the output topic(as per ur requirement)
StoreBuilder<KeyValueStore<String, String>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("UserDedupeStoreName"),
Serdes.String(),
Serdes.String())
.withCachingEnabled();
builder.addStateStore(storeBuilder)
.stream("users-topic", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new UsersDedupeTransformer(), "usersDedupeStoreName")
.to("destination-topic");
In UserDedupeTransformer - Configured userDedupeStore and override the transform method -
public void init(ProcessorContext context) {
this.context = context;
dedupeStore = (KeyValueStore<String, String>) context.getStateStore("userDedupeStoreName");
}
public KeyValue<String, String> transform(String key, String v) {
if (null != key && null != dedupeStore.get(key))
return KeyValue.pair(key, value);
else
return null;
This dedupe store can be configured as In-Memory and also can be persisted using RocksDB.
I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}
Background:
We had previously used hibernate search, Lucene and jboss hornetq queue for indexing.
Our Application is the producer and sends the metadata(unique data information to identify a record in the Database) to the hornetq.
Consumer receives this metadata and query against the database to fetch the complete record details(including child objects).
This is much more database centric approach.
Now we want to eliminate the database centric approach for indexing. We have decided to use kafka rather hornetq.
There is no issue when user creates the data.
We see there is a potential problem when the user edits the data(Say a parent entity with two child objects). When the data is pulled from the database for user display,
we push the same data to kafka topic1. When user modify's the data(say parenet level data) and submits. We get only the parent level data(don't get the child objects data), we push the changed data to topic2. Now we have to merge the message present in topic1(child objects) with the corresponding message in topic2(parent level data)
Note: We have to take this route as you know there is no update in Indexing rather it is delete and then insert.
Questions:
If i go with the above approach, how can I map the specific
message present in topic1 with the specific message in topic2. Is
there a way to provide the same message ids in topic1 and topic2?
Is there any way to resolve this issue if i use the single topic?
Is there any better design/approach to resolve the above issue?
Thanks in advance.
If i go with the above approach, how can I map the specific message present in topic1 with the specific message in topic2. Is there a way to provide the same message ids in topic1 and topic2 ?
To map or join the specific messages between topics in the same Kafka cluster maybe Kafka Stream and KSQL is a good direction to do. Can you find the reference here.
There are many ways to make an object unique and I suggest using parent entity id when you send messages to topic1 and topic2. Sample Java code as following:
ProducerRecord<String, ParentEntity> record = new ProducerRecord<>(topic1,
ParentEntity.getId(), ParentEntity);
ListenableFuture<SendResult<String, ParentEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ParentEntity>>() {
#Override
public void onSuccess(SendResult<String, ParentEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
ProducerRecord<String, ChildEntity> record = new ProducerRecord<>(topic2,
ChildEntity.getParentEntityId(), ChildEntity);
ListenableFuture<SendResult<String, ChildEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ChildEntity>>() {
#Override
public void onSuccess(SendResult<String, ChildEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
Is there any way to resolve this issue if i use the single topic ?
You can create a new table (said A) in database to store the full message to be sent for indexing. Every time user creates or updates data the message also to be inserted/updated to the table A. Finally your Kafka client pulls message objects from the table A and produce to an unique topic in Kafka cluster.
Is there any better design/approach to resolve the above issue ?
Can you try Kafka Stream and KSQL as I mentioned above.
I'm using a kafka spring consumer that is under group management.
I have the following code in my consumer class
public class ConsumerHandler implements Receiver<String, Message>, ConsumerSeekAware {
#Value("${topic}")
protected String topic;
public ConsumerHandler(){}
#KafkaListener(topics = "${topic}")
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment) {
for (ConsumerRecord<String, Message> message : messages) {
Message msg = message.value();
this.handleMessage(any, message);
}
acknowledgment.acknowledge();
return null;
}
#Override
public void registerSeekCallback(ConsumerSeekCallback callback) {
}
#Override
public void onPartitionsAssigned(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {
for (Entry<TopicPartition, Long> pair : assignments.entrySet()) {
TopicPartition tp = pair.getKey();
callback.seekToEnd(tp.topic(),tp.partition());
}
}
#Override
public void onIdleContainer(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {}
}
This code works great while my consumer is running. However, sometimes the amount of messages being processed is too much and the messages stack up. I've implemented concurrency on my consumers and still sometimes there's delay in the messages over time.
So as a workaround, before I figure out why the delay is happening, I'm trying to keep my consumer up to the latest messages.
I'm having to restart my app to get partition assigned invoked so that my consumer seeks to end and starts processing the latest messages.
Is there a way to seek to end without having to bounce my application?
Thanks.
As explained in the JavaDocs and the reference manual, you can save off the ConsumerSeekCallback passed into registerSeekCallback in a ThreadLocal<ConsumerSeekCallback>.
Then, you can perform arbitrary seek operations whenever you want; however, since the consumer is not thread-safe, you must perform the seeks within your #KafkaListener so they run on the consumer thread - hence the need to store the callback in a ThreadLocal.
In version 2.0 and later, you can add the consumer as a parameter to the #KafkaListener method and perform the seeks directly thereon.
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
The current version is 2.1.6.
I have never found or seen a fixed solution for this kind of problem. The way I do is to boost performance as high as possible base on the amount of messages to be processed and Kafka parameters.
Let's say if you have a shopping online app then you can control the upper bound of the number of transactions per day, said N. So you should make the app work well in the scenario where 1.5*N or 2*N transactions will need to sync to Kafka cluster. You keep this state until a day your shopping app reaches a new level and you will need to upgrade your Kafka system again. For shopping online app there are a special high number of transactions in promotion or mega sales days so what you prepare for your system is for these days.