Kafka transaction management at producer end - apache-kafka

I am looking for how Kafka behave when the producer is running in transaction.
I have a oracle database insert operations running in same transaction which rollback the changes if the transaction is rolled back.
How does Kafka producer behave in case of transaction rollback.
Will the message be rolled back or Kafka doesn't support rollback.
I know the JMS message are committed to queue only when transaction is committed. Looking for similar solutions if it is supported.
Note : Producer code is written using spring boot.

You are trying to update two systems
update a record in your oracle database
sending a event to apache kafka
This represents a challenge as you would like it to be atomic, either everything gets executed or nothing, otherwise you will end up with inconsistencies between your database and kafka.
You might send a Kafka message even if the database transaction was rollbacked.
Or the other way around (if you are sending the message just after the commit), you might commit the database transaction and crash (for some reason) just before sending the Kafka event.
One of the simplest solution is to use the outbox pattern:
Let's say you want to update an order table and send orderEvent to kafka
Instead of sending the event to kafka in the same transaction
You can save it a database table (outbox) using the same transaction as the order update
A separate process will read data from outbox table and make sure it's sent to kafka (using at least once semantic)
Your consumer need to be idempotent.
In this post, I explain more in detail how to implement this solution
https://mirakl.tech/sending-kafka-message-in-a-transactional-way-34d6d19bb7b2

Related

Kafka transactions how to emulate an error to rollback transaction

I want to test Kafka transactional behavior in depth. I need to test what happens in the transaction boundary if this is a possibility.
For example, I want the commit of the transaction to fail so that I can test what happens during a transaction rollback.
Is this possible?
For example, when talking about database transactions, one easy way to test a failed transaction is to try to persist a record with an field with length greater than the maximum.
But in Kafka I don't know how to emulate something similar to trigger the kafka transaction rollback.

Kafka & JPA transaction management

I am new to Kafka and trying to do transaction management for Kafka and DB transaction. I already read many articles on this topic, but so far I am able to test only 1 scenario successfully.
#Transactional
public void updateData(InputData data)
{
repository.save(data);
kafkaTemplate.send(data.id,data);
}
In this case if Kafka transactions fails, DB transaction will be rollback. This works fine.
But is it possible to do Kafka transaction first & then DB transaction? and if DB transaction fails, then the Kafka transaction will be aborted & message posted on Kafka topic will be in uncommitted state?
I tested such scenario, but it didn't work. Message posted on topic was not in uncommitted state. Hence want to check possibility of this scenario.
I solved this problem. It is solved by using nested #transactional annotation. I put #transactional("kafkaTxManager) on the method in which I am starting kafka transaction & put #transactional("chainkafkaTxmanager") on the method where i am starting DB transaction.

What are Kafka transactions?

What do transactions mean in Kafka? Of course I know the ordinary SQL transactions: A transaction is a sequence of operations performed (using one or more SQL statements) on a database as a single logical unit of work
So does it mean it's possible to send something to Kafka, and if something going wrong it will be rolled back (erase messages from partitions?)
And is it possible to write to different topics in a transaction?
The Transactional Producer allows to send data to multiple partitions and guarantees all these writes are either committed or discarded. This is done by grouping multiple calls to send() (and optionally sendOffsetsToTransaction()) into a transaction.
Once a transaction is started, you can call commitTransaction() or abortTransaction() to complete it.
Consumers configured with isolation.level=read_committed will not receive messages from aborted transactions.
For more details, check the Message Delivery Semantics section in the docs

How to implement kafka manual offset commits for database transactions

I am using kafka, services from dotnet to take messages from kafka, and once I take the message from kafka, I send this message to the database and then performing some operations in the database.
My code works as below..
Dotnet service will execute one method, where i am returning message from kafka.. So here kafka is noting down as that message is consumed by my code.
After this messge is returned, i am performing database operations, so if something is wrong in db, it will not get processed.. But when i retry the process from my service, it will not take the same message, as the message is already noted as consumed in kafka.
So I want to make sure the consumer is committing not after my service return kafka message, but it should note as consumed only after the db operation is performed and all ok.. Then it should make note as consumed in kafka.
Can anyone suggest how can I implement manual offset commits for this case.

Using Kafka Connect HOWTO "commit offsets" as soon as a "put" is completed in SinkTask

I am using Kafka Connect to get messages from a Kafka Broker (v0.10.2) and then sync it to a downstream service.
Currently, I have code in SinkTask#put that will process the SinkRecord & then persist it to the downstream service.
A couple of key requirements,
We need to make sure the messages are persisted to the downstream service AT LEAST once.
If the downstream service throws an error or says it didn't process the message then we need to make sure that the messages are re-read again.
So we thought we can rely on SinkTask#flush to effectively back out of committing offsets for that particular poll/cycle of received messages by throwing an exception or something that will tell Connect not to commit the offsets, but retry in the next poll.
But as we found out flush is actually time-based & is more or less independent of the polls & it will commit the offsets when it reaches a certain time threshold.
In 0.10.2 SinkTask#preCommit was introduced, so we thought we can use it for our purposes. But nowhere in the documentation it is mentioned that there is a 1:1 relationship between SinkTask#put & SinkTask#preCommit.
Since essentially we want to commit offsets as soon as a single put succeeds. And similarly, not commit the offsets, if that particular put failed.
How to accomplish this, if not via SinkTask#preCommit?
Getting data into and out of Kafka correctly can be challenging, and Kafka Connect makes this easier since it uses best practices and hides many of the complexities. For sink connectors, Kafka Connect reads messages from a topic, sends them to your connector, and then periodically commits the largest offsets for the various topic partitions that have been read and processed.
Note that "sending them to your connector" corresponds to the put(Collection<SinkRecord>) method, and this may be called many times before Kafka Connect commits the offsets. You can control how frequently Kafka Connect commits offsets, but Kafka Connect ensures that it will only commit an offset for a message when that message was successfully processed by the connector.
When the connector is operating nominally, everything is great and your connector sees each message once, even when the offsets are committed periodically. However, should the connector fail, then when it restarts the connector will start at the last committed offset. That might mean your connector sees some of the same messages that it processed just before the crash. This usually is not a problem if you carefully write your connector to have at least once semantics.
Why does Kafka Connect commit offsets periodically rather than with every record? Because it saves a lot of work and doesn't really matter when things are going nominally. It's only when things go wrong that the offset lag matters. And even then, if you're having Kafka Connect handle offsets your connector needs to be ready to handle messages at least once. Exactly once is possible, but your connector has to do more work (see below).
Writing Records
You have a lot of flexibility in writing a connector, and that's good because a lot will depend on the capabilities of the external system to which it's writing. Let's look at different ways of implementing put and flush.
If the system supports transactions or can handle a batch of updates, your connector's put(Collection<SinkRecord>) could write all of the records in that collection using a single transaction / batch, retrying as many times as needed until the transaction / batch completes or before finally throwing an error. In this case, put does all the work and will either succeed or will fail. If it succeeds, then Kafka Connect knows all of the records were handled properly and can thus (at some point) commit the offsets. If your put call fails, then Kafka Connect assumes doesn't know whether any of the records were processed, so it doesn't update its offsets and it stops your connector. Your connector's flush(...) would need to do nothing, since Kafka Connect is handling all the offsets.
If the system doesn't support transactions and instead you can only submit items one at a time, you might have have your connector's put(Collection<SinkRecord>) attempt to write out each record individually, blocking until it succeeds and retrying each as needed before throwing an error. Again, put does all the work, and the flush method might not need to do anything.
So far, my examples do all the work in put. You always have the option of having put simply buffer the records and to instead do all the work of writing to the external service in flush or preCommit. One reason you might do this is so that you're writes are time-based just like flush and preCommit. If you don't want your writes to be time-based, you probably don't want to do the writes in flush or preCommit.
To Record Offsets or Not To Record
As mentioned above, by default Kafka Connect will periodically record the offsets so that upon restart the connector can begin where it last left off.
However, sometimes it is desirable for a connector to record the offsets in the external system, especially when that can be done atomically. When such a connector starts up, it can look in the external system to find out the offset that was last written, and can then tell Kafka Connect where it wants to start reading. With this approach your connector may be able to do exactly once processing of messages.
When sink connectors do this, they actually don't need Kafka Connect to commit any offsets at all. The flush method is simply an opportunity for your connector to know which offsets that Kafka Connect is committing for you, and since it doesn't return anything it can't modify those offsets or tell Kafka Connect which offsets the connector is handling.
This is where the preCommit method comes in. It really is a replacement for flush (it actually takes the same parameters as flush), except that it is expected to return the offsets that Kafka Connect should commit. By default, preCommit just calls flush and then returns the same offsets that were passed to preCommit, which means Kafka Connect should commit all the offsets it passed to the connector via preCommit. But if your preCommit returns an empty set of offsets, then Kafka Connect will record no offsets at all.
So, if your connector is going to handle all offsets in the external system and doesn't need Kafka Connect to record anything, then you should override the preCommit method instead of flush, and return an empty set of offsets.