Kafka Streams: Any guarantees on ordering of saves to state stores when using at_least_once? - apache-kafka

We have a Kafka Streams Java topology built with the Processor API.
In the topology, we have a single processor, that saves to multiple state stores.
As we use at_least_once, we would expect to see some inconsistencies between the state stores - e.g. an incoming record results in writes to both state store A and B, but a crash between the saves results in only the save to store A getting written to the Kafka change log topic.
Are we guaranteed that the order in which we save will also be the order in which the writes to the state stores happen? E.g. if we first save to store A and then to store B, we can of course have situation where the write to both change logs succeeded, and a situation where only the write to change log A was completed - but can we also end up in a situation where only the write to change log B was completed?
What situations will result in replays? A crash of course - but what about rebalances, new broker partition leader, or when we get an "Offset commit failed" error (The request timed out)?
A while ago, we tried using exactly_once, which resulted in a lot of error messages, that didn't make sense to us. Would exactly_once give us atomic writes across multiple state stores?

Ad 3. According to The original design document on exactly-once support in Kafka Streams I think with eaxctly_once you get atomic writes across multiple state stores
When stream.commit() is called, the following steps are executed in order:
Flush local state stores (KTable caches) to make sure all changelog records are sent downstream.
Call producer.sendOffsetsToTransactions(offsets) to commit the current recorded consumer’s positions within the transaction. Note that although the consumer of the thread can be shared among multiple tasks hence multiple producers, task’s assigned partitions are always exclusive, and hence it is safe to just commit the offsets of this tasks’ assigned partitions.
Call producer.commitTransaction() to commit the current transaction. As a result the task state represented as the above triplet is committed atomically.
Call producer.beginTransaction() again to start the next transaction.

Related

Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics?

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

When to use Kafka transactional API?

I was trying to understand Kafka's transactional API. This link defines atomic read-process-write cycle as follows:
First, let’s consider what an atomic read-process-write cycle means. In a nutshell, it means that if an application consumes a message A at offset X of some topic-partition tp0, and writes message B to topic-partition tp1 after doing some processing on message A such that B = F(A), then the read-process-write cycle is atomic only if messages A and B are considered successfully consumed and published together, or not at all.
It further says says following:
Using vanilla Kafka producers and consumers configured for at-least-once delivery semantics, a stream processing application could lose exactly once processing semantics in the following ways:
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!—temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
We designed transaction APIs in Kafka to solve the second and third problems. Transactions enable exactly-once processing in read-process-write cycles by making these cycles atomic and by facilitating zombie fencing.
Doubts:
Points 2 and 3 above describe when message duplication can occur which are dealt with using transactional API. Does transactional API also help to avoid message loss in any scenario?
Most online (for example, here and here) examples of Kafka transactional API involve:
while (true)
{
ConsumerRecords records = consumer.poll(Long.MAX_VALUE);
producer.beginTransaction();
for (ConsumerRecord record : records)
producer.send(producerRecord(“outputTopic”, record));
producer.sendOffsetsToTransaction(currentOffsets(consumer), group);
producer.commitTransaction();
}
This is basically read-process-write loop. So does transactional API useful only in read-process-write loop?
This article gives example of transactional API in non read-process-write scenario:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
It says:
This allows a producer to send a batch of messages to multiple partitions such that either all messages in the batch are eventually visible to any consumer or none are ever visible to consumers.
Is this example correct and shows another way to use transactional API different from read-process-write loop? (Note that it also does not commit offset to transaction.)
In my application, I simply consume messages from kafka, do processing and log them to the database. That is my whole pipeline.
a. So, I guess this is not read-process-write cycle. Is Kafka transactional API of any use to my scenario?
b. Also I need to ensure that each message is processed exactly once. I guess setting idempotent=true in producer will suffice and I dont need transactional API, right?
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
a. So, I guess this is not read-process-write cycle. Is Kafka
transactional API of any use to my scenario?
It is a read-process-write, except you are writing to a database instead of Kafka. Kafka has its own transaction manager and thus writing inside a transaction with idempotency would enable exactly once processing, assuming you can resume the state of your consumer-write processor correctly. You cannot do that with a DB because the DB's transaction manager doesn't sync with Kafka's. What you can do instead is make sure that even if kafka transactions are not atomic with respect to your database, they are still eventually consistent.
Let's assume your consumer reads, writes to the DB and then acks. If the DB fails you don't ack and you can resume normally based on the offset. If the ack fails you will process twice and save to the DB twice. If you can make this operation idempotent, then you are safe. This means that your processor must be pure and the DB has to dedupe: processing the same message twice should always lead to the same result on the DB.
b. Also I need to ensure that each message is processed exactly once.
I guess setting idempotent=true in producer will suffice and I dont
need transactional API, right?
Assuming that you respect the requirements from point a, exactly once processing with persistence on a different store also requires that between your initial write and the duplicate no other change has happened to the objects that you are saving. Imagine having a value written as X, then some other actor changes it to Y, then the message is reprocessed and changes it back to X. This can be avoided for example, by making your database table be a log, similar to a kafka topic.
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
It is the producer which writes to the topic you consume from that may create zombie messages. That producer needs to play nice with kafka so that zombies are ignored. The transactional API together with your consumer will make sure that this producer writes atomically and your consumer reads committed messages, albeit not atomically. If you want exactly once idempotency is enough. If the messages are supposed to be atomically written you need transactions too. Either way your read-write/consume-produce processor needs to be pure and you have to dedupe. Your DB is also part of this processor since the DB is the one that actually persists.
I've looked for a bit on the internet, maybe this link helps you: processing guarantees
The links you posted: exactly once semantics and transactions in kafka are great.

Is stream processing atomic / transactional when using at least once delivery in Kafka Streams?

Let's suppose a simple case like this:
ORDER_TOPIC ----> KSTREAM ----> VALIDATED_ORDER_TOPIC
|
ROCKSDB LOCAL STATE STORE
The KStream deduplicates the messages from ORDER_TOPIC using a transform operation with a transformer that stores the messages in a persistent local state store by their key/id. This way if the same order arrives twice it will be ignored.
Now a new order arrives, it's not duplicated so it's stored in the local store but before sending it to the VALIDATED_ORDER_TOPIC the application crashes.
I'm wondering what the transactional guarantees are inside a KStream: has the record been stored and committed to the local state store or rolledback?
Could you point at some documentation regarding transactional guarantees for Kafka Streams with at-least-once semantics?
If you run with at-least-once semantics, there are no transactional guarantees. For this case, if you first add the ID to the store, but you crash before the record is written to the output topic, you may loose this record when it is reprocessed from the input topic.
If you want to de-duplicate, you need to enable processing.guarantees=exactly_once. For this case, if you crash, the store will be "rolled back" into a consistent state. Ie, after a crash, it will contain the ID only if the write to the output topic was successful.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?
I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.