Kafka log compaction pointers - apache-kafka

Reading about log compaction on a topic, I was wondering if there is any way for a consumer to get hold of any of the positions/offsets of the following?
end of the head
start of the tail
compaction cleaner point
Basically the point at which the compacted and non-compacted parts of the log meet?
I've read that there is a cleaner-offset-checkpoint file that sits on the broker at /var/lib/kafka/data/cleaner-offset-checkpoint but is the info in this file available to a consumer?
My use case is a consumer that will consume compacted keys one way and non-compacted keys another way.
thanks for any advice.
UPDATE:
thinking for example of a topic holding various customer events like here https://www.confluent.io/blog/put-several-event-types-kafka-topic/; new customer, customer updates name, customer updates address, etc. Log compaction, I believe, will leave one event per customer in the tail but still many events per customer in the head (assuming compaction is slower than message production..?) A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event? In any case I was wondering if a consumer could tell how far along a topic compaction has got, at any given time?

It's not possible, with the consumer api, no.
If you want to check that checkpoint file on disk, you could use Jssh, for example, to access a broker, and read the file. If it has offset data, you could then use seek methods, but keep in mind that the Log Cleaner thread may be actively running when you seek to or consume that data
A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event?
I don't think this is a valid use case. For a stream of customer updates, you'd just update a new customer model in a table via a streaming reduce function. If any consumer restarts, it'll have to always read from the beginning of the topic to rebuild its local state then continue reading any updates to those stored values, so doesn't make sense to skip past them all, or have two separate consumers
I also don't necessarily think you need different models. Some UUID would be unique, and every event can contain the full model of a "customer". Most fields can remain optional/nullable until they are provided with a new message with all those fields set (or not), and this defines a batch update since you can set/update/remove multiple attributes at once. If you need more granularity, that's also possible to define at the producer level by storing and looping over your attributes and producing individual "customer" objects with each new attribute

Related

Kafka client and aggregated events

In event-driven design we strive to find out events that we interested of. Using Kafka we can easily subscribe (a new group.id) to a topic and start consuming events. If retention policy is default one we could consume also one week old messages if specify auto.offset.reset=earliest. Right? But what if we want to start from the very beginning? I guess that KTable should be used but I'm not sure what will happened when a new client has subscribed to a stateful stream. Could you tell me is it true that the new subscriber will receive all aggregated messages?
You can't consume data that has been deleted.
That's why KTables are built on top of compacted topics, which will store the latest keys for each record, and have infinite retention.
If you want to read the "current state" of the table, to get all aggregated messages, then you can use Interactive Queries.
not sure what will happened when a new client has subscribed to a stateful stream
It needs to read the entire compacted topic, starting from the beginning (earliest available offset, not necessarily the first ever produced message) since it cannot easily find where in the topic that each unique key may start.

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Apache Kafka - KStream and KTable hard disk space requirements

I am trying to, better, understand what happens in the level of resources when you create a KStream and a KTable. Below, I wil mention some conclusions that I have come to, as I understand them (feel free to correct me).
Firstly, every topic has a number of partitions and all the messages in those partitions are stored in the hard disk(s) in continuous order.
A KStream does not need to store the messages, that are read from a topic, again to another location, because the offset is sufficient to retrieve those messages from the topic which is connected to.
(Is this correct? )
The question regards the KTable. As I have understand, a KTable, in contrast with a KStream, updates every message with the with the same key. In order to do that, you have to either store externally the messages that arrive from the topic to a static table, or read all the message queue, each time a new message arrives. The later does not seem very efficient regarding time performance. Is the first approach I presented correct?
read all the message queue, each time a new message arrives.
All messages are only read at the fresh start of the application. Once the app reads up to the latest offset, it's just updating the table like any other consumer
How disk usage is determined ultimately depends on the state store you've configured for the application, along with its own settings. For example, in-memory vs rocksdb vs an external state store interface that you've written on your own

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?
I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.