Best practice at the moment of processing data with dependencies in Kafka?

Best practice at the moment of processing data with dependencies in Kafka? - apache-kafka

We are developing an app that takes data from different sources and once the data is available we process it, put it together and then proceed to move it to a different topic.
In our case we have 3 topics and each of these topics are going to bring data which have a relation with data from a different topic, in this case, every entity generated could be or not received at the same time (or a short period of time), and this is when the problem comes because there is a need for joining this 3 entities into one before we proceed with the moving to the topic.
Our idea was to create a separate topic which is going to contain all the data that is not processed yet and then have a separate thread that is going to check that topic in fixed intervals and also check the dependencies of this topic to be available, if they are available then we delete this entity from this separate topic, if not, we kept this entity there until it gets resolved.
At the end of all this explanation my question is if is it reasonable to do it in this way or there are other good practices or strategies that Kafka provides to solve this kind of scenarios?

Kafka messages could get clean after some time based on retention policy so you need to store message somewhere:
I can see below option but always every problem have may approach and solution:
Processed all message and forward "not processed message" to other topic say A
Kafka Processor API to consume messages from topic A and store into the state store
Schedule a punctuate() method with a time interval
Iterate all messages stored in the state stored.
check dependency if available delete the message from the state store and processed it or publish back to original topics to get processed again.
You can refer below link for reference
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html

Related

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).

consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it

I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Kafka log compaction pointers

Reading about log compaction on a topic, I was wondering if there is any way for a consumer to get hold of any of the positions/offsets of the following?
end of the head
start of the tail
compaction cleaner point
Basically the point at which the compacted and non-compacted parts of the log meet?
I've read that there is a cleaner-offset-checkpoint file that sits on the broker at /var/lib/kafka/data/cleaner-offset-checkpoint but is the info in this file available to a consumer?
My use case is a consumer that will consume compacted keys one way and non-compacted keys another way.
thanks for any advice.
UPDATE:
thinking for example of a topic holding various customer events like here https://www.confluent.io/blog/put-several-event-types-kafka-topic/; new customer, customer updates name, customer updates address, etc. Log compaction, I believe, will leave one event per customer in the tail but still many events per customer in the head (assuming compaction is slower than message production..?) A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event? In any case I was wondering if a consumer could tell how far along a topic compaction has got, at any given time?

It's not possible, with the consumer api, no.
If you want to check that checkpoint file on disk, you could use Jssh, for example, to access a broker, and read the file. If it has offset data, you could then use seek methods, but keep in mind that the Log Cleaner thread may be actively running when you seek to or consume that data
A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event?
I don't think this is a valid use case. For a stream of customer updates, you'd just update a new customer model in a table via a streaming reduce function. If any consumer restarts, it'll have to always read from the beginning of the topic to rebuild its local state then continue reading any updates to those stored values, so doesn't make sense to skip past them all, or have two separate consumers
I also don't necessarily think you need different models. Some UUID would be unique, and every event can contain the full model of a "customer". Most fields can remain optional/nullable until they are provided with a new message with all those fields set (or not), and this defines a batch update since you can set/update/remove multiple attributes at once. If you need more granularity, that's also possible to define at the producer level by storing and looping over your attributes and producing individual "customer" objects with each new attribute

Sharing state between KStream applications in same consumer group with globalStateStore

The current problem I am trying to solve is about sharing states between multiple applications in the same consumer group, they consume data from the same topic but different partition.
So I have an inputTopic with say 3 partitions. I will be running 3 KStream microservices (eg: MS1, MS2, MS3) of each partition, each microservice will process and the write result to an output topic.
Problem: most of the time the microservice can operate independently within its partition, but there are cases a microservice will need to pull the previous state of an attribute before it is able to process, and this state might previously be processed and stored by another microservice.
So an example would be if I have data of a guy walking on 3 section of a road. Each section represents a partition. So if this guy walk from section 1 to section 2, we are no longer publishing his state from section 1 publisher. His state is now published by section 2 publisher. And if I have a microservice to process his data per section. When I see records coming to section 2, I need to check if the guy's previous state whether he just started walking on my section or is he coming to my section from another section in order for me to continue process his data.
Proposed solution: I have been reading about globalStateStore, and it seems like it might solve my problem. So I will write down my thinking here and some questions, just wondering if you can see any problems in my approach:
Have the microservices read input topic from its assigned partition.
Have a GlobalStateStore to store the state so all 3 microservices can read it.
Since you can not write directly into the globalStateStore, I might have to create an intermediate topic to store the state (eg: <BobLocation,Long>; <BobMood,String>). The global state store will be created from this topic ("global-topic") - Is this correct?
Then everytime my microservice get a message I will always read the globalStateStore to update its state then process the record - Do I read it as a GlobalKTable?
Then update the state to the "global-topic"
Is there any implication on the restart process? As I am storing the state from the global state store all the time is there a problem when one app dies and the other one takes over?
Thank you so much guys!

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?

I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.

Looking for alternative to dynamically creating Kafka topics

I have a service that fetches a snapshot of some information about entities in our system and holds on to that for later processing. Currently in the later processing stages we fetch the information using http.
I want to use Kafka to store this information by dynamically creating topics so that the snapshots aren't mixed up with each other. When the service fetches the entities it creates a unique topic and then each entity we fetch gets pushed to that topic. The later processing stages would be passed the topic as a parameter and can then read all the info at their own leisure.
The benefits of this would be:
Restarting the later stages processing can be made to just restart at the offset it has processed so far.
No need to worry about batching of requests (or stream processing the incoming http response) for the entities if there is a lot of them since we simply read one at a time.
Multiple consumer groups can easily be added later for other processing purposes.
However, Kafka/Zookeeper has some limits on the total number of topics/partitions it can support. As such I would need to delete them either after the processing is done or based on some arbitrary time passing. In addition since (some) of the processors would have to know when all the information has been read I would need to include some sort of "End of Stream" message on the topic.
Two general questions:
Is it bad to dynamically create and delete Kafka topics like this?
Is it bad to include an "End of Stream" type of message?
Main question:
Is there an alternative to the above approach using static topics/partitions that doesn't entail having to hold onto the entities in memory until the processing should occur?

It seems that one “compacted” topic can be an alternative

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse