Kafka Streaming Application with Not Null Check - apache-kafka

I got a streaming application which is subscribed to two topics and which publishes a topic. One subscribed Topic comes from a datasource beyond my control and gives me null values, where the shouldn't be one.
So I was thinking of implementing a NUll-Check in this Streaming-Application, but I need to know the latest published message, because at the moment the streaming-app is kind of stateless.
So I would add a statestore to the streaming-app where I can query the latest message.
Is this a legit approach? Are there other approaches to this topic beyond adding a "State" to the streaming app?

If you want to handle the possible null value within the streams app and keep track of the latest published message, then yes, adding a statestore is the appropriate thing to do.

Related

Restrict Kafka consumers based on event headers (metadata)

The book "Building Event-Driven Microservices" gives good practice to use a metatags (event headers) for placing restrictions on Kafka consumers. One of which is the following:
Deprecation:
A way to indicate that a stream is out of date. Marking an event stream as deprecated
allows existing systems to continue using it while
new microservices are blocked from requesting a subscription... the
owner of the deprived stream of events can be notified when there are
no more registered users of the deprecated stream, at which point it
can be safely deleted.
Can you point to me please, how this can be implemented (Java/Spring centric)? Is it possible for Kafka ACL to make restrictions based on event headers?
Thank you in advance!
Is it possible for Kafka ACL to make restrictions based on event headers?
No, but you can filter out after receiving the message. ACLs will prevent access to partition as a whole, not to particular records.
the owner of the deprived stream of events can be notified when there are no more registered users of the deprecated stream
You need to remember that Kafka is not a pure messaging solution, and it does not have a concept of "registered" consumers, at any time as long as the message has not been removed by the cluster.
You'd need to implement your own "notification" pipeline that there are no instances interested in the original topic (possibly even with Kafka again).

How to ensure exactly once processing in Kafka for older version 0.10.0

I have a scenario where I need to call an api to the notification service which will send an email to the user.
Based on the message that appears to the topic I need to consume and call an api.
But if the consumer fails in between due to the nature of the at-least once configuration of the consumer its hard to avoid duplicate processing.
Sorry for newbie question. I Tried to find the many blogs but most of them are using newer version of Kafka but in our infra they are still using older version.
Wondering how can we achieve this behavior?
Links that I referred earlier :
If exactly-once semantics are impossible, what theoretical constraint is Kafka relaxing?
I also saw there documentation where they mention to store the processing result and offset to the same storage. But my scenario is api call lets imagine if the consumer call api and dies in between before committing offset then next poll will get that message and call api. In this way email will be sent twice.
Any help is appreciated.

Is it possible to track the consumers in Kafka and the timestamp in which a consumer consumed a message?

Is it possible to track who the consumers are in Kafka? Perhaps some way for a consumer to 'register' itself as a consumer to a Kafka topic?
If #1 is possible, then is it also possible for Kafka to track the time when a consumer consumed a message?
It is possible to implement these features in the application itself, but I wonder if Kafka already provides some way to do this. I can't find any documentation on these features so perhaps this is not possible in Kafka, but it would be great to get confirmation. Thank you!
You can track consumer groups, but I do not think you can track consumers within the group very easily. Within a group, it gives you lag, and from lag, you would need to read that offset difference to actually get the time
There is no other such "registration process".
What you could do is develop an "interceptor" that is able to track messages and times throughout the system. That is how Confluent Control Center is able to graphically display if/when consumers get messages
However, that requires additional configurations on all consumers. More specifically, the interceptor on the classpath.

Does it make sense to use Apache Kafka for this Scenario?

There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.

Event vs Topic Apache Kafka

Sorry this may be a basic question trying to understand the difference between an event and topic in Apache Kafka.
My understanding is both are same and in streaming context topic is called as Event. Correct me if I am wrong.
What is called "event" in the streaming context (if we speak about Kafka Streams API) is a "message" in the normal Kafka usage. The topic is the place where you store messages (or events, in streaming context).
Event (the data) is something you would store in a Topic e.g. userA updated his profile - this is an event and you can send this across (in any format e.g. a JSON payload) to a Kafka topic. Both are not same - no matter what the context
A Brief Introduction to Events and Topics
I just wanna to give my knowledge on this.
Events
Events are normally means something happened which means object state changes refer to events. My room temperature is 35c and it is changed to 37c then there is a temperature change event that happens. So This says There are a lot of events in the world. like as below.
There are a lot of events are happening in the world.
Topic
As I said there are a lot of events happening in the world. We need to organise/ categories them. Here the topic comes to play. A topic is simply used to group some events with their nature. Just Illustrate the below scenario, In our system customers can order products. All the events are to this action can be categorised into an `**order**` topic. There are no issues in duplication events between topics. so for example from this order topic event. We can filter specific geolocation order events to separate topics.
like **Srilanka-order**.
Topics can be related to a table in the database but not exactly and events can be related to records of the database.