How to track who published a message in Kafka? - apache-kafka

Wondering if there is a way to force a broker to include information (perhaps in a header) about the publisher that pushed a Record.
The publisher can do this, but it can easily avoid it too.

"Force" a broker? No, brokers only receive bytes of a record. That information would generally be done at the producer, but even then you cannot force usage of a certain serializer or message format (for the later, Confluent Server offers broker-side schema validation, but then you'd run into the issue that clients can provide fake information).
With that note, CloudEvents defines a "spec" for metadata that is recommended for each message in event driven systems.
In my opinion, best you can do is force authorization + authentication policies to track client/credentials used to access a set of topics.
OpenPolicyAgent or Apache Ranger have Kafka plugins that can assist with this

Related

Kafka record headers won’t be forwarded to changelog topics

I need the headers in every topic for object mapping purposes.
Currently when I add a header to a record in a processor, then the header will be present in the target output topic, but not in changelog topics used by state stores.
How can the headers be forwarded to each and every derived internal topic.
Unfortunately, there is no way to propagate headers to the changelog topics. The changelog is considered to be an internal topic and as such is not intended for consumption by the Streams user, so the Streams developers did not add the functionality you're asking about.
If I might ask, what is the use-case? There might be a different work-around.

Restrict Kafka consumers based on event headers (metadata)

The book "Building Event-Driven Microservices" gives good practice to use a metatags (event headers) for placing restrictions on Kafka consumers. One of which is the following:
Deprecation:
A way to indicate that a stream is out of date. Marking an event stream as deprecated
allows existing systems to continue using it while
new microservices are blocked from requesting a subscription... the
owner of the deprived stream of events can be notified when there are
no more registered users of the deprecated stream, at which point it
can be safely deleted.
Can you point to me please, how this can be implemented (Java/Spring centric)? Is it possible for Kafka ACL to make restrictions based on event headers?
Thank you in advance!
Is it possible for Kafka ACL to make restrictions based on event headers?
No, but you can filter out after receiving the message. ACLs will prevent access to partition as a whole, not to particular records.
the owner of the deprived stream of events can be notified when there are no more registered users of the deprecated stream
You need to remember that Kafka is not a pure messaging solution, and it does not have a concept of "registered" consumers, at any time as long as the message has not been removed by the cluster.
You'd need to implement your own "notification" pipeline that there are no instances interested in the original topic (possibly even with Kafka again).

How to freeze new version of AVRO schema of a topic so that non compliant message is rejected?

With Java Client producer, the message can be fine tune to comply with the schema format before publish to topics.
With kafka rest proxy, how to reject messages if the message unable to be deserialize with kafka avro schema version? This is to prevent junks to be added by clients that not comply to schema. I see that version upgrade automatically with new schema to a topics. How to restrict the messages publish to the topic? It could be due to producer clients of all bugs.
I am searching the document and I am new to learn kafka. I know consumer can be smarter with offset but i want to clean up junks from a topics. Thanks.
There's no way to prevent this with Apache Kafka without putting some reverse proxy that denies the schema update request in front of the Registry, for example. And you'd have to deny access to anyone bypassing that proxy
If I understood your question correctly, you need to reject all events those are non compatible with your standard schema, if its correct then it would be helpful if you try AVRO schema registry compatibility checks.
How to freeze new version of AVRO schema of a topic so that non compliant message is rejected?
If you want to freeze a version of schema, then use FULL/FULL_TRANSITIVE compatibility type with all required fields, now you cannot delete any field because registry wont allow you to delete non-optional fields, and can only add optional fields.
So that you can freeze on removal of existing fields, however optional still can added and this shouldn't be a impact on other non-relevant clients.
Hope it helps!

Kafka use case to send data to external system

Studying kafka in the documentation I found next sentence:
Queuing is the standard messaging type that most people think of: messages are produced by one part of an application and consumed by another part of that same application. Other applications aren't interested in these messages, because they're for coordinating the actions or state of a single system. This type of message is used for sending out emails, distributing data sets that are computed by another online application, or coordinating with a backend component.
It means that Kafka topics aren't suitable for streaming data to external applications. However, in our application, we use Kafka for such purpose. We have some consumers which read messages from Kafka topics and try to send them to an external system. With such approach we have a number of problems:
Need a separet topic for each external application (assume that the number of external application numbers > 300, doesn't suite well)
Messages to an external system can fail when the external application is unavailable or for some another reason. It is incorrect to keep retrying to send the same message and not to commit offset. Another way there is no nicely configured log when I can see all fail messages and try to resend them.
What are other best practice approach to stream data to an external application? OR Kafka is a good choice for the purpose?
Just sharing a piece of experience. We use Kafka extensively for integrating external applications in the enterprise landscape.
We use topic-per-event-type pattern. The current number of topics is about 500. The governance is difficult but we have our own utility tool, so it is feasible.
Where possible we extend an external application to integrate with Kafka. So the consumers become a part of the external application and when the application is not available they just don't pull the data.
If the extension of the external system is not possible, we use connectors, which are mostly implemented by us internally. We distinguish two type of errors: recoverable and not recoverable. If the error is not recoverable, for example, the message is corrupted or not valid, we log the error and commit the offset. If the error is recoverable, for example, the database for writing the message is not available then we do not commit the message, suspend consumers for some period of time and after that period try again. In your case it is probably makes sense to have more topics with different behavior (logging errors, rerouting the failed messages to different topics and so on)

Does it make sense to use Apache Kafka for this Scenario?

There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.