Restrict Kafka consumers based on event headers (metadata) - apache-kafka

The book "Building Event-Driven Microservices" gives good practice to use a metatags (event headers) for placing restrictions on Kafka consumers. One of which is the following:
Deprecation:
A way to indicate that a stream is out of date. Marking an event stream as deprecated
allows existing systems to continue using it while
new microservices are blocked from requesting a subscription... the
owner of the deprived stream of events can be notified when there are
no more registered users of the deprecated stream, at which point it
can be safely deleted.
Can you point to me please, how this can be implemented (Java/Spring centric)? Is it possible for Kafka ACL to make restrictions based on event headers?
Thank you in advance!

Is it possible for Kafka ACL to make restrictions based on event headers?
No, but you can filter out after receiving the message. ACLs will prevent access to partition as a whole, not to particular records.
the owner of the deprived stream of events can be notified when there are no more registered users of the deprecated stream
You need to remember that Kafka is not a pure messaging solution, and it does not have a concept of "registered" consumers, at any time as long as the message has not been removed by the cluster.
You'd need to implement your own "notification" pipeline that there are no instances interested in the original topic (possibly even with Kafka again).

Related

kafka consumer how to force refresh metadata in order to discover new topics

I'm using regex pattern to subscribe a group of topics, which might be created dynamically. However, there might be quite a while before the consumer discovers the new created topics.
I can set the topic.metadata.refresh.interval.ms property to change the polling intervals, but I'm concerned that short intervals might lead to overhead. So I think a notification approach would be better, i.e, when a new topic is created, the creator will notify the consumer service.
I'm looking for an API forcing the consumer to refresh its topic metadata. Didn't find a after looking through kafka Consumer APIs...any ideas?
The only API for this would be to .close() the consumer and re-subscribe it upon receiving such "notification event"

Event broadcasting in Kafka?

Is there a way to have a event delivered to every subscriber of a topic regardless of consumer group? think "Refresh your local cache" kind of scenario
As far as Kafka in concerned, you cannot subscribe to a topic without a consumer group.
Out of the box, this isn't a pattern of a Kafka consumer; there isn't a way to make all consumers in a group read all messages from all partitions. There cannot be more consumer clients than partitions (thereby making "fan out" hard), and only one message goes to any one consumer before the message offset gets committed and the entire consumer group seeks those offsets forward to consume later events.
You'd need a layer above the consumer to decouple yourself from the consumer-group limitations.
For example, with Interactive Queries, you'd consume a topic and build a StateStore from the data that comes in, effectively building a distributed cache. With that, you can layer in an RPC layer (mentioned in those docs) that allows external applications over a protocol of your choice (e.g. HTTP) to later query and poll that data. From an application that is polling the data, you then would have the option of forwarding "notification events" via any method of your choice.
As for a framework that already exposes most of this out-of-the-box, checkout Azkarra Streams (I have no affiliation)
Or you can use alternative solutions such as Kafka Connect and write data to Slack, Telegram, etc. "message boards" instead, where many people explicitly subscribe to those channel notifications.

How to track who published a message in Kafka?

Wondering if there is a way to force a broker to include information (perhaps in a header) about the publisher that pushed a Record.
The publisher can do this, but it can easily avoid it too.
"Force" a broker? No, brokers only receive bytes of a record. That information would generally be done at the producer, but even then you cannot force usage of a certain serializer or message format (for the later, Confluent Server offers broker-side schema validation, but then you'd run into the issue that clients can provide fake information).
With that note, CloudEvents defines a "spec" for metadata that is recommended for each message in event driven systems.
In my opinion, best you can do is force authorization + authentication policies to track client/credentials used to access a set of topics.
OpenPolicyAgent or Apache Ranger have Kafka plugins that can assist with this

Kafka Streaming Application with Not Null Check

I got a streaming application which is subscribed to two topics and which publishes a topic. One subscribed Topic comes from a datasource beyond my control and gives me null values, where the shouldn't be one.
So I was thinking of implementing a NUll-Check in this Streaming-Application, but I need to know the latest published message, because at the moment the streaming-app is kind of stateless.
So I would add a statestore to the streaming-app where I can query the latest message.
Is this a legit approach? Are there other approaches to this topic beyond adding a "State" to the streaming app?
If you want to handle the possible null value within the streams app and keep track of the latest published message, then yes, adding a statestore is the appropriate thing to do.

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?
Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.