Kafka security - secure producers - apache-kafka

I have been reading about kafka security and looks like ACKLs is the way to go to have a secure kafka setup. But my question is :
1)Consumer level: Is there a way to have more granular control over
what part of the data can be read on our topics. For example: can we
somehow make sure some part of our Kafka message, say ssn is not
visible for certain users?
2) Producer level: let's assume we have a topic for voters, and we
have different producers writing to that topic with write ackls on
that topic. How do we control, say a producer from state Missouri says
a voter voted in Illinois he has write ackls on the voters topic
whereas in reality he should not. Is having a per state topic the only
solution ?

Both of your questions are sort of pointing to per-field entitlements/encryption/tokenization. This is not supported by default in core Kafka, but can be implemented at the producers/consumers if you have the knowledge and infrastructure.
For question 1 you'd tokenize/encrypt any fields and only clients with the correct keys could decrypt the payload.
For question 2 you'd encrypt the entire record payload, possibly using Kafka's headers to determine the key pair required to decrypt the data.
In both cases you'll need some key management system to ensure the producers and consumers have access to the keys they require and you'll need to write your own encryption/decryption processes into the clients. I'd implement this in the serializer/deserializer to make it simple.
If you don't want to do all this then separating access to data via ACLs applied to topics is the best approach.

Related

Kafka Streams - disable internal topic creation

I work in an organization where we must use the shared Kafka cluster.
Due to internal company policy, the account we use for authentication has only the read/write permissions assigned.
We are not able to request the topic-create permission.
To create the topic we need to follow the onboarding procedure and know the topic name upfront.
As we know, Kafka Streams creates internal topics to persist the stream's state.
Is there a way to disable the fault tolerance and keep the stream state in memory or persist in the file system?
Thank you in advance.
This entirely depends how you write the topology. For example, map/filter/forEach, etc stateless DSL operators don't create any internal topics.
If you actually need to do aggregation, and build state-stores, then you really shouldn't disable topics. Yes, statestores are stored either in-memory or as RocksDB on disk, but they're still initially stored as topics so they can actually be distributed, or rebuilt in case of failure.
If you want to prevent them, I think you'll need an authorizer class defined on the broker that can restrict topic creation based, at least, on client side application.id and client.id regex patterns, but there's nothing you can do at the client config.

Custom compaction for Kafka topic on the broker side?

Assume some Kafka cluster with some topic named MyTopic. According to business logic I am implementing, adjancent records are considered equal whenever some subset of value's rather then key's properties are equal. Thus, built-in compaction, driven by key equality, doesn't work for my scenario. I could implement pseudocompaction at the consumer side, which is neither an option due to performance. The whole idea is to maintain right compaction at the broker side. In addition to that, such a compaction has to be applied only within some special consumer group; all other groups have to get entire log of records as they are now.
According to my knowledge there is no way to implement such compaction. Am I wrong?
You can not have custom log compaction. It is either delete or compact based on keys. https://kafka.apache.org/documentation/#compaction
However, if your case is just related to some special consumer groups, you might create a stream to read your specified topic, create a hash key (based on value subset) which will write to another topic and apply clean up policy compaction to this new topic.
This obviously will have almost duplicated data which might not suit your case.
This question has already been answered correct, ie it's not currently possible. But it's worth noting that KIP-280 has been approved and will add new compaction policies. It is currently targeted for Kafka 2.5.
It looks like your goal would be achieved with the new header policy.

Kafka topic filtering vs. ephemeral topics for microservice request/reply pattern

I'm trying to implement a request/reply pattern with Kafka. I am working with named services and unnamed clients that send messages to those services, and clients may expect a reply. Many (10s-100s) of clients may interact with a single service, or consumer group of services.
Strategy one: filtering messages
The first thought was to have two topics per service - the "HelloWorld" service would consume the "HelloWorld" topic, and produce replies back to the "HelloWorld-Reply" topic. Clients would consume that reply topic and filter on unique message IDs to know what replies are relevant to them.
The drawback there is it seems like it might create unnecessary work for clients to filter out a potentially large amount of irrelevant messages when many clients are interacting with one service.
Strategy two: ephemeral topics
The second idea was to create a unique ID per client, and send that ID along with messages. Clients would consume their own unique topic "[ClientID]" and services would send to that topic when they have a reply. Clients would thus not have to filter irrelevant messages.
The drawback there is clients may have a short lifespan, e.g. they may be single use scripts, and they would have to create their topic beforehand and delete it afterward. There might have to be some extra process to purge unused client topics if a client dies during processing.
Which of these seems like a better idea?
We are using Kafka in production as a handler for event based messages and request/response messages. our approach to implementing request/response is your first strategy because, when the number of clients grows, you have to create many topics which some of them are completely useless. another reason for choosing the first strategy was our topic naming guideline that each service should belong to only one topic for tacking. however, Kafka is not made for request/response messages but I recommend the first strategy because:
few numbers of topics
better service tracking
better topic naming
but you have to be careful about your consumer groups. which may causes of data loss.
A better approach is using the first strategy with many partitions in one topic (service) that each client sends and receives its messages with a unique key. Kafka guarantees that all messages with the same key will go to a specific partition. this approach doesn't need filtering irrelevant messages and maybe is a combination of your two strategies.
Update:
As #ValBonn said in the suggested approach you always have to be sure that the number of partitions >= number of clients.

Eliminate redundancy between ktable internal topic and user topic

I'm writing an app for a book library and I have 2 microservices: memberService, and bookService. The memberService creates a ktable of members (built by aggregating change messages on another topic) for its own internal use. The bookService also needs read access to that ktable. Currently I share the data by having the memberService call memberTable.toStream().to("memberTableTopic")
and I have the bookService subscribe to the memberTableTopic.
Based on my understanding of how ktables work, the data in memberTableTopic will be identical to the backing internal topic used by the ktable. Is there a good way to eliminate this redundancy? Should my bookService subscribe to the internal topic?
Yes the data will look same in both topics- internal topic and usertopic.
Conceptully, Internal topics are used internally by the KStream Applications. That implies that when an application ID is reset, these internal topics will be deleted and you lose the data. While User topics exist externally to an application, that can be used by any application at any time.
That depends on your need how you want to use the data. If you want to remove the redundancy, you can set a short retention on your internal topics.

What is the performance impact to Kafka consumers for re-keying a topic?

Use case: We have a compacted topic which will use a UUID to track the lifecycle of an order.
Some of our services need to know how many orders any given account has at one time, to provide alerts and metrics (using Kafka Streams KTables). But since we are keying by UUID, the records will be produced across the topic partitions and not co-located.
My team's proposal is using a custom partition class so that we can use a UUID for the record key, but use the account Id value for the hashing algorithm for producing the records.
But our devOps/support team is concern about the tools support for custom partitioning class (such as Mirror Maker) screwing up which partition records are located in.
For the services that need this data co-located, we could use the Kafka streams re-key logic which will re-key the data in an internal topic to that service, but from my understanding, it comes with a performance impact. Is there any clear way of knowing what kind of impact it would be? Would it mean our processing time for each record is double?