Kafka Streams - disable internal topic creation - apache-kafka

I work in an organization where we must use the shared Kafka cluster.
Due to internal company policy, the account we use for authentication has only the read/write permissions assigned.
We are not able to request the topic-create permission.
To create the topic we need to follow the onboarding procedure and know the topic name upfront.
As we know, Kafka Streams creates internal topics to persist the stream's state.
Is there a way to disable the fault tolerance and keep the stream state in memory or persist in the file system?
Thank you in advance.

This entirely depends how you write the topology. For example, map/filter/forEach, etc stateless DSL operators don't create any internal topics.
If you actually need to do aggregation, and build state-stores, then you really shouldn't disable topics. Yes, statestores are stored either in-memory or as RocksDB on disk, but they're still initially stored as topics so they can actually be distributed, or rebuilt in case of failure.
If you want to prevent them, I think you'll need an authorizer class defined on the broker that can restrict topic creation based, at least, on client side application.id and client.id regex patterns, but there's nothing you can do at the client config.

Related

Use transactional API and exactly-once with regular Producers and Consumers

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

How to get "client.id" from Kafka topic?

The situation is the following.
Some of my sinks, which are connected to Kafka, are very sensitive on load.
They are DBs, which do do like to be overloaded.
I would like dynamically set quota values for some topics depending on overall load on those sinks. I am feeding data into DBs using Kafka Connect and self-made streamming app based on KStreams.
I know I cannot set quota on the topic, but on the client.id.
Anyhow at the end I prefer to have control over concrete topic(s).
Especially later I prefer to have a tool (perhaps self programmed) to close the the feeback loop from sink load to kafka quotas.
Even more complicated the matter is, when using streams, as client.id is extended with postfix like
StreamThread-1-consumer-f1835e80-e8ae-428a-a40e-2a44aab0e9ae
I have admin access to the topics, so I can "sniff" all messages.
The question is:
How to get client.id of the message in certain topic without asking developers, what they had implemented or if they had changed something related to client.id?
Thanks in advance!

Kafka security - secure producers

I have been reading about kafka security and looks like ACKLs is the way to go to have a secure kafka setup. But my question is :
1)Consumer level: Is there a way to have more granular control over
what part of the data can be read on our topics. For example: can we
somehow make sure some part of our Kafka message, say ssn is not
visible for certain users?
2) Producer level: let's assume we have a topic for voters, and we
have different producers writing to that topic with write ackls on
that topic. How do we control, say a producer from state Missouri says
a voter voted in Illinois he has write ackls on the voters topic
whereas in reality he should not. Is having a per state topic the only
solution ?
Both of your questions are sort of pointing to per-field entitlements/encryption/tokenization. This is not supported by default in core Kafka, but can be implemented at the producers/consumers if you have the knowledge and infrastructure.
For question 1 you'd tokenize/encrypt any fields and only clients with the correct keys could decrypt the payload.
For question 2 you'd encrypt the entire record payload, possibly using Kafka's headers to determine the key pair required to decrypt the data.
In both cases you'll need some key management system to ensure the producers and consumers have access to the keys they require and you'll need to write your own encryption/decryption processes into the clients. I'd implement this in the serializer/deserializer to make it simple.
If you don't want to do all this then separating access to data via ACLs applied to topics is the best approach.

Eliminate redundancy between ktable internal topic and user topic

I'm writing an app for a book library and I have 2 microservices: memberService, and bookService. The memberService creates a ktable of members (built by aggregating change messages on another topic) for its own internal use. The bookService also needs read access to that ktable. Currently I share the data by having the memberService call memberTable.toStream().to("memberTableTopic")
and I have the bookService subscribe to the memberTableTopic.
Based on my understanding of how ktables work, the data in memberTableTopic will be identical to the backing internal topic used by the ktable. Is there a good way to eliminate this redundancy? Should my bookService subscribe to the internal topic?
Yes the data will look same in both topics- internal topic and usertopic.
Conceptully, Internal topics are used internally by the KStream Applications. That implies that when an application ID is reset, these internal topics will be deleted and you lose the data. While User topics exist externally to an application, that can be used by any application at any time.
That depends on your need how you want to use the data. If you want to remove the redundancy, you can set a short retention on your internal topics.