How to get "client.id" from Kafka topic? - apache-kafka

The situation is the following.
Some of my sinks, which are connected to Kafka, are very sensitive on load.
They are DBs, which do do like to be overloaded.
I would like dynamically set quota values for some topics depending on overall load on those sinks. I am feeding data into DBs using Kafka Connect and self-made streamming app based on KStreams.
I know I cannot set quota on the topic, but on the client.id.
Anyhow at the end I prefer to have control over concrete topic(s).
Especially later I prefer to have a tool (perhaps self programmed) to close the the feeback loop from sink load to kafka quotas.
Even more complicated the matter is, when using streams, as client.id is extended with postfix like
StreamThread-1-consumer-f1835e80-e8ae-428a-a40e-2a44aab0e9ae
I have admin access to the topics, so I can "sniff" all messages.
The question is:
How to get client.id of the message in certain topic without asking developers, what they had implemented or if they had changed something related to client.id?
Thanks in advance!

Related

Kafka Streams - disable internal topic creation

I work in an organization where we must use the shared Kafka cluster.
Due to internal company policy, the account we use for authentication has only the read/write permissions assigned.
We are not able to request the topic-create permission.
To create the topic we need to follow the onboarding procedure and know the topic name upfront.
As we know, Kafka Streams creates internal topics to persist the stream's state.
Is there a way to disable the fault tolerance and keep the stream state in memory or persist in the file system?
Thank you in advance.
This entirely depends how you write the topology. For example, map/filter/forEach, etc stateless DSL operators don't create any internal topics.
If you actually need to do aggregation, and build state-stores, then you really shouldn't disable topics. Yes, statestores are stored either in-memory or as RocksDB on disk, but they're still initially stored as topics so they can actually be distributed, or rebuilt in case of failure.
If you want to prevent them, I think you'll need an authorizer class defined on the broker that can restrict topic creation based, at least, on client side application.id and client.id regex patterns, but there's nothing you can do at the client config.

Streaming audio streams trough MQ (scalability)

my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.

Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS?

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.

Kafka: multiple consumers in the same group

Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...

Correlating in Kafka and dynamic topics

I am building a correlated system using Kafka. Suppose, there's a service A that performs data processing and there're its thousands of clients B that submit jobs to it. Bs are short-lived, they appear on the network, push the data to A and then two important things happen:
B will immediately receive a status from A;
B then will either
drop out completely, stay online to receive further updates on
status, or will sporadically pop back on to check the status.
(this is not dissimilar to grid computing or mpi).
Both points should be achieved using a well-known concept of correlationId: B possesses a unique id (UUID in my case), which it sends to A in headers, which, in turn, uses it as Reply-To topic to send status updates to. Which means it has to create topics on the fly, they can't be predetermined.
I have auto.create.topics.enable switched on, and it indeed creates topics dynamically, but existing consumers are not aware of them and require to be restarted [to fetch topic metadata i suppose, if i understood the docs right]. I also checked consumer's metadata.max.age.ms setting, but it doesn't help it seems, even if i set it to a very low value.
As far as i've read, this is yet unanswered, i.e.: kafka filtering/Dynamic topic creation, kafka consumer to dynamically detect topics added, Can a Kafka producer create topics and partitions? or answered unsatisfactory.
As there're hundreds of As and thousands of Bs, i can't possibly use shared topics or anything like it, lest i overload my network. I can use Kafka's AdminTools, or whatever it's called, to pre-create topics, but i find it somehow silly (even though i saw real-life examples of people using it to talk to Zookeeper and Kafka infrastructure itself).
So the question is, is there a way to dynamically create Kafka topics in a way that makes both consumer and producer aware of it without being restarted or anything? And, in the worst case, will AdminTools really help it and on which side must i use it - A or B?
Kafka 0.11, Java 8
UPDATE
Creating topics with AdminClient doesn't help for whatever reason, consumers still throw LEADER_NOT_AVAILABLE when i try to subscribe.
Ok, so i’d answer my own question.
Creating topics with AdminClient works only if performed before corresponding consumers are created.
Changed the topology i have, taking into account 1) and introducing exchange of correlation ids in message headers (same as in JMS). I also had to implement certain topology management methodologies, grouping Bs into containers.
It should be noted that, as many people have said, this only works when Bs are in single-consumer groups and listen to topics with 1 partition.
To get some idea of the work i'm into, you might have a look at the middleware framework i've been working on https://github.com/ikonkere/magic.
Creating an unbounded number of topics is not recommended. Id advise to redesign your topology/system.
Ive thought of making dynamic topics myself but then realized that eventually zookeeper will fail as it will run out of memory due to stale topics (imagine a year from now on how many topics could be created). Maybe this could work if you make sure you have some upper bound on topics ever created. Overall an administrative headache.
If you look up using Kafka with request response you will find others also say it is awkward to do so (Does Kafka support request response messaging).