What defines the scope of a kafka topic - apache-kafka

I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?

Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked

Related

Recomended message length in a kafka topic

I have a List of ids, with more or lest 400.000 ids, i need send to kafka the ids, i don't know if the best option is send the message split in n messages with x transactions, or if is better in one message adjusting like said in this post:
How can I send large messages with Kafka (over 15MB)?
This is a very generic question and it depends on how you want to process it.
If your consumer is capable of processing each of id entries quickly, then you can put a lot of them into a single message.
OTOH, if the processing is slow, it's better to publish more messages (across multiple partitions), so if you use consumer groups you wouldn't get group membership loss events etc.
Not to forget, there's also a limit on message size (as you've linked) with default of around 1mb.
In other words - you might need to perf-test on your own side, as it's hard to make a decision with only so little data.

Streaming audio streams trough MQ (scalability)

my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.

Is there a way to prioritize messages in Apache Kafka 2.0?

EDIT
In case anyone else is in this particular situation, I got something akin to what I was looking for after tweaking the consumer configurations. I created a producer that sent the priority messages to three separate topics (for high/med/low priorities), and then I created 3 separate consumers to consume from each. Then I polled the higher priority topics frequently, and didn't poll the lower priorities unless the high was empty:
while(true) {
final KafkaConsumer<String,String> highPriConsumer = createConsumer(TOPIC1);
final KafkaConsumer<String,String> medPriConsumer = createConsumer(TOPIC2);
final ConsumerRecords<String, String> consumerRecordsHigh = highPriConsumer.poll(100);
if (!consumerRecordsHigh.isEmpty()) {
//process high pri records
} else {
final ConsumerRecords<String, String> consumerRecordsMed = medPriConsumer.poll(100);
if (!consumerRecordsMed.isEmpty()) {
//process med pri records
The poll timeout (argument to the .poll() method) determines how long to wait if there are no records to poll. I set this to a very short time for each topic, but you can set it lower for the lower priorities to make sure it's not consuming valuable cycles waiting when high pri messages are there
The max.poll.records config obviously determines the maximum number of records to grab in a single poll. This could be set higher for the higher priorities as well.
The max.poll.interval.ms config determines the time between polls - how long it should take to process max.poll.records messages. Clarification here.
Also, I believe pausing/resuming an entire consumer/topic can be implemented like this:
kafkaConsumer.pause(kafkaConsumer.assignment())
if(kafkaConsumer.paused().containsAll(kafkaConsumer.assignment())) {
kafkaConsumer.resume(kafkaConsumer.assignment());
}
I'm not sure if this is the best way, but I couldn't find a good example elsewhere
I agree with senseiwu below that this is not really the correct use for Kafka. This is single-threaded processing, with each topic having a dedicated consumer, but I will work on improving this process from here.
Background
We are trying to improve our application and hoping to use Apache Kafka for messaging between decoupled components. Our system is frequently low-bandwidth (although there are cases where bandwidth can be high for a time), and have small, high-priority messages that must be processed while larger files wait, or are processed slowly to consume less bandwidth. We would like to have topics with different priorities.
I am new to Kafka, but have tried looking into both the Processor API and Kafka Streams with no success, although certain posts on forums seem to be saying this is doable.
Processor API
When I tried the Processor API, I tried to determine if the High Priority KafkaConsumer was currently processing anything by checking if poll() was empty, and then hoped to poll() with the Med Priority Consumer, but the second topic poll returned empty. There also didn't seem to be an easy way to get all TopicPartition's on a topic in order to call kafkaConsumer.pause(partitions).
Kafka Streams
When I tried KafkaStreams, I set up a stream to consume from each of my "priority" topics, but there was no way to check if the KStream or KafkaStreams instance connected to the higher-priority topic was currently idle or processing.
I based my code on this file
Other
I also tried the code here: priority-kafka-client, but it didn't work as expected, as running the downloaded test file had mixed priorities.
I found this thread, where one of the developers says (addressing adding priorities for topics): "...a user could implement this behavior with pause and resume". But I was unable to find out how he meant this could work.
I found this StackOverflow article, but they seem to be using a very old version, and I was unclear on how their mapping function was supposed to work.
Conclusion
I would be very grateful if someone would tell me if they think this is something worth pursuing. If this isn't how Apache Kafka is supposed to work, because it disrupts the benefit gained from the automatic topic/partition handling, that's fine, and I will look elsewhere. However, there were so many instances where people seemed to have success with it, that I wanted to try. Thank you.
This sounds like a design issue in your application - kafka is originally designed as a commit log where each message is written to the broker with an offset and various consumer consume them in the order in which they were committed with very low latency and high throughput. Given that partitions and not topics are fundamental unit of work distribution in Kafka, having topic level priorities would be difficult to achieve natively.
I'd recommend to adapt your design to use other architectural components than Kafka instead of trying to cut your feet to fit into the shoes. One thing you could already do is to let your producer upload file to a proper file storage and send the link via Kafka including metadata. Then depending upon the bandwidth status, your consumer could decide based on metadata of the large file whether it is sensible to download or not. This way you are probably more likely to have a robust design rather than using Kafka the wrong way.
If you indeed want to stick to only Kafka, one solution would be to send large files to some fixed number of hardcoded partitions and consumers consume from those partitions only when bandwidth is good.

Desigining Kafka Topics - Many Topics vs One Big Topic

Considering a stream of different events the recommended way would be
one big topic containing all events
multiple topics for different types of events
Which option would be better?
I understand that messages not being in the same partition of a topic it means there are no order guarantee, but are there any other factors to be considered when making this decision?
A topic is a logical abstraction and should contain message of the same type. Let's say, you monitor a website and capture click stream events and on the other hand you have a database that populates it's changes into a changelog topics. You should have two different topics because click stream events are not related to you database changelog.
This has multiple advantages:
your data will have different format und you will need different (de)serializers to write read the data (using a single topic you would need a hybrid serializer and you will not get type safety when reading data)
you will have different consumer application and one application might be interested in click stream events only, while a second application is only interested in the database changelog and a third application is interested in both. If you have multiple topics, application one and two only subscribe to the topics they are interesting in -- if you have a single topic, application one an two need to read everything and filter the stuff they are not interested in increasing broker, network, can client load
As #Matthias J. Sax told before there is not a golden bullet over here. But we have to take different topics into account.
The conditioner: ordered deliveries
If you application needs guarantee order delivery, you need to work with only one topic, plus same keys for those messages which need to guarantee it.
If ordering is not mandatory, the game starts...
Does the schema same for all messages?
Would be consumers interested in the same type of different events?
What is gonna happen at the consumer side?, do we are reducing or increasing complexity in terms of implementation, maintainability, error handling...?
Does horizontal scalability important for us? More topics often means more partitions available, which means more horizontal scalability capacity. Also it allows more accurate scalability configuration at the broker side, because we can choose what number of partitions to increase per event type. or at the consumer side, what number of consumers stand up per event type.
Does makes sense parallelising consumption per message type?
...
Technically speaking, if we allow consumers to fine tune those type of events to be consumed we're potentially reducing the network bandwidth required to send undesired messages from the broker to the consumer, plus the number deserialisations for all of them (cpu used, which makes along time more free resources, energy cost reduction...).
Also is worthy to remember that splitting different type of messages in different topics doesn't mean have to consume them with different Kafka consumers because they allow consumption from different topics at the same time.
Well, there's not a clear answer for this question, but I have the feeling that with Kafka, because multiple features, if ordered deliveries are not needed we should split our messages per type in different topics.

Pausing Stream Consumption

I am working on an application that processes very few records in a minute. The request rate would be around 2 calls per minute. These requests are create and update made for a set of data. The requirements were delivery guarantee, reliable delivery, ordering guarantee and preventing any loss of messages.
Our team has decided to use Kafka and I think it does not fit the use case since Kafka is best suitable for streaming data. Instead we could have been better off with traditional message model as well. Though Kafka does provide ordering per partition, the same can be achieved on a traditional messaging system if the number of messages Is low and sources of data is also low. Would that be a fair statement ?
We are using Kafka streams for processing the data and the processing requires that we do lookups to external systems. If the external systems are not available then we stop processing and automatically deliver messages to target systems when the external lookup systems are available.
At the moment, we stop processing by continuously looping in the middle of a processing and checking if the systems are available.
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Regarding your point 2:
a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?
If, as in your case, you have a very low incoming data rate (a few records per minute), then it might be ok to pause processing an input stream when required dependency systems are not available currently.
In Kafka Streams the preferable API to implement such a behavior -- which, as you are alluding to yourself, is not really a recommended pattern -- is the Processor API.
Even so there are a couple of important questions you need to answer yourself, such as:
What is the desired/required behavior of your stream processing application if the external systems are down for extended periods of time?
Could the incoming data rate increase at some point, which could mean that you would need to abandon the pausing approach above?
But again, if pausing is what you want or need to do, then you can give it a try.
b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?
Some stream processing tools allow you to do that. Whether it's the best pattern to use them is a different question.
For instance, you could also consider the following alternative: You could automatically ingest the external systems' data into Kafka, too, for example via Kafka's built-in Kafka Connect framework. Then, in Kafka Streams, you could read this exported data into a KTable (think of this KTable as a continuously updated cache of the latest data from your external system), and then perform a stream-table join between your original, low-rate input stream and this KTable. Such stream-table joins are a common (and recommended) pattern to enrich an incoming data stream with side data (disclaimer: I wrote this article); for example, to enrich a stream of user click events with the latest user profile information. One of the advantages of this approach -- compared to your current setup of querying external systems combined with a pausing behavior -- is that your stream processing application would be decoupled from the availability (and scalability) of your external systems.
is only a fair statement for traditional message brokers when there is a single consumer (i.e. an exclusive queue). As soon as the queue is shared by more than one consumer, there will be the possibility of out of order delivery of messages. This is because any one consumer might fail to processes and ACK a message resulting in the message being put back at the head of the shared queue, and subsequently delivered (out of order) to another consumer. Kafka guarantees in order parallel consumption across multiple consumers using topic partitions (which are not present in traditional message brokers).