Recomended message length in a kafka topic - apache-kafka

I have a List of ids, with more or lest 400.000 ids, i need send to kafka the ids, i don't know if the best option is send the message split in n messages with x transactions, or if is better in one message adjusting like said in this post:
How can I send large messages with Kafka (over 15MB)?

This is a very generic question and it depends on how you want to process it.
If your consumer is capable of processing each of id entries quickly, then you can put a lot of them into a single message.
OTOH, if the processing is slow, it's better to publish more messages (across multiple partitions), so if you use consumer groups you wouldn't get group membership loss events etc.
Not to forget, there's also a limit on message size (as you've linked) with default of around 1mb.
In other words - you might need to perf-test on your own side, as it's hard to make a decision with only so little data.

Related

How to implement fair scheduling between multiple tennants writing to 1 stream

As of now I have single Kafka Topic with 10 partitions. We have 10000 clients who keep dumping uncontrolled data into streams. The problem currently is that
A tenant with out any notice (or little notice) floods the topic
now the messages from other tenants suffer --> because their messages (handful) are queued behind and will take several hours to get their turn for processing
Question:
Can I somehow read may be 1k messages per tenant and roundrobin --> essentially like fair scheduling of Hadoop yarn
Can Apache pulsar help me in this? If yes then is there any example you can point me to?
I went through: https://www.confluent.io/blog/prioritize-messages-in-kafka/ already; but given the volume of clients it may not be practical to have 100k partitions etc.
I'm not aware of any way to get what you want out of the box. You could probably have the consumer pause some partitions to prioritize consumption from the ones with more messages (for example, by checking the lag per partition after every few poll iterations).
I'm not familiar enough with Apache Pulsar to have a clear answer.
I have a similar problem: a single customer can monopolize the resources and delay execution from all other customers, just because their events arrived first.
On a different application with a low amount of messages, we just load all the events in memory, creating a in-memory queue for every customer and then dequeuing up to N events from each customer queue and re-queue them again into a different queue, lets call it the re-ordered queue. The re-ordered queue has a capacity limit. (lets say...100*N), so no additional elements are queue until there is space. This guarantees equal treatment to all customers.
I am facing the same problem now with an application that processes billions of messages. The solution above is impossible; there is just not enough RAM. We can't keep all the data in memory. Creating a topic for each customer also sounds overkill; specially if you have a variable set of active customers at any given point in time. Nevertheless, Pulsar seems to handle well thousands, even millions, of topics.
So the technique above may work well for you (and for me).
Just read from thousands of topics... write to another topic a limited number of messages and then wait for it to have "space" to continue enqueuing.

What defines the scope of a kafka topic

I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked

Kafka streams merging message

I have a data payload, which is too big for one message. Consider an avro:
record Likes {...}
record Comments {...}
record Post {
Likes likes;
Comments comments;
string body;
}
Assume, likes and comments are large collections and if pass them together with post, it will exceed max message size, which I suppose incorrect to increase up to 10-20 MB.
I want to split one message into three: post body, comments and likes. However, I want database insert to be atomic - so I want to group and merge these messages in consumer memory.
Can I do it with kafka-streams?
Can I have a stream without output topic (as the output message will again exceed max size).
If you have any ideas assuming the same inputs (one large message exceeding configured max message size), please share
Yes, you can do it with kafka-streams, merging the messaging in the datastore, and you can have a stream without output topic. You need to be sure that three parts go to the same partition (to go to the same instance of the application), so they probably will have the same key.
You may also use three topics, for each object and then join them. (Again with the same key).
But generally Kafka is designed to handle a lot of small messages and it does not work good with large messages. May be you should consider to send not the whole info in one message, but incremental changes, only information, which was updated.

KafkaProducer send a list of messages or break list into individual messages

Is it okay to batch 100 messages into a single object and send those objects to kafka or should I split those 100 messages into individual messages and then put them in kafka
Say for example, I have an object that contains a List. I can put 100 strings in that list and send the object to kafka. Is it better to do it that way or should i split the list of strings and send individual strings to kafka instead
What are some pros and cons to the above approaches
Batching is always good when async processing, until you need to partially process the batch in case of errors.
If you are processing an order and the list of 100 are the items. send them together, as they will be processed together. If you are sending 100 orders, and will process the independently, process them one by one, as the error in one order should not block the others.
As for message sizes, kafka has some message size limits, but these are configurable.
Definitively you need to improve your question.
You want to send a huge message that is more than the max.message.bytes configuration of your kafka broker(let's assume you can't change it). You break it down and put it back together at the consumer side.
This would require some work around the limitations of kafka deployment as of now. For e.g
Should your consumer process all these 100 strings as if they were one batch? when should your consumer decide to commit the offsets for these messages? Is your consumer processing idempotent? Do you have one consumer or multiple consumer instances? what if the 100 strings were split across 5 partitions? which consumer gets which subset of these 100 strings?
An approach is to create 100 messags all with the same batch id like so
(batch1:message1, batch1:message2, batch1:message3)
On the consumer side collect all these messages with the same key
(batch1: (message1, message2, message3))
But, how would you know when the batch ends? does the sequence message1, message2, message3 matter?
So you do something like this
(batch1:message1of3, batch1:message2of3, batch1:messsage3of3)
Now what if you received message1of3 and message2of3 but not message3of3? how long do you wait for it?
As you can see, at each step there are multiple ways to go about this and you will have to make choices right for your problem. Perhaps, you will use timeouts, perhaps in your case batches are interleaved like this
(batch1:message1of3, batch2:message2of5, batch1:message2of3...)
Expect to make some compromises. With Kafka your consumer group is guaranteed to receive all messages, and while it's running, any consumer is assigned one or more partitions(meaning a single partition is not assigned to more than one consumer at the same time). Kafka will also assign messages with the same key to the same partition. With these two properties in mind you can design a system that can consume messages in batches with some obvious trade-offs and limitations.

apache- kafka with 100 millions of topics

I'm trying to replace rabbit mq with apache-kafka and while planning, I bumped in to several conceptual planning problem.
First we are using rabbit mq for per user queue policy meaning each user uses one queue. This suits our need because each user represent some job to be done with that particular user, and if that user causes a problem, the queue will never have a problem with other users because queues are seperated ( Problem meaning messages in the queue will be dispatch to the users using http request. If user refuses to receive a message (server down perhaps?) it will go back in retry queue, which will result in no loses of message (Unless queue goes down))
Now kafka is fault tolerant and failure safe because it write to a disk.
And its exactly why I am trying to implement kafka to our structure.
but there are problem to my plannings.
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
Second, If I decide to go for topics based on operation and partition by random hash of users id, if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
So as conclusion, 1~5 millions users. We do not want to have one user blocking large number of other users being processed. Having topic per user will solve this issue, it seems like there might be an issue with zookeeper if such large number gets in (Is this true? )
what would be the best solution for structuring? Considering scalability?
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
I would advise against modeling like this.
Google around for "kafka topic limits", and you will find the relevant considerations for this subject. I think you will find you won't want to make millions of topics.
Second, If I decide to go for topics based on operation and partition by random hash of users id
Yes, have a single topic for these messages and then route those messages based on the relevant field, like user_id or conversation_id. This field can be present as a field on the message and serves as the ProducerRecord key that is used to determine which partition in the topic this message is destined for. I would not include the operation in the topic name, but in the message itself.
if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
This depends on how the users are consuming messages. You could set up a timeout, after which the message is routed to some "failed" topic. Or send messages to users in a UDP-style, without acks. There are many ways to model this, and it's tough to offer advice without knowing how your consumers are forwarding messages to your clients.
Also, if you are using Kafka Streams, make note of the StreamPartitioner interface. This interface appears in KStream and KTable methods that materialize messages to a topic and may be useful in a chat applications where you have clients idling on a specific TCP connection.