Under what circumstance will kafka overflow? - apache-kafka

I am testing a POC where data will be sent from 50+ servers from UDP port to kafka client (Will use UDP-kafka bridge). Each servers generates ~2000 messages/sec and that adds up to 100K+ messages/sec for all 50+ servers. My question here is
What will happen if kafka is not able to ingest all those messages?
Will that create a backlog on my 50+ servers?
If I have a kafka cluster, how do I decide which server sends message to which broker?

1) It depends what you mean by "not able" to ingest. If you mean it's bottlenecked by the network, then you'll likely just get timeouts when trying to produce some of your messages. If you have retries (documented here) set up to anything other than the default (0) then it will attempt to send the message that many times. You can also customise the request.timeout.ms, the default value of this being 30 seconds. If you mean in terms of disk space, the broker will probably crash.
2) You do this in a roundabout way. When you create a topic with N partitions, the brokers will create those partitions on whichever server has the least usage at that time, one by one. This means that, if you have 10 brokers, and you create a 10 partition topic, each broker should receive one partition. Now, you can specify which partition to send a message to, but you can't specify which broker that partition will be on. It's possible to find out which broker a partition is on after you have created it, so you could create a map of partitions to brokers and use that to choose which broker to send to.

Related

If the partition a Kafka producer try to send messages to went offline, can the producer try to send to a different partition?

My Kafka cluster has 5 brokers and the replication factor is 3 for topics. At some time some partitions went offline but eventually they went back online. My questions are:
How many brokers were down does it indicate, given the fact that there were offline partitions? I think given the cluster setup above, I can afford to lose 2 brokers at the same time. However, if there were 2 brokers down, for some partitions they no longer have quorum; will these partitions go offline in this case?
If there are offline partitions, and a Kafka producer tries to send messages to them and fails, will the producer try a different partition that may be online? The messages have no key in them.
Not sure if I understood your question completely right but I have the impression that you are mixing up partitions and replications. Or at least, your question cannot be looked at isolated on the producer. As soon as one broker is down some things will happen on the cluster.
Each TopicPartition has one Partition Leader and your clients (e.g. Producer and Consumer) are only communicating with this one leader, independen of the number of replications.
In the case where two out of five broker are not available, Kafka will move the partition leader as well as the replicas to a healthy broker. In that scenario you should therefore not get into trouble although it might take some time and retries for the new leader to be selected and the new replications to be created on the healthy broker. A leader selection can be made fast as you have set the replication factor to three, so even if two brokers go down, one broker should still have the complete data (assuming all partitions were in-sync). However, creating two new replicas could take some time dependent on the amount of data. For that scenario you need to look into the topic level configuration min.insync.replicas and the KafkaProducer confiruation acks (see below).
I think the following are the most important configurations for your KafkaProducer to handle such situation:
bootstrap.servers: If you are anticipating regular connection problems with your brokers, you should ensure that you list all five of them. Although it is sufficient to only mention one address (as one broker will then communicate will all other brokers in the cluster) it is safe to have them all listed in case one or even two broker are not available.
acks: This defaults to 1 and defines the number of acknowledgments the producer requires the partition leader to have received before considering a request as successful. Possible values are 0, 1 and all.
retries: This value defaults to 2147483647 and will cause the client to resend any record whose send fails with a potentially transient error until the time of delivery.timeout.ms is reached
delivery.timeout.ms: An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.
You will find more details on the documentation on the Producer configs.

Kafka Producer (with multiple instance) writing to same topic

I have a use case where messages are coming from a channel, which we want to push into a Kafka topic(multiple partitions) . In our case message order is important so we have to push the messages to topic in the order they are received which looks very straight forward if we have only one producer and single partition. In our case, for load balancing and scalability we want to run multiple instances for same producer but the problem is how to maintain order of messages.
Any thought or solution would be great helpful.
Even if I think to have single partition can it replicated to multiple brokers for availability and fault tolerance?
we have to push the messages to topic in the order they are received
which looks very straight forward if we have only one producer and
single partition
You can have multiple partitions in the topic with one producer and still have the order maintained if you provide key for your messages. All messages with the same key produced by a single producer are always in order.
When you say multiple producers, I assume that you are having multiple instances of your application running and that you are not creating multiple producers in the same JVM instance.
Since you said channel, I suppose that it is a network channel like Datagram channel, for example. In that case, I suppose that you are listening on some port and sending the received data to Kafka.
I do not see a point in having multiple producers in the same instance
producing to the same topic, so it is better to have a single producer
send all the messages and for performance you can tune the producer
properties like batch.size, linger.ms etc.
To achieve fault tolerance, have another instance running in HA mode (fail-over mode), so that if this instance dies the other automatically picks up.
If it is a network channel, you can run multiple instances and open
the socket with the option SO_REUSEADDR in
StandardSocketOptions and this way you only one producer will be
active at any point and new producer will become active once the
active one dies.

Kafka Random Access to Logs

I am trying to implement a way to randomly access messages from Kafka, by using KafkaConsumer.assign(partition), KafkaConsumer.seek(partition, offset).
And then read poll for a single message.
Yet i can't get past 500 messages per second in this case. In comparison if i "subscribe" to the partition i am getting 100,000+ msg/sec. (#1000 bytes msg size)
I've tried:
Broker, Zookeeper, Consumer on the same host and on different hosts. (no replication is used)
1 and 15 partitions
default threads configuration in "server.properties" and increased to 20 (io and network)
Single consumer assigned to a different partition each time and one consumer per partition
Single thread to consume and multiple threads to consume (calling multiple different consumers)
Adding two brokers and a new topic with it's partitions on both brokers
Starting multiple Kafka Consumer Processes
Changing message sizes 5k, 50k, 100k -
In all cases the minimum i get is ~200 msg/sec. And the maximum is 500 if i use 2-3 threads. But going above, makes the ".poll()", call take longer and longer (starting from 3-4 ms on a single thread to 40-50 ms with 10 threads).
My naive kafka understanding is that the consumer opens a connection to the broker and sends a request to retrieve a small portion of it's log. While all of this has some involved latency, and retrieving a batch of messages will be much better - i would imagine that it would scale with the number of receivers involved, with the expense of increased server usage on both the VM running the consumers and the VM running the broker. But both of them are idling.
So apparently there is some synchronization happening on broker side, but i can't figure out if it is due to my usage of Kafka or some inherent limitation of using .seek
I would appreaciate some hints of whether i should try something else, or this is all i can get.
Kafka is a streaming platform by design. It means there are many, many things has been developed for accelerating sequential access. Storing messages in batches is just one thing. When you use poll() you utilize Kafka in such way and Kafka do its best. Random access is something for what Kafka don't designed.
If you want fast random access to distributed big data you would want something else. For example, distributed DB like Cassandra or in-memory system like Hazelcast.
Also you could want to transform Kafka stream to another one which would allow you to use sequential way.

Implementation of queues using kafka server

I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.

Do Kafka consumers have an open connection per partition?

I know that each partition is allocated to one Kafka consumer (inside of a consumer-group), but one Kafka consumer can be consuming multiple partitions at the same time. If each has an open connection to the partition, then I can imagine tens of thousands of connections open per consumer. If this is true, that seems like something to watch out for when deciding on number of partitions, no?
I'm assuming you are asking about the official Java client. Third party clients could do something else.
The KafkaConsumer does not have a network connection per partition. As you hinted, that would not scale very well.
Instead the KafkaConsumer has a connection to each broker/node that is the leader of a partition it is consuming from. Data for partitions that have the same leader is transmitted using the same connection. It also uses an additional connection to the Coordinator for its group. So at worst it can have <# of brokers in the cluster> + 1 connections to the Kafka cluster.
Have a look at NetworkClient.java, you'll see that connections are handle per Node (broker)