Kafka Producer (with multiple instance) writing to same topic - apache-kafka

I have a use case where messages are coming from a channel, which we want to push into a Kafka topic(multiple partitions) . In our case message order is important so we have to push the messages to topic in the order they are received which looks very straight forward if we have only one producer and single partition. In our case, for load balancing and scalability we want to run multiple instances for same producer but the problem is how to maintain order of messages.
Any thought or solution would be great helpful.
Even if I think to have single partition can it replicated to multiple brokers for availability and fault tolerance?

we have to push the messages to topic in the order they are received
which looks very straight forward if we have only one producer and
single partition
You can have multiple partitions in the topic with one producer and still have the order maintained if you provide key for your messages. All messages with the same key produced by a single producer are always in order.
When you say multiple producers, I assume that you are having multiple instances of your application running and that you are not creating multiple producers in the same JVM instance.
Since you said channel, I suppose that it is a network channel like Datagram channel, for example. In that case, I suppose that you are listening on some port and sending the received data to Kafka.
I do not see a point in having multiple producers in the same instance
producing to the same topic, so it is better to have a single producer
send all the messages and for performance you can tune the producer
properties like batch.size, linger.ms etc.
To achieve fault tolerance, have another instance running in HA mode (fail-over mode), so that if this instance dies the other automatically picks up.
If it is a network channel, you can run multiple instances and open
the socket with the option SO_REUSEADDR in
StandardSocketOptions and this way you only one producer will be
active at any point and new producer will become active once the
active one dies.

Related

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

Opening Kafka streams dynamically from a queue consumer

We have a use case where based on work item arriving on a worker queue, we would need to use the message metatdata to decide which Kafka topic to stream our data from. We would have maybe less than 100 worker nodes deployed and each worker node can have a configurable number of threads to receive messages from the queue. So if a worker has "n" threads , we would land up opening maybe kafka streams to "n" different topics. (n is usually less than 10).
Once the worker is done processing the message, we would need to close the stream also.
The worker can receive the next messsage once its acked the first message and at which point , I need to open a kafka stream for another topic.
Also every kafka stream needs to scan all the partitions(around 5-10) for the topic to filter by a certain attribute.
Can a flow like this work for Kafka streams or is this not an optimal approach?
I am not sure if I fully understand the use case, but it seem to be a "simple" copy data from topic A to topic B use case, ie, no data processing/modification. The logic to copy data from input to output topic seems complex though, and thus using Kafka Streams (ie, Kafka's stream processing library) might not be the best fit, as you need more flexibility.
However, using plain KafkaConsumers and KafkaProducers should allow you to implement what you want.

How does a Kafka Consumer behave if a Producer goes down. What happens to the data in the interval when the producer goes down

I just want to know how the Consumer is able to consume data when the producer is down. Let's say Producer keeps sending logs to the consumer at a steady rate and then the producer goes down from 8AM- 6PM. How does the consumer work in such a case and is there a way that the consumer can get the data that would have been sent during 8am - 6pm if the producer was up.
In Apache Kafka there is no relationship between how producer and consumer behaves.
Acting as a messaging system, Kafka allows to decoupling producer from a consumer providing an asynchronous communication channel.
The producer can send messages at its own pace and the consumer can read these messages in real time or later at its own pace (different from the producer one).
The messages are saved in a topic living in the Kafka cluster, and each message has a position in the topic partition (offset).
Of course, it's possible to tune when messages are deleted from the topic if the consumer isn't online for long time reading the messages.
You can set to store messages for very long time (days, weeks, months) and after that they will be deleted; or you can set to store messages based on time (so deleting the ones older than a time).
Furthermore, the consumer is also able to rewind the stream of messages in the topic, actually re-reading the messages if needed.
Finally, the consumer can also seek to a specific position in the topic partition based on offset or specifiying a time.
The Kafka doc has a nice diagram which I copied below. It shows the novelty of Kafka in a succinct way.
Without Kafka, the situation is something like this. We have multiple servers, e.g. Frontend servers, DB servers, Chat servers etc. On the other side, we have probably different metrics and monitoring tools (e.g. DB monitor, UI monitor etc.). Direct one-to-one communications between different servers and collectors might work out for smaller systems, but it breaks down pretty quickly after the system has surpassed a a certain threshold, in terms of scalability. Kafka solves this problem by decoupling the senders and receivers. Both of them talk through the Kafka brokers instead of talking to each other.
So, in your case the consumer would simply ask the broker if there's any new data on the topic it's subscribing to. As the producer is down, and assuming there is no data in the queue, broker would reply, there's nothing to be consumed.. So, the consumer would be perpetually polling in a fixed interval, in an endless loop and do nothing. Whenever the producer comes up and starts pumping out data, consumer would start receiving (and processing) it. There are more involved use cases when you might be losing data if retention period for particular topic is over, and the consumer hasn't processed the backlog. But I don't think that's a concern for you at this point of your journey.

Under what circumstance will kafka overflow?

I am testing a POC where data will be sent from 50+ servers from UDP port to kafka client (Will use UDP-kafka bridge). Each servers generates ~2000 messages/sec and that adds up to 100K+ messages/sec for all 50+ servers. My question here is
What will happen if kafka is not able to ingest all those messages?
Will that create a backlog on my 50+ servers?
If I have a kafka cluster, how do I decide which server sends message to which broker?
1) It depends what you mean by "not able" to ingest. If you mean it's bottlenecked by the network, then you'll likely just get timeouts when trying to produce some of your messages. If you have retries (documented here) set up to anything other than the default (0) then it will attempt to send the message that many times. You can also customise the request.timeout.ms, the default value of this being 30 seconds. If you mean in terms of disk space, the broker will probably crash.
2) You do this in a roundabout way. When you create a topic with N partitions, the brokers will create those partitions on whichever server has the least usage at that time, one by one. This means that, if you have 10 brokers, and you create a 10 partition topic, each broker should receive one partition. Now, you can specify which partition to send a message to, but you can't specify which broker that partition will be on. It's possible to find out which broker a partition is on after you have created it, so you could create a map of partitions to brokers and use that to choose which broker to send to.

Implementation of queues using kafka server

I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.