Load Balancing in Cluster ActiveMQ - activemq-artemis

Let's say that I have 3 ActiveMQ Artemis brokers in one cluster:
Broker_01
Broker_02
Broker_03
In a given point of time I have a number of consumers for each broker:
Broker_01 has 50 consumers
Broker_02 has 10 consumers
Broker_03 has 10 consumers
Let's assume at this given point of time there are 70 messages to be sent to a queue in this cluster.
We are expecting load balancing done by the cluster so that Broker_01 would receive 50 messages, Broker_02 10 messages, and Broker_03 also 10 messages, but currently we are experiencing that the 70 messages are distributed randomly through all the 3 brokers.
Is there any configuration I can do to distribute the messages based on the number of consumers in each broker?
I just read the documentation. So, as far as I understood, ActiveMQ does load balancing, based on round robin, if we configure cluster connection. Our broker.xml looks like this:
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>amq-v01_connector</connector-ref>
<min-large-message-size>524288</min-large-message-size>
<call-timeout>120000</call-timeout>
<retry-interval>500</retry-interval>
<retry-interval-multiplier>1.5</retry-interval-multiplier>
<max-retry-interval>2000</max-retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<notification-interval>800</notification-interval>
<notification-attempts>2</notification-attempts>
<static-connectors>
<connector-ref>amq-v02_connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
Further, address-setting for the queue looks like this:
<address-setting match="MyQueue">
<address-full-policy>BLOCK</address-full-policy>
<max-size-bytes>50Mb</max-size-bytes>
</address-setting>
Am I missing something, so that load balancing will be done?
The next point would be as stated in documentation, load balancing is always done based on round robin, there is no configuration possible to load balance based on number of consumers in each node.
I assume that I need client-side connection load-balancing since we want to load-balance the messages arriving in the 3 brokers according to the number of consumers in each broker. As stated in the documentation, there are 5 out-of-the-box policies (Round-Robin, First Element, etc.) which we can use. Additionally we could implement our own policy by implementing ConnectionLoadBalancingPolicy. Assuming that I would like to implement my own policy, what would be the idea how to do this according the number of consumer?

There is no out-of-the-box way for the producer to know how many consumers are on each broker and then send messages to those messages to those brokers accordingly.
It is possible for you to implement your own ConnectionLoadBalancingPolicy. However, in order to determine how many consumers exist on the queue load-balancing policy implementation would need to know the URL of all the brokers in the cluster as well as name of the queue to which you're sending messages, and there's no way to supply that information. The ConnectionLoadBalancingPolicy interface is very simple.
I would encourage you to revisit your need for a 3-node cluster in the first place if each node is going to have so few messages on it. A single broker can handle a throughput of millions of messages in certain use-cases. If each node is dealing with less than 50 messages then you probably don't need a cluster at all.

Related

Max size production kafka cluster deployment for now

I am considering how to deploy our kafka cluster: a big cluster with several broker groups or several clusters. If a big cluster, I want to know how big a kafka cluster can be. kafka has a controller node and I don't know how many brokers it can support. And another one is _consume_offset_ topic ,how big it can be and can we add more partitions to it.
I've personally worked with production Kafka clusters anywhere from 3 brokers to 20 brokers. They've all worked fine, it just depends on what kind of workload you're throwing at it. With Kafka, my general recommendation is that it's better to have a smaller amount of larger/more-powerful brokers, than having a bunch of tiny servers.
For a standing cluster, each broker you add increases "crosstalk" between the nodes, since they have to move partitions around, replicate data, as well as maintain the metadata in sync. This additional network chatter can impact how much load the broker can handle. As a general rule, adding brokers will add overall capacity, but you have to shift partitions around so that the load will be balanced properly across the entire cluster. Because of that, it's much better to start with 10 nodes, so that topics and partitions will be spread out evenly from the beginning, than starting out with 6 nodes and then adding 4 nodes later.
Regardless of the size of the cluster, there is always only one controller node at a time. If that node happens to go down, another node will take over as controller, but only one can be active at a given time, assuming the cluster is not in an unstable state.
The __consumer_offsets topic can have as many partitions as you want, but it comes by default set to 50 partitions. Since this is a compacted topic, assuming that there is no excessive committing happening (this has happened to me twice already in production environments), then the default settings should be enough for almost any scenario. You can look up the configuration settings for consumer offsets topics by looking for broker properties that start with offsets. in the official Kafka documentation.
You can get more details at the official Kafka docs page: https://kafka.apache.org/documentation/
The size of a cluster can be determined by the following ways.
The most accurate way to model your use case is to simulate the load you expect on your own hardware.You can use the kafka load generation tools kafka-producer-perf-test and kafka-consumer-perf-test.
Based on the producer and consumer metrics, we can decide the number of brokers for our cluster.
The other approach is without simulation, which is based on the estimated rate at which you get data that required data retention period.
We can also calculate the throughput and based on that we can also decide the number of brokers in our cluster.
Example
If you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s.
More details about the architecture are available at confluent.
Within a Kafka Cluster, a single broker works as a controller. If you have a cluster of 100 brokers then one of them will act as the controller.
If we talk internally, each broker tries to create a node(ephemeral node) in the zookeeper(/controller). The first one becomes the controller. The other brokers get an exception("node already exists"), they set a watch on the controller. When the controller dies, the ephemeral node is removed and watching brokers are notified for the controller selection process.
The functionality of the controller can be found here.
The __consumer_offset topic is used to store the offsets committed by consumers. Its default value is 50 but it can be set for more partitions. To change, set the offsets.topic.num.partitions property.

Kafka: Can consumers read messages before all replicas are in sync?

I'm designing an event driven distributed system.
One of the events we need to distribute needs
1- Low Latency
2- High availability
Durability of the message and consistency between replicas is not that important for this event type.
Reading the Kafka documentation it seems that consumers need to wait until all sync replicas for a partition have applied the message to their log before consumers can read it from any replica.
Is my understanding correct? If so is there a way around it
If configured improperly; consumers can read data that has not been written to replica yet.
As per the book,
Data is only available to consumers after it has been committed to Kafka—meaning it was written to all in-sync.
If you have configured min.insync.replicas=1 then only Kafka will not wait for replicas to catch-up and serve the data to Consumers.
Recommended configuration for min.insync.replicas depends on type of application. If you don't care about data then it can be 1, if it's critical piece of information then you should configure it to >1.
There are 2 things you should think:
Is it alright if Producer don't send message to Kafka? (fire & forget strategy with ack=0)
Is it alright if consumer doesn't read a message? (if min.insync.replica=1 then if a broker goes down then you may lose some data)

How many Partitions can we have in Kafka?

I have a requirement in my IoT project like, a custom java application called "NorthBound" (NB) can manage 3000 devices maximum. Devices send data to SouthBound (SB - Java Application), SB sends data to Kafka and from Kafka, NB consume the messages.
To manage around 100K devices, I am planning to start multiple instances (around 35) of NorthBound, but i want same instance should receive the messages from same devices. e.g. Device1 is sending data to NB_instance1, Device2 is sending data to NB_instance2 etc.
To handle this, i am thinking of creating 35 partitions of same topic (Device-Messages) so that each NB instance can consume one partition and same device's data should go to same NB instance. Is it the right approach? Or is there any better way?
How many partitions can we make in a Kafka cluster? and What is a recommended value considering 3 nodes (Brokers) in a cluster?
Currently, we have only 1 node in Kafka. Can we continue with single node and 35 partitions?
Say on startup I might have only 5-6K devices, then I will have only 2 partitions with 2 NB instances. Gradually when we add more devices, we will keep adding more partitions and NB instances. Can we do it without restarting Kafka? Is it possible to create partitions dynamically?
Regards,
Krishan
As you can imagine the number of partitions you can have depends on a number of factors.
Assuming you have recent hardware, since Kafka 1.1, you can have 1000s of partitions per broker. Moreover Kafka has been tested with over 100000 partitions in a cluster. Link 1
As a rule of thumb, it's recommended to over partition a bit in order to allow future growth in traffic/usage. Kafka allows to add partitions at runtime but that will change partitioning of keyed messages which can be an issue depending on your use case.
Finally, it's not recommended to run a single broker for production workloads as if it was to crash or fail, you'd be exposed to an outage and possibly data loss. It's best to at least have 2 of them with a replication factor of 2 even with only 35 partitions.

Under what circumstance will kafka overflow?

I am testing a POC where data will be sent from 50+ servers from UDP port to kafka client (Will use UDP-kafka bridge). Each servers generates ~2000 messages/sec and that adds up to 100K+ messages/sec for all 50+ servers. My question here is
What will happen if kafka is not able to ingest all those messages?
Will that create a backlog on my 50+ servers?
If I have a kafka cluster, how do I decide which server sends message to which broker?
1) It depends what you mean by "not able" to ingest. If you mean it's bottlenecked by the network, then you'll likely just get timeouts when trying to produce some of your messages. If you have retries (documented here) set up to anything other than the default (0) then it will attempt to send the message that many times. You can also customise the request.timeout.ms, the default value of this being 30 seconds. If you mean in terms of disk space, the broker will probably crash.
2) You do this in a roundabout way. When you create a topic with N partitions, the brokers will create those partitions on whichever server has the least usage at that time, one by one. This means that, if you have 10 brokers, and you create a 10 partition topic, each broker should receive one partition. Now, you can specify which partition to send a message to, but you can't specify which broker that partition will be on. It's possible to find out which broker a partition is on after you have created it, so you could create a map of partitions to brokers and use that to choose which broker to send to.

How to setup Apache Kafka consumer to get data across internet?

I've configured 2 Kafka with basic settings on two different servers across the net one in the UK and another one in India and my scenario is very simple the UK is publisher and India is consumer but none of them can get any data.
I've checked my firewalls there is not port blocking/ whatsoever. Also I've tested my scenario with redis pubsub and it worked but I wasn't successful with Kafka.
How should I setup my Kafkas to do such? or is it possible at all to do such with Kafka?
Kafka is not recommended when you want to interact from multiple data centres. Kafka is designed to give you high-throughput given you are producing and consuming from the same data centre where network latency is minimal.
Why ?
Once you have consumers in a different data centre the latency comes to play affecting all coordination that Kafka does with consumers (group rebalancing/offset commits/heartbeats) and producers being in a different data centre the latency for getting acks for each message send will be considerable, slowing down the rate at which you can produce messages.
So, in theory, you can very well have the setup if your network is reliable.
Now If you are thinking to have Kafka brokers distributed among data centres, it will be more costly. All the inter-broker communication will be delayed effectively creating lag in replicas, lotof network calls(over the internet), broker heartbeat timeouts etc, again theoretically feasible.
In practice for these scenarios, it is better to have local Kafka cluster for each DC where they produce/consume messages with applications hosted locally and have Mirrormaker to aggregate messages between data centres.
Found the solution :
In Apache Kafka's config, add following line :
advertised.listeners=PLAINTEXT://xxx.xxx.xxx.xxx:pppp
# x = your IP
# p = your port