Consuming # subscription via MQTT causes huge queue on lost connection - activemq-artemis

I am using Artemis 2.14 and Java 14.0.2 on two Ubuntu 18.04 VM with 4 Cores an 16 GB RAM. My producers send approximately 2,000 Messages per seconds to approx 5,500 different Topics.
When I connect via the MQTT.FX client with certificate based authorization and do a subscription to # the MQTT.FX client dies after some time and in the web console I see a queue under # with my client id that won't be cleared by Artemis. It seems that this queue grows until the RAM is 100% used. After some time my Artemis Broker restarts itself.
Is this behaviour of Artemis normal? How can I tell Artemis to clean up "zombie" queues after some time?
I already tried to use this configuration parameters in different ways, but nothing works:
confirmationWindowSize=0
clientFailureCheckPeriod=30000
consumerWindowSize=0

Auto-created queues are automatically removed by default when:
consumer-count is 0
message-count is 0
This is done so that no messages are inadvertently deleted.
However, you can change the corresponding auto-delete-queues-message-count address-setting in broker.xml to -1 to skip the message-count check. Also, can adjust the auto-delete-queues-delay to configure a delay if needed.
It's worth noting that if you create a subscription like # (which is fairly dangerous) you need to be prepared to consume the messages as quickly as they are produced to avoid accumulation of messages in the queue. If accumulation is unavoidable then you should configure the max-size-bytes and address-full-policy according to your needs so the broker doesn't get overwhelmed. See the documentation for more details on that.

Related

Messages are stuck in ActiveMQ Artemis cluster queues

We have a problem with Apache ActiveMQ Artemis cluster queues. Sometimes messages are beginning to pile up in the particular cluster queues. It usually happens 1-4 times per day and mostly on production (it was only one time for last 90 days when it has happened on one of the test environments).
These messages are not delivered to consumers on other cluster brokers until we restart cluster connector (or entire broker).
The problem looks related to ARTEMIS-3809.
Our setup is: 6 servers in one environment (3 pairs of master/backup servers). Operating system is Linux (Red Hat).
We have tried to:
upgrade from 2.22.0 to 2.23.1
increase minLargeMessageSize on the cluster connectors to 1024000
The messages are still being stuck in the cluster queues.
Another problem that I tried to configure min-large-message-size as it written in documentation (in cluster-connection), but it caused errors at start (broker.xml did not pass validation with xsd), so it was only option to specify minLargeMessageSize in the URL parameters of connector for each cluster broker. I don't know if this setting has effect.
So we had to make a script which checks if messages are stuck in the cluster queues and restarts cluster connector.
How can we debug this situation?
When the messages are stuck, nothing wrong is written to the log (no errors, no stacktraces etc.).
Which logging level (for what classes) should we enable to debug or trace level to find out what happens with the cluster connectors?
I believe you can remedy the situation by setting this on your cluster-connection:
<producer-window-size>-1</producer-window-size>
See ARTEMIS-3805 for more details.
Generally speaking, moving message around the cluster via the cluster-connection, while convenient, isn't terribly efficient (much less so for "large" messages). Ideally you would have a sufficient number of clients on each node to consume the messages that were originally produced there. If you don't have that many clients then you may want to re-evaluate the size of your cluster as it may actually decrease overall message throughput rather than increase it.
If you're just using 3 HA pairs in order to establish a quorum for replication then you should investigate the recently added pluggable quorum voting which allows integration with a 3rd party component (e.g. ZooKeeper) for leader election eliminating the need for a quorum of brokers.

Distributed Jmeter max number of servers

I use distributed jmeter for load tests in Kubernetes. To maximize the number of concurrent threads I use several Jmeter server instances.
My test plan has 2000 users, so I have 10000 concurrent users with 5 Jmeter servers. Each user send 1 request per second to Kafka. This runs without any problems.
But if I increase the number of server instances up to 10, Jmeter gets a lot of errors when sending requests and is not able to send the required rate of requests per second.
Is there a way to use more than 5 server instances in Jmeter (my cluster has 24vCPUs and 192gb ram)?
Theoretical maximum value of slaves is very high, you can have as many as 2 147 483 647 slaves which is a little bit more than 5.
So my expectation is that the problem is somewhere else, i.e. there is a maximum number of connections per IP defined or the broker is running out of resources
We cannot give meaningful advices unless we have more details, in the meantime you can check:
jmeter.log files for master and slaves
response data
kafka logs
resource consumption on kafka broker(s) side, it can be done using i.e. JMeter PerfMon Plugin

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...
I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true
Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.
Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?
To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.

How to find out the total number of connections to an Activemq broker?

I'm using ActiveMQ 5.3.2. My app is a distributed system that creates lots of connections to the AMQ broker. Right now, my app occasionally runs into issues such as the producer stops producing messages, the AMQ broker non-responding, etc. I'm interested to find out the total number of connections to my AMQ broker, but I couldn't find this number anywhere in my JConsole where I can find out other numbers like total number of topics, queues, etc.
Does anyone know how to find out the total number of connections to a AMQ broker?
If you want to find the total number of connections to your broker, you can look that up in JMX under:
org.apache.activemq.Connection.[Protocol]
where Protocol is something like "Openwire" connections. There will be an MBean per connection. Beyond that, there's not a good way to get a total count.
Can you explain more about why your broker is not responding? By the sounds of it, you've simply hit Producer Flow Control.
You also should consider upgrading to ActiveMQ 5.5. The impact on your code and build should be minimal and consist only of updated client libs for 5.5's activemq-core (and activemq-pool) dependency).
You can use JMX library to retrive Mbean Type = Broker and fetch the attribute TotalConnectionsCount on your Broker.