ActiveMQ Artemis: Disk filling indefinitely without consumers or producers - persistence

We are testing ActiveMQ Artemis 2.22.0 with clients using the core protocol. The broker is configured to apply paging. We let producers fill up the broker with messages until max-disk-usage stopped all producers.
After we tried connecting consumers, which worked in the first place. However, the brokers disk kept filling until the broker crashed completely.
Now, even after we disconnected all clients manually we see that after restart the broker is extending its message journal until the disk is full again. After restart we see a lot of messages saying deleting orphaned file. The disk usage goes down. After some seconds however, the journal starts rising again and the story repeats.
That's probably not enough information to clearly solve our issue. Thus, here are my questions:
What are possible causes to fill disk space if neither consumers nor producers are connected?
How can we debug such a situation?
In case (really not hoping for that), the journal became corrupt. Is there any way first elaborate that and second restore it?

What are possible causes to fill disk space if neither consumers nor producers are connected?
You may be hitting ARTEMIS-3868 in which case I strongly recommend you move to the latest release (i.e. 2.25.0 at this point).
How can we debug such a situation?
The first thing to do would be to use the artemis data print command to print details about exactly what is in the journal. Hopefully that will shed light on what is causing the journal growth.
In case (really not hoping for that), the journal became corrupt. Is there any way first elaborate that and second restore it?
Particular records in the journal may be corrupted and lost, but the broker should be able to read and recover everything that is still viable.

Related

Purge all messages in ActiveMQ Artemis

We have several ActiveMQ Artemis 2.17.0 clusters setup to replicate between data centres with mirroring.
Our previous failover had been an emergency, and it's likely the state had fallen out of sync. When we next performed our scheduled failover tests weeks-old messages were sent to the consumers. I know that mirroring is asynchronous so it is expected that synchronization may not be 100% all the time. However, these messages were not within the time frame of synchronization delays. It is worth noting that we've had several events which I expect might throw mirroring off. We had hit the NFS split brain issue as well as the past emergency fail over
As such, we are looking for a way to purge (or sync) all messages on the standby server after we know that there have been problems with the mirroring to prevent a similar scenario from happening. There are over 5,000 queues, so preferably the operation doesn't need to be run on a queue by queue basis.
Is there any way to accomplish this, either in ActiveMQ Artemis 2.17.0 or a later version?
There's no programmatic way to simply delete all the data from every queue on the broker. However, you can combine a few management operations (e.g. in a script) to get the same result. You can use the getQueueNames method to get the name of every queue and then pass those names to the destroyQueue(String) method.
However, the simplest way to clear all the data would probably be to simply stop the broker, clear the data directory, and then restart the broker.

How long a rollbacked message is kept in a Kafka topic

I came across with this scenario when implementing a chained transaction manager inside our spring boot application interacting with consuming messages from JMS then publishing to a Kafka topic.
My testing strategy was explained on here:
Unable to synchronise Kafka and MQ transactions usingChainedKafkaTransaction
In short I threw a RuntimeException on purpose after consuming messages from MQ and writing them to Kafka just to test transaction behaviour.
However as the rollback functionality worked OK I could see the number of uncommitted messages in the Kafka topic growing forever even if a rollback was happening with each processing. In a few seconds I ended up with hundreds of uncommitted messages in the topic.
Naturally I asked myself if a message is rollbacked why would it still be there taking storage. I understand with transaction isolation set to read_committed they will never get consumed but the idea of a poison message being rollbacked again and again eating up your storage does not sound right to me.
So my question is:
Am I missing something? Is there a configuration in place for a "time to live" or similar for a message that was rollbacked. I tried to read the Kafka docs around this subject but I could not find anything. Is such a setting is not in place what would be a good practice to deal with situations like this and avoid wasting storage.
Thank you in advance for your inputs.
That's just the way Kafka works.
Publishing a record always takes a slot in the partition log. Whether or not a consumer can see that record depends on whether it is committed or not (assuming the isolation level is read_committed).
Kafka achieves its extraordinary throughput because of its simple log architecture.
Rollback is assumed to be somewhat rare.
If you are getting so many rollbacks then your application architecture is probably at fault.
You should probably shut things down for a while if you keep rolling back.
To specifically answer your question, see log-rentention-hours.
The uncommitted records are kept for a week by default.

How to make full replication in kafka?

How to make full replication in kafka?
I have two servers, a leader and a follower.
How to make sure that when the leader refuses (turns off), all messages that are sent to the follower also appear on the leader after turning it on.
I know one option with launching: Kafka has a built-in bin/kafka-mirror-maker.sh synchronization program. It should always be run on the leader, then messages that go to it will also go to the follower. When the leader turns off, this program should start on the follower, and all messages, as I understand it, will go to him. After the leader is turned on, and after synchronization (that is, at the moment when the messages begin to go only to the leader), this service should also start on the leader and turn off on the follower, then the messages will always be synchronized.
If you keep these services on both servers at the same time, the messages will be endlessly duplicated. That is, one message will constantly come to both the follower and the leader due to synchronization.
But I'm not sure that this method is correct and it requires additional resources: a service for tracking all this and running bin/kafka-mirror-maker.sh.
 How can I do it right and without wasting resources?
Kafka itself is a distributed system. Per the docs:
Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures.
If you want to replicate between Kafka clusters (such as full datacenters, or clusters serving different purposes) then this is where something like MirrorMaker would come in.
How to make sure that when the leader refuses (turns off), all messages that are sent to the follower also appear on the leader after turning it on
This is built into the protocol, but that assumes every topic you are using has replication-factor=2
Sounds like you have only two brokers on the same network, so you do not need MirrorMaker, as the docs show it clearly is between two different, regional datacenters.
I would like to add, if you did want to do that, don't use kafka-mirror-maker. It is not as fault-tolerant and scalable as you might expect.
Instead, use MirrorMaker 2, as part of the apache-kafka-connect framework.

kafka Multi-Datacenter with high availability

I'm setting up 2 kafka v0.10.1.0 clusters on different DCs and planning to use mirror-maker to keep one as source and the other one as target, what I'm not sure is how to ensure high availability when my source/main cluster goes down (complete DC where source kafka cluster goes down) do I need to make my application switch to produce messages to the target kafka and what will happen when source kafka is back? how to bring it back in sync with the possible lost messages?
Thanks
From reading your question I don't think, that MirrorMaker will be a suitable tool for your needs I am afraid.
Basically MirrorMaker is simply a Consumer and a Producer tied together to replicate messages from one cluster to another. It is not a tool to tie two Kafka clusters together in an active-active configuration, which sounds a lot like what you are looking for.
But to answer your questions in order:
Do I need to make my application switch to produce messages to the
target kafka?
Yes, there is currently no failover function, you would need to implement logic in your producers to try the target cluster after x amount of failed messages or no messages sent in y minutes or something like that.
What will happen when source kafka is back?
Pretty much nothing that you don't implement yourself :)
MirrorMaker will start replicating data from your source cluster to your target cluster again, but since your producers now switched over to the target cluster, the source cluster is not getting any data, so they will idle along.
Your producers will keep producing into the target cluster, unless you implemented a regular check whether the source came back online and have them switch back.
How to bring it back in sync with the possible lost messages?
When your source cluster is back online and assuming all the things I mentioned above have happened you effectively switched your clusters around, depending on whether you want your source as primary cluster that gets written to or are happy to reverse roles when this happens you have two options that I can come up with off the top of my head:
reverse the direction of mirrormaker and set the consumer group offsets manually so that it picks up at the point where the source cluster died
stop producing new data for a while, recover missing data to the source cluster, switch back your producers and start everything up again.
Both options require you to figure out, what data is missing on the source cluster manually though, I don't think there is a way around this.
Bottom line is, that this in not an easy thing to do with MirrorMaker and it might be worth having another think about whether you really want to switch producers over to the target cluster if the source goes down.
You could also have a look at Confluent's Replicator, which might better suit what you are looking for and is part of their corporate offering. Information is a bit sparse on that, let me know if you are interested in it and I can make an introduction to someone who can tell you more about it (or of course just send a mail to Confluent, that'll reach the right person as well).

MSMQ: What can cause a "Insufficient resources to perform operation" error when receiving from a queue?

MSMQ: What can cause a "Insufficient resources to perform operation" error when receiving from a queue?
At the time the queue only held 2,000 messages with each message being about 5KB in size.
I had the same error message and the solution was simple.
There were a lot of messages sitting on various queues, and the storage limits had been reached. I went to:
Server Manager -> Features
Right clicked on Message Queuing
Selected properties
In the General tab un-ticked the storage limits
I was informed that services using MSMQ would be re-started, and then the error went away.
From John Breakwell's Blog there are eleven possibilities:
The thread pool for the remote read is exhausted (MSMQ 2.0 only).
The number of local callback threads is exceeded
The volume of messages has exceeded
what the system can handle (MSMQ 2.0
only).
Paged-pool kernel memory is
exhausted.
Mismatched binaries.
The message size is too large.
The machine quota has been exceeded.
Routing problems when opening a
transactional foreign queue (MSMQ
3.0 only)
Lack of disk space.
Storage problems on mobile devices
Clustering too many MSMQ resources
Too many open connections
Computer name was longer than 15 characters
Too many messages in the dead letter queue
http://blogs.msdn.com/johnbreakwell/archive/2006/09/18/761035.aspx
I would check the version of your queue and the amount of connections (to and from) your queue open at the time of error. Any of those "could have" caused your error.
I had too many failed messages in my outgoing queue.
Check System Queues -> Dead-letter messages. I cleared this queue out and it worked fine again.
If journaling is enabled, you will be storing copies of all messages removed from the queue, so you might also be hitting the MSMQ journal limit. Short term fix might be to purge the journals for the queue, longer term - disable journaling.
I encountered the same error, after checking the things mentioned above it turned out that it was the computer name that was causing the issue! It was longer than 15 characters, after I changed it to a shorter one the issue was gone.
For me, the problem was not the machine that hosted the queue. It was the machine that was sending the message to the queue. I noticed that the "Outgoing Queues" on the source machine showed large numbers of messages, which led me to MSMQ Messages Are Stuck In The Outgoing Queue. Reinstalling MSMQ on the source machine is what fixed it for me.