How does Artemis federation priority-adjustment work? - activemq-artemis

According to the document, priority-adjustment means
when a consumer attaches its priority is used to make the upstream consumer, but with an adjustment by default -1, so that local consumers get load balanced first over remote, this enables this to be configurable should it be wanted/needed.
Therefore, priority-adjustment can control the direction of data flow. When priority adjustment = -1, the broker will first transmit the data to the remote instead of using it locally.
Migrating between two clusters. Consumers and publishers can be moved in any order and the messages won't be duplicated (which is the case if you do exchange federation). Instead, messages are transferred to the new cluster when your consumers are there. Here for such a migration with blue/green or canary moving a number of consumers on the same queue, you may want to set the priority-adjustment to 0, or even a positive value, so message would actively flow to the federated queue.
Does this mean that when the priority is adjusted to an integer greater than or equal to zero, once the remote consumer being generated, the broker will be more willing to transmit the message to the remote consumer rather than the local consumer? What happens if the priority is negative under two clustering conditions?
I have also seen some samples adjust the priority to a value other than -1/0, I don't know how this value works.
I also found some documents about ActiveMQ message priority setting, which does not seem to apply to Artemis.
Please tell me more details about the meaning of the parameter "priority-adjustment".

ActiveMQ Artemis retrieves messages from upstream queues in order to satisfy demand for messages from local consumers connected to a federated queue.
To retrive messages from upstream queues it creates a remote consumer using the priority of the local consumer adjusted with the priority-adjustment value adjustedPriority = consumer.getPriority() + priorityAdjustment.
The adjusted priority of the remote consumer works as the consumer priority of local consumers attached to the upstream queue. Messages will only going to lower priority consumers when the high priority consumers do not have credit available to consume the message, or those high priority consumers have declined to accept the message.
If all consumers (upstream and downstream) have the same priority. The local consumers should receive the messages first then remote consumers when priority adjustment = -1.

Related

ActiveMQ Artemis JMS Shared Subscription

I have a single node ActiveMQ instance with two competing consumers connected to a topic. The topic subscription is shared as per JMS 2.0 specification. Shared subscription does guarantee that only either of the subscribers (using same subscription name) gets the message. But what I noticed is that it does not guarantee that the second message is delivered only if the first one is acknowledged. In case if the first consumer takes time to acknowledge the message, the second message is delivered to the free consumer even before the acknowledgement of the first one is sent by the consumer to the broker. Is this a standard behaviour? And is there a way to stop the broker from delivering the second message before the acknowledgement of the first one?
ActiveMQ Artemis allows the exclusive queues. They are special queues which route all messages to only one consumer at a time.
Obviously exclusive queues have a draw back that you cannot scale out the consumers to improve consumption as only one consumer would technically be active.
However I would suggest to take a look at the message grouping to scale out your solution. Message groups are useful when you want all messages for a certain value of the property to be processed serially by the same consumer, without stopping the delivery of messages with different value of the property to other consumers.

Kafka Producer (with multiple instance) writing to same topic

I have a use case where messages are coming from a channel, which we want to push into a Kafka topic(multiple partitions) . In our case message order is important so we have to push the messages to topic in the order they are received which looks very straight forward if we have only one producer and single partition. In our case, for load balancing and scalability we want to run multiple instances for same producer but the problem is how to maintain order of messages.
Any thought or solution would be great helpful.
Even if I think to have single partition can it replicated to multiple brokers for availability and fault tolerance?
we have to push the messages to topic in the order they are received
which looks very straight forward if we have only one producer and
single partition
You can have multiple partitions in the topic with one producer and still have the order maintained if you provide key for your messages. All messages with the same key produced by a single producer are always in order.
When you say multiple producers, I assume that you are having multiple instances of your application running and that you are not creating multiple producers in the same JVM instance.
Since you said channel, I suppose that it is a network channel like Datagram channel, for example. In that case, I suppose that you are listening on some port and sending the received data to Kafka.
I do not see a point in having multiple producers in the same instance
producing to the same topic, so it is better to have a single producer
send all the messages and for performance you can tune the producer
properties like batch.size, linger.ms etc.
To achieve fault tolerance, have another instance running in HA mode (fail-over mode), so that if this instance dies the other automatically picks up.
If it is a network channel, you can run multiple instances and open
the socket with the option SO_REUSEADDR in
StandardSocketOptions and this way you only one producer will be
active at any point and new producer will become active once the
active one dies.

max.poll.intervals.ms set to int.Max by default

Apache Kafka documentation states:
The internal Kafka Streams consumer max.poll.interval.ms default value
was changed from 300000 to Integer.MAX_VALUE
Since this value is used to detect when the processing time for a batch of records exceeds a given threshold, is there a reason for such an "unlimited" value?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Kafka Streams leverages a heartbeat functionality of the Kafka consumer client in this context, and thus decouples heartbeats ("Is this app instance still alive?") from calls to poll(). The two main parameters are session.timeout.ms (for the heartbeat thread) and max.poll.interval.ms (for the processing thread), and their difference is described in more detail at https://stackoverflow.com/a/39759329/1743580.
The heartbeating was introduced so that an application instance may be allowed to spent a lot of time processing a record without being considered "not making progress" and thus "be dead". For example, your app can do a lot of crunching for a single record for a minute, while still heartbeating to Kafka "Hey, I'm still alive, and I am making progress. But I'm simply not done with the processing yet. Stay tuned."
Of course you can change max.poll.interval.ms from its default (Integer.MAX_VALUE) to a lower setting if, for example, you actually do want your app instance to be considered "dead" if it takes longer than X seconds in-between polling records, and thus if it takes longer than X seconds to process the latest round of records. It depends on your specific use case whether or not such a configuration makes sense -- in most cases, the default setting is a safe bet.
session.timeout.ms: The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms: The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

Repeatedly produced to Apache Kafka, different offsets? (Exactly once semantics)

While trying to implement exactly-once semantics, I found this in the official Kafka documentation:
Exactly-once delivery requires co-operation with the destination
storage system but Kafka provides the offset which makes implementing
this straight-forward.
Does this mean that I can use the (topic, partiton, offset) tuple as a unique primary identifier to implement deduplication?
An example implementation would be to use an RDBMS and this tuple as a primary key for an insert operation within a big processing transaction where the transaction fails if the insertion is not possible anymore because of an already existing primary key.
I think the question is equivalent to:
Does a producer use the same offset for a message when retrying to send it after detecting a possible failure or does every retry attempt get its own offset?
If the offset is reused when retrying, consumers obviously see multiple messages with the same offset.
Other question, maybe somehow related:
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Another possibility could be that the offset is determined e.g. solely by or as recently as the message reaches the leader which does the job (implying that - if not listening to something like a producer's suggested offset - there are probably no gaps/offset jumps, but also different offsets for duplicate messages and I would have to use my own unique identifier within the application's message on application level).
To answer my own question:
The offset is generated solely by the server (more precisely: by the leader of the corresponding partition), not by the producing client. It is then sent back to the producer in the produce response. So:
Does a producer use the same offset for a message when retrying to
send it after detecting a possible failure or does every retry attempt
get its own offset?
No. (See update below!) The producer does not determine offsets and two identical/duplicate application messages can have different offsets. So the offset cannot be used to identify messages for producer deduplication purposes and a custom UID has to be defined in the application message. (Source)
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Due to the fact that there is only a single leader for every partition which maintains the current offset and the fact that (with the default configuration) this leadership is only transfered to active in-sync replica in case of a failure, I assume that the latest used offset is always communicated correctly when electing a new leader for a partition and therefore there are should not be any offset gaps or jumps initially. However, because of the log compaction feature, there are cases (assuming log compaction being enabled) where there can indeed be gaps in a stream of offsets when consuming already committed messages of a partition once again after the compaction has kicked in. (Source)
Update (Kafka >= 0.11.0)
Starting from Kafka version 0.11.0, producers now additionally send a sequence number with their requests, which is then used by the leader to deduplicate requests by this number and the producer's ID. So with 0.11.0, the precondition on the producer side for implementing exactly once semantics is given by Kafka itself and there's no need to send another unique ID or sequence number within the application's message.
Therefore, the answer to question 1 could now also be yes, somehow.
However, note that exactly once semantics are still only possible with the consumer never failing. Once the consumer can fail, one still has to watch out for duplicate message processings on consumer side.

Making Kafka consumers consume existing messages before subscription

Having Publisher and N Consumers, if consumers use auto.offset.reset=latest then they miss all the messages that were published to a topic before they subscribed to it ... It is a known fact that Consumer with auto.offset.reset=latestdoesn't replay messages that existed in the topic before it subscribed...
So I would need either :
Make publisher wait until all subscribers start consuming messages and then start publishing. Dunno how to do that without leveraging Zookeeper for instance. Does Kafka provide means to do that ?
Another way would be having auto.offset.reset=latest Consumers and make them explicitly consume all existing messages before in case they are about to subscribe to a topic with existing messages...
What is the best practice for this case?
I guess that consumer must check topic for existing messages, consume them if there are any and then initiate auto.offset.reset=latest consumption. That sounds like the best way to me ...
If a high level consumer gets started, it does the following:
look for committed offsets for its consumer group
a. if valid offsets are found, resume from there
b. if no valid offsets are found, set offsets according to auto.offset.reset
Thus, auto.offset.reset only triggers, if no valid offset was committed. This behavior is intended and necessary to provide at-least-once processing guarantees in case of failure.
Thus, is you want to read a topic from its beginning, you can either use a new consumer group.id and set auto.offset.reset = earliest or you explicitly modify the offsets on startup using seekToBeginning() before you start your poll() loop.
We do the option (1) using a service discovery feature provided by Eureka (any other service discovery app would do the job) + aliasing. Basically a publisher does not register itself (and start processing requests nor publish notifications) until at least one subscriber is available.