Artemis - Messages Sync Between Memory & Journal - activemq-artemis

When reading artemis docs understood that - artemis stores entire current active messages in memory and can offload messages to paging area for a given queue/topic as per the settings & artemis journals are append only.
With respect to this
How and when broker sync messages to and from from journal ( Only during restart ? )
How it identifies the message to be deleted from journal ( For ex : If journal is append only mode , if a consumer of a persistent message ACK the message , then how broker removes a single message from journal without keeping indexing).
Isn't it a performance hit to keep every active message in memory and even makes broker go out of memory. To avoid this , every queue/topic pagination settings have to be set in configuration otherwise broker may fill all the messages. Please correct me if wrong.
Any reference link that can explain about message sync and these information is helpful. Artemis docs explains regarding append only mode though but may be any section/article that explains these storage concepts and I might be missing.

By default, a durable message is persisted to disk after the broker receives it and before the broker sends a response back to the client that the message was received. In this way the client can know for sure that if it receives the response back from the broker that the durable message it sent was received and persisted to disk.
When using the NIO journal-type in broker.xml (i.e. the default configuration), data is synced to disk using java.nio.channels.FileChannel.force(boolean).
Since the journal is append-only during normal operation then when a message is acknowledged it is not actually deleted from the journal. The broker simply appends a delete record to the journal for that particular message. The message will then be physically removed from the journal later during "compaction". This process is controlled by the journal-compact-min-files & journal-compact-percentage parameters in broker.xml. See the documentation for more details on that.
Keeping message data in memory actually improves performance dramatically vs. evicting it from memory and then having to read it back from disk later. As you note, this can lead to memory consumption problems which is why the broker supports paging, blocking, etc. The main thing to keep in mind is that a message broker is not a storage medium like a database. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. Ideally the broker should be configured to handle the expected load without paging (e.g. acquire more RAM, allocate more heap). In other words, message production and message consumption should be balanced. The broker is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.

Related

kafka response ack after the data is written to pageCache or to disk?

Many articles tell me that Kafka writes data to the PageCache first, which improves write performance.
However, I have a doubt, when ack=-1, when copy=2, the data does already exist in the PageCache of both nodes.
If Kafka responds to acks at this time, and immediately, both nodes experience a power outage or system crash at the same time, at this time, neither node's data is yet persistent on disk.
In this extreme case, data loss can still occur?
Data loss can occur in the situation outlined.
Related reading:
this other answer
Confluent blog post: "Since the log data is not flushed from the page cache to disk synchronously, Kafka relies on replication to multiple broker nodes, in order to provide durability. By default, the broker will not acknowledge the produce request until it has been replicated to other brokers."

Can a message loss occur in Kafka even if producer gets acknowledgement for it?

Kafka doc says:
Kafka relies heavily on the filesystem for storing and caching messages.
A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes.
Modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache
...rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.”
Further this article says:
(3) a message is ‘committed’ when all in sync replicas have applied it to their log, and (4) any committed message will not be lost, as long as at least one in sync replica is alive.
So even if I configure producer with acks=all (which causes producer to receive acknowledgement after all brokers commit the message) and producer receives acknowledgement for certain message, does that mean their is still a possibility that the message can get lost, especially if all brokers goes down and the OS never flushes the committed message cache to disk?
With acks=all and if the replication factor of the topic is > 1, it's still possible to lose acknowledged messages but pretty unlikely.
For example, if you have 3 replicas (and all are in-sync), with acks=all, you would need to lose all 3 brokers at the same time before any of them had the time to do the actual write to disk. With acks=all, the aknowledgement is sent once all in-sync replicas have received the message, you can ensure this number stays high with min.insync.replicas=2 for example.
You can reduce the possibility of this scenario even further if you use the rack awareness feature (and obviously brokers are physically in different racks or even better data centers).
To summarize, using all these options, you can reduce the likeliness of losing data enough so that it's unlikely to ever happen.

Can a Kafka consumer read records from a Broker's page cache?

Kafka's documentation clearly states that messages/records are immediately written to the file system as they are received by the Broker. With the default configuration, this means that the Broker flushes records to the page cache immediately and later the Kernel can flush it to disk.
My question is: can a consumer read a record that is in the page cache but that has not yet been flushed to disk by the kernel?
If the answer is yes, how will the consumer keep track of the offset it reads from?
If the answer is no, then it would mean that the record has to be read back from disk to the page cache before it is sent out to NIC via zero-copy. Correct?
Thanks,
Whenever there is a read/write operation to the file, the data is written/fetched to page cache first. In case of read, if the data is already present in cache page the actual disk read is not called and data is served from page cache. It's not that kafka consumer is reading from page cache of broker but this is being done by file system and hidden from actual read call. In most of the cases, the records from kafka are read sequentially which allows it to use page cache effectively.
zero-copy optimization is used in every read from kafka client, copying data directly from page cache to NIC buffer.

Reliable fire-n-forget Kafka producer implementation strategy

I'm in middle of a 1st mile problem with Kafka. Everybody deals with partitioning, etc. but how to handle the 1st mile?
My system consists of many applications producing events distributed on nodes. I need to deliver these events to a set of applications acting as consumers in a reliable/fail-safe way. The messaging system of choice is Kafka (due its log nature) but it's not set in stone.
The events should be propagated in a decoupled fire-n-forget manner as most as possible. This means the producers should be fully responsible for reliable delivering their messages. This means apps producing events shouldn't worry about the event delivery at all.
Producer's reliability schema has to account for:
box connection outage - during an outage producer can't access network at all; Kafka cluster is thus not reachable
box restart - both producer and event producing app restart (independently); producer should persist in-flight messages (during retrying, batching, etc.)
internal Kafka exceptions - message size was too large; serialization exception; etc.
No library I've examined so far covers these cases. Is there a suggested strategy how to solve this?
I know there are retriable and non-retriable errors during Producer's send(). On those retriable, the library usually handles everything internally. However, non-retriable ends with an exception in async callback...
Should I blindly replay these to infinity? For network outages it should work but how about Kafka internal errors - say message too large. There might be a DeadLetterQueue-like mechanism + replay. However, how to deal with message count...
About the persistence - a lightweight DB backend should solve this. Just creating a persistent queue and then removing those already send/ACKed. However, I'm afraid that if it was this simple it would be already implemented in standard Kafka libraries long time ago. Performance would probably go south.
Seeing things like KAFKA-3686 or KAFKA-1955 makes me a bit worried.
Thanks in advance.
We have a production system whose primary use case is reliable message delivery. I can't go in much detail, however i can share a high level design on how we achieve this. However this system is guarantees "atleast once delivery" messaging sematics.
Source
First we designed a message schema, and all the message sent to this
system must follow it.
Then we write the message to the a mysql message table, which is sharded by
date, with a field marked as delivered or not
We have a app constantly polling db, with rows marked un-delivered, picks up a row, constructs the message and send it to the load balancer, this is a blocking call and
updates the message row to delivered, only when returned 200
In case of 5xx, the app will retry the message with sleep back off. Also you can make the retries configurable as per your need.
Each source system maintains their own polling app and db.
Producer Array
This is basically a array of machines under a load balancer waiting for incoming messages and produce those to the Kafka Cluster.
We maintain 3 replicas of each topic and in the producer Config we keep acks = -1 , which is very important for your fire-n-forget requirement. As per the doc
acks=all This means the leader will wait for the full set of in-sync
replicas to acknowledge the record. This guarantees that the record
will not be lost as long as at least one in-sync replica remains
alive. This is the strongest available guarantee. This is equivalent
to the acks=-1 setting
As I said producing is a blocking call, and it will return 2xx if the message is produced succesfully across all 3 replicas.
4xx, if message is doesn't meet the schema requirements
5xx, if the kafka broker threw some exception.
Consumer Array
This is a normal array of machines, running Kafka High level Consumers for the topic's consumer groups.
We are currently running this setup with few additional components for some other functional flows in production and it is basically fire-n-forget from the source point of view.
This system addresses all of your concerns.
box connection outage : Unless the source polling app gets 2xx,it
will produce again-again which may lead to duplicates.
box restart : Due to retry mechanism of the source , this shouldn't be a problem as well.
internal Kafka exceptions : Taken care by polling app, as producer array will reply with 5xx unable to produce, and will be further retried.
Acks = -1, also ensures that all the replicas are in-sync and have a copy of the message, so broker going down will not be a issue as well.

Need help to understand Kafka storage

I am new in kafka. From the link : http://notes.stephenholiday.com/Kafka.pdf
It is mentioned:
"Every time a producer publishes a message to a partition, the broker
simply appends the message to the last segment file. For better
performance, we flush the segment files to disk only after a
configurable number of messages have been published or a certain
amount of time has elapsed. A message is only exposed to the consumers
after it is flushed."
Now my question is
What is segment file here?
When I create a topic with partition then each partition will have an index file and a .log file.
is this (.log file) the segment file? if so then it is already in disk so why it is saying "For better performance, we flush the segment files to
disk". if it is flushing to disk then where in the disk it is flushing?
It seems that until it flush to disk , it is not available to the the consumer. Then we adding some latency to read the message, but why?
Also want help to understand that when consumer wants to read some data then is it reading from disk (partition, segment file) or there is some cache mechanism , if so then how and when data is persisting into the cache?
I am not sure all questions are valid or not, but it will help me understand if anybody can clear it.
You can think this segment file as OS pagecache.
Kafka has a very simple storage layout. Each partition of a topic
corresponds to a logical log. Physically, a log is implemented as a
set of segment files of equal sizes. Every time a producer publishes a
message to a partition, the broker simply appends the message to the
last segment file. Segment file is flushed to disk after configurable
number of messages has been published or after certain amount of time.
Messages are exposed to consumer after it gets flushed.
And also please refer to document below.
http://kafka.apache.org/documentation/#appvsosflush
Kafka always immediately writes all data to the filesystem and
supports the ability to configure the flush policy that controls when
data is forced out of the OS cache and onto disk using the flush. This
flush policy can be controlled to force data to disk after a period of
time or after a certain number of messages has been written. There are
several choices in this configuration.
Don't get confused when you see the filesystem word there, OS pagecache is also a filesystem and the link you have mentioned is really very much outdated.