The information I found comparing Apache Kafka and ActiveMQ (and similar message queuing products) is never clear about the integrity properties of each solution (especially, consistency).
With Kafka you can get the guarantee that no message is lost even in the presence of failures. Do you lose that guarantee using the "LazyPersistence" option?
By "no loss" I mean that the messages would be available to clients, even upon failure after restart - ideally, all messages arriving at the client, in the correct order.
Does ActiveMQ (either "classic" or Artemis) guarantee no loss of messages upon failure? Any configuration options that do give that guarantee? If the answer would differ for "classic" vs Artemis, that would be nice to know.
With Kafka, you can get the guarantee that no message is lost, even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
This is a large topic.
guarantee that no message is lost
This depends on a few things. First, you may configure retention - after a specific period where it is fine for you that the messages are lost. You may consider infinite retention but also beware that you have enough storage for that, maybe you need compaction of the topic?
even in the presence of failures; I guess you loose that guarantee using the "LazyPersistence" option, is that correct?
Kafka is a distributed system, it is common for distributed system to rely more on distributed replication than synchronous disk writes. Even if you write synchronous to disk - the disk may die and be lost. To what degree you want to use distributed replication (e.g. 3 or 6 replicas?) and synchronous or asynchronous disk writes depends on your requirements - but it also has a trade off in throughput. E.g. AWS Aurora is a distributed database that use 6 replicas.
There is no reasonable or practical way to have "no loss of messages" with any solution.
Kafka's approach is to replicate the data once it gets to the server. As #Jonas mentioned there is a total throughput trade-off. Kafka's producers are typically asynchronous out-of-the-box, so it is reasonable to expect that a process (container restart) or network outage would result in observable message loss from the producing application-side. Also, the LazyPersistence can lead to reasonably observable message loss due to process or server-side Kafka failure.
ActiveMQ's approach is to sync data to disk using the OS system call fsync() which is supposed to result in a write to disk. When you combine that with a RAID storage you have the most practical guarantee of data not being lost.
However, there is a alternative pattern that has nothing to do with persistence that can achieve a higher degree of guarantee. This is used by some financial trading systems and defense applications.
Often referred to as 'fanout'. ActiveMQ has a fanout transport included in its client. Works like this:
Producer sends message to 3 servers (they should be as isolated and separated from each other as possible).
Consumer(s) receive up to 3 messages.
First message through "wins" and the consumer app drops the other 2 messages.
With this approach, you can skip persistence altogether, since you have 3 independent routes and the odds of all 3 failing are low. (There are strategies to improve producer-side QOS in the event producer's network is offline).
Consumer has the option of processing first-message (fast) or requiring at least 2 messages to process and validate that the request is legit (secure, but higher latency).
Related
Quoting the zookeeper docs
ZooKeeper is a distributed, open-source coordination service for
distributed applications. It exposes a simple set of primitives that
distributed applications can build upon to implement higher level
services for synchronization, configuration maintenance, and groups
and naming.
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to
be a basis for the construction of more complicated services, such as
synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
But I don't see any new problem that Zookeeper solves apart from being highly fault tolerant compared to a central database. All the guarantees that zookeeper assures can be guaranteed in a central database too.
Atomicity -> As it's a single node. all updates are atomic.
Sequential Consistency -> after an update clients can wait until the ack until they send the next update to maintain the sequence.
Single System Image, Reliability, Timeliness -> guaranteed as it's a single node.
So, Avoiding a single point of failure is the only main advantage of using zookeeper. Please correct me if I'm wrong.
Zookeeper (and other consensus based systems) offers sequential consistency, strong consistency and high availability.
"apart from being highly fault tolerant" that's actually huge - the fault tolerance.
If you don't care about availability, you totally can use any other linearizable storage - even a directory with files will work.
Consensus based system, and systems based on them (e.g. zoo + your own code) are used to implement machine state replication. All transitions are stored in a distributed log - to make it durable there are many copies. Consensus is about what is the order of event in the log.
With the log being available, the actual business code can consume events and change its state machine - typical state machine transitions. Since each copy of log has the same sequence of events, all states machines will get to the same state.
The key thing is about timing - all logs will get same events in the same order, but there is no guarantee when that happens - a node could be disconnected from the network, hence its log will be stale, and by extension the state machine as well.
To see the true latest value, as you would expect with a singe source of truth, you have to use linearizable read. One way of doing this is to append the read operation to the log itself and wait for it to be committed. Read do nothing with state machines, but the fact that a reader placed something to log and got it committed, that signals that the entire log is read - there is no stale data. (Stale it means that all writes happened before the read are reflected, while read is happening, new writes could happen).
All of this complexity comes form the availability requirements - a cluster with three nodes can let one node to go down, without affecting operations.
So, yes, you could use any linear storage to do the same, ignoring availability. You could do this by keeping the log of events in a table, and every client to track a pointer (or id) of last applied operation; so every client could go and move its own state machine.
I was reading about Messaging Queues and found that the messages can be of two types : Persistence and Non-Persistence .
Persistence Message are stored in disk/database so that they will survive a broker restart while the Non-Persistence Messages are stored in Memory which do not survive a broker restart.
Persistent messaging is usually slower than non-persistent delivery.
But I am unable to think of a specific use-case of non-persistent messages.
Can anyone give an example when a programmer should use non-persistent messages.
Generally speaking, when it doesn't matter much if you lose some messages.
For example, railroad signaling... the signals send their state every few seconds. If one or two get lost, there are more coming.
Or stock price display... if the display fails to update for a bit, it's not really a big deal. Not talking about trading activity here - just display in a public area or something.
Aside from specific "business" applications where persistence might not be required, there is another important reason why non-persistent messages might be preferred over persistent ones - performance. Sending and consuming non-persistent messages is almost always much, much faster than the same operations with persistent messages. When dealing with persistent messages the broker must interact with a storage device (e.g. local HDD, local SSD, network attached storage, etc.) which will usually be orders of magnitude slower than RAM (i.e. where non-persistent messages live).
The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified
Scenario: I have a low-volume topic (~150msgs/sec) for which we would like to have a
low propagation delay from producer to consumer.
I added a time stamp from a producer and read it at consumer to record the propagation delay, with default configurations the msg (of 20 bytes) showed a propagation delay of 1960ms to 1230ms. No network delay is involved since, I tried on a 1 producer and 1 simple consumer on the same machine.
When I have tried adjusting the topic flush interval to 20ms, it drops
to 1100ms to 980ms. Then I tried adjusting the consumers "fetcher.backoff.ms" to 10ms, it dropped to 1070ms - 860ms.
Issue: For a 20 bytes of a msg, I would like to have a propagation delay as low as possible and ~950ms is a higher figure.
Question: Anything I am missing out in configuration?
I do welcome comments, delay which you got as minimum.
Assumption: The Kafka system involves the disk I/O before the consumer get the msg from the producer and this goes with the hard disk RPM and so on..
Update:
Tried to tune the Log Flush Policy for Durability & Latency.Following is the configuration:
# The number of messages to accept before forcing a flush of data to disk
log.flush.interval=10
# The maximum amount of time a message can sit in a log before we force a flush
log.default.flush.interval.ms=100
# The interval (in ms) at which logs are checked to see if they need to be
# flushed to disk.
log.default.flush.scheduler.interval.ms=100
For the same msg of 20 bytes, the delay was 740ms -880ms.
The following statements are made clear in the configuration itself.
There are a few important trade-offs:
Durability: Unflushed data is at greater risk of loss in the event of a crash.
Latency: Data is not made available to consumers until it is flushed (which adds latency).
Throughput: The flush is generally the most expensive operation.
So, I believe there is no way to come down to a mark of 150ms - 250ms. (without hardware upgrade) .
I am not trying to dodge the question but I think that kafka is a poor choice for this use case. While I think Kafka is great (I have been a huge proponent of its use at my workplace), its strength is not low-latency. Its strengths are high producer throughput and support for both fast and slow consumers. While it does provide durability and fault tolerance, so do more general purpose systems like rabbitMQ. RabbitMQ also supports a variety of different clients including node.js. Where rabbitMQ falls short when compared to Kafka is when you are dealing with extremely high volumes (say 150K msg/s). At that point, Rabbit's approach to durability starts to fall apart and Kafka really stands out. The durability and fault tolerance capabilities of rabbit are more than capable at 20K msg/s (in my experience).
Also, to achieve such high throughput, Kafka deals with messages in batches. While the batches are small and their size is configurable, you can't make them too small without incurring a lot of overhead. Unfortunately, message batching makes low-latency very difficult. While you can tune various settings in Kafka, I wouldn't use Kafka for anything where latency needed to be consistently less than 1-2 seconds.
Also, Kafka 0.7.2 is not a good choice if you are launching a new application. All of the focus is on 0.8 now so you will be on your own if you run into problems and I definitely wouldn't expect any new features. For future stable releases, follow the link here stable Kafka release
Again, I think Kafka is great for some very specific, though popular, use cases. At my workplace we use both Rabbit and Kafka. While that may seem gratuitous, they really are complimentary.
I know it's been over a year since this question was asked, but I've just built up a Kafka cluster for dev purposes, and we're seeing <1ms latency from producer to consumer. My cluster consists of three VM nodes running on a cloud VM service (Skytap) with SAN storage, so it's far from ideal hardware. I'm using Kafka 0.9.0.0, which is new enough that I'm confident the asker was using something older. I have no experience with older versions, so you might get this performance increase simply from an upgrade.
I'm measuring latency by running a Java producer and consumer I wrote. Both run on the same machine, on a fourth VM in the same Skytap environment (to minimize network latency). The producer records the current time (System.nanoTime()), uses that value as the payload in an Avro message, and sends (acks=1). The consumer is configured to poll continuously with a 1ms timeout. When it receives a batch of messages, it records the current time (System.nanoTime() again), then subtracts the receive time from the send time to compute latency. When it has 100 messages, it computes the average of all 100 latencies and prints to stdout. Note that it's important to run the producer and consumer on the same machine so that there is no clock sync issue with the latency computation.
I've played quite a bit with the volume of messages generated by the producer. There is definitely a point where there are too many and latency starts to increase, but it's substantially higher than 150/sec. The occasional message takes as much as 20ms to deliver, but the vast majority are between 0.5ms and 1.5ms.
All of this was accomplished with Kafka 0.9's default configurations. I didn't have to do any tweaking. I used batch-size=1 for my initial tests, but I found later that it had no effect at low volume and imposed a significant limit on the peak volume before latencies started to increase.
It's important to note that when I run my producer and consumer on my local machine, the exact same setup reports message latencies in the 100ms range -- the exact same latencies reported if I simply ping my Kafka brokers.
I'll edit this message later with sample code from my producer and consumer along with other details, but I wanted to post something before I forget.
EDIT, four years later:
I just got an upvote on this, which led me to come back and re-read. Unfortunately (but actually fortunately), I no longer work for that company, and no longer have access to the code I promised I'd share. Kafka has also matured several versions since 0.9.
Another thing I've learned in the ensuing time is that Kafka latencies increase when there is not much traffic. This is due to the way the clients use batching and threading to aggregate messages. It's very fast when you have a continuous stream of messages, but any time there is a moment of "silence", the next message will have to pay the cost to get the stream moving again.
It's been some years since I was deep in Kafka tuning. Looking at the latest version (2.5 -- producer configuration docs here), I can see that they've decreased linger.ms (the amount of time a producer will wait before sending a message, in hopes of batching up more than just the one) to zero by default, meaning that the aforementioned cost to get moving again should not be a thing. As I recall, in 0.9 it did not default to zero, and there was some tradeoff to setting it to such a low value. I'd presume that the producer code has been modified to eliminate or at least minimize that tradeoff.
Modern versions of Kafka seem to have pretty minimal latency as the results from here show:
2 ms (median)
3 ms (99th percentile)
14 ms (99.9th percentile)
Kafka can achieve around millisecond latency, by using synchronous messaging. With synchronous messaging, the producer does not collect messages into a patch before sending.
bin/kafka-console-producer.sh --broker-list my_broker_host:9092 --topic test --sync
The following has the same effect:
--batch-size 1
If you are using librdkafka as Kafka client library, you must also set socket.nagle.disable=True
See https://aivarsk.com/2021/11/01/low-latency-kafka-producers/ for some ideas on how to see what is taking those milliseconds.
I am new to message brokers like RabbitMQ which we can use to create tasks / message queues for a scheduling system like Celery.
Now, here is the question:
I can create a table in PostgreSQL which can be appended with new tasks and consumed by the consumer program like Celery.
Why on earth would I want to setup a whole new tech for this like RabbitMQ?
Now, I believe scaling cannot be the answer since our database like PostgreSQL can work in a distributed environment.
I googled for what problems does the database poses for the particular problem, and I found:
polling keeps the database busy and low performing
locking of the table -> again low performing
millions of rows of tasks -> again, polling is low performing
Now, how does RabbitMQ or any other message broker like that solves these problems?
Also, I found out that AMQP protocol is what it follows. What's great in that?
Can Redis also be used as a message broker? I find it more analogous to Memcached than RabbitMQ.
Please shed some light on this!
Rabbit's queues reside in memory and will therefore be much faster than implementing this in a database. A (good)dedicated message queue should also provide essential queuing related features such as throttling/flow control, and the ability to choose different routing algorithms, to name a couple(rabbit provides these and more). Depending on the size of your project, you may also want the message passing component separate from your database, so that if one component experiences heavy load, it need not hinder the other's operation.
As for the problems you mentioned:
polling keeping the database busy and low performing: Using Rabbitmq, producers can push updates to consumers which is far more performant than polling. Data is simply sent to the consumer when it needs to be, eliminating the need for wasteful checks.
locking of the table -> again low performing: There is no table to lock :P
millions of rows of task -> again polling is low performing: As mentioned above, Rabbitmq will operate faster as it resides RAM, and provides flow control. If needed, it can also use the disk to temporarily store messages if it runs out of RAM. After 2.0, Rabbit has significantly improved on its RAM usage. Clustering options are also available.
In regards to AMQP, I would say a really cool feature is the "exchange", and the ability for it to route to other exchanges. This gives you more flexibility and enables you to create a wide array of elaborate routing typologies which can come in very handy when scaling. For a good example, see:
(source: springsource.com)
and: http://blog.springsource.org/2011/04/01/routing-topologies-for-performance-and-scalability-with-rabbitmq/
Finally, in regards to Redis, yes, it can be used as a message broker, and can do well. However, Rabbitmq has more message queuing features than Redis, as rabbitmq was built from the ground up to be a full-featured enterprise-level dedicated message queue. Redis on the other hand was primarily created to be an in-memory key-value store(though it does much more than that now; its even referred to as a swiss army knife). Still, I've read/heard many people achieving good results with Redis for smaller sized projects, but haven't heard much about it in larger applications.
Here is an example of Redis being used in a long-polling chat implementation: http://eflorenzano.com/blog/2011/02/16/technology-behind-convore/
PostgreSQL 9.5
PostgreSQL 9.5 incorporates SELECT ... FOR UPDATE ... SKIP LOCKED. This makes implementing working queuing systems a lot simpler and easier. You may no longer require an external queueing system since it's now simple to fetch 'n' rows that no other session has locked, and keep them locked until you commit confirmation that the work is done. It even works with two-phase transactions for when external co-ordination is required.
External queueing systems remain useful, providing canned functionality, proven performance, integration with other systems, options for horizontal scaling and federation, etc. Nonetheless, for simple cases you don't really need them anymore.
Older versions
You don't need such tools, but using one may make life easier. Doing queueing in the database looks easy, but you'll discover in practice that high performance, reliable concurrent queuing is really hard to do right in a relational database.
That's why tools like PGQ exist.
You can get rid of polling in PostgreSQL by using LISTEN and NOTIFY, but that won't solve the problem of reliably handing out entries off the top of the queue to exactly one consumer while preserving highly concurrent operation and not blocking inserts. All the simple and obvious solutions you think will solve that problem actually don't in the real world, and tend to degenerate into less efficient versions of single-worker queue fetching.
If you don't need highly concurrent multi-worker queue fetches then using a single queue table in PostgreSQL is entirely reasonable.