Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don't see how Kafka throughput would go down if it were to pushed based. Any ideas on how push-based can degrade performance?
Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.
The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.
Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.
Refer to the Kafka documentation which details the particular design decision: Push vs pull
Major points that were in favor of pull are:
Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
Consumers can more effectively control the rate of their individual consumption;
Easier and more optimal batch processing implementation.
The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.
Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:
Numerous push-based messaging systems support consumption at
different rates, usually through their session management primitives.
You establish/resume an active application layer session when you
want to consume and suspend the session (e.g. by simply not
responding for less than the keepalive window and greater than the in-flight windows...or with an explicit message) when you want to
stop/pause. MQTT and AMQP, for example both provide this capability
(in MQTT's case, since the late 90's). Given that no actions are
required to pause consumption (by definition), and less traffic is
required under steady stable state (no request), it is difficult to
see how Kafka's pull-based model is more efficient.
One critical advantage push messaging has vs. pull messaging is that
there is no request traffic to scale as the number of potentially
active topics increases. If you have a million potentially active
topics, you have to issue queries for all those topics. This
concern becomes especially relevant at scale.
The critical advantage pull messaging has vs push messaging is replayability. This factors a great deal into whether downstream systems can offer guarantees around processing (e.g. they might fail before doing so and have to restart or e.g. fail to write messages recoverably).
Another critical advantage for pull messaging vs push messaging is buffer allocation. A consuming process can explicitly request as much data as they can accommodate in a pre-allocated buffer, rather than having to allocate buffers over and over again. This gains back some of the goodput losses vs push messaging from query scaling (but not much). The impact here is measurable, however, if your message sizes vary wildly (e.g. a few KB->a few hundred MB).
It is a fallacy to suggest that pull messaging has structural scalability advantages over push messaging. Partitioning is what is usually used to provide scale in messaging applications, regardless of the consumption model. There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters...125K msgs/sec doesn't even buy admission to the show. In fact, pull messaging has inferior goodput by definition and systems like Kafka usually end up with more hardware to reach the same performance level. The benefits noted above may often make it worth the cost. I am unaware of anyone using Kafka for messaging in high frequency trading, for example, where microseconds matter.
It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.
Pushing is just extra work for the broker. With Kafka, the responsibility of fetching messages is on consumers. Consumers can decide at what rate they want to process the messages.
If a broker is pushing messages and if some of the consumers are down, the broker will retry certain times to push the messages till they decide not to push anymore. This decreases performance. Imagine the workload of pushing messages to multiple consumers.
Related
I have started using launch darkly(LD) recently. And I was exploring how LD updates its feature flags.
As mentioned Here, there are two ways.
Streaming
Polling
I was just thinking which implementation will be better in what cases. After a little research about streaming vs polling, It was found Streaming has the following advantages over polling.
Faster than polling
Receives only latest data instead of all the data which is same as before
Avoids periodic requests
I am pretty sure all of the above advantages comes at a cost. So,
Are there any downsides of using streaming over polling?
In what scenarios polling should be preferred? or the other way around?
On what factors should I decide whether to stream or poll?
Streaming
Streaming requires your application to be always alive. This might not be the case in a serverless environment. Furthermore, a streaming solution usually relies on a connection that is always open in the background. This might be costly, so feature flag providers tend to limit the number of concurrent connections you can keep open to their infrastructure. This might be not a problem if you use feature flags only in a few application instances. But you will easily reach the limit if you want to stream feature flag updates to mobile apps or a ton of microservices.
Polling
Polling sounds less fancy, but it's a reliable & robust old-school pattern that will work in almost all environments.
Webhooks
There is a third option too: webhooks. The basic idea is that you create an HTTP endpoint on your end and he feature flag service will call that endpoint whenever a feature flag value update happens. This way you get a "notification" about feature flag value changes. For example ConfigCat supports this model. ConfigCat can notify your infrastructure by calling your webhooks and (optionally) pushing new values to your end. Webhooks have the advantage over streaming that they are cheap to maintain, so feature flag service providers don't limit them as much (for example ConfigCat can give you unlimited webhooks).
How to decide
How I would use the above 3 option really depends on your use-case. A general rule of thumb is: use polling by default and add quasi real-time notifications (by streaming or by webhooks) to the components where it's critical to know about feature flag value updates.
In addition to #Zoltan's answer, I Found the following from LaunchDarkly's Effective Feature management E book (Page 36)
In any networked system there are two methods to distribute information.
Polling is the method by which the endpoints (clients or servers) periodically ask for updates. Streaming, the second method,is when the central authority pushes the new values to all the end‐points as they change.Both options have pros and cons.
However, in a poll-based system, you are faced with an unattractive trade-off: either you poll infrequently and run the risk of different parts of your application having different flag states, or you poll very frequently and shoulder high costs in system load, network bandwidth, and the necessary infra‐structure to support the high demands.
A streaming architecture, on the other hand, offers speed advantages and consistency guarantees. Streaming is a better fit for large-scale and distributed systems. In this design, each client maintains along-running connection to the feature management system, which instantly sends down any changes as they occur to all clients.
Polling Pros:
Simple
Easily cached
Polling Cons:
Inefficient. All clients need to connect momentarily, regardless of whether there is a change.
Changes require roughly twice the polling interval to propagate to all clients.
Because of long polling intervals, the system could create a “split brain” situation, in which both new flag and old flag states exist at the same time.
Streaming Pros:
Efficient at scale. Each client receives messages only when necessary.
Fast Propagation. Changes can be pushed out to clients in real time.
Streaming Cons:
Requires the central service to maintain connections for every client
Assumes a reliable network
For my use case, I have decided to use polling in places where I don't need to update the flags often(long polling interval) and doesn't care about inconsistencies (split-brain) .
And Streaming for applications that need immediate flag updates and consistency is important.
I am in the process of designing a system which acts like a message forwarder from one system to another system. I have several options to go for but I would like to apply the best option which provides less resource consumption (cpu, ram) and latency. Thus, I need your recommendation and view on this.
We assume that messages will be streaming to our system from a topic in Kafka. We need to forward all the messages from the topic to another host. There can be different strategies for this purpose.
Collect certain number of messages let's say 100 messages (batch processing) and send them at once within a single HTTP message.
When one message is received, system will send this message as the http POST request to the target host.
Open webSocket between our system and the target host and send messages.
Behave like a Kafka producer and send messages to topic.
Each of them might have advantages and disadvantages. I have concern that system may not handle the high amount of messages coming. Do you have any option other than these 4 items? Which is the best option you think in terms of what?
How important your latency requirement is ?
HTTP is quite slow, compared to an UDP based messaging system, but maybe you don't need a so tailored latency.
Batching your messages will increase latency, as you may know.
But it's disturbing because the title of this page is "rest - forwarding" =).
Does it has to be REST ( so HTTP) ? because it seems you can as well envisage to act like a kafka producer, if so, it's not REST.
The memory footprint of Kafka may be a bit high (Java lib), but no so much.
Do you work on embedded system (willing to reduce memory footprint ?)
For CPU purposes.. it depends with what we're comparing Kafka, but I still think Kafka is quite optimised when asking for performance.
Think we lack more information about this "another host" , could you give more details about its purpose ?
Yannick
I think you are looking for Kafka Streaming in this scenario. Although from an efficiency point of view maybe some hadoop stack implementation (Flume) or Spark would be less consuming, not sure, depends on amount of data, network jumps, disk used, amount of memory.
If you have huge amounts of messages those distributed solutions should be your right approach not a custom REST client.
I am very much new to Kafka, and i am researching if Kafka can be used as a real time messaging broker rather than retaining and sending. In other words can it just do the basic pub/sub broker job without retaining at all.
Is it configurable in Kafka Server configurations?
I don't think it's possible to accomplish this. One of the key differences between Kafka and other messaging systems is that Kafka uses the underlying OS's to handle storage.
Another unconventional choice that we made is to avoid explicitly
caching messages in memory at the Kafka layer. Instead, we rely on
the underlying file system page cache. Whitepaper
So Kafka automatically writes messages to disk, so it retains them by default. This is a conscious decision the designers of Kafka have made that they believe is worth the tradeoffs.
If you're asking this because you're worried writing to disk may be slower than keeping things in memory.
We have found that both the production and the
consumption have consistent performance linear to the data size,
up to many terabytes of data. Whitepaper
So the size of the data that you've retained doesn't impact how fast the system is.
I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)
However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.
Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.
Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.
My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.
Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.
If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.
you are right to consider that sending data across the network will consume more time of the total processing time.
However, these frameworks (Storm, Spark, Samza, Flink) were created to process a lot of data that potentially does not fit in memory of one computer. So, if we use more than one computer to process the data we can achieve parallelism.
And, following your question about the network latency. Yes! this is a trade off to consider. The developer has to know that they are implementing programs to deploy in a parallel framework. The way that they build the application will influence how much data is transferred through the network as well.
I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?
It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.
There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.
I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).
I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)