I am very much new to Kafka, and i am researching if Kafka can be used as a real time messaging broker rather than retaining and sending. In other words can it just do the basic pub/sub broker job without retaining at all.
Is it configurable in Kafka Server configurations?
I don't think it's possible to accomplish this. One of the key differences between Kafka and other messaging systems is that Kafka uses the underlying OS's to handle storage.
Another unconventional choice that we made is to avoid explicitly
caching messages in memory at the Kafka layer. Instead, we rely on
the underlying file system page cache. Whitepaper
So Kafka automatically writes messages to disk, so it retains them by default. This is a conscious decision the designers of Kafka have made that they believe is worth the tradeoffs.
If you're asking this because you're worried writing to disk may be slower than keeping things in memory.
We have found that both the production and the
consumption have consistent performance linear to the data size,
up to many terabytes of data. Whitepaper
So the size of the data that you've retained doesn't impact how fast the system is.
Related
As the title states, have people used Kafka Streams' interactive queries for production-grade requests, or is the interactive querying more of a debugging feature?
I'm imagining that there may be issues around hot-keys if you tried to send too many QPS, which you can solve by caching, but I'm curious to know if there are more library-specific limitations.
It's used very frequently in production. In fact, tools like Confluent Control Center and KsqlDB heavily rely on the feature to work.
Yes, "hot keys" are a problem in any Kafka cluster, regardless of using Streams API. I'm not sure what you mean by caching, though unless you're referring to RocksDB or the in memory statestore as a cache
One main limitation is disk utilization against keyspaces with large cardinality, not "hot keys"; if the Streams app ever loses its state, it could take hours for it to rebuild. I've heard people get around this problem by writing their own state store implementations around maybe Redis or Aerospike
I am creating a service that sends lots of data to kafka-rest-proxy. I am only sending data (producing) to kafka. What I'm finding is that kafka-rest-proxy is easily overwhelmed and runs out of java heap space. I've allocated additional resources, and even horizontally scaled out the number of hosts running kafka-rest-proxy, yet I still encounter dropped connections and memory issues.
I'm not familiar with the internals of kafka-rest-proxy, but my hunch is that it's buffering the records and sending them to Kafka asynchronously. If that is the case then what mechanism does it have to control back pressure? Is there a way to configure it such that it writes records to Kafka synchronously?
Kafka REST Proxy exposes all of the functionality of the Java producers, consumer's command-line tools. Rest Proxy doesn't need any back pressure concept.
To be more specific, Kafka is capable of delivering messages over the network at an alarmingly fast rate.
You need to scale the brokers as per the rate you are producing and consuming the data.
I am in the process of designing a system which acts like a message forwarder from one system to another system. I have several options to go for but I would like to apply the best option which provides less resource consumption (cpu, ram) and latency. Thus, I need your recommendation and view on this.
We assume that messages will be streaming to our system from a topic in Kafka. We need to forward all the messages from the topic to another host. There can be different strategies for this purpose.
Collect certain number of messages let's say 100 messages (batch processing) and send them at once within a single HTTP message.
When one message is received, system will send this message as the http POST request to the target host.
Open webSocket between our system and the target host and send messages.
Behave like a Kafka producer and send messages to topic.
Each of them might have advantages and disadvantages. I have concern that system may not handle the high amount of messages coming. Do you have any option other than these 4 items? Which is the best option you think in terms of what?
How important your latency requirement is ?
HTTP is quite slow, compared to an UDP based messaging system, but maybe you don't need a so tailored latency.
Batching your messages will increase latency, as you may know.
But it's disturbing because the title of this page is "rest - forwarding" =).
Does it has to be REST ( so HTTP) ? because it seems you can as well envisage to act like a kafka producer, if so, it's not REST.
The memory footprint of Kafka may be a bit high (Java lib), but no so much.
Do you work on embedded system (willing to reduce memory footprint ?)
For CPU purposes.. it depends with what we're comparing Kafka, but I still think Kafka is quite optimised when asking for performance.
Think we lack more information about this "another host" , could you give more details about its purpose ?
Yannick
I think you are looking for Kafka Streaming in this scenario. Although from an efficiency point of view maybe some hadoop stack implementation (Flume) or Spark would be less consuming, not sure, depends on amount of data, network jumps, disk used, amount of memory.
If you have huge amounts of messages those distributed solutions should be your right approach not a custom REST client.
I read about lot of design stories where data reaches storage (both acid and non-sql) through Kafka. Not sure I understand in depth what case it solves. Why not directly?
On other hand, I've never seen other usages of Kafka. Is are other major case of?
Regards,
In short : coupling. If you write directly to your storage from your source system, you couple the two together. If you want to change one, you directly impact the other.
Kafka enables you to decouple this, and use data more effectively. Data in Kafka can be used by multiple independent consumers, so if you want to write it to multiple targets you still only extract it from the source system one.
This talk might help you understand further: "Embrace the Anarchy: Apache Kafka’s Role in Modern Data Architectures" Video / Slides
Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don't see how Kafka throughput would go down if it were to pushed based. Any ideas on how push-based can degrade performance?
Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.
The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.
Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.
Refer to the Kafka documentation which details the particular design decision: Push vs pull
Major points that were in favor of pull are:
Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
Consumers can more effectively control the rate of their individual consumption;
Easier and more optimal batch processing implementation.
The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.
Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:
Numerous push-based messaging systems support consumption at
different rates, usually through their session management primitives.
You establish/resume an active application layer session when you
want to consume and suspend the session (e.g. by simply not
responding for less than the keepalive window and greater than the in-flight windows...or with an explicit message) when you want to
stop/pause. MQTT and AMQP, for example both provide this capability
(in MQTT's case, since the late 90's). Given that no actions are
required to pause consumption (by definition), and less traffic is
required under steady stable state (no request), it is difficult to
see how Kafka's pull-based model is more efficient.
One critical advantage push messaging has vs. pull messaging is that
there is no request traffic to scale as the number of potentially
active topics increases. If you have a million potentially active
topics, you have to issue queries for all those topics. This
concern becomes especially relevant at scale.
The critical advantage pull messaging has vs push messaging is replayability. This factors a great deal into whether downstream systems can offer guarantees around processing (e.g. they might fail before doing so and have to restart or e.g. fail to write messages recoverably).
Another critical advantage for pull messaging vs push messaging is buffer allocation. A consuming process can explicitly request as much data as they can accommodate in a pre-allocated buffer, rather than having to allocate buffers over and over again. This gains back some of the goodput losses vs push messaging from query scaling (but not much). The impact here is measurable, however, if your message sizes vary wildly (e.g. a few KB->a few hundred MB).
It is a fallacy to suggest that pull messaging has structural scalability advantages over push messaging. Partitioning is what is usually used to provide scale in messaging applications, regardless of the consumption model. There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters...125K msgs/sec doesn't even buy admission to the show. In fact, pull messaging has inferior goodput by definition and systems like Kafka usually end up with more hardware to reach the same performance level. The benefits noted above may often make it worth the cost. I am unaware of anyone using Kafka for messaging in high frequency trading, for example, where microseconds matter.
It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.
Pushing is just extra work for the broker. With Kafka, the responsibility of fetching messages is on consumers. Consumers can decide at what rate they want to process the messages.
If a broker is pushing messages and if some of the consumers are down, the broker will retry certain times to push the messages till they decide not to push anymore. This decreases performance. Imagine the workload of pushing messages to multiple consumers.