I am studying the internals of Apache Kafka; how it works.
The Kafka brokers deal with the requests from the multiple producers and consumers.
I want to know how Kafka schedules those requests. (i.e. FCFS)
- Is it First-Come-First-Served (FCFS) or Processor Sharing (PS) ?
- Does the producers have the higher priorities than consumers?
The Kafka official documentation does not have explanation on it.
Can anyone give me an idea on this?
Thanks,
There is a TCP connection per client at the broker (the client can either be a consumer or producer or any number of producers &/or consumers)
The way CPU resources are shared between different connections is not a property controlled by Kafka. This depends on the OS on which your broker is running. Specifically, the scheduler implementation of your OS (which decide how processes are schedules on cores), will decide this.
If the scheduler is FCFS, this will very well be FCFS. More generally, the scheduler implementation in most OS is some version of Multi Level Feedback Queue.
Thus, this has got nothing to do with Kafka.
Related
Is there any option to throttle/rate-limit the MirrorMaker process so that the target cluster always lag behind the source cluster,in a predictable way?
Does this option --num.streams help to throttle the flow?
I have to throttle and maintain the lag on the target cluster, so it enables the consumers to switch to target cluster with ease, without missing any messages to process.
My consumers can handle duplicates in messages. But I also dont want them to consume messages from the beginning/earliest.
Kafka Version is : 0.9
This is a follow-up question to an earlier discussion. I think of Zookeeper as a coordinator for instances of the Kafka broker, or "message bus". I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with. But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients? Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
To add to the answer of Hans Jespersen, recent Kafka producer/consumer clients (0.9+) do not interact with ZooKeeper anymore.
Nowadays ZooKeeper is only used by the Kafka brokers (i.e., the server-side of Kafka). This means you can e.g. lock-down external access from clients to all ZooKeeper instances for better security.
I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with.
Producer/consumer clients are not "transacting" through ZooKeeper, see above.
But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients?
If the motivation of your question is because you want to implement your own Kafka producer or consumer client, then the answer is: your custom client should not ZooKeeper any longer. The official Kafka producer/consumer clients (Java/Scala) or e.g. Confluent's C/C++, Python, or Go clients for Kafka demonstrate how scalability, fault-tolerance, etc. can be achieved by leveraging Kafka functionality (rather than having to rely on a separate service such as ZooKeeper).
Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
No, we are not giving up any advantages here. Otherwise the Kafka project would not have changed its producer/consumer clients to stop using ZooKeeper and start using Kafka themselves for their inner workings.
ZooKeeper is only still at work behind the scenes for the Kafka brokers, see above.
Zookeeper is still at work behind the scenes but the 0.9+ clients don't need to worry about it anymore because consumer offsets are now stored in a Kafka topic rather than in zookeeper.
I have a setup where I'm pushing events to kafka and then running a Kafka Streams application on the same cluster. Is it fair to say that the only way to scale the Kafka Streams application is to scale the kafka cluster itself by adding nodes or increasing Partitions?
In that case, how do I ensure that my consumers will not bring down the cluster and ensure that the critical pipelines are always "on". Is there any concept of Topology Priority which can avoid a possible downtime? I want to be able to expose the streams for anyone to build applications on without compromising the core pipelines. If the solution is to setup another kafka cluster, does it make more sense to use Apache storm instead, for all the adhoc queries? (I understand that a lot of consumers could still cause issues with the kafka cluster, but at least the topology processing is isolated now)
It is not recommended to run your Streams application on the same servers as your brokers (even if this is technically possible). Kafka's Streams API offers an application-based approach -- not a cluster-based approach -- because it's a library and not a framework.
It is not required to scale your Kafka cluster to scale your Streams application. In general, the parallelism of a Streams application is limited by the number of partitions of your app's input topics. It is recommended to over-partition your topic (the overhead for this is rather small) to guard against scaling limitations.
Thus, it is even simpler to "offer anyone to build applications" as everyone owns their application. There is no need to submit apps to a cluster. They can be executed anywhere you like (thus, each team can deploy their Streams application the same way by which they deploy any other application they have). Thus, you have many deployment options from a WAR file, over YARN/Mesos, to containers (like Kubernetes). Whatever works best for you.
Even if frameworks like Flink, Storm, or Samza offer cluster management, you can only use such tools that are integrated with those frameworks (for example, Samza requires YARN -- no other options available). Let's say you have already a Mesos setup, you can reuse it for your Kafka Streams applications -- no need for a dedicated "Kafka Streams cluster" (because there is no such thing).
An application’s processor topology is scaled by breaking it into
multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based
on the input stream partitions for the application, with each task
assigned a list of partitions from the input streams (i.e., Kafka
topics).
The assignment of partitions to tasks never changes so that each task
is a fixed unit of parallelism of the application. Tasks can then
instantiate their own processor topology based on the assigned
partitions; they also maintain a buffer for each of its assigned
partitions and process messages one-at-a-time from these record
buffers.
As a result stream tasks can be processed independently and in
parallel without manual intervention.
It is important to understand that Kafka Streams is not a resource
manager, but a library that “runs” anywhere its stream processing
application runs. Multiple instances of the application are executed
either on the same machine, or spread across multiple machines and
tasks can be distributed automatically by the library to those running
application instances.
The assignment of partitions to tasks never changes; if an application
instance fails, all its assigned tasks will be restarted on other
instances and continue to consume from the same stream partitions.
The processing of the stream happens in the machines where the application is running.
I recommend you to have a look to this guide, it can help you to better understand the way Kafka Streams work.
I need a simple health checker for Apache Kafka. I dont want something large and complex like Yahoo Kafka Manager, basically I want to check if a topic is healthy or not and if a consumer is healthy.
My first idea was to create a separate heart-beat topic and periodically send and read messages to/from it in order to check availability and latency.
The second idea is to read all the data from Apache Zookeeper. I can get all brokers, partitions, topics etc. from ZK, but I dont know if ZK can provide something like failure detection info.
As I said, I need something simple that I can use in my app health checker.
Some existing tools you can try them out if you haven't yet -
Burrow Linkedin's Kafka Consumer Lag Checking
exhibitor Netflix's ZooKeeper co-process for instance monitoring, backup/recovery, cleanup and visualization.
Kafka System Tools Kafka command line tools
I have a Kafka Cluster in a data center. A bunch of clients that may communicate across WANs (even the internet) will send/receive real time messages to/from the cluster.
I read from Kafka's Documentation:
...It is possible to read from or write to a remote Kafka cluster over the WAN though TCP tuning will be necessary for high-latency links.
It is generally not advisable to run a single Kafka cluster that spans multiple datacenters as this will incur very high replication latency both for Kafka writes and Zookeeper writes and neither Kafka nor Zookeeper will remain available if the network partitions.
From what I understand here and here:
Producing over a WAN doesn't require ZK and is okay, just mind tweaks to TCP for high latency connections. Great! Check.
The High Level consumer APIs require ZK connections.
Aren't then clients reading/writing to Kafka over a WAN subject to the same limitations for clusters in bold above?
The statements you have highlighted are mostly targeted at the internal communication between the Kafka/zookeeper cluster where evil things will happen during network partitions which are much more common across a WAN.
Producers are isolated and if there are network issues should be able to buffer/retry based on your settings.
High level consumers are trickier since, as you note, require a connection to zookeeper. Here when disconnects occur, there will be rebalancing and a higher chance messages will get duplicated.
Keep in mind, the producer will need to be able to get to every Kafka broker and the consumer will need to be able to get to all zookeeper nodes and Kafka brokers, a load balancer won't work.