I am looking for some suggestion on Kafka broker auto scaling up and down based on the load.
Let us say we have an e-commerce site and we are capturing certain activities or events and these events are send to Kafka. Since during the peak hours/days the site traffic will be more so having the ideal kafka cluster with fixed number of brokers always is not a good idea so we want to scale it up the number of brokers when site traffic is more and scale it down the number of brokers when traffic is less.
How does people solve this kind of issue? i am not able to find any resource in this topic. any help will be greatly appreciated.
Kafka doesn't really work that way. Adding/removing brokers from the cluster is a very hands-on process, and it creates a lot of additional load/overhead on the cluster, so you wouldn't want the cluster to be automatically scaling up or down by itself. The main reason why it creates so much additional overhead is that adding or removing brokers requires lots of data copying across the cluster, on top of the normal traffic. Basically, all the data from a dead broker needs to be copied somewhere else, to keep the same replication factor for the topic/partitions, or if it's a new broker, data needs to be shuffled into it from the other brokers, so that the load on the cluster as a whole is reduced. All this data being copied around creates lots of IO/CPU load on the cluster, and it might be enough to cause significant problems.
The best way to handle this scenario is to do performance testing and optimization with 2x or even 3x the traffic you'd expect during peak hours, and build out the cluster accordingly. This way, you'll have plenty of headroom if there are sudden spikes, and you won't have to scale-out/scale-in.
Kafka is extremely performant, even for traffic of millions of messages per second, so you will probably find that the cluster size your application/system requires is not as large/expensive as you initially thought.
Related
I'm trying to figure out what is the proper way to create a topic in Active-Passive Kafka cluster architecture.
Let's say we have two Kafka clusters (Active-Passive) in two separated availability zone.
I understand that the topic mirroring happens when data is replicated from active to passive clusters.
The question is. When I create a topic should I create it on both clusters separately?
If not, then the Mirroring maker should know to detect if there is a new topic on the Active cluster and create its replica on the passive cluster.
Will appreciate any clarification
KMM2 (at least) does monitoring of Kafka topics in the source cluster.
In detail take a look at how often KM does use the admin to find out what are the topics to replicate and how it is involved in syncing their configuration.
KMM code is pretty straightforward, so forking and making changes if needed should not be too hard (but it is very much possible you might get what you want by setting up policies/params).
I need to simply monitor if my Kafka cluster is up. Occasionally the machines running Kafka were shutdown. I want to send an email alert if the cluster is not available.
I can create a producer and consumer to send and receive dummy messages periodically. Is there a simpler way to do it?
You can use https://github.com/obsidiandynamics/kafdrop
It won't send you emails, but it much easier than send dummy messages
Actually knowing if cluster is up is not so easy at all, there is discussion with community what is the best practice to decide if kafka cluster is up and active but there is no current good way to get this information, as kafka architecture is distributed system, you might have big clusters and while one or more brokers are down , still having your cluster to give high available service, not effecting the integrity of data. Also you might have problems with one topic while on other topics it might work fine.
One suggestion I read which might give you the most certain approach is to produce "dummy" msgs to your applicative topics, and "skip" these msgs on consumption, that guarantee you that your application would work. I don't like this approach very much as it requires to "send junk to your main topics"
Other approaches are like you say "produce/consume to/from test/healthcheck topic" but it is might not give full guarantee that your application would work, this is a lot like select from dummy in other db approaches... if for them is good enough....
Another suggestion is to use AdminClient to read the metrics of cluster, if metrics are provided that usually means the cluster is healthy , also not very good guarantee...
I asked in comment which language are you using, maybe you are using something like spring which has HealthIndicator to check component status, but for your case it would be little different.
First of all, you should know that Kafka by default should be High
Available, so while building the cluster you should follow the bold
lines of best practices, you should ensure that you have replicas of
machines. This is good assumption that will make you satisfied over implementing all of this.
But, if you want to check health of a cluster, you can use admin process, you can use AdminClient, with help of some utilities; you can check list of topics, groups, etc that you have. But this not 100% guarantee for you although it is good workaround.
You can do that using as you mentioned periodic scheduler, and send email based on the findings you get. But again this is not the ideal solution, and HA cluster infrastructure should save lots of time for you if you build it correctly from the beginning.
I am creating a service that sends lots of data to kafka-rest-proxy. I am only sending data (producing) to kafka. What I'm finding is that kafka-rest-proxy is easily overwhelmed and runs out of java heap space. I've allocated additional resources, and even horizontally scaled out the number of hosts running kafka-rest-proxy, yet I still encounter dropped connections and memory issues.
I'm not familiar with the internals of kafka-rest-proxy, but my hunch is that it's buffering the records and sending them to Kafka asynchronously. If that is the case then what mechanism does it have to control back pressure? Is there a way to configure it such that it writes records to Kafka synchronously?
Kafka REST Proxy exposes all of the functionality of the Java producers, consumer's command-line tools. Rest Proxy doesn't need any back pressure concept.
To be more specific, Kafka is capable of delivering messages over the network at an alarmingly fast rate.
You need to scale the brokers as per the rate you are producing and consuming the data.
I'm setting up 2 kafka v0.10.1.0 clusters on different DCs and planning to use mirror-maker to keep one as source and the other one as target, what I'm not sure is how to ensure high availability when my source/main cluster goes down (complete DC where source kafka cluster goes down) do I need to make my application switch to produce messages to the target kafka and what will happen when source kafka is back? how to bring it back in sync with the possible lost messages?
Thanks
From reading your question I don't think, that MirrorMaker will be a suitable tool for your needs I am afraid.
Basically MirrorMaker is simply a Consumer and a Producer tied together to replicate messages from one cluster to another. It is not a tool to tie two Kafka clusters together in an active-active configuration, which sounds a lot like what you are looking for.
But to answer your questions in order:
Do I need to make my application switch to produce messages to the
target kafka?
Yes, there is currently no failover function, you would need to implement logic in your producers to try the target cluster after x amount of failed messages or no messages sent in y minutes or something like that.
What will happen when source kafka is back?
Pretty much nothing that you don't implement yourself :)
MirrorMaker will start replicating data from your source cluster to your target cluster again, but since your producers now switched over to the target cluster, the source cluster is not getting any data, so they will idle along.
Your producers will keep producing into the target cluster, unless you implemented a regular check whether the source came back online and have them switch back.
How to bring it back in sync with the possible lost messages?
When your source cluster is back online and assuming all the things I mentioned above have happened you effectively switched your clusters around, depending on whether you want your source as primary cluster that gets written to or are happy to reverse roles when this happens you have two options that I can come up with off the top of my head:
reverse the direction of mirrormaker and set the consumer group offsets manually so that it picks up at the point where the source cluster died
stop producing new data for a while, recover missing data to the source cluster, switch back your producers and start everything up again.
Both options require you to figure out, what data is missing on the source cluster manually though, I don't think there is a way around this.
Bottom line is, that this in not an easy thing to do with MirrorMaker and it might be worth having another think about whether you really want to switch producers over to the target cluster if the source goes down.
You could also have a look at Confluent's Replicator, which might better suit what you are looking for and is part of their corporate offering. Information is a bit sparse on that, let me know if you are interested in it and I can make an introduction to someone who can tell you more about it (or of course just send a mail to Confluent, that'll reach the right person as well).
I am writing a clustered application sitting on top of Kafka -- it uses Kafka exclusively for interprocess communications and coordination. I could use Zookeeper to manage my cluster -- but it would not be very difficult to use Kafka topics to manage the cluster. And the more I think about it, other than for historical reasons, it seems like Kafka could drop Zookeeper and just use a topic-based solution
For example, there could be a special topic or topics in Kafka where you publish all of the same data currently kept track of in Zookeeper. Brokers, Topics, Partitions, Leaders, etc -- seems like this is just as easily tracked via Kafka topics as via Zookeeper.
I know in Kafka 0.9.0 there's some movement away from Zookeeper, more towards this model, and remember my question is less about Kafka development or more me trying to figure out which direction to go in my application.
I'm not asking for an opinion -- what I want to know is are there any specific functions provided by Zookeeper that are going to be difficult with a Kafka/topic-based approach to coordination. But I can't think of anything.
Even heartbeat monitoring -- which was the reason I started looking at Zookeeper in the first place -- you could have a client connection topic, and clients could publish to it when they join the cluster, publish heartbeats at a given interval, and publish as they leave it.
Let us start from a space eyed view: You have two distributed
systems which store data. Zookeeper organizes it's data in nodes in some kind
of directory like structure. Kafka stores messages within topics.
From a bird eye view kafka is build for high-throughput and scalability while one of zookeepers
main design goal is consistency. Zookeeper is mean to be a a Distributed Coordination Service for
Distributed Applications while Kafka can be thought as a distributed commit log.
So the answer to your question is surprisingly: 'It depends'. For coordinating
a distributed system I would use zookeeper: Thats what it was build for. You could
do this also with kafka but there are couple of things which needs to be done
manualy which comes out of the box if you are using zookeeper.
Some examples:
Consistency: The ZK-Client can choose if he needs strong or a eventual consistency
Ephemeral nodes: Together with ZK-Watches a great thing to react on failing services
Sequential Consistency: It's not granted that you recieve the kafka-message in the order you wrote it to the broker (it's only granted that messages within a partion are ordered)
ACLs: Never used it but its at least something which is not offered out of the box by kafka
Sequence Nodes
A pretty nice overview about what you can do with zookeeper are the zookeeper-recipes: https://zookeeper.apache.org/doc/trunk/recipes.html
[EDIT]: Heartbeating an application using kafka is of course possible. But ephemeral nodes in zookeeper are in my eyes the easier option.
This is currently being worked on in scope of KIP-500.