How to make full replication in kafka? - apache-kafka

How to make full replication in kafka?
I have two servers, a leader and a follower.
How to make sure that when the leader refuses (turns off), all messages that are sent to the follower also appear on the leader after turning it on.
I know one option with launching: Kafka has a built-in bin/kafka-mirror-maker.sh synchronization program. It should always be run on the leader, then messages that go to it will also go to the follower. When the leader turns off, this program should start on the follower, and all messages, as I understand it, will go to him. After the leader is turned on, and after synchronization (that is, at the moment when the messages begin to go only to the leader), this service should also start on the leader and turn off on the follower, then the messages will always be synchronized.
If you keep these services on both servers at the same time, the messages will be endlessly duplicated. That is, one message will constantly come to both the follower and the leader due to synchronization.
But I'm not sure that this method is correct and it requires additional resources: a service for tracking all this and running bin/kafka-mirror-maker.sh.
 How can I do it right and without wasting resources?

Kafka itself is a distributed system. Per the docs:
Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures.
If you want to replicate between Kafka clusters (such as full datacenters, or clusters serving different purposes) then this is where something like MirrorMaker would come in.

How to make sure that when the leader refuses (turns off), all messages that are sent to the follower also appear on the leader after turning it on
This is built into the protocol, but that assumes every topic you are using has replication-factor=2
Sounds like you have only two brokers on the same network, so you do not need MirrorMaker, as the docs show it clearly is between two different, regional datacenters.
I would like to add, if you did want to do that, don't use kafka-mirror-maker. It is not as fault-tolerant and scalable as you might expect.
Instead, use MirrorMaker 2, as part of the apache-kafka-connect framework.

Related

Why can't we solve the consensus problem by just enforcing a new leader?

I am going through this lecture series by Martin Kleppman.
In this video at around 1:25, he says you can manually configure the distributed nodes to chose a leader.
If that's the case can't we just automate the process by having a different process running that just checks for the health of the leader and chooses a new leader after the leader's failure or network partition.
Why is this problem actually so hard? Why can't we solve the consensus problem by enforcing a new leader without the nodes having to actually come to an agreement What am I missing?
Let's say we have an active leader and a passive one. The passive one listens for active's heartbeat. When the heartbeat is not heard, the passive one switches to active mode and, maybe, tell everyone - "I am the leader...".
The problem is that just because the passive one hears no heartbeat, it does not mean that the true leader is off - maybe there is a network issue in between these two boxes?
Another option - the leader may get offline for a short period of time - enough for the passive one to detect; but later, the original leader comes back online - now there are two leaders.
The general problem to resolve here is how to build a failure detector. It is tricky. In the last example, the old leader comes back, thinks it is the leader; but that is not true.

kafka Multi-Datacenter with high availability

I'm setting up 2 kafka v0.10.1.0 clusters on different DCs and planning to use mirror-maker to keep one as source and the other one as target, what I'm not sure is how to ensure high availability when my source/main cluster goes down (complete DC where source kafka cluster goes down) do I need to make my application switch to produce messages to the target kafka and what will happen when source kafka is back? how to bring it back in sync with the possible lost messages?
Thanks
From reading your question I don't think, that MirrorMaker will be a suitable tool for your needs I am afraid.
Basically MirrorMaker is simply a Consumer and a Producer tied together to replicate messages from one cluster to another. It is not a tool to tie two Kafka clusters together in an active-active configuration, which sounds a lot like what you are looking for.
But to answer your questions in order:
Do I need to make my application switch to produce messages to the
target kafka?
Yes, there is currently no failover function, you would need to implement logic in your producers to try the target cluster after x amount of failed messages or no messages sent in y minutes or something like that.
What will happen when source kafka is back?
Pretty much nothing that you don't implement yourself :)
MirrorMaker will start replicating data from your source cluster to your target cluster again, but since your producers now switched over to the target cluster, the source cluster is not getting any data, so they will idle along.
Your producers will keep producing into the target cluster, unless you implemented a regular check whether the source came back online and have them switch back.
How to bring it back in sync with the possible lost messages?
When your source cluster is back online and assuming all the things I mentioned above have happened you effectively switched your clusters around, depending on whether you want your source as primary cluster that gets written to or are happy to reverse roles when this happens you have two options that I can come up with off the top of my head:
reverse the direction of mirrormaker and set the consumer group offsets manually so that it picks up at the point where the source cluster died
stop producing new data for a while, recover missing data to the source cluster, switch back your producers and start everything up again.
Both options require you to figure out, what data is missing on the source cluster manually though, I don't think there is a way around this.
Bottom line is, that this in not an easy thing to do with MirrorMaker and it might be worth having another think about whether you really want to switch producers over to the target cluster if the source goes down.
You could also have a look at Confluent's Replicator, which might better suit what you are looking for and is part of their corporate offering. Information is a bit sparse on that, let me know if you are interested in it and I can make an introduction to someone who can tell you more about it (or of course just send a mail to Confluent, that'll reach the right person as well).

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?
Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.

Kafka instead of Zookeeper for cluster management

I am writing a clustered application sitting on top of Kafka -- it uses Kafka exclusively for interprocess communications and coordination. I could use Zookeeper to manage my cluster -- but it would not be very difficult to use Kafka topics to manage the cluster. And the more I think about it, other than for historical reasons, it seems like Kafka could drop Zookeeper and just use a topic-based solution
For example, there could be a special topic or topics in Kafka where you publish all of the same data currently kept track of in Zookeeper. Brokers, Topics, Partitions, Leaders, etc -- seems like this is just as easily tracked via Kafka topics as via Zookeeper.
I know in Kafka 0.9.0 there's some movement away from Zookeeper, more towards this model, and remember my question is less about Kafka development or more me trying to figure out which direction to go in my application.
I'm not asking for an opinion -- what I want to know is are there any specific functions provided by Zookeeper that are going to be difficult with a Kafka/topic-based approach to coordination. But I can't think of anything.
Even heartbeat monitoring -- which was the reason I started looking at Zookeeper in the first place -- you could have a client connection topic, and clients could publish to it when they join the cluster, publish heartbeats at a given interval, and publish as they leave it.
Let us start from a space eyed view: You have two distributed
systems which store data. Zookeeper organizes it's data in nodes in some kind
of directory like structure. Kafka stores messages within topics.
From a bird eye view kafka is build for high-throughput and scalability while one of zookeepers
main design goal is consistency. Zookeeper is mean to be a a Distributed Coordination Service for
Distributed Applications while Kafka can be thought as a distributed commit log.
So the answer to your question is surprisingly: 'It depends'. For coordinating
a distributed system I would use zookeeper: Thats what it was build for. You could
do this also with kafka but there are couple of things which needs to be done
manualy which comes out of the box if you are using zookeeper.
Some examples:
Consistency: The ZK-Client can choose if he needs strong or a eventual consistency
Ephemeral nodes: Together with ZK-Watches a great thing to react on failing services
Sequential Consistency: It's not granted that you recieve the kafka-message in the order you wrote it to the broker (it's only granted that messages within a partion are ordered)
ACLs: Never used it but its at least something which is not offered out of the box by kafka
Sequence Nodes
A pretty nice overview about what you can do with zookeeper are the zookeeper-recipes: https://zookeeper.apache.org/doc/trunk/recipes.html
[EDIT]: Heartbeating an application using kafka is of course possible. But ephemeral nodes in zookeeper are in my eyes the easier option.
This is currently being worked on in scope of KIP-500.

Learning Zookeeper - Help me with example

I'm trying to wrap my head around Zookeeper and what it does. To this point, my experience with Zookeeper has been through other libraries that require Zookeeper (Solr and Kafka) and so my basic understand is the very vague "you better use Zookeeper to keep your configuration straight".
So help me think through a simple example problem. Let's say that I build my own service that does "stuff". There are two things that I want to protect:
I want to have as little downtime as possible (gotta keep doing stuff).
I can not have more than one server doing stuff because bad things would happen.
So, how would I set this up in Zookeeper? Is Zookeeper responsible for starting another stuff server if one goes down? Or do I subscribe to a Zookeeper "stuff doer status" callback? If I erroneously start up two stuff servers, how does Zookeeper help me keep bad things from happening?
Zookeeper is a distributed lock manager. These systems provide features like coordinator election (aka "master election" or "leader election") for a distributed system, as well as provide a consistent, distributed access to small amounts of critical information which is frequently used for configuration (i.e., don't treat it like a database or a general file system).
Note that Zookeeper does not manage your service, but you can use Zookeeper to keep a hot standby (or several) such that in case of one master failing, another one will take over, so you would run N replicas of your servers, such that one of the working instances can take over immediately if the current leader goes down or becomes unavailable for any reason.
Using master election, you can choose to have two (or more) servers, but only one of them will be able to take the master lock, so only that one will be able to take action. As soon as it goes away, it will lose its claim to the lock, and your hot standby will pick up the lock and start doing work that you need it to do. Look at Zookeeper recipes for code samples. However, properly handing off work, checkpointing, and general service resilience is still up to you to design and implement.
That said, Zookeeper and similar systems provide a solid foundation to enable you to build robust distributed systems.
Other systems similar to Zookeeper include (alphabetically):
Chubby
doozerd
etcd
Several of these have detailed comparisons written up on their respective websites to show how they differ from the others in the list.