Push data from one Kafka server to another - apache-kafka

I need to push data from multiple Kafka producers to a seperate Kafka broker. Say I have 3 Kafka servers. From Kafka 1 and 2, I need to push the data to Kafka 3 like below, is it possible?

Kafka has built in replication across brokers. Your producer can only write to one broker at any time for any topic in the cluster.
If you have separate clusters, use MirrorMaker to replicate topics

There are some misunderstood in your questions.
1. There is no Kafka Server
Kafka is a cluster, which means that all "servers" work together as a unique server. This means that when you send a message to a Kafka Cluster, you don't know which broker will accept this message.
You need to use the correct names for questions. When you say "Kafka broker" you mean a Kafka instance in a cluster. There is no "Kafka Server".
2. Do you need to replicate your data? Or just send the same message to two Kafka Clusters?
You need to replicate your message, this means that you have just one message that exists in two brokers, you need to set your topic replication.
3. Do you need the same message in two Clusters?
Use Mirror Maker

Related

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Impact of publishing messages to one Kafka broker with in a cluster

I have Kafka cluster with three brokers and zookeeper instances. Kept the replication factor of 2 for each partition.
i want to understand the impact of publishing messages to single node in a cluster by giving one broker address. Will this broker sends message to other brokers if messages fit into partitions hold by other brokers?
can someone explain how internal sync works or else point to resources.
giving one broker address
Even if you give one address, the bootstrap protocol returns all brokers to the client.
The partitioner logic determines which partition in which broker to send the data to - you target partitions, not brokers in the client.

what is bootstrap-server in kafka config?

I have just started learning kafka and continuously I am coming across a term bootstrap-server.
Which server does it represent in my kafka cluster?
It is the url of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster.
The metadata consists of the topics, their partitions, the leader brokers for those partitions etc.
Depending upon this metadata your producer or consumer produces or consumes the data.
You can have multiple bootstrap-servers in your producer or consumer configuration. So that if one of the broker is not accessible, then it falls back to other.
We know that a kafka cluster can have 100s or 1000nds of brokers (kafka servers). But how do we tell clients (producers or consumers) to which to connect? Should we specify all 1000nds of kafka brokers in the configuration of clients? no, that would be troublesome and the list will be very lengthy. Instead what we can do is, take two to three brokers and consider them as bootstrap servers where a client initially connects. And then depending on alive or spacing, those brokers will point to a good kafka broker.
So bootstrap.servers is a configuration we place within clients, which is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
A host and port pair uses : as the separator.
localhost:9092
localhost:9092,another.host:9092
So as mentioned, bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster.
Special Notes:
Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list does not have to contain the full set of servers (you may want more than one, though, in case a server is down).
Clients (producers or consumers) make use of all servers irrespective of which servers are specified in bootstrap.servers for bootstrapping.
bootstrap.servers is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
Kafka broker
A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.
more information: check this, official doc

Kafka streams read & write to separate cluster

a similar question has been answered before but the solution doesn't work for my use case.
We run 2 Kafka clusters each in 2 separate DCs. Our overall incoming traffic is split between these 2 DCs.
I'd be running separate Kafka streaming app in each DC to transform that data and want to write to a Kafka topic in a single DC.
How can I achieve that?
Ultimately we'd be indexing the kafka topic data in Druid. Its not possible to run separate Druid clusters since we are trying to aggregate the data.
I've read that its not possible with a single Kafka stream. Is there a way I can use another Kafka stream to read from DC1 and write to DC2 kafka cluster ?
As you wrote yourself, you cannot use the Kafka Streams API to read from Kafka cluster A and write to a different Kafka cluster B.
Instead, if you want to move data between Kafka clusters (whether it's in the same DC or across DCs) you should use a tool such as Apache Kafka's Mirror Maker or Confluent Replicator.

Is it possible to produce to a kafka topic when only 1 of the brokers is reachable?

Is it possible to produce to a Kafka topic when only 1 of the brokers is reachable from the producer, none of the zookeeper nodes are reachable from the producer, but all of the brokers are healthy and are reachable from each other?
For example, this would be required if I were to produce messages via an SSH tunnel. If this were for a temporary push I could possibly create the topic with replication factor 1 and have all partitions assigned to the broker in question, and reassign the partitions after the fact, but I'm hoping there is a more flexible setup.
This is all using the java client.
Producers don't interact with Zookeeper so it's not an issue.
The only requirement for Producers is to be able to connect to the brokers that are leaders for the partitions they want to use.
If the broker you connect to is the leader for the partitions you want to use, then yes you can produce to it.
Otherwise it's not going to work. Also creating a topic may not help as its partitions could be assigned to any brokers. Also in order to create a topic, a client has to connect to the controller which may not be the broker you can reach.
If you can only connect to 1 "thing", you may want to consider using something like a REST Proxy. Your "isolated" environment could send REST requests to the proxy which is able to connect to all brokers in the cluster.