Load data from separate kafka cluster to Samza? - apache-kafka

I am trying to create a Samza job that as closely resembles the Wikipedia example job as I can make it. However in the "WikipediaFeed" object I am trying to get data from a different Kafka broker than the Kafka broker that is running when you start the Hello-Samza grid.
Do I have to create a thread safe Kafka consumer inside the "WikipediaFeed" object to consume data from a different Kafka cluster or is there another way I'm not seeing?
Edit 1:
Here is a link to their Wikipedia example.
https://github.com/apache/samza-hello-samza/tree/master/src/main
Thanks

In your example you need change this config (https://github.com/apache/samza-hello-samza/blob/master/src/main/config/wikipedia-feed.properties) :
systems.kafka.consumer.zookeeper.connect=KAFKA_CLUSTER_FRONTING:2181
systems.kafka.producer.bootstrap.servers=KAFKA_CLUSTER_FRONTING:9092
task.inputs=kafka.topic1,kafka.topic2,kafka.topic3
Change the config with your Fronting Kafka cluster
and add your topic in task.inputs separated with ","
Edit:
Just to be clear, you can deploy your Samza into a Cluster 1 and consume a Kafka topic from another cluster. You need change the config in your Samza properties.
To see more information : Samza config
Then if you need send your message after process to another Kafka cluster you will need create another system in your config.
See more information : https://samza.apache.org/learn/documentation/0.13/api/overview.html

Related

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

Kafka HA + flume. How can I use the Kafka HA configuration with Flume?

environment
Apache Kafka 2.7.0
Apache Flume 1.9.0
Problem
Currently, in our architecture,
We are using Flume with Kafka channel, no source and sink to HDFS.
In the future, We are going to build a Kafka HA cluster using kafka mirror maker.
So, even if one cluster is shut down, I try to use it so that there is no problem with failure by connecting to the other cluster.
To do this, I think that we need to subscribe topic with a regex pattern with Flume.
Assume that cluster A and cluster B exist, and two clusters have a topic called ex. And the mirror maker copy each other ex, so cluster A has topic : ex, b.ex and cluster B has topic : ex, a.ex.
For example, while reading e and b.e from cluster A, if there is a failure, it tries to read ex and a.ex by going to the opposite cluster.
Like below.
test.channel = c1 c2
c1.channels.kafka.topics.regex = .*e (impossible in kafka channel)
...
c1.source.kafka.topics.regex = .*e (possible in kafka source)
In the case of flume kafka source, there is a property to read the topic as a regex pattern.
But This property does not exist in channel.
Is there any good way?
I'd appreciate it if you could suggest a better way. Thank you.
Sure, using a regex or simply a list of both topics would be preferred, but you then end up with data split across different directories based on the topic name, leaving HDFS clients to merge the data back together
A channel includes a producer, thus why a regex isn't possible
By going to the opposite cluster
There's no way Flume will automatically do that unless you modify its bootstrap servers config and restart it. Same applies for any Kafka client, really... This isn't exactly what I'd call "highly available" because all clients pointing to the down cluster will experience downtime
Instead, you should be using a Flume pipeline (or Kafka Connect) from each cluster. That being said, MirrorMaker would only then be making extra copies of your data or allowing clients to consume data from the other cluster for their own purposes rather than acting as a backup/fallback
Aside: unclear from the question, but make sure you are using MirrorMaker2, also, which would imply you'd already be using Kafka Connect and can therefore install the HDFS sink rather than need Flume

Disable mirrormaker2 offset-sync topics on source kafka cluster

We're using MirrorMaker2 to replicate some topics from one kerberized kafka cluster to another kafka cluster (strictly unidirectional). We don't control the source kafka cluster and we're given only access to describe and read specific topics that are to be consumed.
MirrorMaker2 creates and maintains a topic (mm2-offset-syncs) in the source cluster to encode cluster-to-cluster offset mappings for each topic-partition being replicated and also creates an AdminClient in the source cluster to handle ACL/Config propagation. Because MM2 needs authorization to create and write to these topics in the source cluster, or to perform operations through AdminClient, I'm trying to understand why/if we need these mechanisms in our scenario.
My question is:
In a strictly unidirectional scenario, what is the usefulness of this source-cluster offset-sync topic to Mirrormaker?
If indeed it's superfluous, is it possible to disable it or operate mm2 without access to create/produce to this topic?
If ACL and Config propagation is disabled, is it safe to assume that the AdminClient is not used for anything else?
In the MirrorMaker code, the offset-sync topic it is readily created by MirrorSourceConnector when it starts and then maintained by the MirrorSourceTask. The same happens to AdminClient in the MirrorSourceConnector.
I have found no way to toggle off these features but honestly I might be missing something in my line of thought.
There is an option inroduced in Kafka 3.0 to make MM2 not to create the mm2-offset-syncs topic in the source cluster and operate on it in the target cluster.
Thanks to the KIP-716: https://cwiki.apache.org/confluence/display/KAFKA/KIP-716%3A+Allow+configuring+the+location+of+the+offset-syncs+topic+with+MirrorMaker2
Pull-request:
https://issues.apache.org/jira/browse/KAFKA-12379
https://github.com/apache/kafka/pull/10221
Tim Berglund noted this KIP-716 in Kafka 3.0 release: https://www.youtube.com/watch?v=7SDwWFYnhGA&t=462s
So, to make MM2 to operate on the mm2-offset-syncs topic in the target cluster you should:
set option src->dst.offset-syncs.topic.location = target
manually create mm2-offset-syncs.dst.internal topic in the target cluster
start MM2
src and dst - are examples of aliases, replace it with yours.
Keep in mind: if mm2-offset-syncs.dst.internal topic is not created manually in the target cluster, then MM2 still tries to create this topic in the source cluster.
In case of one-direction replication process this topic is useless, because it is empty all the time, but MM2 requires it anyway.

Is Kafka topic linked with zookeeper and If zookeeper changed will topic disappeare

I was working with Kafka. I downloaded the zookeeper, extracted and started it.
Then I downloaded Kafka, extracted the zipped file and started Kafka. Everything was working good. I created few topics and I was able to send and receive messages. After that I stopped Kafka and Zookeeper. Then I read that Kafka itself provides Zookeeper. So I started Zookeeper that was provided with Kafka. However the data directory for it was different, and then I started Kafka from same configuration file and same data directory location. However after starting Kafka I could not find the topics that I had created.
I just want to know that, does this mean the meta data about the topics is maintained by Zookeeper. I searched Kafka documentation, however, I could not find anything in detail.
https://kafka.apache.org/documentation/
Check this documentation provided by confluent. According to this Apache Kafka® uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.
So, the answer to your question is, yes, the purpose of zookeeper is to store relevant metadata about the kafka brokers, topics, etc,.
Also, since you have just started working on Kafka and Zookeeper, I would like to mention this. By default, Kafka stored it's data in a temp location which get's deleted on system reboot, so you should change that as well.
the answer to your question tag is yes,
1)Initially you started standalone zookeeper from zip file and you stopped the zookeeper, which means the topics that are created are stored in the zookeeper standalone are lost.Now you persistent cluster metadata related to Kafka is lost .
2)second time you started the zookeeper from the package that comes along with Kafka, now the new zookeeper instance does not have any topics information that you created previously, so you need to create newly .
3) suppose in case 1: if you close the terminal and start again the zookeeper from standalone , you no need to create the Topic again ,but if you stopped the zookeeper server from standalone then topics are lost.
in simple : you created two separate zookeeper instances, where topics will not be shared between them .

Is there a way to replicate kafka topics from one cluster to another using java?

I need to replicate kafka topics from one cluster to another using a java process. The new messages to the source cluster should also be replicated to the destination cluster. Is there any simple java code to done this?