Kafka - mirror from one server to another server - apache-kafka

I'm trying to mirror Kafka real-time data from one server to another server.
Found a tool called 'Mirror Maker' at Apache website.
[Apache Kafka Mirror Maker][1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330#Kafkamirroring(MirrorMaker)-Howtosetupamirror
But the description is too simple, without configuration detail of consumer.config & producer.config files.
How to deploy this tool with correct config files for mirroring data between two Kafka servers for:
Real-time topics transfer
fault-tolerant under unstable network
Any other possible way to achieve this is also very welcome.
Thanks.

Apache's Mirror Maker is a relatively simple tool, this is why all the documentation you read doesn't seem incredible informative. There isn't much information to give. All it does is consume data from cluster A and produce to cluster B. Mirror Maker does provide real-time topic transfer but as far as stability goes you may need to look to third party tools such as https://eng.uber.com/ureplicator/.

Related

Is it possible to run MirrorMaker in Kafka without using Kafka Connect?

Looking to come up with solution that would mirror or replicate one Kafka environment without needing Kafka Connect. Having a hard time coming up with any possible solutions or workarounds. Very new to Kafka, would appreciate any thoughts and/or guidance!
MirrorMaker2 is based on Kafka Connect. The original MirrorMaker is not, however it is not recommended to use this anymore as it's not very fault tolerant.
Most Kafka replication solutions are built on Kafka Connect (Confluent Replicator as another example)
Uber uReplicator mentioned in the comments is built on Apache Helix and requires a Zookeeper connection, which Kafka Connect does not, so ultimately depends on what access and infrastructure you have available
Since Kafka comes with the Connect API and MirrorMaker2 pre-installed, there should be little reason to find alternatives unless it absolutely doesn't work for your use case (which is...?)

Kafka Connect instead of Flume Ingestion

I have been looking into the concepts and application of Kafka Connect, and I have even touched one project based on it in one of my intern. Now in my working scenario, now I am considering replacing the architecture of the our real time data ingestion platform which is currently based on flume -> Kafka with Kafka Connect and Kafka.
The reason why I am considering the switch can be concluded mainly into:
But if we use flume we need to install the agent on each remote machine which generates tons of workload for further devops, especially at the place where I am working where the authority of machines is managed in a rigid way that maintaining utilities on machines belonging to other departments.
Another reason for the consideration is that the machines' os environment varies, if we install flumes on a variety of machines , some machine with different os and jdks(I have met some with IBM jdk) just cannot make flume work well which in worst case can result in zero data ingestion.
It looks with Kafka Connect we can deploy it in a centralized way with our Kafka cluster so that the develops cost can go down. Beside, we can avoid installing flumes on machines belonging to others and avoid the risk of incompatible environment to ensure the stable ingestion of data from every remote machine.
Besides, the most ingestion scenario is only to ingest real-time-written log text file on remote machines(on linux and unix file system) into Kafka topics, that is it. So I won't need advanced connectors which is not supported in apache version of Kafka.
But I am not sure if I am understanding the usage or scenario of Kafka Connect the right way. Also I am wondering if Kafka Connect should be deployed on the same machine with the data source machines or if it is ok they resides on different machines. If they can be different then why flume requires the agent to be run on the same machine with the data source? So I wish someone more experienced can give me some lights on that.
Is Kafka Connect appropriate for ingesting data to Kafka? yes
Does Kafka Connect run local to the data source? only if it has to (e.g. reading a local file with Kafka Connect spooldir plug, FilePulse plugin, etc ).
Should you rip out something that works and replace it with Kafka Connect? not unless it's fixing a problem that you have
If you're not using either yet, should you use Kafka Connect instead of Flume? Quite possibly.
Learn more about Kafka Connect here: https://dev.to/rmoff/crunchconf-2019-from-zero-to-hero-with-kafka-connect-81o
For file ingest alone there's other tools too like Filebeat too

Kafka and IIDR CDC

I am trying to build a CDC pipeline using : DB2--IBM CDC --Kafka
and I am trying to figure out the right way to setup this .
I tried below things -
1.Setup a 3 node kafka cluster on linux on prem
2.Installed IIDR CDC software on linux on prem using - setup-iidr-11.4.0.1-5085-linux-x86.bin file . The CDC instance is up and running .
The various online documentation suggest to install 'IIDR management console ' to configure the source datastore and CDC server configuration and also Kafka subscription configuration to build the pipeline .
Currently I do not have the management console installed .
Few questions on this -
1.Is there any alternative to IBM CDC management console for setting up the kafka-CDC pipeline ?
2.How can I get the IIDR management console ? and if we install it on our local windows dekstop and try to connect to CDC/Kafka which are on remote linux servers, will it work ?
3.Any other method to setup the data ingestion IIDR CDC to Kafka ?
I am fairly new to CDC/ IIDR , please help !
I own the development of the IIDR Kafka target for our CDC Replication product.
Management Console is the best way to setup the subscription initially. You can install it on a windows box.
Technically I believe you can use our scripting language called CHCCLP to setup a subscription as well. But I recommend using the GUI.
Here are links to our resources on our IIDR (CDC) Kafka Target. Search for the "Kafka" section.
"https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/IIDR%20Wiki"
An example of setting up a subscription and replicating is this video
https://ibm.box.com/s/ur8jokg6tclsx5fcav5g86a3n57mqtd5
Management console and access server can be obtained from IBM fix central.
I have installed MC/Access server on my VM and on my personal windows box to use it against my linux VMs. You will need connectivity of course.
You can definitely follow up with our Support and they'll be able to sort you out. Plus we have docs in our knowledge centre on MC starting here.... https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/overview_of_cdc.html
You'll find our Kafka target is very flexible it comes with five different formats to write data into Kafka, and you can choose to capture data in an audit format, or the Kafka compaction compatible key, null for a delete method.
Additionally you can even use the product to write several records to several different topics in several formats from a single insert operation. This is useful if some of your consumer apps want JSON and others Avro binary. Additionally you can use this to put all the data to more secure topics, and write out just some of the data to topics that more people have access to.
We even have customers who encrypt columns in flight when replicating.
Finally the product's transformations can be parallelized even if you choose to only use one producer to write out data.
Actually one more finally, we additionally provide the option to use a special consumer which produces database ACID semantics for data written into Kafka and shred across topics and partitions. It re-orders it. we call it the transactionally consistent consumer. It provides operation order, bookmarks for restarting applications, and allows parallelism in performance but ordered, exactly once, deduplicated consumption of data.
From my talk at the Kafka Summit...
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions

Monitoring UI for Apache kafka - kafka manager vs kafka monitor [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 3 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am new to kafka. We want to monitor and manage kafka topics. We tried different open source monitoring tools like
kafka-monitor
kafka-manager
Both tools are good. But we are unable to make a decision which should be included in our deployment stack. Which one is better and why, and in which scenario?
'kafka manager' from yahoo looks the older one and 'kafka monitor' from LinkedIn is newer one
Kafka Monitor-
Lenses
Lenses (ex Landoop) enhances Kafka with User Interface, streaming SQL engine and cluster monitoring. It enables faster monitoring of Kafka data pipelines.
They provide a free all-in-one docker (Lenses Box) which can serve a single broker for up to 25M messages. Note that this is recommended for development environments.
Cloudera SMM
Streams Messaging Manager is the solution for monitoring and managing clusters running Cloudera or Hortonworks kafka. It also comes with replication capability.
Confluent
Another option is Confluent Enterprise which is a Kafka distribution for production environments. It also includes Control Centre, which is a management system for Apache Kafka that enables cluster monitoring and management from a User Interface.
Yahoo CMAK (Cluster Manager for Apache Kafka, previously known as Kafka Manager)
Kafka Manager or CMAK is a tool for monitoring Kafka offering less functionality compared to the aforementioned tools.
KafDrop
KafDrop is a UI for monitoring Apache Kafka clusters. The tool displays information such as brokers, topics, partitions, and even lets you view messages. It is a lightweight application that runs on Spring Boot and requires very little configuration.
LinkedIn Burrow
Burrow is a monitoring companion for Apache Kafka that provides consumer lag checking as a service without the need for specifying thresholds. It monitors committed offsets for all consumers and calculates the status of those consumers on demand. An HTTP endpoint is provided to request status on demand, as well as provide other Kafka cluster information. There are also configurable notifiers that can send status out via email or HTTP calls to another service.
Kafka Tool
Kafka Tool is a GUI application for managing and using Apache Kafka clusters. It provides an intuitive UI that allows one to quickly view objects within a Kafka cluster as well as the messages stored in the topics of the cluster. It contains features geared towards both developers and administrators.
If you cannot afford licenses, then go for Yahoo Kafka Manager, LinkedIn Burrow or KafDrop. Confluent's and Landoop's products are the best out there, but unfortunately, they require licensing.
For more details, you can refer to my blog post Overview of UI Monitoring tools for Apache Kafka Clusters.
If you want to pay for licensing and Kafka cluster support, then you can use Confluent Control Center
Alternatively, the free route would be to use JMX exporters from Datadog and/or Prometheus/Influxdb (with Grafana dashboards) to see overall system health checks (CPU, network, memory, etc)... Much more information than what you get only by monitoring Kafka processes with Kafka tools
At my company, we used the Yahoo product, we investigated the LinkedIn product, and several others mentioned. My company ultimately chose to use Prometheus+Grafana. Everyone loves it and I'd highly recommend it.
There are two big advantages to Prometheus+Grafana. The first is it does full featured Kafka metrics ingestion+visualization+alerting but it's not limited to Kafka. While our initial needs were just to monitor Kafka, we also wanted metrics on HTTP servers+traffic, server utilization (cpu/ram/disk), and custom application level metrics. Prometheus handles all of the above. Secondly, Prometheus + Grafana are very high quality, well designed, and easy to use. A lot of other products in this space are old and complicated to work with. Prometheus + Grafana are both excellent to work with, they are very customizable, polished, and easy to use. Grafana has a very flashy + functional JavaScript interface that lets you make exactly the customized dashboards that you want. Prometheus has a very polished metric collection engine, storage engine, query language, and alerting system. Something like Yahoo Kafka Manager has much more limited functionality in all of these categories.
If you want to try Prometheus, you need to do two things:
1) install+configure the JMX->Prometheus exporter on your Kafka brokers:
https://github.com/prometheus/jmx_exporter
2) Setup a Prometheus server to collect metrics + and setup a Grafana dashboard to display the graphs that you want.
I'd also say that this is just for monitoring+dashboards+alerting. For management functions, you still need other tools.
The kafka-monitor is (despite the name) a load generation and reporting tool. Yahoo's kafka-manager is an overall monitoring tool.

ActiveMQ Deployment model

I have gone through (not fully) ActiveMQ and tried to figure out the deployment model for my application.
I am bit confused on that.
I want to make the system High Availability and decided to use the following. Please correct me if anything is wrong or disadvantage of the model.
Deployment Modle:
Will deploy Brokers in M1 and M2 respectivley.
Use Hardware load balancer (Either F5 or Zeus) to connect either one of the broker (M1 or M2) based on the load.
Want to publish a message using Load balancer URL.
I have gone through network of brokers and we need to mintain some topology. I fell which makes the system more complicated if system grows horizontally. So it is better to have one load balancer to distribute the load.
Questions
Is this above model will send message to any one of the Broker?
Consumer Will be deployed in Tomcat (Think i need to use embeded brokers to configure either M1 or M2). Is it possible to use Load balancer URL instaed of M1 or M2?
Is it possible to have single Web Console Admin to monitor both M1 and M2.
Do we have any performance issue using Spring's feature to consume message.
Sorry to shoot out so many questions. Please help me to correct the deployment model.
I think the best way to get some load balancing with some activemq servers is having a : network of brokers and your consumers/producers (in your webapps) should use some failover
So if a producer p1 send a message on a queue on broker 1, the consumer c1 can read the message on broker 2.
[Edit] I have never tried to add some hardware balancer instead of the activemq protocole failover. It should work : just try it and tell us.
3- I do not think it is possible to have only one Web Console to monitor both of your brokers.
4- As far as I am concerned I do not have any performance issue with my Spring configuration.
There are a lot of questions there.
The first thing is to do is start simple. If your application's load is being handled with just one broker, consider setting up high availability through a master-slave setup. For this you do not need a load balancer - the ActiveMQ client library has a failover mechanism where you can define the URLs to a set of brokers that the client should attempt to connect to.
If you are looking at setting up an infrastructure where one broker will not be able to deal with the message load (you can test the maximum throughput of your broker using the performance module), I would advise you to read up on how networks of brokers work. If you do go down this path, you really need to understand ActiveMQ.
On monitoring, a web console can only show you the internals of a single broker. To get insight around what is going on around a set of brokers you will need a monitoring tool such as FuseHQ/Hyperic that is able to aggregate JMX information from a number of boxes.
Performance with Spring is not a problem as long as you configure it correctly (see the section on PooledConnectionFactory).
I see that you are a new user, so if this answers your question, please tick it.