How to start kafka server programmatically - apache-kafka

we are trying to start the kafka server (zookeeper and broker) programmatically in our application. Any api / library available for the same?

Yes, you can use embedded Kafka, which will run zookeeper and kafka server for you. It is generally used for testing kafka producer/consumer where there is no need to explicitly run them.
For more detail refer
To run it, we write EmbeddedKafka.start() at the start and EmbeddedKafka.stop() at the end.

Related

Is it possible to run MirrorMaker in Kafka without using Kafka Connect?

Looking to come up with solution that would mirror or replicate one Kafka environment without needing Kafka Connect. Having a hard time coming up with any possible solutions or workarounds. Very new to Kafka, would appreciate any thoughts and/or guidance!
MirrorMaker2 is based on Kafka Connect. The original MirrorMaker is not, however it is not recommended to use this anymore as it's not very fault tolerant.
Most Kafka replication solutions are built on Kafka Connect (Confluent Replicator as another example)
Uber uReplicator mentioned in the comments is built on Apache Helix and requires a Zookeeper connection, which Kafka Connect does not, so ultimately depends on what access and infrastructure you have available
Since Kafka comes with the Connect API and MirrorMaker2 pre-installed, there should be little reason to find alternatives unless it absolutely doesn't work for your use case (which is...?)

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

Ingest Streaming Data to Kafka via http

I am very new with Kafka and Streaming Data in general. What I am trying to do is to ingest data which is to be sent via http to kafka. My research has brought me to the confluent REST proxy but I can't get it to work.
What I currently have is kafka running with a single node and single broker with kafkamanager in docker containers.
Unfortunately I can't run the full confluent platform with docker since I don't have enough memory available on my machine.
In essence my question is: How to setup a development environment where data is ingested by kafka through http?
Any help is highly appreciated!
You don't need the "full Confluent Platform" (KSQL, Control Center, included)
Zookeeper, Kafka, the REST proxy, and optionally the Schema Registry, should all only take up-to 4 GB of RAM total. If you don't even have that, then you'll need to go buy more RAM.
Note that Zookeeper and Kafka do not need to be running on the same machines as the Schema Registry or REST proxy, so if you have multiple machines, then you can save some resources that way as well.
To run one Kafka broker, zookeeper and schema registry, 1Gb is usually enough (in dev).
If you do not want for some reason to use Confluent REST proxy, you can write your own. It's quite straightforward: "on request, parse your incoming JSON, validate data, construct your message (in Avro?) and produce it to Kafka".
In this article, you'll find some configuration to press Kafka and ZK on heap memory: https://medium.com/#saabeilin/kafka-hands-on-part-i-development-environment-fc1b70955152
Here you can read how to produce/consume messages with Python:
https://medium.com/#saabeilin/kafka-hands-on-part-ii-producing-and-consuming-messages-in-python-44d5416f582e
Hope these help!

Automatically deploy Kafka Stream app

I'm just starting with Kafka and Kafka Streaming Applications. I wrote a Kafka Stream App that consumes from one topic, process this messages, and send them to another topic.
To the best of my knowledge, the only ways that I have found to run this Kafka Stream App coded are:
Run Java Class from IDE.
Generate *.jar file and run it from prompt.
I would like to know if there is any way to automatically run Kafka Streaming Applications on Kafka server startup. For example: copy the *.jar file to some folder of my Kafka installation, and automatically run this stream app when I start my Kafka server.
Your Kafka broker (server) and your Kafka Streams application are independent from one another. You can start them however you manage processes on your server, whether it's something like initd or systemd, or container-based solutions like Docker or Kubernetes.
In my experience, if your streams application starts well before your broker or ZooKeeper, then it may time out waiting for them to come online. So you may need to configure the streams process to restart in such a situation.

Are there any Apache Kafka consumer lag checker?

My Kafka Consumers commit their offsets to Kafka(instead of Zookeeper), so I cannot use Kafka Manager.
Burrow is great, however, I cannot use Go in our production environment. :(
So I'm wondering are there any Apache Kafka consumer lag checker besides the above two? I Googled it but didn't find much useful information. Thanks in advance!
You could use Remora https://github.com/zalando-incubator/remora. Its an application that can be deployed with your kafka
Not exactly the same but it can use to monitor lag.
https://github.com/quantifind/KafkaOffsetMonitor
Topic position
There is also records-lag-max JMX metric available at every Kafka Consumer instance.
So you can monitor this one either directly form your application by accesing MBean Server or remotely.