how to build data pipeline with KafkaStreams - apache-kafka

Messages are not getting posted to output topic.
Is there any cluster level configuration is required to use Kafka stream APIs. for information Kafka version is > 0.10.0.0

Related

Can kafka publish messages to AWS lambda

I have to publish messages from a kafka topic to lambda to process them and store in a database using a springboot application, i did some research and found something to consume messages from kafka
public Function<KStream<String, String>, KStream<String, String>> process(){} however, im not sure if this is only used to publish the consumed messages to another kafka topic or can be used as an event source to lambda, I need some guidance on consuming and converting the consumed kafka message to event source.
Brokers do not push. Consumers always poll.
Code shown is for Kafka Streams API, which primarily writes to new Kafka topics. While you could fire HTTP events to start a lambda, that's not recommended.
Alternatively, Kafka is already supported as an event source. You don't need to write any consumer code.
https://aws.amazon.com/about-aws/whats-new/2020/12/aws-lambda-now-supports-self-managed-apache-kafka-as-an-event-source/
This is possible from MSK or a self managed Kafka
process them and store in a database
Your lambda could process the data and send to a new Kafka topic using a producer. You can then use MSK Connect or run your own Kafka Connect cluster elsewhere to dump records into a database. No Spring/Java code would be necessary.

Kafka Streams application integrate with Kafka JDBC sink connector

I am trying to use kafka streams for some sort of computation, and send the result of computation to a topic which is sinked to database by JDBC sink connector. The result needs to be serialized using avro with confluent schema registry. Is there any demo or guide to show how to handle this scenario?
Not clear what you mean by "integrate"; Kafka Streams is independent from Kafka Connect, however both can be used from ksqlDB
The existing examples of Kafka Connect should be adequate enough using the output topic of your Streams tasks
As for Kafka Streams, you'd need to use the Confluent Avro Serde's and add Schema Registry URL to the StreamsConfig.

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

Apache Kafka spout is not working on Consumer Side

I am trying to integrate MongoDB and Storm-Kafka, Kafka Producer produces data from MongoDB but it fails to fetch from Consumer side.
Kafka version :0.10.*
Storm version :1.2.1
Do i need to add any functionality in Consumer?

How to configure Flume with Kafka channel without source?

It complains if a source is not specified in the configuration. According to doc :
The Kafka channel can be used for multiple scenarios:
With Flume source and sink - it provides a reliable and highly
available channel for events
With Flume source and interceptor but no sink - it allows writing Flume events into a Kafka topic, for use by other apps
With Flume sink, but no source - it is a low-latency,fault tolerant way to send events from Kafka to Flume sinks such asHDFS, HBase or Solr
https://flume.apache.org/FlumeUserGuide.html
I'm interested in the scenario 3 however there is no example for that in official flume doc.
Regards
flume agent source can be omitted in flume config on newer versions of CDH (5.14 in my case). Only a warning is issued.
you can provide some dummy name for the source like:
agent.sources = dummySource
agent.sinks = hdfsSink
agent.channels = kafkaChnl
and just provide the configurations for hdfsSink and kafkaChnl