User topic management using Kafka Stream Processor API - apache-kafka

I have just started my hands dirty with kafka. I have gone through this. It only says data/topic management for kafka stream DSL. Can anyone share any link for same sort of data management for Processor API of kafka stream? I am specially interested for user and internal topic management of Processor API.
TopologyBuilder builder = new TopologyBuilder();
// add the source processor node that takes Kafka topic "source-topic" as input
builder.addSource("Source", "source-topic")
From where to populate this source topic with input data before stream processor starts consuming the same?
In short, can we write to kafka "Source" topic using streams, like as producer writes to topic? Or is stream only for parallel consumption of topic?
I believe we should as "Kafka's Streams API is built on top of Kafka's producer and consumer clients".

Yes, you have to use a KafkaProducer to generate inputs for the source topics which feeds the KStream.
But, the intermediate topics can be populated, via
KafkaStreams#to
KafkaStreams#through

You can use JXL(Java Excel API) to write a producer which writes to a kafka topic from an excel file.
Then create a kafka streams application to consume that topic and produce to another topic.
And you can use context.getTopic() to get topic from which the processor is receiving.
Then set multiple if statements to call the process logic for that topic inside the process() function.

Related

Filter kafka message from topic

I have a spring batch job which will write messages into a Kafka topic. Now I need to test whether specific messages are present in the kafka topic how do i achieve this using spring boot
I am anticipating to use the Kafka stream will this be useful
Searching Kafka topics requires consuming the whole topic. Ideally, you'd dump the topic into an index database instead, and search that.
You can use Kafka Streams, but it doesn't return true/false and stop on single messages.
Start a regular consumer. You can use spring-kafka w/ #KafkaListener, and you can then write an if statement to check message content.

Apache Kafka Lineage Information

Is there any way to obtain lineage data for Kafka jobs.
Like for example we have a Job History URL for all the MapReduce jobs.
Is there anything similar for Kafka where I can get the metadata of the Producer producing information to a particular topic? (eg: IP address of the producer)
Out of the box, no, not for Producers.
You can list consumer client addressses, but not see producers into a topic.
One option would be to look into Apache Atlas for lineage information, which is one component of Hortonwork's Stream Messaging Manager for analyzing Kafka connection information.
Otherwise, you would be stuck trying to enfore producers to send this data along with the messages.

Producer-consumer processing pattern for Kafka processing

I'm implementing a streaming pipeline that resembles the illustration below:
*K-topic1* ---> processor1 ---> *K-topic2* ---> processor2 -->
*K-topic3* ---> processor3 --> *K-topic4*
The K-topic components represent Kafka topics and the processor components code (Python/Java).
For the processor component, the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain as well as persistent store such as S3.
I have a question regarding the design approach.
The way I see it, each processor component should encapsulate both consumer and producer functionality.
Would the best approach be to have a Processor module/class that could contain KafkaConsumer and KafkaProducer classes ? To date, most examples I've seen have separate consumer and producer components which are run separately and would entail running double the number of components
as opposed to encapsulating producers & consumers within each Processor object.
Any suggestions/references are welcome.
This question is different from
Designing a component both producer and consumer in Kafka
as that question specifically mentions using Samza which is not the case here.
the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain
This is exactly the strength of Kafka Streams and/or KSQL. You could use the Processor API, but from what you describe, I think you'll only need the Streams DSL API
persist the results to the next topic in the chain as well as persistent store such as S3.
From the above topic, you can use a Kafka Connect Sink for getting the topic data into these other external systems. There is no need to write a consumer to do this for you.

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

Create new Producer from Kafka consumer?

How to create new Kafka Producer from existing Consumer with java ?
You can't create a KafkaProducer from a KafkaConsumer instance.
You have to explicitly create a KafkaProducer using the same connection settings as your producer.
Considering the use case you mentioned (copying data from a topic to another), I'd recommend using Kafka Streams. There's actually an example in Kafka that does exactly that: https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/pipe/PipeDemo.java
I will recommend to use the Kafka Streams library. It reads data from kafka topics and do some processing and write back to another topics.
That could be simpler approach for you.
https://kafka.apache.org/documentation/streams/
Current limitation is, Source and destination cluster should be same with Kafka Streams.
Otherwise you need to use Processor API to define another destination cluster.
Another approach, is simply define a producer in the consumer program. Wherever your rule matches(based on offset or any conditions), call producer.send() method