Kafka connect topic message modification before writing to sink database - apache-kafka

I have setup Kafka connect between my source and destination, for example
I have a table in mysql which I want to send to mongodb, I have setup mysql as source where as mongodb as sink and its working fine.
In my mysql table has a column called 'download_link', where I have a pdf s3 download link. Now when I setup Kafka this link will go mongodb but what I need is, after I receive message from mysql source, I want to execute a python code which downloads the pdf file and extract text from it, so when my data goes into mongodb. It shouldnt be link rather the text extracted. Is it possible to do something like this?
Can someone provide some resources how I can achieve this?

I want to execute a python code ...
Kafka Connect cannot do this.
Since you have apache-kafka-streams, refer post - Does Kafka python API support stream processing?
You would run your Python stream processor after the source connector, send data to new topic(s), then use a Connect sink on those
Keep in mind that Kafka messages have a maximum size, so extracting large PDF text blobs and persisting the data in the topic(s) might not be the best idea. Instead, you could have the MongoDB writer application download the PDF before writing to the database, but as stated, you'd need to write Java to use Kafka Connect for that. Otherwise, you're left with other Python processes that consume from Kafka and write to Mongo

Related

Streaming Database(mariadb) Changes into another database table using apache-kafka and debezium connector

What I am aiming to do is to Stream data changes into new database table using apache-kafka along with debezium-connectors. But I don't have the slightest idea to how to achieve it. Although I know how to start kafka-zookeeper ,creating topics , and subscribe to that topic . And I am unfamiliar with all the next steps .How to achieve data streaming and capture that data into new database table using Change Data Capture(CDC)?
Debezium only sources data into Kafka. Won't read from Kafka or write to a new database.
You can refer an old blog post of theirs using the JDBC Sink Kafka Connector to write to a new server
https://debezium.io/blog/2017/09/25/streaming-to-another-database/

Kafka Connect vs Apache Nifi

Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.

Streaming database data to Kafka topic without using a connector

I have a use case where I have to push all my MySQL database data to a Kafka topic. Now, I know I can get this up and running using a Kafka connector, but I want to understand how it all works internally without using a connector. In my spring boot project I already have created a Kafka Producer file where I set all my configuration, create a Producer record and so on.
Has anyone tried this approach before? Can anyone throw some light on this?
Create entity using spring jpa for tables and send data to topic using find all. Use scheduler for fetching data and sending it to topic. You can add your own logic for fetching from DB and also a different logic for sending it to Kafka topic. Like fetch using auto increment, fetch using last updated timestamp or a bulk fetch. Same logic of JDBC connectors can be implemented.
Kakfa Connect will do it in an optimized way.

What is the gain of using kafka-connect over traditional approach?

I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here

How can I save stream result into remote database via REST or anything easily

I have examined confluent kafka stream wordcount and anomaly detection example. In these example result is written to a topic. Instead of this How can I save the result iinto remote database via REST or anything easily and fastly. Are there any structure in confluent platform
Code example:
// instead of code, send remote database
wordCounts.toStream().to("streams-wordcount-output", Produced.with(stringSerde, longSerde));
The usual pattern here is to write the results of your stream processing to a Kafka topic, and then use Kafka Connect to stream that topic to anywhere you want to persist the data to. Kafka Connect is part of Apache Kafka, and there are numerous connectors, including kafka-connect-jdbc for writing data to (and from) databases.
If you write directly from your streams application to the database you're unnecessarily tying together your processing and your storage. If the database is offline or unreachable your stream processing has to handle that. Instead decouple the two, and Kafka Connect will handle unreachable database etc.