Why Kafka Connect Works? - apache-kafka

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".

In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

Related

Kafka Connect vs Apache Nifi

Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.

What is the gain of using kafka-connect over traditional approach?

I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here

How to get data from Kafka into a store without Kafka Connect sink?

When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.
Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

Send different instances of Kafka Connect to different Kafka topic

I have tried to send the information of a Kafka Connnect instance in distributed mode with one worker to a specific topic, I have the topic name in the "archive.properties" file that use when I launch the instance.
But, when I send five or more instances, I see the messages merged in all topics.
The "solution" I thought was make a map to store the relation between ID and topic but it doesn't worked
Is there an specific Kafka connect implementation to do this?
Thanks.
First, details on how you are running connect and which connector you are using will be very helpful.
Some connectors support sending data to more than one topic. For example, confluent-jdbc-sink will send each table to a separate topic. So this could be a limitation of the connector you are using.
Also depending on the connector and your use case - whether you need to run more than one connector. With the JDBC connector, you need one connector per database and it will handle all the tables. If you run two connectors on the same database and same tables, you'll get duplicates.
In short hopefully your connector has helpful documentation.
In the next release of Apache Kafka we are adding Single Message Transformations. One of the transformations can modify the target topic based on data in the event - so you can use the transformation to perform event routing.