Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.
Related
This question already has answers here:
Kafka Connect- Modifying records before writing into sink
(2 answers)
Closed 1 year ago.
As I've read from Kafka: The definitive guide book, Kafka Connect can simplify the task of loading CSV files into Kafka. But because we didn't write any code for business logic implementation (like Python/Java code), what should I do if I want to get data from CSV, and add many data from different sources to generate a new message, or even generate new data from system logs to that new message, before loading it into Kafka? Is Kafka Connect still a good approach in this use case?
The source for this answer is from this Stackoverflow thread: Kafka Connect- Modifying records before writing into HDFS sink
You have several options.
Single Message Transforms, great for light-weight changes as messages pass through Connect. Configuration-file-based, and extensible using the provided API if there's not an existing transform that does what you want. See the discussion here on when SMT is suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like.
I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization
When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.
Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect
I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.
I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON