I know Vertica has vkconfig to import data from Kafka. However, it seems like Confluent also has a connector that does the same. As their web site states: "Vertica Sink Connector for Confluent Platform - You can use the Kafka Connect Vertica Sink connector to export data from Apache Kafka topics to Vertica. The Vertica Sink connector periodically polls records from Kafka and adds them to a Vertica table."
Are the two connectors aim at doing the same tasks? If not, what are the differences?
At a high level, the difference in software is that the Vertica one is done on a periodic schedule
Kafka Connect is more real-time, depending on how you've configured it.
Of course, there also comes the details around installation, support, and licensing.
You should expect more Vertica-specific features from the Vertica importer than from Confluent, which may only focus on the bare minimum requirements to take Kafka records to DB events, and who knows is adding enhacements to that connector.
Worked and done a comparison of both Vertica and confluent Vertica sink connectors. Vertica inbuilt connector does not have the capability to handle Kafka tombstone messages(logged an enhancement request with Vertica ) and is slow with avro. Worked with confluent in fixing at least 15 bugs and enhancements and they improved and released latest version of vertica sink connector supporting most of the features. Working on licensing with confluent is the difficult part as they might not provide license to their connector pack unless the entire eco system to manage Kafka is bought
Related
I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors
I'm reading up on Kafka and Kafka Connect. The documentation mentions 'Kafka sources' and 'Kafka sinks' in a generic sort of way in Kafka Connect documentation. I'm not certain if these two terms are specific to Kafka Connect or they are simply referring Producers and Consumers.
If you are in need to bring data into your kafka cluster or copy data outside of your kafka ( copy data from / into kafka ) there are many tools supporting you on that task ,
You might as well write and MAINTAIN your code with Kafka Consumer / Producer API
In order to avoid struggling to create new code for "already solved problem" kafka community developed the Kafka Connect framework.
the "kafka way" is by leveraging its internal ecosystem tool named kafka connect.
kafka connect is a distributed framework which has many connectors supported by community or vendor. open sourced or proprietary, there is big and growing hub "market place" for any need.
connector is piece of pluggable code (JAR files) that runs inside the framework, there are two types of connectors , sink connector is "read from kafka and sink to target", and source connector which is "read from data source and write to kafka".
in order to set up a connector you are just setting a configuration file with all the required parameters, without the need of any programming skills. no code. losing some flexibility in favor of simplicity
I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.
I have been working with kafka connect, Spark streaming , Nifi with kafka for streaming data.
I am aware that unlike other technologies kafka connect is not a separate application and it is a tool of kafka.
In case of distributed mode all technologies implement the parallelism by the underlying tasks or threads. What makes kafka connect to be efficient when dealing with kafka and why is it called light weight?
It's efficient and lightweight because it uses the built-in Kafka protocols and doesn't require an external system such as YARN. While it is arguably better/easier to deploy Connect in Mesos/Kubernetes/Docker, it is not required
The connect API is also maintained by the core Kafka developers rather than people that just want a simple integration into another tool. For example, last time I checked, NiFi cannot access the Kafka message timestamps. And dealing with the Avro Schema Registry seems to be an after thought in the other tools as compared to using Confluent Certified Connectors
I want to have all of the changes of a couchdb database in kafka at application run time as they arrive. Is there any reliable existing tool for that?
You may try to use Kafka Connect tool. Also, Confluent Platform provides long list of different connectors for Kafka Connect.
I'm not a CouchDB user, but you may choose one of applicable source connectors here or create your own Kafka CouchDB source connector.