Kafka Connect - Connector with same Kafka cluster as a source? - apache-kafka

I only found references to MirrorMaker v2.
Can I reuse org.apache.kafka.connect.mirror.MirrorSourceConnector as if it were a "plain" Connector with Kafka as a source, or is there something else, hopefully simpler, available?
I'm trying to use KafkaConnect and (a combination of) its SMTs to simulate message routing behaviour found in other message brokers.
For example, I would like to consume from a topic, extract values from the message (either headers or payload), and route the message to another topic within the same cluster depending on the data found in the message.
Thank you

within the same cluster
Then that's what Kafka Streams or ksqlDB are for. You can import and use SMT methods directly via code, although you also need to use the Connect Converter classes to get Schema/Struct types that most of the SMT's require
While you could use MirrorMaker, intercluster relocation is not its purpose

Related

Does it make sense to use kafka-connect to transform kafka messages?

We have confluents platform in our infrastructure. At core, we are using kafka broker to distribute events. Dozens of devices produce events to kafka topics (there is a kafka topic for each type of event), where events are serialized in google's protobuf. We have confluent's schema registry to keep track of the protobuf schemas.
What we need is, for several events, we need to apply some transformation and then publish the transformation output to some other kafka topic. Of course Kafka Streams is one way to accomplish that, like in this example. However, we don't want to have a java application for each transformation (which increase the complexity of the project and development/deployment effort), and it doesn't feels right to put all streams in one application (modifying one will require to stop all streams ans start again).
At this point, we thought that maybe Confluent's Kafka Connect might be better approach. We can have several workers, and we can deploy them into one kafka connect instance/or cluster. The question is;
Does it make sense to use kafka connect to get message from one kafka topic and send it to another kafka topic? Be cause all the use cases and examples aims to get data from outside (database, file etc.) to kafka, and from kafka to outside.
To clarify, Kafka Connect is not "Confluent's", it's part of Apache Kafka.
While you could use MirrorMaker2/Confluent Replicator with transforms, it honestly wouldn't be much different than extracting the transformation logic into a shared library, then bundling a deployable Kafka Streams application that accepts configuration parameters for input and output topics with the transformation in-between.
You make a good point about single-point of administration, but that's also a single point of failure... If you use Connect, changing your transform plugin will also require you to stop and restart the Connect server, if all topics are part of the same connector, then any task failure would stop some percentage of the topic transformations
Kafka Streams (or KSQL) is preferred for inter-cluster translations, anyway
You could also look at solutions like Apache Nifi for more complex event management and routing

Why Kafka Connect Works?

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

Kafka topic to multiple kafka topics dispatcher (same cluster)

My use-case is as follows:
I have a kafka topic A with messages "logically" belonging to different "services", I don't handle neither the system sending the messages to A.
I want to read such messages from A and dispatch them to a per-service set of topics on the same cluster (let's call them A_1, ..., A_n), based on one column describing the service (the format is CSV-style, but it doesn't matter).
The set of services is static, I don't have to handle addition/removal at the moment.
I was hoping to use KafkaConnect to perform such task but, surprisingly, there are no Kafka source/sinks (I cannot find the tickets, but they have been rejected).
I have seen MirrorMaker2 but it looks like an overkill for my (simple) use-case.
I also know KafkaStreams but I'd rather not write and maintain code just for that.
My question is: is there a way to achieve this topic dispatching with kafka native tools without writing a kafka-consumer/producer myself?
PS: if anybody thinks that MirrorMaker2 could be a good fit I am interested too, I don't know the tool very well.
As for my knowledge, there is no straightforward way to branch incoming topic messages to a list of topics based on the incoming messages. You need to write custom code to achieve this.
Use Processor API Refer here
Pass list of topics inside the Processor method
Use logic to identify topics need to branch
Use context.forward to publish a message to other topics
context.forward(key, value, To.child("selected topic"))
Mirror Maker is for doing ... mirroring. It's useful when you want to mirror one cluster from one data center to the other with the same topics. Your use case is different.
Kafka Connect is for syncing different systems (data from Databases for example) through Kafka topics but I don't see it for this use case either.
I would use a Kafka Streams application for that.
All the other answers are right, at the time of writing I did find any "config-only" solution in the Kafka toolset.
What finally did the trick was to use Logstash, as its "kafka output plugin" supports jinja variables in topic-id parameter.
So once you have the "target topic name" available in a field (say service_name) it's as simple as this:
output {
kafka {
id => "sink"
codec => [...]
bootstrap_servers => [...]
topic_id => "%{[service_name]}"
[...]
}
}

Send different instances of Kafka Connect to different Kafka topic

I have tried to send the information of a Kafka Connnect instance in distributed mode with one worker to a specific topic, I have the topic name in the "archive.properties" file that use when I launch the instance.
But, when I send five or more instances, I see the messages merged in all topics.
The "solution" I thought was make a map to store the relation between ID and topic but it doesn't worked
Is there an specific Kafka connect implementation to do this?
Thanks.
First, details on how you are running connect and which connector you are using will be very helpful.
Some connectors support sending data to more than one topic. For example, confluent-jdbc-sink will send each table to a separate topic. So this could be a limitation of the connector you are using.
Also depending on the connector and your use case - whether you need to run more than one connector. With the JDBC connector, you need one connector per database and it will handle all the tables. If you run two connectors on the same database and same tables, you'll get duplicates.
In short hopefully your connector has helpful documentation.
In the next release of Apache Kafka we are adding Single Message Transformations. One of the transformations can modify the target topic based on data in the event - so you can use the transformation to perform event routing.

Kafka topic alias

Is it possible to create an alias of a topic name?
Or, put another way...
If a user writes to topic examplea is it possible to override that at the broker so they actually write to topic exampleb?
alternatively, if the topic was actually written as examplea, but the consumer can refer to it as exampleb.
I'm thinking it could probably be achieved using small hack at the broker where it replies to metadata requests, but I'd rather not if it can be done in some standard way.
Aliases are not natively supported in Kafka.
One workaround could be to produce to examplea and have a consumer/producer pair that consumers from examplea and produces to exampleb. The consumer/producer pair could be written with Kafka clients, as a connector in Connect, as a MirrorMaker instance (though you'll need to modify it to change the topic name), or as a Kafka Streams job. Note that the messages will appear in exampleb slightly after examplea because they're being copied after being written.
How are you writing to the Kafka topic because if it's via a REST proxy you should be able to rewrite the topic portion of the URL using NGINX or a similar reverse proxy intermediary.