Get all Kafka Source Connectors writing to a specific topic - apache-kafka

I have the name of a Kafka Topic. I would like to know what Connectors are using this topic. Specifically, I need the Source Connector name so I can modify the source query. I only have access to the Confluent Control Center. We have hundreds of Connectors and I cannot search through them manually.
Thank you in advance!

You'd need to write a script.
Given: Connect REST endpoint
Do: (psuedocode)
to_find = 'your_topic_name'
topic_used_by = []
connectors = GET /connectors
for each connector_name in connectors:
connector_config = GET /connectors/{connector_name}
connector_config = connector_config['config']
if 'topics.regex' in connector_config:
// pattern match against your topic
if to_find.matches(connector_config['topics.regex']):
topic_used_by.add(connector_name)
else if 'topics' in connector_config:
// split these on commas
if to_find in connector_config['topics'].split(','):
topic_used_by.add(connector_name)
print(topic_used_by)
From that, your HTTP client should be easy to extend to update any config value, then POST /connectors/{connector_name}/config
Keep in mind that neither 'topics' nor 'topics.regex' are source connector properties (they are sink connector properties). Therefore, you'd need to modify that to use whatever connector properties you do have (ideally, filtering by the connector class name). For example, the table.whitelist property of the JDBC Source, or collection in the MongoDB source, determine the topic name.
And unless you filter by class name, that would return both sources and sinks, so unless there is "Source", or similar, in the name of the connector class, this is the best you can do
Also, you'd need to consider that some topic names may be set by various transform configurations, which could have any JSON key value in the config, so there'd be no straightforward way to search on those without at least transform_config_value.contains('your_topic_name')

You can use feature implemented through KIP-558: Track the set of actively used topics by connectors in Kafka Connect.
E.g. Get the set of active topics from a connector called 'some-source':
curl -s 'http://localhost:8083/connectors/some-source/topics' | jq
{
"some-source": {
"topics": [
"foo",
"bar",
"baz",
]
}
}

Related

How to fix the deserialization error when merging 2 kstreams topics using leftJoin?

I am new to Kafka. I am working on a personal project where I want to write to 2 different Avro topics and merge them using leftJoin. Once I merge them, I want to produce the same messages to a KSQL DB as well. (I haven't implemented that part yet).
I am using Kafka Template to produce to the 2 Avro topics and convert them into kstreams to merge them. I am also using KafkaListener to print any messages in them and that work is working. Here's where I am having issues at: 2 of them actually. In either cases, it doesn't produce any messages in the merged topic.
If I removed the consumed.with()from the kstream, then it throws a default key Serde error.
But if I keep it, then it throws a deserialization error.
I have even provided the default serialization and deserialization in both my application.properties and in the streamConfig inside main() but it's still not working.
Can somebody please help me with how to merge the 2 Avro topics? Is it error occurring because I am using the Avro schema? Should I use JSON instead? I wanna use a schema because my value part of the message will have multiple values in it.
For eg: {Key : Value} = {company : {inventory_id, company, color, inventory}} = {Toyota : {0, RAV4, 50,000}}
Here's a link to all the file: application.properties, DefaultKeySerdeError.txt, DeserializationError.txt, FilterStreams.java, Inventory.avsc, Pricing.avsc, and MergedAvro.avsc . Let me know if yall want me to put them below. Thank you very much for your help in advance!
https://gist.github.com/Arjun13/b76f53c9c2b4e88225ef71a18eb08e2f
Looking at the DeserializationError.txt file, it looks like the problem is you haven't provided the credentials for schema registry. Even though you have provided them in the application.properties file, they're not getting into the serdes configuration, so if you add the basic.auth.user.info configs to the serdeConfig map you should be all set.

Why use a schema registry

I just started working with Kafka and I use Protocol Buffers for the message format and I just learn about schema registry.
To give some context we are a small team with a dozen of webservices and we use Kafka to communicate between them and we store all the schemas and read/write models in a library that is later imported by each service. This way they know to serialize/deserialize a message.
But now schema registry comes into play. Why use it? Now my infrastructure becomes more complicated plus I need to update it every time I change a schema and I need to define as well the read/write models in each service like I do now using the library.
So from my point of view I only see cons mainly just complicating things so why should I use a schema registry?
Thanks
The schema registry ensures your messages will not deviate from a common base compatibility guarantee (the first version of the schema).
For example, you have a schema that describes an event like {"first_name": "Jane", "last_name": "Doe"}, but then later decide that names can actually have more than 2 parts, so you then move to a schema that can support {"name": "Jane P. Doe"}... You still need a way to deserialize old data with first_name and last_name fields to migrate to the new schema having only name. Therefore, consumers will need both schemas. The registry will hold that and encode the schema ID within each payload from the producer. After all, the initial events with the two name fields would know nothing about the "future" schema with only name.
You say your models are shared in libraries across services. You probably then have some regression testing and release cycle to publish these between services? The registry will allow you to centralize that logic.

Customize Debezium pubsub message

I am trying to use debezium server to stream "some" changes in a postgresql table. Namely this table being tracked has a json type column named "payload". I would like the message streamed to pubsub by debezium to contain only the contents of the payload column. Is that possible?
I ve explored the custom transformations provided by debezium but from what I could get it would only allow me to enrich the published message with extra fields, but not to publish only certain fields, which is what I want to do.
Edit:
The closest I got to what I wanted was to use the outbox transform but that published the following message:
{
"schema":{
...
},
"payload:{
"key":"value"
}
Whereas what I would like the message to be is:
{"key":"value"}
I ve tried adding an ExtractNewRecordState transform but still got the same results. My application.properties file looks like:
debezium.transforms=outbox,unwrap
debezium.transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
debezium.transforms.outbox.table.field.event.key=grouping_key
debezium.transforms.outbox.table.field.event.payload.id=id
debezium.transforms.outbox.route.by.field=target
debezium.transforms.outbox.table.expand.json.payload=true
debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.transforms.unwrap.add.fields=payload
Many thanks,
Daniel

How to make the Kafka Connect BigQuery Sink Connector create one table per event type and not per topic?

I'm using confluentinc/kafka-connect-bigquery on our Kafka (Avro) events. On some topics, we have more than one event type, e.g., UserRegistered and UserDeleted are on the topic domain.user.
The subjects in our Schema Registry look as follows.
curl --silent -X GET http://avro-schema-registry.core-kafka.svc.cluster.local:8081/subjects | jq .
[...]
"domain.user-com.acme.message_schema.domain.user.UserDeleted",
"domain.user-com.acme.message_schema.domain.user.UserRegistered",
"domain.user-com.acme.message_schema.type.domain.key.DefaultKey",
[...]
My properties/connector.properties (I'm using the quickstart folder.) looks as follows:
[...]
topics.regex=domain.*
sanitizeTopics=true
autoCreateTables=true
[...]
In BigQuery a table called domain_user is created. However, I would like to have two tables, e.g., domain_user_userregistered and domain_user_userdeleted or similar, because the schemas of these two event types are quite different. How can I achieve this?
I think you can use the SchemaNameToTopic Single Message Transform to do this. By setting the topic name as the schema name this will propagate through to the name given to the created BigQuery table.

Confluent Platform: Schema Registry Subjects

Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer (used for new Kafka 0.8.2 producer) uses topic-key|value pattern for subjects (link) whereas KafkaAvroEncoder (for old producer) uses schema.getName()-value pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo) and one for value (LogEntry).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder is implemented rather than the KafkaAvroSerializer. KafkaAvroEncoder does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer would try to update the topic-key and topic-value subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you