How to make the Kafka Connect BigQuery Sink Connector create one table per event type and not per topic? - apache-kafka

I'm using confluentinc/kafka-connect-bigquery on our Kafka (Avro) events. On some topics, we have more than one event type, e.g., UserRegistered and UserDeleted are on the topic domain.user.
The subjects in our Schema Registry look as follows.
curl --silent -X GET http://avro-schema-registry.core-kafka.svc.cluster.local:8081/subjects | jq .
[...]
"domain.user-com.acme.message_schema.domain.user.UserDeleted",
"domain.user-com.acme.message_schema.domain.user.UserRegistered",
"domain.user-com.acme.message_schema.type.domain.key.DefaultKey",
[...]
My properties/connector.properties (I'm using the quickstart folder.) looks as follows:
[...]
topics.regex=domain.*
sanitizeTopics=true
autoCreateTables=true
[...]
In BigQuery a table called domain_user is created. However, I would like to have two tables, e.g., domain_user_userregistered and domain_user_userdeleted or similar, because the schemas of these two event types are quite different. How can I achieve this?

I think you can use the SchemaNameToTopic Single Message Transform to do this. By setting the topic name as the schema name this will propagate through to the name given to the created BigQuery table.

Related

Customize Debezium pubsub message

I am trying to use debezium server to stream "some" changes in a postgresql table. Namely this table being tracked has a json type column named "payload". I would like the message streamed to pubsub by debezium to contain only the contents of the payload column. Is that possible?
I ve explored the custom transformations provided by debezium but from what I could get it would only allow me to enrich the published message with extra fields, but not to publish only certain fields, which is what I want to do.
Edit:
The closest I got to what I wanted was to use the outbox transform but that published the following message:
{
"schema":{
...
},
"payload:{
"key":"value"
}
Whereas what I would like the message to be is:
{"key":"value"}
I ve tried adding an ExtractNewRecordState transform but still got the same results. My application.properties file looks like:
debezium.transforms=outbox,unwrap
debezium.transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
debezium.transforms.outbox.table.field.event.key=grouping_key
debezium.transforms.outbox.table.field.event.payload.id=id
debezium.transforms.outbox.route.by.field=target
debezium.transforms.outbox.table.expand.json.payload=true
debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.transforms.unwrap.add.fields=payload
Many thanks,
Daniel

Get all Kafka Source Connectors writing to a specific topic

I have the name of a Kafka Topic. I would like to know what Connectors are using this topic. Specifically, I need the Source Connector name so I can modify the source query. I only have access to the Confluent Control Center. We have hundreds of Connectors and I cannot search through them manually.
Thank you in advance!
You'd need to write a script.
Given: Connect REST endpoint
Do: (psuedocode)
to_find = 'your_topic_name'
topic_used_by = []
connectors = GET /connectors
for each connector_name in connectors:
connector_config = GET /connectors/{connector_name}
connector_config = connector_config['config']
if 'topics.regex' in connector_config:
// pattern match against your topic
if to_find.matches(connector_config['topics.regex']):
topic_used_by.add(connector_name)
else if 'topics' in connector_config:
// split these on commas
if to_find in connector_config['topics'].split(','):
topic_used_by.add(connector_name)
print(topic_used_by)
From that, your HTTP client should be easy to extend to update any config value, then POST /connectors/{connector_name}/config
Keep in mind that neither 'topics' nor 'topics.regex' are source connector properties (they are sink connector properties). Therefore, you'd need to modify that to use whatever connector properties you do have (ideally, filtering by the connector class name). For example, the table.whitelist property of the JDBC Source, or collection in the MongoDB source, determine the topic name.
And unless you filter by class name, that would return both sources and sinks, so unless there is "Source", or similar, in the name of the connector class, this is the best you can do
Also, you'd need to consider that some topic names may be set by various transform configurations, which could have any JSON key value in the config, so there'd be no straightforward way to search on those without at least transform_config_value.contains('your_topic_name')
You can use feature implemented through KIP-558: Track the set of actively used topics by connectors in Kafka Connect.
E.g. Get the set of active topics from a connector called 'some-source':
curl -s 'http://localhost:8083/connectors/some-source/topics' | jq
{
"some-source": {
"topics": [
"foo",
"bar",
"baz",
]
}
}

using existing kafka topic with avro in ksqldb

suppose i have a topic, lets say 'some_topic'. data in this topic is serialized with avro using schema registry.
schema subject name is the same as the name of the topic - 'some_topic', without '-value' postfix
what i want to do is to create a new topic, lets say 'some_topic_new', where data will be serialized with the same schema, but some fields will be 'nullified'
i'm trying to evaluate if this could be done using ksqldb and have two questions:
is it possible to create stream/table based on existing topic and using existing schema?
maybe something like create table some_topic (*) with (schema_subject=some_topic, ...).
so fields for new table would be taken from existing schema automatically
could creating of new schema with '-value' postfix be avoided when creating new stream/table?
When you create a stream in ksqlDB based on another you can have it inherit the schema. Note that it won't share the same schema though, but the definition will be the same.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='AVRO');
CREATE STREAM my_stream_new
WITH (KAFKA_TOPIC='some_topic_new', VALUE_FORMAT='AVRO') AS
SELECT * FROM my_stream;

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

How to clone field in Kafka Connect?

I'm configuring Kafka Connect to copy data from Kafka to Database
I need to put value from some field to two column in Database .
My Kafka Message has two fields name, age. Target table has 3 columns name, displayName and age. I would like to clone value of name from Kafka message to put it in both columns name and displayName.
Is there any Transform, that can by applied to do that?
As Driss Nejjar says, this would typically be the kind of thing that a Single Message Transform would be perfect for. However, there is no Transform that ships with Apache Kafka that I can see that would do this. You could write your own, or you could also use KSQL:
CREATE STREAM new AS SELECT name, name as displayName, age FROM source;
This would take your source topic (populated by Connect), and add the additional field displayName, and write to a new Kafka topic called new.
Disclaimer: I work for Confluent, the company behind the KSQL project.