How to update table schema when there is new Avro schema for Kafka data in Flink? - apache-kafka

We are consuming a Kafka topic in the Flink application using Flink Table API.
When we first submit the application, we first read the latest schema from our custom registry. Then create a Kafka Datastream and Table using Avro schema. My data serializers' implementation works similarly to the Confluent schema registry by checking the schema ID and then using the registry. So we can apply the correct schema in runtime.
However, I do not know how to update the table schema and re-execute SQL without re deploying the job. Is there a way to have a background thread for checking the schema changes, and if there are any, pauses the current execution, updates the table schema and execute the SQL.
This will be particularly useful for the continuous delivery of schema changes to the applications. We already have a compatibility check in place.

TL;DR you don't need to change anything to get it working in most cases.
In Avro, there is the concept of reader and writer schema. Writer schema is the schema that was used to generate the Avro record and it's encoded into the payload (in most cases as an id).
Reader schema is used by your application to make sense of your data. If you do a particular calculation you are using a specific set of fields of an Avro record.
Now the good part: Avro transparently translates the writer schema to a read schema if they are schema-compatible. So as long as your schemas are fully compatible, there is a way to always transform the writer schema to your read schema.
So if your schema of the records change in the background while the application is running, the DeserializationSchema fetches the new write schema and infers a new mapping to the read schema. Your query will not notice any change.
This approach falls short if you actually want to enrich the schema in your application; for example, you always want to add a field calculated and return all other fields. Then a newly added field will not be picked up, since effectively your reader schema changes. In this case, you either need to restart or use generic record schema.

Related

Kafka connect: How to handle database schema/table changes

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector
Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

schema evolution along with message transformations using confluent_kafka and schema-registry

I need to transform messages along with schema evolution to target MySQL DB. I can't use Sink connector here because it supports schema evolution but not complex transformations.
I have a table in MySQL Source like below :
id, name, created_at
1, shoaib, 2022-01-01
2, ahmed, 2022-02-01
In target MySQL I want to replicate that table with some transformations. Target table would look like
id, name, created_at, isDeleted
1, shoaib, 2022-01-01, 0
2, ahmed, 2022-02-01, 0
Whenever a row gets deleted in Source, here isDeleted column should be updated as 1.
This is a very simple transformation I just put in.
So, I decided not to use Sink's SMT because it offers very basic transformations. I went with
confluent_kafka libray using Python.
I am able to transform data as needed and load into target MySQL but I am not able to make the relevant schema changes automatically to target using schema-registry with confluent_kafka libray. Schema versions are getting updated in the registry but how to propagate that changes to target DB if I'm not using the Sink connector.
propagate that changes to target DB if I'm not using the Sink connector
You would need to run an ALTER TABLE statement on your own.
I suggest looking at the Debezium docs on how it can insert a __deleted attribute into the data from the source connector, then you should be able to use an SMT to convert the boolean into an INT (or just keep the database having a boolean column rather than a number)
Or you can use Python to write a new schema to a new Kafka topic (creating a new subject in the registry), which then can be consumed by the Sink connector and write to (and evolve) the table.

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

ksqlDB: Best way to create Tables from a Debezium source topics?

I would like to create Tables in ksqlDB from Debezium source topics, with the ultimate aim of performing a left join on these tables and efficiently outputting materialized views to a downstream database using the JDBC sink connector.
The Debezium source topics have not had any transforms applied (such as ExtractNewRecordState), so contain a 'before' and 'after' property, as described in the Debezium documentation here.
The reason for not applying the ExtractNewRecordState transform (which would presumably simplify matters) is that the source CDC topics may be used for various purposes and it does not appear possible to create multiple topics off the same source database table (since topic names are automatically determined by Debezium and depend on the database server, schema and table name as described here).
The best approach I have found so far is to:
create a stream in ksqlDB from the raw Debezium input, e.g.:
CREATE STREAM user_stream WITH (KAFKA_TOPIC='mssql.dbo.user', VALUE_FORMAT='AVRO');
create a second stream selecting the required fields from the 'after' property of the first stream, e.g.:
CREATE STREAM user_stream2 AS SELECT AFTER->user_id, AFTER->username, AFTER->email FROM user_stream EMIT CHANGES;
finally, convert the second stream to a table as described here, namely:
SELECT user_id,
LATEST_BY_OFFSET(username) AS username,
LATEST_BY_OFFSET(email) AS email
FROM user_stream2
GROUP BY user_id
EMIT CHANGES;
These steps must be repeated to generate each Table, at which point a join can be performed on the Tables to produce an output.
This seems quite long-winded, with a lot of intermediate steps. Performance also seems sluggish. Is there a better and/or more direct way to generated materialized views using ksqlDB and Debezium? Can any of the steps be cut out and/or should I be using a different approach in step 3 (such as a windowing function)?
I'm particularly keen to ensure that the approach taken is the most efficient from a performance and resource usage perspective.

How to sink structured records directly from KSQL into a connector (e.g., InfluxDB)

I'm trying to sink data directly from KSQL into InfluxDB (or any other connector that would require definitions). I'm able to get things working in the simple case, but I start having trouble when the schema requires complex types. (I.e., tags for InfuxDB).
Here's an example of my stream/schema:
Field | Type
-------------------------------------------------------------------
ROWKEY | VARCHAR(STRING) (primary key)
FIELD_1 | VARCHAR(STRING)
FIELD_2 | VARCHAR(STRING)
FIELD_3 | VARCHAR(STRING)
FIELD_4 | DOUBLE
TAGS | MAP<STRING, VARCHAR(STRING)>
If I manually create an AVRO schema and populate the records from a simple producer, I can get through the getting started guide here and embed the tags for InfluxDB.
However, when I move to KSQL, if I try to sink the AVRO stream directly into InfluxDB, I lose information on the complex types (tags). I notice the warning from this blog post, "Warning ksqlDB/KSQL cannot yet write data in an Avro format that is compatible with this connector"
Next, I try converting the AVRO stream into JSON format, but now I understand that I would have to specify the schema in each record, similar to what this question is posing. I haven't been able to convert an AVRO stream into a JSON stream and embed the schema and payload at the same time.
Finally, I see the "jiggling solution" with kafkacat, but this would force me to dump records out from KSQL into kafkacat, and then back into Kafka before finally arriving at Influx.
Is there a method to sink complex records directly from KSQL in either JSON or AVRO format into a connector?
I would imagine the reason ksqlDB can't yet output the AVRO data in the format InfluxDB requires is because it won't output the TAGS field as an Avro map type due to Avro maps requiring a non-null key and the SQL MAP<STRING, STRING> type allowing null keys. Hence ksqlDB serializes the map as an Avro array of key-value entries.
To get something working with Avro you'll need either:
Support for non-null types: https://github.com/confluentinc/ksql/issues/4436, or
Support for using existing Avro schema: https://github.com/confluentinc/ksql/issues/3634
Please feel free to up-vote / comment on these issues to raise their profiles.
Previously, a JSON based solution would not of worked because, as you've pointed out, the connector requires the JSON schema embedded in the payload. However, the most recent version of Confluent Platform / Schema Registry supports JSON schemas in the Schema Registry. Hence, while I haven't tried it, upgrading to the latest CP version may mean a JSON based solution will work. If not, it is probably work raise a Jira/Github ticket to get the appropriate component upgraded for this to work.