using existing kafka topic with avro in ksqldb - apache-kafka

suppose i have a topic, lets say 'some_topic'. data in this topic is serialized with avro using schema registry.
schema subject name is the same as the name of the topic - 'some_topic', without '-value' postfix
what i want to do is to create a new topic, lets say 'some_topic_new', where data will be serialized with the same schema, but some fields will be 'nullified'
i'm trying to evaluate if this could be done using ksqldb and have two questions:
is it possible to create stream/table based on existing topic and using existing schema?
maybe something like create table some_topic (*) with (schema_subject=some_topic, ...).
so fields for new table would be taken from existing schema automatically
could creating of new schema with '-value' postfix be avoided when creating new stream/table?

When you create a stream in ksqlDB based on another you can have it inherit the schema. Note that it won't share the same schema though, but the definition will be the same.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='AVRO');
CREATE STREAM my_stream_new
WITH (KAFKA_TOPIC='some_topic_new', VALUE_FORMAT='AVRO') AS
SELECT * FROM my_stream;

Related

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

How to make the Kafka Connect BigQuery Sink Connector create one table per event type and not per topic?

I'm using confluentinc/kafka-connect-bigquery on our Kafka (Avro) events. On some topics, we have more than one event type, e.g., UserRegistered and UserDeleted are on the topic domain.user.
The subjects in our Schema Registry look as follows.
curl --silent -X GET http://avro-schema-registry.core-kafka.svc.cluster.local:8081/subjects | jq .
[...]
"domain.user-com.acme.message_schema.domain.user.UserDeleted",
"domain.user-com.acme.message_schema.domain.user.UserRegistered",
"domain.user-com.acme.message_schema.type.domain.key.DefaultKey",
[...]
My properties/connector.properties (I'm using the quickstart folder.) looks as follows:
[...]
topics.regex=domain.*
sanitizeTopics=true
autoCreateTables=true
[...]
In BigQuery a table called domain_user is created. However, I would like to have two tables, e.g., domain_user_userregistered and domain_user_userdeleted or similar, because the schemas of these two event types are quite different. How can I achieve this?
I think you can use the SchemaNameToTopic Single Message Transform to do this. By setting the topic name as the schema name this will propagate through to the name given to the created BigQuery table.

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

Concatenate value of two fields before publishing to kafka topic using Connect SMT

Is ReplaceField transform used only to replace or mask the field name Or can I change the value of the field as well using some expression , with static values ?
My need is to concatenate value of two fields before publishing to kafka topic.
org.apache.kafka.connect.transforms.InsertField is used to add static values or topic metadata (topic name, partition, timestamp, offset, etc), but not concatenate, or use expressions.
org.apache.kafka.connect.transforms.ReplaceField is used to rename/filter existing fields, not add new ones.
That being said, you're going to have to create your own Transformation subclass that can merge a list of fields.
Or publish the existing "raw" data then use Kafka Streams or KSQL to create the "enriched" topic.

Storm- Routing bolt to get schema from the kafka spout

Storm - Conditionally consuming stream from kafka spout?
How do i get the schema of the data inside the Split Bolt when I try to output it using the declareOutputFields().
Fields schema = new Fields(?)
How do i get the schema of the all the fields in the data inside this bolt without basically reparsing all the data and recreating it?
You need to know the schema beforehand, ie, before you process the first tuples. The method declareOutputFields() is called during deployment before the first call to execute().
Storm cannot handle a variable schema. If you have JSON data with unknown structure, you could declare new Fields("json") and put the whole JSON object into a single field.