How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT - apache-kafka

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

Related

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

using kafka with schema-registry, in app with multiple topics and SubjectNameStrategy

To begin with, I have found a way how to do this, more or less. But it's really bad code. So I'm looking for suggestions how to solve this better if this approach exist.
To lay something to work with. Assume you have app, which sends avro to n topics and uses schema registry. Assume(at first) that you don't want to use avro unions, since they bring some issues along. N-1 topics are easy, 1 schema per topic. But then, you have data, you need to send in order, which means 1 topic and specified group key, but these data don't have same schema. So to do that, you need to register multiple schema per that topic in schema registry, which implies use of key.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy or similar. And here it becomes ugly.
But that setting is per schema registry instance, so you have to declare 2(or more) schema registry instances, one per each SubjectNameStrategy key/value combination. This will work.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
So if you cannot use RecordNameStrategy, and for some reason you still want to use avro and schema registry, IIUC you have no other choice, than to use avro unions on top level, and use defaut TopicNameStrategy, which is fine now, since you have single unioned schema. But top-level unions weren't nice to me in past, since deserializer don't know, naturally, which type would you like to deserialize from the data. So theoretically a way out of this could be using say Cloudevents standard(or something similar), setting cloudevent type attribute in respect to which type from union was used to serialize data, and then have type->deserializer map, to be able to pick correct deserializer for avro-encoded data in received cloudevents message. This will work, and not only for java.
So to wrap up, here are 2 generally described solutions to very simple problem. But to be honest, these seems extremely complicated for widely accepted solution (avro/schema-registry). I'd like to know, if there is easier way through this.
This is a common theme, particularly in CQRS-like systems in which commands may be ordered (eg create before update or delete etc). In these cases, using Kafka, it's often not desirable to publish the messages over multiple topics. You are correct that there are two solutions for sending messages with multiple schemas on the same topic: either a top-level union in the avro schema, or multiple schemas per topic.
You say you don't want to use top-level unions in the schema, so I'll address the case of multiple schemas per topic. You are correct that this excludes the use of any subject naming strategy that includes only the topic name to define the subject. So TopicNameStrategy is out.
But then, according to documentation, RecordNameStrategy is java-platform only (!), so if you would like to create service, which is not language specific (which you would most probably like to do in 2021 ...), you cannot use RecordNameStrategy.
This is worthy of some clarification.... In the confluent way of things, the 'schema-registry aware avro serializers' first register your writer schema in the registry against a subject name to obtain a schema id. They then prefix your avro bytes with that schema id before publishing to kafka. See the 'Confluent Wire Format' at https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format.
So the subject naming is a choice in the serializer library; the deserializer just resolves a schema by the id prefixing the kafka message. The confluent Java serializers make this subject naming configurable and define strategies TopicNameStrategy, RecordNameStrategy and TopicRecordNameStrategy. See https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#subject-name-strategy. The three strategies are conventions for defining 'scopes' over which schemas will be tested for compatibility in the registry (per-topic, per-record, or a combination). You've identified RecordNameStrategy fits your use case for having multiple avro schemas per topic.
However, I think your concern about non-Java support for RecordNameStrategy can be set aside. In the serializer, the subject naming is free to be implemented however the serializer developer chooses. Having worked on this stuff in Java, Python, Go and NodeJS, I've experienced some variety in how third-party serializers work in this regard. Nevertheless, working non-Java libs do exist.
If all else fails, you can write your own 'schema-registry aware serializer' that registers the schema with your chosen subject name prior to encoding the confluent wire-format for Kafka. I've had happy outcomes from other tooling by keeping to one of the well-known confluent strategies, so I can recommend mimicking them.

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

What is the best way to publish and consume different type of messages?

Kafka 0.8V
I want to publish /consume byte[] objects, java bean objects, serializable objects and much more..
What is the best way to define a publisher and consumer for this type scenario?
When I consume a message from the consumer iterator, I do not know what type of the message it is.
Can anybody point me a guide on how to design such scenarios?
I enforce a single schema or object type per Kafka Topic. That way when you receive messages you know exactly what you are getting.
At a minimum, you should decide whether a given topic is going to hold binary or string data, and depending on that, how it will be further encoded.
For example, you could have a topic named Schema that contains JSON-encoded objects stored as strings.
If you use JSON and a loosely-typed language like JavaScript, it could be tempting to store different objects with different schemas in the same topic. With JavaScript, you can just call JSON.parse(...), take a peek at the resulting object, and figure out what you want to do with it.
But you can't do that in a strictly-typed language like Scala. The Scala JSON parsers generally want you to parse the JSON into an already defined Scala type, usually a case class. They do not work with this model.
One solution is to keep the one schema / one topic rule, but cheat a little: wrap an object in an object. A typical example would be an Action object where you have a header that describes the action, and a payload object with a schema dependent on the action type listed in the header. Imagine this pseudo-schema:
{name: "Action", fields: [
{name: "actionType", type: "string"},
{name: "actionObject", type: "string"}
]}
This way, in even a strongly-typed language, you can do something like the following (again this is pseudo-code) :
action = JSONParser[Action].parse(msg)
switch(action.actionType) {
case "foo" => var foo = JSONParser[Foo].parse(action.actionObject)
case "bar" => var bar = JSONParser[Bar].parse(action.actionObject)
}
One of the neat things about this approach is that if you have a consumer that's waiting for only a specific action.actionType, and is just going to ignore all the others, it's pretty lightweight for it to decode just the header and put off decoding action.actionObject until when and if it is needed.
So far this has all been about string-encoded data. If you want to work with binary data, of course you can wrap it in JSON as well, or any of a number of string-based encodings like XML. But there are a number of binary-encoding systems out there, too, like Thrift and Avro. In fact, the pseudo-schema above is based on Avro. You can even do cool things in Avro like schema evolution, which amongst other things provides a very slick way to handle the above Action use case -- instead of wrapping an object in an object, you can define a schema that is a subset of other schemas and decode just the fields you want, in this case just the action.actionType field. Here is a really excellent description of schema evolution.
In a nutshell, what I recommend is:
Settle on a schema-based encoding system (be it JSON, XML, Avro,
whatever)
Enforce a one schema per topic rule