PubSub Subscription error with REPEATED Column Type - Avro Schema - publish-subscribe

I am trying to use the PubSub Subscription "Write to BigQuery" but am running into an issue with the "REPEATED" column type. the message I get when update the subscription is
Incompatible schema mode for field 'Values': field is REQUIRED in the topic schema, but REPEATED in the BigQuery table schema
My Avro Schema is:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": [
{
"type": "record",
"name": "Values",
"fields": [
{
"name": "AttributeID",
"type": "string"
},
{
"name": "AttributeValue",
"type": "string"
}
]
}
]
}
]
}
Input JSON That "Matches" Schema:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": {
"AttributeID": "TEST_ID_1",
"AttributeValue": "Value_1"
}
}
my Table looks like:
ItemID | STRING | NULLABLE
UserType | STRING | NULLABLE
Values | RECORD | REPEATED
AttributeID | STRING | NULLABLE
AttributeValue | STRING | NULLABLE
I am able to "Test" and "Validate Schema" and it comes back with a success. Question is, what am I missing on the Avro for the Values node to make it "REPEATED" vs "Required" for subscription to be created.

The issue is that Values is not an array type in your Avro schema, meaning it expects only one in the message, while it is a repeated type in your BigQuery schema, meaning it expects a list of them.

Per Kamal's comment above, this schema works:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": {
"type": "array",
"items": {
"name": "NameDetails",
"type": "record",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Value",
"type": "string"
}
]
}
}
}
]
}
the payload:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": [
{ "AttributeID": "TEST_ID_1" },
{ "AttributeValue": "Value_1" }
]
}

Related

Add MySQL column comment as a metadata to Avro schema through the Debezium connector

Kafka Connect is used through the Confluent platform and io.debezium.connector.mysql.MySqlConnector is used as the Debezium connector.
In my case, MySQL table includes columns with sensitive data and these columns must be tagged as sensitive for further use.
SHOW FULL COLUMNS FROM astronauts;
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| orderid | int | NULL | YES | | NULL | | select,insert,update,references | |
| name | varchar(100) | utf8mb4_0900_ai_ci | NO | | NULL | | select,insert,update,references | sensitive column |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
Notice MySQL comment for the name column.
Based on this table, I would like to have this Avro schema in the Schema registry:
{
"connect.name": "dbserver1.inventory.astronauts.Envelope",
"connect.version": 1,
"fields": [
{
"default": null,
"name": "before",
"type": [
"null",
{
"connect.name": "dbserver1.inventory.astronauts.Value",
"fields": [
{
"default": null,
"name": "orderid",
"type": [
"null",
"int"
]
},
{
"name": "name",
"type": {
"MY_CUSTOM_ATTRIBUTE": "sensitive column",
"type": "string"
}
}
],
"name": "Value",
"type": "record"
}
]
},
{
"default": null,
"name": "after",
"type": [
"null",
"Value"
]
},
{
"name": "source",
"type": {
"connect.name": "io.debezium.connector.mysql.Source",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "connector",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"default": "false",
"name": "snapshot",
"type": [
{
"connect.default": "false",
"connect.name": "io.debezium.data.Enum",
"connect.parameters": {
"allowed": "true,last,false,incremental"
},
"connect.version": 1,
"type": "string"
},
"null"
]
},
{
"name": "db",
"type": "string"
},
{
"default": null,
"name": "sequence",
"type": [
"null",
"string"
]
},
{
"default": null,
"name": "table",
"type": [
"null",
"string"
]
},
{
"name": "server_id",
"type": "long"
},
{
"default": null,
"name": "gtid",
"type": [
"null",
"string"
]
},
{
"name": "file",
"type": "string"
},
{
"name": "pos",
"type": "long"
},
{
"name": "row",
"type": "int"
},
{
"default": null,
"name": "thread",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "query",
"type": [
"null",
"string"
]
}
],
"name": "Source",
"namespace": "io.debezium.connector.mysql",
"type": "record"
}
},
{
"name": "op",
"type": "string"
},
{
"default": null,
"name": "ts_ms",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "transaction",
"type": [
"null",
{
"connect.name": "event.block",
"connect.version": 1,
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "total_order",
"type": "long"
},
{
"name": "data_collection_order",
"type": "long"
}
],
"name": "block",
"namespace": "event",
"type": "record"
}
]
}
],
"name": "Envelope",
"namespace": "dbserver1.inventory.astronauts",
"type": "record"
}
Notice the custom schema field named MY_CUSTOM_ATTRIBUTE.
Debezium 2.0 supports schema doc from column comments [DBZ-5489], however, personally I think the doc field attribute is not appropriate since:
any implementation of a schema registry or system that processes the schemas is free to drop those fields when encoding/decoding and its fully spec compliant
Additionally, the doc field is solely intended to provide information to a user of the schema and is not intended as a form of metadata that downstream programs can rely on
source: https://avro.apache.org/docs/1.10.2/spec.html#Schema+Resolution
Based on the Avro schema docs, custom attributes for Avro schemas are allowed and these attributes are known as metadata:
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
source: https://avro.apache.org/docs/1.10.2/spec.html#schemas
I think Debezium transformations might be a solution, however, I have the following problems:
No idea how to get MySQL column comments in my custom transformation
org.apache.kafka.connect.data.SchemaBuilder does not allow to add custom attributes, afaik just doc and the specific field
Here are several native transformations for reference: https://github.com/apache/kafka/tree/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/

Confluent Kafka producer message format for nested records

I have a AVRO schema registered in a kafka topic and am trying to send data to it. The schema has nested records and I'm not sure how I correctly send data to it using confluent_kafka python.
Example schema: *ingore any typos in schema (real one is very large, just an example)
{
"namespace": "company__name",
"name": "our_data",
"type": "record",
"fields": [
{
"name": "datatype1",
"type": ["null", {
"type": "record",
"name": "datatype1_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
{
"name": "datatype2",
"type": ["null", {
"type": "record",
"name": "datatype2_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
]
}
I am trying to send data to this schema using confluent_kafka python version. When I have done this prior, the records were not nested and I would use a typical dictionary key: value pairs and serialize it. How can I send nested data to work with schema.
What I tried so far...
message = {'datatype1':
{'site': 'sitename',
'units': 'm'
}
}
this version does not cause any kafka errors, but the all of the columns show up as null
and...
message = {'datatype1':
{'datatype1_1':
{'site': 'sitename',
'units': 'm'
}
}
}
This version produced a kafka error with the schema.
If you use namespaces, you don't have to worry about naming collisions and you can properly structure your optional records:
for example, both
{
"meta": {
"instanceID": "something"
}
}
And
{}
are valid instances of:
{
"doc": "Survey",
"name": "Survey",
"type": "record",
"fields": [
{
"name": "meta",
"type": [
"null",
{
"name": "meta",
"type": "record",
"fields": [
{
"name": "instanceID",
"type": [
"null",
"string"
],
"namespace": "Survey.meta"
}
],
"namespace": "Survey"
}
],
"namespace": "Survey"
}
]
}

Invalid Schema on Confluent Controlcenter

I am just trying to set up a Value-Schema for a Topic in the Web interface of Confluent Control Center.
I chose the Avro-format and tried the following schema:
{
"fields": [
{"name":"date",
"type":"dates",
"doc":"Date of the count"
},
{"name":"time",
"type":"timestamp-millis",
"doc":"date in ms"
},
{"name":"count",
"type":"int",
"doc":"Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
But the interface keeps on saying the input schema is invalid.
I have no idea why.
Any help is appreciated!
There are issues related to datatypes.
"type":"dates" => "type": "string"
"type":"timestamp-millis" => "type": {"type": "long", "logicalType": "timestamp-millis"}
Updated schema will be like:
{
"fields": [
{
"name": "date",
"type": "string",
"doc": "Date of the count"
},
{
"name": "time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"doc": "date in ms"
},
{
"name": "count",
"type": "int",
"doc": "Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
Sample Payload:
{
"date": "2020-07-10",
"time": 12345678900,
"count": 1473217
}
More reference related to Avro datatypes can be found here:
https://docs.oracle.com/database/nosql-12.1.3.0/GettingStartedGuide/avroschemas.html

Deserialize Avro message with schema type as object

I'm reading from a Kafka topic, which contains Avro messages. I have given below schema.
I'm unable to parse the schema and deserialize the message.
{
"type": "record",
"name": "Request",
"namespace": "com.sk.avro.model",
"fields": [
{
"name": "empId",
"type": [
"null",
"string"
],
"default": null,
"description": "REQUIRED "
},
{
"name": "carUnit",
"type": "com.example.CarUnit",
"default": "ABC",
"description": "Car id"
}
]
}
I'm getting below error :
The type of the "carUnit" field must be a defined name or a {"type": ...} expression
Can anyone please help out.
How about
{
"name": "businessUnit",
"type": {
"type": {
"type": "record ",
"name": "BusinessUnit ",
"fields": [{
"name": "hostName ",
"type": "string "
}]
}
}
}

How do I COPY a nested Avro field to Redshift as a single field?

I have the following Avro schema for a record, and I'd like to issue a COPY to Redshift:
"fields": [{
"name": "id",
"type": "long"
}, {
"name": "date",
"type": {
"type": "record",
"name": "MyDateTime",
"namespace": "com.mynamespace",
"fields": [{
"name": "year",
"type": "int"
}, {
"name": "monthOfYear",
"type": "int"
}, {
"name": "dayOfMonth",
"type": "int"
}, {
"name": "hourOfDay",
"type": "int"
}, {
"name": "minuteOfHour",
"type": "int"
}, {
"name": "secondOfMinute",
"type": "int"
}, {
"name": "millisOfSecond",
"type": ["int", "null"],
"default": 0
}, {
"name": "zone",
"type": {
"type": "string",
"avro.java.string": "String"
},
"default": "America/New_York"
}],
"noregistry": []
}
}]
I want to condense the object in MyDateTime during the COPY to a single column in Redshift. I saw that you can map nested JSON data to a top-level column: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json-jsonpaths , but I haven't figured out a way to concatenate the fields directly in the COPY command.
In other words, is there a way to convert the following record (originally in Avro format)
{
"id": 6,
"date": {
"year": 2010,
"monthOfYear": 10,
"dayOfMonth": 12,
"hourOfDay": 14,
"minuteOfHour": 26,
"secondOfMinute": 42,
"millisOfSecond": {
"int": 0
},
"zone": "America/New_York"
}
}
Into a row in Redshift that looks like:
id | date
---------------------------------------------
6 | 2010-10-12 14:26:42:000 America/New_York
I'd like to do this directly with COPY
You would need to declare the Avro file(s) as an Redshift Spectrum external table and then use a query over that to insert the data into the local Redshift table.
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html