Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics?

Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics? - apache-kafka

schema for topic t1
{
"type": "record",
"name": "Envelope",
"namespace": "t1",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
],
"connect.name": "t1.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
}
],
"connect.name": "t1.Envelope"
}
schema for topic t2
{
"type": "record",
"name": "Value",
"namespace": "t2",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
}
],
"connect.name": "t2.Value"
}
s3-sink Connector configuration
connector.class=io.confluent.connect.s3.S3SinkConnector
behavior.on.null.values=ignore
s3.region=us-west-2
partition.duration.ms=1000
flush.size=1
tasks.max=3
timezone=UTC
topics.regex=t1,t2
aws.secret.access.key=******
locale=US
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
value.converter.schemas.enable=false
name=s3-sink-connector
aws.access.key.id=******
errors.tolerance=all
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=s3-sink-connector-bucket
path.format=YYYY/MM/dd
timestamp.extractor=RecordField
timestamp.field=after.createdAt
By using this connector configuration I got error for t2 topic that is "createdAt field does not exist".
If I set timestamp.field = createdAt then error is thrown for t1 topic "createdAt field does not exist".
How can I point "createdAt" field in both schemas at the same time by using same connector for both?
Is it possible to achieve this by using a single s3-sink connector configuration ?
If this scenario is possible then how can I do this, which properties I have to use for achieve this?
If anybody has idea about this, please suggest on this.
If there is any other way to do this then please suggest that way also.

All topics will need the same timestamp field; there's no way to configure topic-to-field mappings.
Your t2 schema doesn't have an after field, so you need to run two separate connectors
The field is also required to be present in all records, otherwise the partitioner won't work.

Related

pyspark stream from kafka topic with avro format returns null dataframe

I have a topic with Avro format and I want to read it as a stream in Pyspark but the output is null. my data is like this:
{
"ID": 559,
"DueDate": 1676362642000,
"Number": 1,
"__deleted": "false"
}
and the schema in the schema registry is:
{
"type": "record",
"name": "Value",
"namespace": "test",
"fields": [
{
"name": "ID",
"type": "long"
},
{
"name": "DueDate",
"type": {
"type": "long",
"connect.version": 1,
"connect.name": "io.debezium.time.Timestamp"
}
},
{
"name": "Number",
"type": "long"
},
{
"name": "StartDate",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.name": "io.debezium.time.Timestamp"
}
],
"default": null
},
{
"name": "__deleted",
"type": [
"null",
"string"
],
"default": null
}
],
"connect.name": "test.Value"
}
and the schema in Pyspark I defined is:
schema = StructType([
StructField("ID",LongType(),False),
StructField("DueDate",LongType(),False),
StructField("Number",LongType(),False),
StructField("StartDate",LongType(),True),
StructField("__deleted",StringType(),True)
])
and in the result, I see null Dataframe
I expect the value in the Dataframe same as the records in Kafka topic but all columns are null

org.apache.avro.AvroTypeException: Unknown union branch EventId

I am trying to convert a json to avro using 'kafka-avro-console-producer' and publish it to kafka topic.
I am able to do that flat json/schema's but for below given schema and json I am getting
"org.apache.avro.AvroTypeException: Unknown union branch EventId" error.
Any help would be appreciated.
Schema :
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [{
"name": "before",
"type": ["null", {
"type": "record",
"name": "Value",
"fields": [{
"name": "EventId",
"type": "long"
}, {
"name": "CameraId",
"type": ["null", "long"],
"default": null
}, {
"name": "SiteId",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Value"
}],
"default": null
}, {
"name": "after",
"type": ["null", "Value"],
"default": null
}, {
"name": "op",
"type": "string"
}, {
"name": "ts_ms",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Envelope"
}
And Json input is like below :
{
"before": null,
"after": {
"EventId": 12,
"CameraId": 10,
"SiteId": 11974
},
"op": "C",
"ts_ms": null
}
And in my case I cant alter schema, I can alter only json such a way that it works

If you are using the Avro JSON format, the input you have is slightly off. For unions, non-null values need to be specified such that the type information is listed: https://avro.apache.org/docs/current/spec.html#json_encoding
See below for an example which I think should work.
{
"before": null,
"after": {
"CoreOLTPEvents.dbo.Event.Value": {
"EventId": 12,
"CameraId": {
"long": 10
},
"SiteId": {
"long": 11974
}
}
},
"op": "C",
"ts_ms": null
}

Removing "connect.name": "CoreOLTPEvents.dbo.Event.Value" and "connect.name": "CoreOLTPEvents.dbo.Event.Envelope" as The RecordType can only contains {'namespace', 'aliases', 'fields', 'name', 'type', 'doc'} keys.
Could you try with below schema and see if you are able to produce the msg?
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "EventId",
"type": "long"
},
{
"name": "CameraId",
"type": [
"null",
"long"
],
"default": "null"
},
{
"name": "SiteId",
"type": [
"null",
"long"
],
"default": "null"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
},
{
"name": "op",
"type": "string"
},
{
"name": "ts_ms",
"type": [
"null",
"long"
],
"default": null
}
]
}

Confluent Kafka producer message format for nested records

I have a AVRO schema registered in a kafka topic and am trying to send data to it. The schema has nested records and I'm not sure how I correctly send data to it using confluent_kafka python.
Example schema: *ingore any typos in schema (real one is very large, just an example)
{
"namespace": "company__name",
"name": "our_data",
"type": "record",
"fields": [
{
"name": "datatype1",
"type": ["null", {
"type": "record",
"name": "datatype1_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
{
"name": "datatype2",
"type": ["null", {
"type": "record",
"name": "datatype2_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
]
}
I am trying to send data to this schema using confluent_kafka python version. When I have done this prior, the records were not nested and I would use a typical dictionary key: value pairs and serialize it. How can I send nested data to work with schema.
What I tried so far...
message = {'datatype1':
{'site': 'sitename',
'units': 'm'
}
}
this version does not cause any kafka errors, but the all of the columns show up as null
and...
message = {'datatype1':
{'datatype1_1':
{'site': 'sitename',
'units': 'm'
}
}
}
This version produced a kafka error with the schema.

If you use namespaces, you don't have to worry about naming collisions and you can properly structure your optional records:
for example, both
{
"meta": {
"instanceID": "something"
}
}
And
{}
are valid instances of:
{
"doc": "Survey",
"name": "Survey",
"type": "record",
"fields": [
{
"name": "meta",
"type": [
"null",
{
"name": "meta",
"type": "record",
"fields": [
{
"name": "instanceID",
"type": [
"null",
"string"
],
"namespace": "Survey.meta"
}
],
"namespace": "Survey"
}
],
"namespace": "Survey"
}
]
}

Kafka Connect - JDBC Avro connect how define custom schema registry

I was following tutorial on kafka connect, and I am wondering if there is a possibility to define a custom schema registry for a topic which data came from a MySql table.
I can't find where define it in my json/connect config and I don't want to create a new version of that schema after creating it.
My MySql table called stations has this schema
Field | Type
---------------+-------------
code | varchar(4)
date_measuring | timestamp
attributes | varchar(256)
where the attributes contains a Json data and not a String (I have to use that type because the Json field of the attributes are variable.
My connector is
{
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"key.converter.schema.registry.url": "http://localhost:8081",
"name": "jdbc_source_mysql_stations",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"transforms": [
"ValueToKey"
],
"transforms.ValueToKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.ValueToKey.fields": [
"code",
"date_measuring"
],
"connection.url": "jdbc:mysql://localhost:3306/db_name?useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC",
"connection.user": "confluent",
"connection.password": "**************",
"table.whitelist": [
"stations"
],
"mode": "timestamp",
"timestamp.column.name": [
"date_measuring"
],
"validate.non.null": "false",
"topic.prefix": "mysql-"
}
and creates this schema
{
"subject": "mysql-stations-value",
"version": 1,
"id": 23,
"schema": "{\"type\":\"record\",\"name\":\"stations\",\"fields\":[{\"name\":\"code\",\"type\":\"string\"},{\"name\":\"date_measuring\",\"type\":{\"type\":\"long\",\"connect.version\":1,\"connect.name\":\"org.apache.kafka.connect.data.Timestamp\",\"logicalType\":\"timestamp-millis\"}},{\"name\":\"attributes\",\"type\":\"string\"}],\"connect.name\":\"stations\"}"
}
Where "attributes" field is of course a String.
Unlike I would apply it this other schema.
{
"fields": [
{
"name": "code",
"type": "string"
},
{
"name": "date_measuring",
"type": {
"connect.name": "org.apache.kafka.connect.data.Timestamp",
"connect.version": 1,
"logicalType": "timestamp-millis",
"type": "long"
}
},
{
"name": "attributes",
"type": {
"type": "record",
"name": "AttributesRecord",
"fields": [
{
"name": "H1",
"type": "long",
"default": 0
},
{
"name": "H2",
"type": "long",
"default": 0
},
{
"name": "H3",
"type": "long",
"default": 0
},
{
"name": "H",
"type": "long",
"default": 0
},
{
"name": "Q",
"type": "long",
"default": 0
},
{
"name": "P1",
"type": "long",
"default": 0
},
{
"name": "P2",
"type": "long",
"default": 0
},
{
"name": "P3",
"type": "long",
"default": 0
},
{
"name": "P",
"type": "long",
"default": 0
},
{
"name": "T",
"type": "long",
"default": 0
},
{
"name": "Hr",
"type": "long",
"default": 0
},
{
"name": "pH",
"type": "long",
"default": 0
},
{
"name": "RX",
"type": "long",
"default": 0
},
{
"name": "Ta",
"type": "long",
"default": 0
},
{
"name": "C",
"type": "long",
"default": 0
},
{
"name": "OD",
"type": "long",
"default": 0
},
{
"name": "TU",
"type": "long",
"default": 0
},
{
"name": "MO",
"type": "long",
"default": 0
},
{
"name": "AM",
"type": "long",
"default": 0
},
{
"name": "N03",
"type": "long",
"default": 0
},
{
"name": "P04",
"type": "long",
"default": 0
},
{
"name": "SS",
"type": "long",
"default": 0
},
{
"name": "PT",
"type": "long",
"default": 0
}
]
}
}
],
"name": "stations",
"namespace": "com.mycorp.mynamespace",
"type": "record"
}
Any suggestion please?
In case it's not possible, I suppose I'll have to create a KafkaStream to create another topic, even if I would avoid it.
Thanks in advance!

I don't think you're asking anything about using a "custom" registry (which you'd do with the two lines that say which registry you're using), but rather how you can parse the data / apply a schema after the record is pulled from the database
You can write your own Transform, or you can use Kstreams, which are really the main options here. There is a SetSchemaMetadata transform, but I'm not sure that'll do what you want (parse a string into an Avro record)
Or if you must shove JSON data into a single database attribute, maybe you shouldn't use Mysql and rather a document database which has more flexible data constraints.
Otherwise, you can use BLOB rather than varchar and put binary Avro data into that column, but then you'd still need a custom deserializer to read the data

Invalid Schema on Confluent Controlcenter

I am just trying to set up a Value-Schema for a Topic in the Web interface of Confluent Control Center.
I chose the Avro-format and tried the following schema:
{
"fields": [
{"name":"date",
"type":"dates",
"doc":"Date of the count"
},
{"name":"time",
"type":"timestamp-millis",
"doc":"date in ms"
},
{"name":"count",
"type":"int",
"doc":"Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
But the interface keeps on saying the input schema is invalid.
I have no idea why.
Any help is appreciated!

There are issues related to datatypes.
"type":"dates" => "type": "string"
"type":"timestamp-millis" => "type": {"type": "long", "logicalType": "timestamp-millis"}
Updated schema will be like:
{
"fields": [
{
"name": "date",
"type": "string",
"doc": "Date of the count"
},
{
"name": "time",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"doc": "date in ms"
},
{
"name": "count",
"type": "int",
"doc": "Number of Articles"
}
],
"name": "articleCount",
"type": "record"
}
Sample Payload:
{
"date": "2020-07-10",
"time": 12345678900,
"count": 1473217
}
More reference related to Avro datatypes can be found here:
https://docs.oracle.com/database/nosql-12.1.3.0/GettingStartedGuide/avroschemas.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics? - apache-kafka

All topics will need the same timestamp field; there's no way to configure topic-to-field mappings. Your t2 schema doesn't have an after field, so you need to run two separate connectors The field is also required to be present in all records, otherwise the partitioner won't work.

Related

pyspark stream from kafka topic with avro format returns null dataframe

org.apache.avro.AvroTypeException: Unknown union branch EventId

Confluent Kafka producer message format for nested records

Kafka Connect - JDBC Avro connect how define custom schema registry

Invalid Schema on Confluent Controlcenter

Categories

Resources