Confluent Kafka producer message format for nested records - apache-kafka

I have a AVRO schema registered in a kafka topic and am trying to send data to it. The schema has nested records and I'm not sure how I correctly send data to it using confluent_kafka python.
Example schema: *ingore any typos in schema (real one is very large, just an example)
{
"namespace": "company__name",
"name": "our_data",
"type": "record",
"fields": [
{
"name": "datatype1",
"type": ["null", {
"type": "record",
"name": "datatype1_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
{
"name": "datatype2",
"type": ["null", {
"type": "record",
"name": "datatype2_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
]
}
I am trying to send data to this schema using confluent_kafka python version. When I have done this prior, the records were not nested and I would use a typical dictionary key: value pairs and serialize it. How can I send nested data to work with schema.
What I tried so far...
message = {'datatype1':
{'site': 'sitename',
'units': 'm'
}
}
this version does not cause any kafka errors, but the all of the columns show up as null
and...
message = {'datatype1':
{'datatype1_1':
{'site': 'sitename',
'units': 'm'
}
}
}
This version produced a kafka error with the schema.

If you use namespaces, you don't have to worry about naming collisions and you can properly structure your optional records:
for example, both
{
"meta": {
"instanceID": "something"
}
}
And
{}
are valid instances of:
{
"doc": "Survey",
"name": "Survey",
"type": "record",
"fields": [
{
"name": "meta",
"type": [
"null",
{
"name": "meta",
"type": "record",
"fields": [
{
"name": "instanceID",
"type": [
"null",
"string"
],
"namespace": "Survey.meta"
}
],
"namespace": "Survey"
}
],
"namespace": "Survey"
}
]
}

Related

Cannot invoke "Object.getClass()" because "datum" is null of string in field 'schemaVersionId' - Confluent Schema Registry

I have an avro schema for which I am generating the Java bean using the avro-maven-plugin. I am then instantiating it and sending it to Kafka using also Confluent's Schema Registry. I can also consume and deserialise the avro into a Spark DataFrame just fine. The problem I am facing is that it is forcing me to set the schemaVersionId at the producer level. If I don't set it, the KafkaAvroSerializer will throw the error in the title. Any ideas please?
val contractEvent: ContractEvent = new ContractEvent()
contractEvent.setSchemaVersionId("1") // This should be appended as the schema is auto-registered.
contractEvent.setIngestedAt("123")
contractEvent.setChangeType(changeTypeEnum.U)
contractEvent.setServiceName("Contract")
contractEvent.setPayload(avroContract)
{
"type": "record",
"namespace": "xxxxxxx",
"name": "ContractEvent",
"fields": [
{"name": "ingestedAt", "type": "string"},
{"name": "eventType",
"type": {
"name": "eventTypeEnum",
"type": "enum", "symbols" : ["U", "D", "B"]
}
},
{"name": "serviceName", "type": "string"},
{"name": "payload",
"type": {
"type": "record",
"name": "Contract",
"fields": [
{"name": "identifier", "type": "string"},
{"name": "createdBy", "type": "string"},
{"name": "createdDate", "type": "string"},
]
}
}
]
}

Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics?

schema for topic t1
{
"type": "record",
"name": "Envelope",
"namespace": "t1",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
],
"connect.name": "t1.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
}
],
"connect.name": "t1.Envelope"
}
schema for topic t2
{
"type": "record",
"name": "Value",
"namespace": "t2",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
}
],
"connect.name": "t2.Value"
}
s3-sink Connector configuration
connector.class=io.confluent.connect.s3.S3SinkConnector
behavior.on.null.values=ignore
s3.region=us-west-2
partition.duration.ms=1000
flush.size=1
tasks.max=3
timezone=UTC
topics.regex=t1,t2
aws.secret.access.key=******
locale=US
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
value.converter.schemas.enable=false
name=s3-sink-connector
aws.access.key.id=******
errors.tolerance=all
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=s3-sink-connector-bucket
path.format=YYYY/MM/dd
timestamp.extractor=RecordField
timestamp.field=after.createdAt
By using this connector configuration I got error for t2 topic that is "createdAt field does not exist".
If I set timestamp.field = createdAt then error is thrown for t1 topic "createdAt field does not exist".
How can I point "createdAt" field in both schemas at the same time by using same connector for both?
Is it possible to achieve this by using a single s3-sink connector configuration ?
If this scenario is possible then how can I do this, which properties I have to use for achieve this?
If anybody has idea about this, please suggest on this.
If there is any other way to do this then please suggest that way also.
All topics will need the same timestamp field; there's no way to configure topic-to-field mappings.
Your t2 schema doesn't have an after field, so you need to run two separate connectors
The field is also required to be present in all records, otherwise the partitioner won't work.

Kafka & Avro producer - Schema being registered is incompatible with an earlier schema for subject

Im running schema-registry-confluent example with my local and I got an error when I modified the schema of the message:
This is my schema:
{
"type": "record",
"namespace": "io.confluent.tutorial.pojo.avro",
"name": "OrderDetail",
"fields": [
{
"name": "number",
"type": "long",
"doc": "The order number."
},
{
"name": "date",
"type": "long",
"logicalType": "date",
"doc": "The date the order was submitted."
},
{
"name": "client",
"type": {
"type": "record",
"name": "Client",
"fields": [
{ "name": "code", "type": "string" }
]
}
}
]
}
And I tried to send this message on the producer:
{"number": 2343434, "date": 1596490462, "client": {"code": "1234"}}
But I got this error:
org.apache.kafka.common.errors.InvalidConfigurationException: Schema being registered is incompatible with an earlier schema for subject "example-topic-avro-value"; error code: 409

Tags are not written in influxdb through kafka-connect-influxdb

I am trying to connect kafka sink to influxdb. While it works but it does not save tags. For example if i send this to kafka topic
{"id": 1, "product": "pencil", "quantity": 100, "price": 50, "tags" : {"DEVICE": "living", "location": "home"}}`
Data is saved to influxdb but only the fields part.
I have been trying to debug this but failed. The versions i am using:
kafka 2.11-2.4.0
influxdb: 1.7.7
I encountered this too when I followed the Avro tags example on this page:
https://docs.confluent.io/kafka-connect-influxdb/current/influx-db-sink-connector/index.html
The "tags" schema in the example was incorrect. The example defines tags as:
{
"name": "tags",
"type": {
"name": "tags",
"type": "record",
"fields": [{
"name": "DEVICE",
"type": "string"
}, {
"name": "location",
"type": "string"
}]
}
}
It should actually be
{
"name": "tags",
"type": {
"type": "map",
"values": "string"
}
}
This web page provided the solution: https://rmoff.net/2020/01/23/notes-on-getting-data-into-influxdb-from-kafka-with-kafka-connect/

Deserialize Avro message with schema type as object

I'm reading from a Kafka topic, which contains Avro messages. I have given below schema.
I'm unable to parse the schema and deserialize the message.
{
"type": "record",
"name": "Request",
"namespace": "com.sk.avro.model",
"fields": [
{
"name": "empId",
"type": [
"null",
"string"
],
"default": null,
"description": "REQUIRED "
},
{
"name": "carUnit",
"type": "com.example.CarUnit",
"default": "ABC",
"description": "Car id"
}
]
}
I'm getting below error :
The type of the "carUnit" field must be a defined name or a {"type": ...} expression
Can anyone please help out.
How about
{
"name": "businessUnit",
"type": {
"type": {
"type": "record ",
"name": "BusinessUnit ",
"fields": [{
"name": "hostName ",
"type": "string "
}]
}
}
}