kafka: producer and consumer with different avro file - apache-kafka

I am processing 2 different avro files:
avroConsumer:
{"namespace": "autoGenerated.avro",
"type": "record",
"name": "UserConsumer",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Surname", "type":["null","string"],"default": null},
{"name": "favorite_number", "type": ["long", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
avroProducer:
{"namespace": "autoGenerated.avro",
"type": "record",
"name": "UserProducer",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
On compiling procedure a deserialization error occurs but I thought that defining the "default" attribute in the consumer should make it work correctly.
Reference: http://avro.apache.org/docs/current/spec.html#Schema+Resolution
if the reader's record schema has a field that contains a default
value, and writer's schema does not have a field with the same name,
then the reader should use the default value from its field.
Do you have some ideas? Can I define a different consumer avro file than the producer avro file?

Related

Cannot invoke "Object.getClass()" because "datum" is null of string in field 'schemaVersionId' - Confluent Schema Registry

I have an avro schema for which I am generating the Java bean using the avro-maven-plugin. I am then instantiating it and sending it to Kafka using also Confluent's Schema Registry. I can also consume and deserialise the avro into a Spark DataFrame just fine. The problem I am facing is that it is forcing me to set the schemaVersionId at the producer level. If I don't set it, the KafkaAvroSerializer will throw the error in the title. Any ideas please?
val contractEvent: ContractEvent = new ContractEvent()
contractEvent.setSchemaVersionId("1") // This should be appended as the schema is auto-registered.
contractEvent.setIngestedAt("123")
contractEvent.setChangeType(changeTypeEnum.U)
contractEvent.setServiceName("Contract")
contractEvent.setPayload(avroContract)
{
"type": "record",
"namespace": "xxxxxxx",
"name": "ContractEvent",
"fields": [
{"name": "ingestedAt", "type": "string"},
{"name": "eventType",
"type": {
"name": "eventTypeEnum",
"type": "enum", "symbols" : ["U", "D", "B"]
}
},
{"name": "serviceName", "type": "string"},
{"name": "payload",
"type": {
"type": "record",
"name": "Contract",
"fields": [
{"name": "identifier", "type": "string"},
{"name": "createdBy", "type": "string"},
{"name": "createdDate", "type": "string"},
]
}
}
]
}

How can I save Kafka message Key in document for MongoDB Sink?

Right now I have a MongoDB Sink and it saves the value of incoming AVRO messages correctly.
I need it to save the Kafka Message Key in the document.
I have tried org.apache.kafka.connect.transforms.HoistField$Key in order to add the key to the value that is being saved, but this did nothing. It did work when using ProvidedInKeyStrategy, but I don't want my _id to be the Kafka message Key.
My configuration:
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"connection.uri": "mongodb://mongo1",
"database": "mongodb",
"collection": "sink",
"topics": "topics.foo",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "hoistKey",
"transforms.hoistKey.type":"org.apache.kafka.connect.transforms.HoistField$Key",
"transforms.hoistKey.field":"kafkaKey"
}
Kafka message schema:
{
"type": "record",
"name": "Smoketest",
"namespace": "some_namespace",
"fields": [
{
"name": "timestamp",
"type": "int",
"logicalType": "timestamp-millis"
}
]
}
Kafka key schema:
[
{
"type": "enum",
"name": "EnvironmentType",
"namespace": "some_namespace",
"doc": "DEV",
"symbols": [
"Dev",
"Test",
"Accept",
"Sandbox",
"Prod"
]
},
{
"type": "record",
"name": "Key",
"namespace": "some_namespace",
"doc": "The standard Key type that is used as key",
"fields": [
{
"name": "conversation_id",
"doc": "The first system producing an event sets this field",
"type": "string"
},
{
"name": "broker_key",
"doc": "The key of the broker",
"type": "string"
},
{
"name": "user_id",
"doc": "User identification",
"type": [
"null",
"string"
]
},
{
"name": "application",
"doc": "The name of the application",
"type": [
"null",
"string"
]
},
{
"name": "environment",
"doc": "The type of environment",
"type": "type.EnvironmentType"
}
]
}
]
Using https://github.com/f0xdx/kafka-connect-wrap-smt I can now wrap all the data from the kafka message into a single document to save in my mongodb sink.

Kafka & Avro producer - Schema being registered is incompatible with an earlier schema for subject

Im running schema-registry-confluent example with my local and I got an error when I modified the schema of the message:
This is my schema:
{
"type": "record",
"namespace": "io.confluent.tutorial.pojo.avro",
"name": "OrderDetail",
"fields": [
{
"name": "number",
"type": "long",
"doc": "The order number."
},
{
"name": "date",
"type": "long",
"logicalType": "date",
"doc": "The date the order was submitted."
},
{
"name": "client",
"type": {
"type": "record",
"name": "Client",
"fields": [
{ "name": "code", "type": "string" }
]
}
}
]
}
And I tried to send this message on the producer:
{"number": 2343434, "date": 1596490462, "client": {"code": "1234"}}
But I got this error:
org.apache.kafka.common.errors.InvalidConfigurationException: Schema being registered is incompatible with an earlier schema for subject "example-topic-avro-value"; error code: 409

Confluent Kafka producer message format for nested records

I have a AVRO schema registered in a kafka topic and am trying to send data to it. The schema has nested records and I'm not sure how I correctly send data to it using confluent_kafka python.
Example schema: *ingore any typos in schema (real one is very large, just an example)
{
"namespace": "company__name",
"name": "our_data",
"type": "record",
"fields": [
{
"name": "datatype1",
"type": ["null", {
"type": "record",
"name": "datatype1_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
{
"name": "datatype2",
"type": ["null", {
"type": "record",
"name": "datatype2_1",
"fields": [
{"name": "site", "type": "string"},
{"name": "units", "type": "string"}
]
}]
"default": null
}
]
}
I am trying to send data to this schema using confluent_kafka python version. When I have done this prior, the records were not nested and I would use a typical dictionary key: value pairs and serialize it. How can I send nested data to work with schema.
What I tried so far...
message = {'datatype1':
{'site': 'sitename',
'units': 'm'
}
}
this version does not cause any kafka errors, but the all of the columns show up as null
and...
message = {'datatype1':
{'datatype1_1':
{'site': 'sitename',
'units': 'm'
}
}
}
This version produced a kafka error with the schema.
If you use namespaces, you don't have to worry about naming collisions and you can properly structure your optional records:
for example, both
{
"meta": {
"instanceID": "something"
}
}
And
{}
are valid instances of:
{
"doc": "Survey",
"name": "Survey",
"type": "record",
"fields": [
{
"name": "meta",
"type": [
"null",
{
"name": "meta",
"type": "record",
"fields": [
{
"name": "instanceID",
"type": [
"null",
"string"
],
"namespace": "Survey.meta"
}
],
"namespace": "Survey"
}
],
"namespace": "Survey"
}
]
}

bq load and I get "Too many positional args - still have [..."

I am attempting to load data from Cloud Storage into a table and am getting the error message below.
bq load --skip_leading_rows=1 --field_delimiter='\t' --source_format=CSV projectID:dataset.table gs://bucket/source.txt sku:STRING,variant_id:STRING,title:STRING,category:STRING,description:STRING,buy_url:STRING,mobile_url:STRING,itemset_url:STRING,image_url:STRING,swatch_url:STRING,availability:STRING,issellableonline:STRING,iswebexclusive:STRING,price:STRING,saleprice:STRING,quantity:STRING,coresku_inet:STRING,condition:STRING,productreviewsavg:STRING,productreviewscount:STRING,mediaset:STRING,webindexpty:INTEGER,NormalSalesIndex1:FLOAT,NormalSalesIndex2:FLOAT,NormalSalesIndex3:FLOAT,SalesScore:FLOAT,NormalInventoryIndex1:FLOAT,NormalInventoryIndex2:FLOAT,NormalInventoryIndex3:FLOAT,InventoryScore:FLOAT,finalscore:FLOAT,EDVP:STRING,dropship:STRING,brand:STRING,model_number:STRING,gtin:STRING,color:STRING,size:STRING,gender:STRING,age:STRING,oversized:STRING,ishazardous:STRING,proddept:STRING,prodsubdept:STRING,prodclass:STRING,prodsubclass:STRING,sku_attr_names:STRING,sku_attr_values:STRING,store_id:STRING,store_quantity:STRING,promo_name:STRING,product_badge:STRING,cbl_type_names1:STRING,cbl_type_value1:STRING,cbl_type_names2:STRING,cbl_type_value2:STRING,cbl_type_names3:STRING,cbl_type_value3:STRING,cbl_type_names4:STRING,cbl_type_value4:STRING,cbl_type_names5:STRING,cbl_type_value5:STRING,choice1_name_value:STRING,choice2_name_value:STRING,choice3_name_value:STRING,cbl_is_free_shipping:STRING,isnewflag:STRING,shipping_weight:STRING,masterpath:STRING,accessoriesFlag:STRING,short_copy:STRING,bullet_copy:STRING,map:STRING,display_msrp:STRING,display_price:STRING,suppress_sales_display:STRING,margin:FLOAT
I have also tried to load the schema into a json file and I get the same error message.
As this was too big for comment, will post it here.
I wonder what happens if you set the schema file to have this content:
[{"name": "sku", "type": "STRING"},
{"name": "variant_id", "type": "STRING"},
{"name": "title", "type": "STRING"},
{"name": "category", "type": "STRING"},
{"name": "description", "type": "STRING"},
{"name": "buy_url", "type": "STRING"},
{"name": "mobile_url", "type": "STRING"},
{"name": "itemset_url", "type": "STRING"},
{"name": "image_url", "type": "STRING"},
{"name": "swatch_url", "type": "STRING"},
{"name": "availability", "type": "STRING"},
{"name": "issellableonline", "type": "STRING"},
{"name": "iswebexclusive", "type": "STRING"},
{"name": "price", "type": "STRING"},
{"name": "saleprice", "type": "STRING"},
{"name": "quantity", "type": "STRING"},
{"name": "coresku_inet", "type": "STRING"},
{"name": "condition", "type": "STRING"},
{"name": "productreviewsavg", "type": "STRING"},
{"name": "productreviewscount", "type": "STRING"},
{"name": "mediaset", "type": "STRING"},
{"name": "webindexpty", "type": "INTEGER"},
{"name": "NormalSalesIndex1", "type": "FLOAT"},
{"name": "NormalSalesIndex2", "type": "FLOAT"},
{"name": "NormalSalesIndex3", "type": "FLOAT"},
{"name": "SalesScore", "type": "FLOAT"},
{"name": "NormalInventoryIndex1", "type": "FLOAT"},
{"name": "NormalInventoryIndex2", "type": "FLOAT"},
{"name": "NormalInventoryIndex3", "type": "FLOAT"},
{"name": "InventoryScore", "type": "FLOAT"},
{"name": "finalscore", "type": "FLOAT"},
{"name": "EDVP", "type": "STRING"},
{"name": "dropship", "type": "STRING"},
{"name": "brand", "type": "STRING"},
{"name": "model_number", "type": "STRING"},
{"name": "gtin", "type": "STRING"},
{"name": "color", "type": "STRING"},
{"name": "size", "type": "STRING"},
{"name": "gender", "type": "STRING"},
{"name": "age", "type": "STRING"},
{"name": "oversized", "type": "STRING"},
{"name": "ishazardous", "type": "STRING"},
{"name": "proddept", "type": "STRING"},
{"name": "prodsubdept", "type": "STRING"},
{"name": "prodclass", "type": "STRING"},
{"name": "prodsubclass", "type": "STRING"},
{"name": "sku_attr_names", "type": "STRING"},
{"name": "sku_attr_values", "type": "STRING"},
{"name": "store_id", "type": "STRING"},
{"name": "store_quantity", "type": "STRING"},
{"name": "promo_name", "type": "STRING"},
{"name": "product_badge", "type": "STRING"},
{"name": "cbl_type_names1", "type": "STRING"},
{"name": "cbl_type_value1", "type": "STRING"},
{"name": "cbl_type_names2", "type": "STRING"},
{"name": "cbl_type_value2", "type": "STRING"},
{"name": "cbl_type_names3", "type": "STRING"},
{"name": "cbl_type_value3", "type": "STRING"},
{"name": "cbl_type_names4", "type": "STRING"},
{"name": "cbl_type_value4", "type": "STRING"},
{"name": "cbl_type_names5", "type": "STRING"},
{"name": "cbl_type_value5", "type": "STRING"},
{"name": "choice1_name_value", "type": "STRING"},
{"name": "choice2_name_value", "type": "STRING"},
{"name": "choice3_name_value", "type": "STRING"},
{"name": "cbl_is_free_shipping", "type": "STRING"},
{"name": "isnewflag", "type": "STRING"},
{"name": "shipping_weight", "type": "STRING"},
{"name": "masterpath", "type": "STRING"},
{"name": "accessoriesFlag", "type": "STRING"},
{"name": "short_copy", "type": "STRING"},
{"name": "bullet_copy", "type": "STRING"},
{"name": "map", "type": "STRING"},
{"name": "display_msrp", "type": "STRING"},
{"name": "display_price", "type": "STRING"},
{"name": "suppress_sales_display", "type": "STRING"},
{"name": "margin", "type": "FLOAT"}]
If you save it say in file "schema.json" and run the command:
bq load --skip_leading_rows=1 --field_delimiter='\t' --source_format=CSV projectID:dataset.table gs://bucket/source.txt schema.json
Do you still get the same error?
Cherba nailed it. The typo was in my batch file that included the extra load parameter. Thanks for all your time consideration.