I am using AWS MSK and msk connect. S3 sink connector is not working properly when I added io.confluent.connect.storage.partitioner.FieldPartitioner Note:without fieldPartitioner s3sink had worked. Other than this stack overflow Question Link I was not able to find any resource
Error
ERROR [FieldPart-sink|task-0] Value is not Struct type. (io.confluent.connect.storage.partitioner.FieldPartitioner:81)
Caused by: io.confluent.connect.storage.errors.PartitionException: Error encoding partition.
ERROR [Sink-FieldPartition|task-0] WorkerSinkTask{id=Sink-FieldPartition-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Error encoding partition. (org.apache.kafka.connect.runtime.WorkerSinkTask:612)
MSK Connect Config
connector.class=io.confluent.connect.s3.S3SinkConnector
format.class=io.confluent.connect.s3.format.avro.AvroFormat
flush.size=1
schema.compatibility=BACKWARD
tasks.max=2
topics=MSKTutorialTopic
storage.class=io.confluent.connect.s3.storage.S3Storage
topics.dir=mskTrials
s3.bucket.name=clickstream
s3.region=us-east-1
partitioner.class=io.confluent.connect.storage.partitioner.FieldPartitioner
partition.field.name=name
value.converter.schemaAutoRegistrationEnabled=true
value.converter.registry.name=datalake-schema-registry
value.convertor.schemaName=MSKTutorialTopic-value
value.converter.avroRecordType=GENERIC_RECORD
value.converter.region=us-east-1
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
Data Schema which is stored in glue schema registry
{
"namespace": "example.avro",
"type": "record",
"name": "UserData",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favourite_color",
"type": [
"string",
"null"
]
}
]
}
In order to partition by fields, your data needs actual fields.
StringConverter cannot parse data it consumes to add said fields. Use AvroConverter if you have Avro data in the topic. Also, Avro always has a schema, so remove the schemas.enable configuration
Avro: 10.1
Dataflow (Apache Beam): 2.28.0
Runner: org.apache.beam.runners.dataflow.DataflowRunner
Avro schema piece:
{
"name": "client_timestamp",
"type": [
"null",
{ "type": "long", "logicalType": "local-timestamp-millis" }
],
"default": null,
"doc": "Client side timestamp of this xxx"
},
An exception when writing Avro output file:
Caused by: org.apache.avro.UnresolvedUnionException:
Not in union ["null",{"type":"long","logicalType":"local-timestamp-millis"}]:
2021-03-12T12:21:17.599
Link to a longer stacktrace
Some of the steps taken:
Replacing "logicalType":"local-timestamp-millis" with "logicalType":"timestamp-millis" causes the same error.
Writing Avro locally also works.
Removing "type": "null" option eliminates the exception
Try something like following:
{
"name": "client_timestamp",
"type": ["null", "long"],
"doc": "Client side timestamp of this xxx",
"default": null,
"logicalType": "timestamp-millis"
}
It is a way to define logicalType in avro schema (link).
I'm trying to serialise and then write to the hortonworks schema registry an avro schema but I'm getting the following error message during the write operation.
Caused by: java.lang.RuntimeException: An exception was thrown while processing request with message: [Invalid default for field viewingMode: null not a [{"type":"record","name":"aName","namespace":"domain.assembled","fields":[{"name":"aKey","type":"string"}]},{"type":"record","name":"anotherName","namespace":"domain.assembled","fields":[{"name":"anotherKey","type":"string"},{"name":"yetAnotherKey","type":"string"}]}]]
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.handleSchemaIdVersionResponse(SchemaRegistryClient.java:678)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.doAddSchemaVersion(SchemaRegistryClient.java:664)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.lambda$addSchemaVersion$1(SchemaRegistryClient.java:591)
This is the avro schema
{
"type": "record",
"name": "aSchema",
"namespace": "domain.assembled",
"fields": [
{
"name": "viewingMode",
"type": [
{
"name": "aName",
"type": "record",
"fields": [
{"name": "aKey","type": "string"}
]
},
{
"name": "anotherName",
"type": "record",
"fields": [
{"name": "anotherKey","type": "string"},
{"name": "yetAnotherKey","type": "string"}
]
}
]
}
]
}
Whoever if I add a "null" as the first type of the union this the succeeds. Do avro union types require a "null"? In my case this would be an incorrect representation of data so I'm not keen on doing it.
If it makes any difference I'm using avro 1.9.1.
Also, apologies if the tags are incorrect but couldn't find a hortonworks-schema-registry tag and don't have enough rep to create a new one.
Turns out if was an issue with hortonwork's schema registry.
This has actually already been fixed here and I've requested a new release here. Hopefully this happens soon.
I tried to split Spark data frame by the timestamp column update_database_time and write it into HDFS with defined Avro schema. However, after calling the repartition method I get this exception:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.
I assume that the column for partitioning disappears in the result. How can I redefine the operation so it would not happen?
Here is the code I use:
dataDF.write
.partitionBy("update_database_time")
.format("avro")
.option(
"avroSchema",
SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")
By the exception you have provided, the error seems to stem from incompatible schemas between the fetched AVRO schema and Spark's schema. Taking a quick look, the most worrisome parts are probably these ones:
(Possibly catalyst doesn't know how to transform string into enumType)
Spark schema:
StructField(random_enum,StringType,true)
AVRO schema:
{
"name": "random_enum",
"type": [
"null",
{
"type": "enum",
"name": "enumType",
"symbols": [
"VAL_1",
"VAL_2",
"VAL_3"
]
}
]
}
(update_databse_time_tz appears only once in the dataframe's schema, but twice in the AVRO schema)
Spark schema:
StructField(update_database_time_tz,TimestampType,true)
AVRO schema:
{
"name": "update_database_time",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
},
{
"name": "update_database_time_tz",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
}
I'd suggest to consolidate schemas first and get rid of that exception before going inside other possible partitioning problems.
EDIT: In regards to number 2 I've missed the face that there are different names in the AVRO schema, which leads to a problem of missing a column update_database_time in the Dataframe.
Generated sample stream from ksql-datagen utility from following schema -
{
"type": "record",
"name": "users",
**"namespace": "com.example",**
"fields": [
{
"name": "registertime",
"type": {
"type":"long",
"arg.properties":{
"range":{"min":1487715775521,"max":1519273364600}
}
}
},
{
"name": "userid",
"type": {
"type":"string",
"arg.properties":{"regex":"User_[1-9][0-2]"}
}
},
{
"name": "regionid",
"type": {
"type":"string",
"arg.properties":{"regex":"Region_[1-9]"}
}
},
{
"name": "gender",
"type": {
"type":"string",
"arg.properties":{
"options":["MALE","FEMALE","OTHER"]
}
}
}
]}
while checked for versions, it still picks "io.confluent.ksql.avro_schemas" schema -
curl "http://localhost:8081/subjects/test-user-value/versions/1"
{"subject":"test-user-value","version":1,"id":4,"schema":"{"type":"record","name":"KsqlDataSourceSchema","namespace":"io.confluent.ksql.avro_schemas","fields":[{"name":"registertime","type":["null","long"],"default":null},{"name":"userid","type":["null","string"],"default":null},{"name":"regionid","type":["null","string"],"default":null},{"name":"gender","type":["null","string"],"default":null}]}"}
Got following error while tried to consume with Kafka-streams API -
Exception in thread
"PageView-Users-Stream-Join-eg-1dc610a3-c9d9-4c1e-b5eb-910e4bc74826-StreamThread-1"
org.apache.kafka.streams.errors.StreamsException: Deserialization
exception handler is set to fail upon a deserialization error. If you
would rather have the streaming pipeline continue after a
deserialization error, please set the
default.deserialization.exception.handler appropriately. at
org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:80)
at
org.apache.kafka.streams.processor.internals.RecordQueue.maybeUpdateTimestamp(RecordQueue.java:160)
at
org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:101)
at
org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:124)
at
org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:711)
at
org.apache.kafka.streams.processor.internals.StreamThread.addRecordsToTasks(StreamThread.java:995)
at
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:833)
at
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777)
at
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747)
Caused by: org.apache.kafka.common.errors.SerializationException:
Error deserializing Avro message for id 4 Caused by:
org.apache.kafka.common.errors.SerializationException: Could not find
class io.confluent.ksql.avro_schemas.KsqlDataSourceSchema specified in writer's schema whilst finding reader's schema for a
SpecificRecord.
Answered at https://github.com/confluentinc/schema-registry/issues/980
Datagen always defines the namespace as io.confluent.ksql.avro_schemas. See confluentinc/ksql#1906
There are now other ways to generate test data into Kafka, too.