Partition column disappears in result set dataframe Spark - scala

I tried to split Spark data frame by the timestamp column update_database_time and write it into HDFS with defined Avro schema. However, after calling the repartition method I get this exception:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.
I assume that the column for partitioning disappears in the result. How can I redefine the operation so it would not happen?
Here is the code I use:
dataDF.write
.partitionBy("update_database_time")
.format("avro")
.option(
"avroSchema",
SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")

By the exception you have provided, the error seems to stem from incompatible schemas between the fetched AVRO schema and Spark's schema. Taking a quick look, the most worrisome parts are probably these ones:
(Possibly catalyst doesn't know how to transform string into enumType)
Spark schema:
StructField(random_enum,StringType,true)
AVRO schema:
{
"name": "random_enum",
"type": [
"null",
{
"type": "enum",
"name": "enumType",
"symbols": [
"VAL_1",
"VAL_2",
"VAL_3"
]
}
]
}
(update_databse_time_tz appears only once in the dataframe's schema, but twice in the AVRO schema)
Spark schema:
StructField(update_database_time_tz,TimestampType,true)
AVRO schema:
{
"name": "update_database_time",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
},
{
"name": "update_database_time_tz",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
}
I'd suggest to consolidate schemas first and get rid of that exception before going inside other possible partitioning problems.
EDIT: In regards to number 2 I've missed the face that there are different names in the AVRO schema, which leads to a problem of missing a column update_database_time in the Dataframe.

Related

MSK S3 sink not working with Field Partitioner

I am using AWS MSK and msk connect. S3 sink connector is not working properly when I added io.confluent.connect.storage.partitioner.FieldPartitioner Note:without fieldPartitioner s3sink had worked. Other than this stack overflow Question Link I was not able to find any resource
Error
ERROR [FieldPart-sink|task-0] Value is not Struct type. (io.confluent.connect.storage.partitioner.FieldPartitioner:81)
Caused by: io.confluent.connect.storage.errors.PartitionException: Error encoding partition.
ERROR [Sink-FieldPartition|task-0] WorkerSinkTask{id=Sink-FieldPartition-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Error encoding partition. (org.apache.kafka.connect.runtime.WorkerSinkTask:612)
MSK Connect Config
connector.class=io.confluent.connect.s3.S3SinkConnector
format.class=io.confluent.connect.s3.format.avro.AvroFormat
flush.size=1
schema.compatibility=BACKWARD
tasks.max=2
topics=MSKTutorialTopic
storage.class=io.confluent.connect.s3.storage.S3Storage
topics.dir=mskTrials
s3.bucket.name=clickstream
s3.region=us-east-1
partitioner.class=io.confluent.connect.storage.partitioner.FieldPartitioner
partition.field.name=name
value.converter.schemaAutoRegistrationEnabled=true
value.converter.registry.name=datalake-schema-registry
value.convertor.schemaName=MSKTutorialTopic-value
value.converter.avroRecordType=GENERIC_RECORD
value.converter.region=us-east-1
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
Data Schema which is stored in glue schema registry
{
"namespace": "example.avro",
"type": "record",
"name": "UserData",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favourite_color",
"type": [
"string",
"null"
]
}
]
}
In order to partition by fields, your data needs actual fields.
StringConverter cannot parse data it consumes to add said fields. Use AvroConverter if you have Avro data in the topic. Also, Avro always has a schema, so remove the schemas.enable configuration

Avro union field with "local-timestamp-millis" deserialization issue

Avro: 10.1
Dataflow (Apache Beam): 2.28.0
Runner: org.apache.beam.runners.dataflow.DataflowRunner
Avro schema piece:
{
"name": "client_timestamp",
"type": [
"null",
{ "type": "long", "logicalType": "local-timestamp-millis" }
],
"default": null,
"doc": "Client side timestamp of this xxx"
},
An exception when writing Avro output file:
Caused by: org.apache.avro.UnresolvedUnionException:
Not in union ["null",{"type":"long","logicalType":"local-timestamp-millis"}]:
2021-03-12T12:21:17.599
Link to a longer stacktrace
Some of the steps taken:
Replacing "logicalType":"local-timestamp-millis" with "logicalType":"timestamp-millis" causes the same error.
Writing Avro locally also works.
Removing "type": "null" option eliminates the exception
Try something like following:
{
"name": "client_timestamp",
"type": ["null", "long"],
"doc": "Client side timestamp of this xxx",
"default": null,
"logicalType": "timestamp-millis"
}
It is a way to define logicalType in avro schema (link).

Avro invalid default for union field

I'm trying to serialise and then write to the hortonworks schema registry an avro schema but I'm getting the following error message during the write operation.
Caused by: java.lang.RuntimeException: An exception was thrown while processing request with message: [Invalid default for field viewingMode: null not a [{"type":"record","name":"aName","namespace":"domain.assembled","fields":[{"name":"aKey","type":"string"}]},{"type":"record","name":"anotherName","namespace":"domain.assembled","fields":[{"name":"anotherKey","type":"string"},{"name":"yetAnotherKey","type":"string"}]}]]
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.handleSchemaIdVersionResponse(SchemaRegistryClient.java:678)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.doAddSchemaVersion(SchemaRegistryClient.java:664)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.lambda$addSchemaVersion$1(SchemaRegistryClient.java:591)
This is the avro schema
{
"type": "record",
"name": "aSchema",
"namespace": "domain.assembled",
"fields": [
{
"name": "viewingMode",
"type": [
{
"name": "aName",
"type": "record",
"fields": [
{"name": "aKey","type": "string"}
]
},
{
"name": "anotherName",
"type": "record",
"fields": [
{"name": "anotherKey","type": "string"},
{"name": "yetAnotherKey","type": "string"}
]
}
]
}
]
}
Whoever if I add a "null" as the first type of the union this the succeeds. Do avro union types require a "null"? In my case this would be an incorrect representation of data so I'm not keen on doing it.
If it makes any difference I'm using avro 1.9.1.
Also, apologies if the tags are incorrect but couldn't find a hortonworks-schema-registry tag and don't have enough rep to create a new one.
Turns out if was an issue with hortonwork's schema registry.
This has actually already been fixed here and I've requested a new release here. Hopefully this happens soon.

Debezium might produce invalid schema

I face an issue with Avro and Schema Registry. After Debezium created a schema and a topic, I have downloaded the schema from Schema Registry. I put it into a .asvc file and it looks like this:
{
"type": "record",
"name": "Envelope",
"namespace": "my.namespace",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "MyValue",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"MyValue"
],
"default": null
}
]
}
I ran two experiments:
I tried to put it back into Schema Registry but I get this error: MyValue is not correct. When I remove "after" record, the schema seems to work well.
I used 'generate-sources' from avro-maven-plugin to generate the Java classes. When I try to consume the topic above, I see this error:
Exception in thread "b2-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. [...]: Error registering Avro schema: [...]
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema; error code: 409
Did anyone faced the same problem? Is it Debezium that is producing an invalid schema or is Schema Registry that is having a bug?
MyValue is not correct.
That's not an Avro type. You would have to embed the actual record within the union, just like the before value
In other words, you're not able to cross reference the record types within a schema, AFAIK
When I try to consume the topic above, I see this error:
A consumer does not register schemas, so it's not clear how you're getting that error unless maybe using Kafka Streams, which produces into intermediate topics

Avro genericdata.Record ignores data types

I have the following avro schema
{ "namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
I use the following snippet to set up a Record
val schema = new Schema.Parser().parse(new File("data/user.avsc"))
val user1 = new GenericData.Record(schema) //strangely this schema only checks for valid fields NOT types.
user1.put("name", "Fred")
user1.put("favorite_number", "Jones")
I would have thought that this would fail to validate against the schema
When I add the line
user1.put("last_name", 100)
It generates a run time error, which is what I would expect in the first case as well.
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: last_name
at org.apache.avro.generic.GenericData$Record.put(GenericData.java:125)
at csv2avro$.main(csv2avro.scala:40)
at csv2avro.main(csv2avro.scala)
What's going on here?
It won't fail when adding it into the Record, it will fail when it tries to serialize because it is at that point when it is trying to match the type. As far as I'm aware that is the only place it does type checking.