I'm trying to create an Apache Beam pipeline where I read from a kafka topic and load it into Bigquery. Using Confluent's schema registry, I should be able to infer the schema when loading into Bigquery. However, the schema is not being inferred when the loading and it fails.
Below is the entire pipeline code.
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers("broker-url:9092")
.withTopic("beam-in")
.withConsumerConfigUpdates(consumerConfig)
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider.of(schemaRegUrl, subj))
.withKeyDeserializer(ByteArrayDeserializer.class)
.commitOffsetsInFinalize()
.withoutMetadata()
)
.apply("Drop Kafka message key", Values.create())
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
.withCustomGcsTempLocation("gs://beam-tmp-load")
.withNumFileShards(10)
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration("10s"))
.to(new TableReference()
.setProjectId("my-project")
.setDatasetId("loaded-data")
.setTableId("beam-load-test")
);
return pipeline.run();
When running this I get the following error, which is from the fact that i'm calling useBeamSchema() and hasSchema() returns false:
Exception in thread "main" java.lang.IllegalArgumentException
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:127)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:2595)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:2579)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1726)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:542)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:493)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:368)
at KafkaToBigQuery.run(KafkaToBigQuery.java:159)
at KafkaToBigQuery.main(KafkaToBigQuery.java:64)
Related
I'm Using flink streaming to read events from Kafka source topic and after de-duplication, writing to separate kafka topic in avro topic.
Flow
Kafka Topic(json format) -> flink streaming(de-duplication) -> scala
case class objects -> Kafka Topic(Avro Format)
val sink = sinkProvider.getKafkaSink(brokerURL, targetTopic,kafkaTransactionMaxTimeoutMs, kafkaTransactionTimeoutMs)
messageStream
.map {
record =>
convertJsonToExample(record)
}
.sinkTo(sink)
.name("Example Kafka Avro Sink")
.uid("Example-Kafka-Avro-Sink")
Here are the steps I followed:
I created avro schema for my output schema
{
"type":"record",
"name":"Example",
"namespace":"ca.ix.dcn.test",
"fields":[
{
"name":"x",
"type":"string"
},
{
"name":"y",
"type":"long"
}
]
}
From avro schema I generated case class using avro-hugger tools(version 1.2.1) for
SpecificRecord
I used flink AvroSerializationSchema forSpecificRecord cause flink
kafka avro sink let's you use either specific record or generic
record constructor for serialization to avro.
def getKafkaSink(brokers: String, targetTopic: String,transactionMaxTimeoutMs:String,transactionTimeoutMs:String) = {
val schema = ReflectData.get.getSchema(classOf[Example])
val sink = KafkaSink.builder()
.setBootstrapServers(brokers)
.setProperty("transaction.max.timeout.ms",transactionMaxTimeoutMs)
.setProperty("transaction.timeout.ms",transactionTimeoutMs)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(targetTopic)
.setValueSerializationSchema(AvroSerializationSchema.forSpecific[Example](classOf[Example]))
.setPartitioner(new FlinkFixedPartitioner())
.build()
)
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.build()
sink
}
Now when I run it I get the exeption:
Caused by: org.apache.avro.AvroRuntimeException: java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData can not access a member of class ca.ix.dcn.test with modifiers "private final"
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:405)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:734)
I saw there is a bug opened on flink for this:
https://issues.apache.org/jira/browse/FLINK-18478
But I didn't find any work around for this.
Wondering if there is any workaround for this. Also if there are detailed examples that explain how to use flink streaming sink(for avro) using AvroSerializationSchema(Specific/Generic)
Appreciate the help on this.
In the Flink ticket that you're linking to, there's a comment made that avro-hugger is not really compatible with the Apache Avro Java library, see https://issues.apache.org/jira/browse/FLINK-18478?focusedCommentId=17164456&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17164456
The solution would be to generate Avro Java POJOs and use them in your Scala application.
Hello together im struggling with (de-)serializing a simple avro schema together with schema registry.
The setup:
2 Flink jobs written in java (one consumer, one producer)
1 confluent schema registry for schema validation
1 kafka cluster for messaging
The target:
The producer should send a message serialized with ConfluentRegistryAvroSerializationSchema which includes updating and validating the schema.
The consumer should then deserialize the message into an object with the received schema. Using ConfluentRegistryAvroDeserializationSchema.
So far so good:
If i configre my subject on the schema registry to be FORWARD-compatible the producer writes the correct avro schema to the registry, but it ends with the error (even if i completely and permanetly delete the subject first):
Failed to send data to Kafka: Schema being registered is incompatible with an earlier schema for subject "my.awesome.MyExampleEntity-value"
The schema was successfully written:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 1,
"id": 100028,
"schema": "{\"type\":\"record\",\"name\":\"input\",\"namespace\":\"my.awesome.MyExampleEntity\",\"fields\":[{\"name\":\"serialNumber\",\"type\":\"string\"},{\"name\":\"editingDate\",\"type\":\"int\",\"logicalType\":\"date\"}]}"
}
following this i could try to set the compability to NONE
If i do so i can produce my data on the kafka but:
The schema registry has a new version of my schema looking like this:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 2,
"id": 100031,
"schema": "\"bytes\""
}
Now i can produce data but the consumer is not able to deserialize this schema emiting the following error:
Caused by: org.apache.avro.AvroTypeException: Found bytes, expecting my.awesome.MyExampleEntity
...
Im currently not sure where the problem exactly is.
Even if i completely and permanetly delete the subject (including schemas) my producer should work fine from scratch registering a whole "new" subject with schema.
On the other hand if i set the compatibility to "NONE" the schema exchange should work anyway by should registering a schema which can be read by the consumer.
Can anybody help me out here?
According to a latest confluent doc NONE: schema compatibility checks are disabled docs:
The whole problem with serialisation was about the usage of the following flag in the kafka config:
"schema.registry.url"
"key.serializer"
"key.deserializer"
"value.serializer"
"value.deserializer"
Setting this flags in flink, even if they are logically correct leads to a undebuggable schema validation and serialisation chaos.
Omitted all of these flags and it works fine.
The registry url needs to be set in ConfluentRegistryAvro(De)serializationSchema only.
I am new to Kafka. I have written a simple JAVA program to generate a message using avro schema. I have generated a specific record. The record is generated successfully. My schema is not yet registered with my local environment. It is currently registered with some other environment.
I am using the apache kafka producer library to publish the message to my local environment kafka topic. Can I publish the message to the local topic or the schema needs to be registered with the local schema registry as well.
Below are the producer properties -
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "https://schema-registry.xxxx.service.dev:443");```
Error I am getting while publishing the message -
``` org.apache.kafka.common.errors.SerializationException: Error registering Avro schema:
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: User is denied operation Write on Subject: xxx.avro-value; error code: **40301**
The issue was kafka producer by default tries to register the schema on topic. So, I added the below - properties.put(KafkaAvroSerializerConfig.AUTO_REGISTER_SCHEMAS, false);
and it resolved the issue.
I am using Confluent's Kafka s3 connect for copying data from apache Kafka to AWS S3.
The problem is that I have Kafka data in AVRO format which is NOT using Confluent Schema Registry’s Avro serializer and I cannot change the Kafka producer. So I need to deserialize existing Avro data from Kafka and then persist the same in parquet format in AWS S3. I tried using confluent's AvroConverter as value converter like this -
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost/api/v1/avro
And i am getting this error -
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic dcp-all to Avro:
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:86)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:488)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
As far as I understand, "io.confluent.connect.avro.AvroConverter" will only work if the data is written in Kafka using Confluent Schema Registry’s Avro serializer and hence I am getting this error. So my question is Do I need to implement a generic AvroConverter in this case? And if yes, how do I extend the existing source code - https://github.com/confluentinc/kafka-connect-storage-cloud?
Any help here will be appreciated.
You don't need to extend that repo. You just need to implement a Converter (part of Apache Kafka) shade it into a JAR, then place it on your Connect worker's CLASSPATH, like BlueApron did for Protobuf
Or see if this works - https://github.com/farmdawgnation/registryless-avro-converter
NOT using Confluent Schema Registry
Then what registry are you using? Each one that I know of has configurations to interface with the Confluent one
Background: i used wrong avro schema registry while producing to prod topic and as a result the kafka connect went down because of the messages with wrong schema id.So as a recovery plan we wanted to copy the messages in the prod topic to a test topic and then write the good messages to the hdfs.But we are facing issues with certain offsets that have wrong schema id while reading from prod topic.Is there a way to ignore such offsets while writing to another topic.
Exception in thread "StreamThread-1"
org.apache.kafka.streams.errors.StreamsException: Failed to deserialize value
for record. topic=xxxx, partition=9, offset=1259032
Caused by: org.apache.kafka.common.errors.SerializationException: Error
retrieving Avro schema for id 600
Caused by:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Schema not found io.confluent.rest.exceptions.RestNotFoundException: Schema not found
io.confluent.rest.exceptions.RestNotFoundException: Schema not found
{code}
You can change the deserialization exception handler to skip over those record as describe in the docs: https://docs.confluent.io/current/streams/faq.html#handling-corrupted-records-and-deserialization-errors-poison-pill-records
Ie, you set LogAndContinueExceptionHandler in the config via parameter default.deserialization.exception.handler.