Deserialize Kafka Thrift message with PySpark Streaming to JSON - pyspark

I am using databricks environment and reading a Kafka input.
The messages consumed from Kafka are in Thrift Binary and am having issues deserialize it to JSON.
I am writing an UDF to do this, but am unable to figure out how to convert thrift -> JSON?
I have tried
import thriftpy2.protocol.json as proto
def decoder(thrift_data):
return proto.struct_to_json(thrift_data)
which throws an error PythonException: 'AttributeError: 'bytearray' object has no attribute 'thrift_spec''
and also tried
def decoder(thrift_data):
return serialize(thrift_data, protocol_factory=TSimpleJSONProtocolFactory())
which throws an error PythonException: 'AttributeError: 'bytearray' object has no attribute 'write'',

Related

Exception in Flink Streaming to Kafka Avro Sink java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData

I'm Using flink streaming to read events from Kafka source topic and after de-duplication, writing to separate kafka topic in avro topic.
Flow
Kafka Topic(json format) -> flink streaming(de-duplication) -> scala
case class objects -> Kafka Topic(Avro Format)
val sink = sinkProvider.getKafkaSink(brokerURL, targetTopic,kafkaTransactionMaxTimeoutMs, kafkaTransactionTimeoutMs)
messageStream
.map {
record =>
convertJsonToExample(record)
}
.sinkTo(sink)
.name("Example Kafka Avro Sink")
.uid("Example-Kafka-Avro-Sink")
Here are the steps I followed:
I created avro schema for my output schema
{
"type":"record",
"name":"Example",
"namespace":"ca.ix.dcn.test",
"fields":[
{
"name":"x",
"type":"string"
},
{
"name":"y",
"type":"long"
}
]
}
From avro schema I generated case class using avro-hugger tools(version 1.2.1) for
SpecificRecord
I used flink AvroSerializationSchema forSpecificRecord cause flink
kafka avro sink let's you use either specific record or generic
record constructor for serialization to avro.
def getKafkaSink(brokers: String, targetTopic: String,transactionMaxTimeoutMs:String,transactionTimeoutMs:String) = {
val schema = ReflectData.get.getSchema(classOf[Example])
val sink = KafkaSink.builder()
.setBootstrapServers(brokers)
.setProperty("transaction.max.timeout.ms",transactionMaxTimeoutMs)
.setProperty("transaction.timeout.ms",transactionTimeoutMs)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(targetTopic)
.setValueSerializationSchema(AvroSerializationSchema.forSpecific[Example](classOf[Example]))
.setPartitioner(new FlinkFixedPartitioner())
.build()
)
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.build()
sink
}
Now when I run it I get the exeption:
Caused by: org.apache.avro.AvroRuntimeException: java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData can not access a member of class ca.ix.dcn.test with modifiers "private final"
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:405)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:734)
I saw there is a bug opened on flink for this:
https://issues.apache.org/jira/browse/FLINK-18478
But I didn't find any work around for this.
Wondering if there is any workaround for this. Also if there are detailed examples that explain how to use flink streaming sink(for avro) using AvroSerializationSchema(Specific/Generic)
Appreciate the help on this.
In the Flink ticket that you're linking to, there's a comment made that avro-hugger is not really compatible with the Apache Avro Java library, see https://issues.apache.org/jira/browse/FLINK-18478?focusedCommentId=17164456&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17164456
The solution would be to generate Avro Java POJOs and use them in your Scala application.

how to resolve java.lang.IllegalArgumentException Unsupported Avro type

private KafkaTemplate<String, KafkaMessage> kafkaTemplate;
Message<KafkaMessage> message = MessageBuilder
        .withPayload(kafkaMessage)
        .setHeader(KafkaHeaders.TOPIC, targetTopic)
        .setHeader(KafkaHeaders.MESSAGE_KEY, "someStringValue" )
        .setHeader("X-Custom-Header", headerCreator.generateHeader(source, type)).build();
ListenableFuture<SendResult<String, KafkaMessage>> listenableFuture = kafkaTemplate.send(message);
This is my code. and the exception occurs at send method.
The exception is java.lang.IllegalArgumentException: Unsupported Avro type. Supported types are null, Boolean, Integer, Long, Float, Double, String, byte[] and IndexedRecord ?
Assuming that the Kafka topic is expecting an AVRO serialized object, you can add the plugin "avro-maven-plugin" to the project POM, and let Maven to generate the AVRO classes for you.
This plugin reads the AVRO schema' files, and automatically (once the project is build) generates the POJO classes. If the schema contains an error or is not valid, you will be warned before executing any code.
The KafkaTeamplate should use this POJO instead of KafkaMessage.
I recommend reading How to Use Schema Registry and Avro in Spring Boot Applications for a complete consumer and producer example, using Confluent components, for the overall project configuration (SERDEs, schema registry, etc.).

How to infer schema from Confluent Schema Registry using Apache Beam?

I'm trying to create an Apache Beam pipeline where I read from a kafka topic and load it into Bigquery. Using Confluent's schema registry, I should be able to infer the schema when loading into Bigquery. However, the schema is not being inferred when the loading and it fails.
Below is the entire pipeline code.
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers("broker-url:9092")
.withTopic("beam-in")
.withConsumerConfigUpdates(consumerConfig)
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider.of(schemaRegUrl, subj))
.withKeyDeserializer(ByteArrayDeserializer.class)
.commitOffsetsInFinalize()
.withoutMetadata()
)
.apply("Drop Kafka message key", Values.create())
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
.withCustomGcsTempLocation("gs://beam-tmp-load")
.withNumFileShards(10)
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration("10s"))
.to(new TableReference()
.setProjectId("my-project")
.setDatasetId("loaded-data")
.setTableId("beam-load-test")
);
return pipeline.run();
When running this I get the following error, which is from the fact that i'm calling useBeamSchema() and hasSchema() returns false:
Exception in thread "main" java.lang.IllegalArgumentException
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:127)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:2595)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:2579)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1726)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:542)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:493)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:368)
at KafkaToBigQuery.run(KafkaToBigQuery.java:159)
at KafkaToBigQuery.main(KafkaToBigQuery.java:64)

Implementing custom AvroConverter for confluent kafka-connect-s3

I am using Confluent's Kafka s3 connect for copying data from apache Kafka to AWS S3.
The problem is that I have Kafka data in AVRO format which is NOT using Confluent Schema Registry’s Avro serializer and I cannot change the Kafka producer. So I need to deserialize existing Avro data from Kafka and then persist the same in parquet format in AWS S3. I tried using confluent's AvroConverter as value converter like this -
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost/api/v1/avro
And i am getting this error -
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic dcp-all to Avro:
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:86)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:488)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
As far as I understand, "io.confluent.connect.avro.AvroConverter" will only work if the data is written in Kafka using Confluent Schema Registry’s Avro serializer and hence I am getting this error. So my question is Do I need to implement a generic AvroConverter in this case? And if yes, how do I extend the existing source code - https://github.com/confluentinc/kafka-connect-storage-cloud?
Any help here will be appreciated.
You don't need to extend that repo. You just need to implement a Converter (part of Apache Kafka) shade it into a JAR, then place it on your Connect worker's CLASSPATH, like BlueApron did for Protobuf
Or see if this works - https://github.com/farmdawgnation/registryless-avro-converter
NOT using Confluent Schema Registry
Then what registry are you using? Each one that I know of has configurations to interface with the Confluent one

Kafka Sink HDFS Unrecognized token

I'm trying to write JSON with Kafka HDFS Sink.
I have the following properties (connect-standalone.properties):
key.converter.schemas.enable = false
value.converter.schemas.enable = false
schemas.enable=false
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
And on my properties:
format.class=io.confluent.connect.hdfs.json.JsonFormat
And I got the following exception:
org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka connect failed due to serilization error
...
Caused By: org.apache.kafka.commom.errors.SerlizationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'test' : was expecting 'null', 'true', 'false' or NaN at [Source: (byte[])"test" line: 1 column: 11]
My JSON is valid.
How can I solve it please?
*I'm trying also with sample JSON like:
{"key":"value"}
And still same error.
Thanks.
According to the error, not all the messages in the topic are actually JSON objects. The latest messages might be valid, or the Kafka values might be valid (not the keys, though), but the error shows that it tried to read a plain string, (byte[])"test", which is not valid
If you only want text data into HDFS, you could use String format, however that won't have Hive integration
format.class=io.confluent.connect.hdfs.string.StringFormat
If you did want to use Hive with this format, you would need to define the JSON Serde yourself