We have a Producer which writes data onto the topic with an Avro Schema. The Kafka HDFS Connect sinks the data from the topic to the hdfs as parquet format and creates external table in Hive. Works fine in Hive/Impala engines. So far so good. We cannot control the Producer side Avro schema column ordering. When there is a change in the avro schema order of columns on the Producer side then error comes up in Hive.
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveArrayInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
Note: Impala is still fine even when the AVRO schema column order changes.
Related
Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD
My requirement is when a producer is producing data without schema , I need to register a new schema in Schema- Register to consume data into JDBC converter.
Have found this, but is it possible to get any other solution.
Schema Registry is not a requirement to use JDBC Connector, but JDBC Sink connector does require a schema in the record payload, as the linked answer says.
The source connector can read data and generate records without a schema, but this has no interaction with any external producer client.
If you have producers that generate records without any schema, then it's unclear what schema you would be registering anywhere. But you can try to use a ProducerInterceptor to intercept and inspect those records to do whatever you need to.
Background :
I used SpringKafka to implement Avro based Consumer and Producer. Other important components : Kafka-Broker, Zookeeper and Schema-registry runs in a docker container. And this works perfectly fine for me.
What I Want :
I want to have a Kafka Avro Deserializer(in Consumer) which should be independent of schema-registry. In my case, I have a avroSchema file which would not change. So I want to get rid of this additional step of using schema-registry on Consumer side and Rather go for a local schema file
If the Producer serializer uses the Schema Registry, then the Consumer should as well. Avro requires you to have a reader and writer schema.
If the consumer, for whatever reason cannot access the Registry over the network, you would need to use ByteArrayDeserializer, then you would take the byte-slice after position 5 (0x0 + 4 byte schema integer ID) of the byte[] from the part of the record you want to parse.
Then, from Avro API, you can use GenericDatumReader along with your local Schema reference to get a GenericRecord instance, but this would assume your reader and writer schema are the exact same (but this shouldn't be true, as the producer could change the schema at any time).
Or you can create a SpecificRecord from the schema you have, and configure the KafkaAvroDeserializer to use that.
I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream
I am trying to set up a topic with an avro schema on confluent platform (with docker).
My topic is running and I have messages.
I also configured the avro schema for the value for this specific topic:
Thus, I can't use the data from for example ksql.
Any idea of what I am doing wrong?
EDIT 1:
So what I expect is:
From the confluent platform, on the topic view, I expect to see the value in a readable format (not Avro), when the schema is in the registry.
From KSQL, I tried to create a Stream with the following command:
CREATE STREAM hashtags
WITH (KAFKA_TOPIC='mytopic',
VALUE_FORMAT='AVRO');
But when I try to visualize my created stream, no data are showing up.