I have written some avro data to the topic “test-avro” using Kafka-avro-console-producer.
Then I have written some plain text data to the same topic “test-avro” using Kafka-console-producer.
After this, all the data in the topic got corrupted. Can anyone explain what caused this to happen like this?
You simply cannot use the avro-console-consumer (or a Consumer with an Avro deserializer) anymore to read those offsets because it'll assume all data in the topic is Avro and use Confluent's KafkaAvroDeserializer.
The plain console-producer will push non-Avro encoded UTF-8 strings and use the StringSerializer, which will not match the wire format expected for the Avro deserializer
The only way to get past them is to know what offsets are bad, and wait for them to expire on the topic, or reset a consumer group to begin after those messages. Or, you can always use the ByteArrayDeserializer, and add a bunch of conditional logic for parsing your messages to ensure no data-loss.
tl;dr The producer and consumer must agree on the data format of the topic.
Related
I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream
I am new to NIFI(and not much experience with Kafka), and I am trying to consume the messages that the producer is generating. To do this job, I am using the the ConsumeKafka processor on NIFI.
The messages are arriving (I can see them on the queue), but when I check the queue, and try to view the messages, I can only see the content with hex format (f.e: in original format I can see a message that says: No viewer is registered for this content type).
The messages that the producer is sending are encoded avro buffer (this is the reference I have taken: https://blog.mimacom.com/apache-kafka-with-node-js/) And when I check the consumer from the console, each message has this format:
02018-09-21T08:37:44.587Z #02018-09-21T08:37:44.587Z #
I have read that the processor UpdateRecord can help to change the hex code to plain text, but I can't make it happen.
How can I configure this UpdateRecord processor?
Chears
Instead of ConsumeKafka, it is better to use ConsumeKafkaRecord processor appropriate to the Kafka version you're using and configure Record Reader with an AvroReader and set Record Writer to the writer of your choice.
Once that's done, you have to configure the AvroReader controller service with a Schema registry. You can use AvroSchemaRegistry where you would specify the schema for the Avro messages that you're receiving in Kafka.
A quick look at this tutorial would help you achieve what you want: https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
i am new of the KAFKA protocol's world and i would like to ask you some inportant information related to my project.
I am using AVRO file for producing and consuming messages, i want to know if i can use the same avro file for multiple Topics maybe for example by using a different "name" attribute into the producer and by using a specific "name" attribute in the consumer.
Thanks a lot.
Stefano
You can use one file to send data to multiple topics, yes, although I'm not sure why one would do that
I would be cautious about merging multiple topics into one Avro file because the schema must match in every topic for that file
It would be suggested that you use the Confluent Schema Registry, for example, rather than sending individual Avro events because if you are not using some registry, then you're likely sending the Avro schema as part of every message, which will slow down the possible throughput of your topic. And then, the name of the Avro schema record in the register will correspond to the topic name
I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON
I try to publish/consume my java objects to kafka. I use Avro schema.
My basic program works fine. In my program i use my schema in the producer (for encoding) and consumer (decoding).
If i publish different objects to different topics( eg: 100 topics)at the receiver, i do not know, what type of message i received ?..I would like to get the avro schema from the received byte and would like to use that for decoding..
Is my understand correct? If so, how can i retrieve from the received object?
You won't receive the Avro schema in the received bytes -- and you don't really want to. The whole idea with Avro is to separate the schema from the record, so that it is a much more compact format. The way I do it, I have a topic called Schema. The first thing a Kafka consumer process does is to listen to this topic from the beginning and to parse all of the schemas.
Avro schemas are just JSON string objects -- you can just store one schema per record in the Schema topic.
As to figuring out which schema goes with which topic, as I said in a previous answer, you want one schema per topic, no more. So when you parse a message from a specific topic you know exactly what schema applies, because there can be only one.
If you never re-use the schema, you can just name the schema the same as the topic. However, in practice you probably will use the same schema on multiple topics. In which case, you want to have a separate topic that maps Schemas to Topics. You could create an Avro schema like this:
{"name":"SchemaMapping", "type":"record", "fields":[
{"name":"schemaName", "type":"string"},
{"name":"topicName", "type":"string"}
]}
You would publish a single record per topic with your Avro-encoded mapping into a special topic -- for example called SchemaMapping -- and after consuming the Schema topic from the beginning, a consumer would listen to SchemaMapping and after that it would know exactly which schema to apply for each topic.