I have an ingestion pipeline using Flume & Kafka, consuming CSV files, converting events in JSON in a Flume Interceptor and pushing it in Kafka.
When I'm logging the message before being sent to Kafka, it's a normal, valid JSON. But when consuming the same message from Kafka, I'm getting errors when trying to serialize it, saying it's not valid JSON.
Indeed I have unrecognized chars at the beginning of my message:
e.g. �
I think it stands for the empty header that flume try to had to the event when posting to Kafka. But I can't seem to be able to prevent this from happening.
Does anyone knows how to completely remove headers from Flume events being sent or more precisely, remove those chars ?
Looks like a basic character encoding issue, like if kafka runs on Linux while the producer runs on a windows machine. You might want to triple-check that all machines handle utf-8 encoded messages.
this post should be your friend.
Related
Is there a possibility that we may lose some messages if we use snowflake kafka connector. For example if the kafka connector reads the message and commits the offset before the message is written to the variant table, then we will lose that message. Is this a scenario that can happen if we use kafka connect
If you have any examples, these are welcome as well, thank you!
According to the documentationfrom snowflake
Both Kafka and the Kafka connector are fault-tolerant. Messages are neither duplicated nor silently dropped. Messages are delivered exactly once, or an error message will be generated. If an error is detected while loading a record (for example, the record was expected to be a well-formed JSON or Avro record, but wasn’t well-formed), then the record is not loaded; instead, an error message is returned.
Limitations are listed as well. Arguably, nothing is impossible, but if you don't trust Kafka I'd not use Kafka at all.
How and where you could loose messages depends on your overall architecture too, like are records written into the Kafka-Topics you're consuming, how do you partition?
I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.
I am new to NIFI(and not much experience with Kafka), and I am trying to consume the messages that the producer is generating. To do this job, I am using the the ConsumeKafka processor on NIFI.
The messages are arriving (I can see them on the queue), but when I check the queue, and try to view the messages, I can only see the content with hex format (f.e: in original format I can see a message that says: No viewer is registered for this content type).
The messages that the producer is sending are encoded avro buffer (this is the reference I have taken: https://blog.mimacom.com/apache-kafka-with-node-js/) And when I check the consumer from the console, each message has this format:
02018-09-21T08:37:44.587Z #02018-09-21T08:37:44.587Z #
I have read that the processor UpdateRecord can help to change the hex code to plain text, but I can't make it happen.
How can I configure this UpdateRecord processor?
Chears
Instead of ConsumeKafka, it is better to use ConsumeKafkaRecord processor appropriate to the Kafka version you're using and configure Record Reader with an AvroReader and set Record Writer to the writer of your choice.
Once that's done, you have to configure the AvroReader controller service with a Schema registry. You can use AvroSchemaRegistry where you would specify the schema for the Avro messages that you're receiving in Kafka.
A quick look at this tutorial would help you achieve what you want: https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
I'm trying to put a pipeline in place and I just realized I don't really know why there will be an error and why there will be an error topic. There is some metadata that I will be counting on to be certain values but other than that, is there anything that is a "typical" kafka error? I'm not sure what the "typcial" kafka error topic is used for. This is specifically for a streams application. Thanks for any help.
One example of an error topic in a streaming environment would be that it contains messages that failed to abide by their contract.. example: if your incoming events are meant to be in a certain json format, your spark application will first try to parse the event into a class that fits the events json contract.
If it is in the right format, it is parsed and the app continues.
If it is in the incorrect format, the parsing fails and the json string is sent to the error topic.
Other use cases could be to to send the event back to an error topic to be processed at a later time.. ie network issues connecting to other services.
Wanted to know if there is a better way to solve the problem that we are having. Here is the flow:
Our client code understands only protocol buffers (protobuf). On the server side, our gateway gets the protobuf and puts it onto Kafka.
Now avrò is the recommended encoding scheme, so we put the specific protobuf within avro (as a byte array) and we put it onto the message bus. The reason we do this is to avoid having to do entire protobuf->avro conversion.
On the consumer side, it reads the avro message, gets the protobuf out of it and works on that.
How reliable is protobuf with Kafka? Are there a lot of people using it? What exactly are the advantages/disadvantages of using Kafka with protobuf?
Is there a better way to handle our use case/scenario?
thanks
Kafka doesn't differentiate between encoding schemes since at the end every message flows in and out of kafka as binary.
Both Proto-buff and Avro are binary based encoding schemes, why would you want to wrap a proto-buff inside an Avro schema, when you can directly put the proto-buff message into Kafka?