I'm trying to put a pipeline in place and I just realized I don't really know why there will be an error and why there will be an error topic. There is some metadata that I will be counting on to be certain values but other than that, is there anything that is a "typical" kafka error? I'm not sure what the "typcial" kafka error topic is used for. This is specifically for a streams application. Thanks for any help.
One example of an error topic in a streaming environment would be that it contains messages that failed to abide by their contract.. example: if your incoming events are meant to be in a certain json format, your spark application will first try to parse the event into a class that fits the events json contract.
If it is in the right format, it is parsed and the app continues.
If it is in the incorrect format, the parsing fails and the json string is sent to the error topic.
Other use cases could be to to send the event back to an error topic to be processed at a later time.. ie network issues connecting to other services.
Related
Is there a possibility that we may lose some messages if we use snowflake kafka connector. For example if the kafka connector reads the message and commits the offset before the message is written to the variant table, then we will lose that message. Is this a scenario that can happen if we use kafka connect
If you have any examples, these are welcome as well, thank you!
According to the documentationfrom snowflake
Both Kafka and the Kafka connector are fault-tolerant. Messages are neither duplicated nor silently dropped. Messages are delivered exactly once, or an error message will be generated. If an error is detected while loading a record (for example, the record was expected to be a well-formed JSON or Avro record, but wasn’t well-formed), then the record is not loaded; instead, an error message is returned.
Limitations are listed as well. Arguably, nothing is impossible, but if you don't trust Kafka I'd not use Kafka at all.
How and where you could loose messages depends on your overall architecture too, like are records written into the Kafka-Topics you're consuming, how do you partition?
It is a business requirement that all the messages I consume from the kafka topic contain an integrity seal so as to identify if any changes were introduced to the message payload.
I was thinking I could possible do this in a kafka connect transform.
This would require that I convert the payload to the resulting json message format in the transform prior to sealing it so that the results could be verified upon consumption of the message.
My issues currently are...
1) I am not sure how to convert the payload to the same json that would be output to the kafka topic whilst in the transform.
2) I am not sure the best way to add the seal to the message. It would need to be placed in a consistent location ( for example first ) so that It could be stripped easily/completely prior to validating the seal in the consumer.
Any thoughts, suggestions, different approaches would be appreciated.
I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.
I have an ingestion pipeline using Flume & Kafka, consuming CSV files, converting events in JSON in a Flume Interceptor and pushing it in Kafka.
When I'm logging the message before being sent to Kafka, it's a normal, valid JSON. But when consuming the same message from Kafka, I'm getting errors when trying to serialize it, saying it's not valid JSON.
Indeed I have unrecognized chars at the beginning of my message:
e.g. �
I think it stands for the empty header that flume try to had to the event when posting to Kafka. But I can't seem to be able to prevent this from happening.
Does anyone knows how to completely remove headers from Flume events being sent or more precisely, remove those chars ?
Looks like a basic character encoding issue, like if kafka runs on Linux while the producer runs on a windows machine. You might want to triple-check that all machines handle utf-8 encoded messages.
this post should be your friend.
I have been using avro as the data format for offline processing over kafka in my application. I have a use case where the producer uses a schema which is almost same as what is used by consumer except that producer has some changes in the documentation of some fields. Would the consumer be able to consume such events without erroring out? I'm debugging an issue where numerous events are missing in the data pipeline and trying to figure out the root cause. I noticed this difference and want to understand if at all this causes an issue.
You should probably test to confirm, but documentation should not impact schema resolution as the schema should get normalized to the "canonical form":
https://avro.apache.org/docs/1.7.7/spec.html#Parsing+Canonical+Form+for+Schemas