Single letter being prepended to JSON records from NIFI JSONRecordSetWriter - apache-kafka

I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?

Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.

Related

Automatically discover topics when using topics.regex configuration for Kafka Connector

When I'm using topics.regex config option for Kafka Connector Sink (in this particular case Confluent S3 Sink) everything works as expected when sink is first started (it discovers all topics and start consuming messages from them). I want to be able to also create some topics later. I can't find anywhere in documentation what is expected behaviour here, but I wanted to be able to somehow automatically start consuming new topics with name matching provided regex. Is it possible at all? If not what would be best way to implement some automation for this?
You should not need to do anything. It will find the new topic automatically if it matches the regex. But it is not immediate - it might take a few minutes (I think it is driven by the metadata refresh which is by default 5 minutes?). I never used it directly with the S3 connector you mention. But it worked fine for me with other connectors and I think there should be no differences.

Why Kafka Connect Works?

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

Confluent Schema Registry/Kafka Streams: prevent schema evolution

Is there a way to configure Confluent Schema Registry and/or Kafka Streams to prevent schema evolution?
Motivation
We have multiple Kafka Streams jobs producing messages for the same topic. All of the jobs should send messages with the same schema, but due to misconfiguration of the jobs, it has happened that some of them send messages with fields missing. This has caused issues downstream and is something we want to prevent.
When this happens, we can see a schema evolution in the schema registry as expected.
Solution
We checked the documentation for Confluent Schema Registry and/or Kafka Streams, but couldn't find a way to prevent the schema evolution.
Hence, we consider to modify the Kafka Streams jobs to read the schema from Confluent Schema Registry before sending it. If the received schema matches the local schema of the messages, only then we send them.
Is this the right way to go or did we miss a better option?
Update: we found an article on medium for validating the schema against the schema registry before sending.
It depends which language and library you use and what kind of APIs do they provide. If you are publishing generic records, you can read and parse .avdl or .avsc file into the record type and build your event. Which means if event you are trying to build wouldn't be compatible with the current schema you won't be able even to build that event hence won't be able to modify existing schema. So in this case simply store with your source code a static schema. With specific record it is more or less the same, you can build your Java/C# or other language classes based on the schema, you build them then simply new them up and publish. Does it make any sense?)
PS. I worked with C# libs for Kafka maybe some other languages do not have that support or have some other better options

Confluent Schema Registry Avro Schema

Hey I would like to use the Confluent schema registry with the Avro Serializers: The documentation now basically says: do not use the same schema for multiple different topics
Can anyone explain to me why?
I reasearch the source code and it basically stores the schema in a kafka topic as follows (topicname,magicbytes,version->key) (schema->value)
Therefore I don't see the problem of using the schema multiple times expect redundancy?
I think you are referring to this comment in the documentation:
We recommend users use the new producer in org.apache.kafka.clients.producer.KafkaProducer. If you are using a version of Kafka older than 0.8.2.0, you can plug KafkaAvroEncoder into the old producer in kafka.javaapi.producer. However, there will be some limitations. You can only use KafkaAvroEncoder for serializing the value of the message and only send value of type Avro record. The Avro schema for the value will be registered under the subject recordName-value, where recordName is the name of the Avro record. Because of this, the same Avro record type shouldn’t be used in more than one topic.
First, the commenter above is correct -- this only refers to the old producer API pre-0.8.2. It's highly recommended that you use the new producer anyway as it is a much better implementation, doesn't depend on the whole core jar, and is the client which will be maintained going forward (there isn't a specific timeline yet, but the old producer will eventually be deprecated and then removed).
However, if you are using the old producer, this restriction is only required if the schema for the two subjects might evolve separately. Suppose that you did write two applications that wrote to different topics, but use the same Avro record type, let's call it record. Now both applications will register it/look it up under the subject record-value and get assigned version=1. This is all fine as long as the schema doesn't change. But lets say application A now needs to add a field. When it does so, the schema will be registered under subject record-value and get assigned version=2. This is fine for application A, but application B has either not been upgraded to handle this schema, or worse, the schema isn't even valid for application B. However, you lose the protection the schema registry normally gives you -- now some other application could publish data of that format into the topic used by application B (it looks ok because record-value has that schema registered). Now application B could see data which it doesn't know how to handle since its not a schema it supports.
So the short version is that because with the old producer the subject has to be shared if you also use the same schema, you end up coupling the two applications and the schemas they must support. You can use the same schema across topics, but we suggest not doing so since it couples your applications (and their development, the teams developing them, etc).

Avro compatibility when only documentation of some of the field changes

I have been using avro as the data format for offline processing over kafka in my application. I have a use case where the producer uses a schema which is almost same as what is used by consumer except that producer has some changes in the documentation of some fields. Would the consumer be able to consume such events without erroring out? I'm debugging an issue where numerous events are missing in the data pipeline and trying to figure out the root cause. I noticed this difference and want to understand if at all this causes an issue.
You should probably test to confirm, but documentation should not impact schema resolution as the schema should get normalized to the "canonical form":
https://avro.apache.org/docs/1.7.7/spec.html#Parsing+Canonical+Form+for+Schemas