Need Help Configuring NiFi's JsonRecordSetWriter and AvroSchemaRegistry - type-conversion

I am creating a NiFi WorkFlow to convert CSV to JSON, and I need help configuring ConvertRecords's JsonRecordSetWriter Controller Service.
What is happening is that a SchemaNotFoundException is being thrown saying
Unable to find schema with name 'ccr' (The name I chose for the schema).
The schema is inferred from the header in the CSV document using "InferAvroSchema", and "UpdateAttribute" is configured to add an attributed named "schema.name" that is set to 'ccr' (per guidance from other how-tos and guidance).
The JsonRecordSetWriter is configured to use the Controller Service "AvroSchemaRegistry" with a property added to it named "ccr" and the value for this property is set to "${inferred.avro.schema}".
I would like to have the derived schema contained in the attribute "inferred.avro.schema" to be used instead of having to supply the actual text of the avro schema as the value to this added property. InferAvroSchema's SchemaOutputDestination property is set to value "flowfile-attribute" meaning the inferred avro schema will be put into an attribute named "inferred.avro.schema".
I really need help in that I cannot specify the schema as text by virtue of the project's requirement. Rather I would like to use the schema inferred so that I can have CSV files of differing header and data content processed by the same workflow.
Any help and guidance you can share with me I would greatly appreciate it.

If you have a reader or writer with Schema Access Strategy set to "Schema Name" then it has to retrieve the schema by name from a schema registry, and the schema registry won't know anything about ${inferred.avro.schema} which is on a flow file.
You can set your writer to use Schema Access Strategy of "Schema Text" and in the schema text field put ${inferred.avro.schema} so it will dynamically get the schema text from the incoming flow file. You aren't using a schema registry at this point based on your requirements.
A different option, which may work for you... If you are on the 1.4.0 release, you could eliminate InferAvroSchema. You would use ConvertRecord with a CsvReader and set the Schema Access Strategy to "Use String Fields From Header" so the reader will infer a schema, then in your JsonRecordSetWriter set the Schema Access Startegy to "Inherit from Reader" so that it uses the same schema determined by the reader. The inherit capability doesn't exist in earlier releases which is why this is dependent on 1.4.0.

Related

Why use a schema registry

I just started working with Kafka and I use Protocol Buffers for the message format and I just learn about schema registry.
To give some context we are a small team with a dozen of webservices and we use Kafka to communicate between them and we store all the schemas and read/write models in a library that is later imported by each service. This way they know to serialize/deserialize a message.
But now schema registry comes into play. Why use it? Now my infrastructure becomes more complicated plus I need to update it every time I change a schema and I need to define as well the read/write models in each service like I do now using the library.
So from my point of view I only see cons mainly just complicating things so why should I use a schema registry?
Thanks
The schema registry ensures your messages will not deviate from a common base compatibility guarantee (the first version of the schema).
For example, you have a schema that describes an event like {"first_name": "Jane", "last_name": "Doe"}, but then later decide that names can actually have more than 2 parts, so you then move to a schema that can support {"name": "Jane P. Doe"}... You still need a way to deserialize old data with first_name and last_name fields to migrate to the new schema having only name. Therefore, consumers will need both schemas. The registry will hold that and encode the schema ID within each payload from the producer. After all, the initial events with the two name fields would know nothing about the "future" schema with only name.
You say your models are shared in libraries across services. You probably then have some regression testing and release cycle to publish these between services? The registry will allow you to centralize that logic.

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

Copy json blob to ADX table

I have an ADF with a copy activity which copies a json blob to kusto.
I have did the following:
Created a json mapping in the kusto table.
In the "Sink" section of the copy activity: I set the Ingestion mapping name field the name of #1.
In the mapping section of the copy activity, I mapped all the fields.
When I run the copy activity, I get the following error:
"Failure happened on 'Sink' side. ErrorCode=UserErrorKustoWriteFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failure status of the first blob that failed: Mapping reference wasn't found.,Source=Microsoft.DataTransfer.Runtime.KustoConnector,'"
I looked in kusto for ingestion failures and I see this:
Mapping reference 'mapping1' of type 'mappingReference' in database '' could not be found.
Why am I seeing those errors even though I have an ingestion mapping on the table and what do I need to do to correct it?
It might be that the ingestion format specified in the ADF is not json.
Well, After I removed the mapping name in the sink section, it works.
Looks like the docs are not updated because it states that you can define both:
"ingestionMappingName Name of a pre-created mapping on a Kusto table. To map the columns from source to Azure Data Explorer (which applies to all supported source stores and formats, including CSV/JSON/Avro formats), you can use the copy activity column mapping (implicitly by name or explicitly as configured) and/or Azure Data Explorer mappings."

For AvroProducer to Kafka, where are avro schema for "key" and "value"?

From the AvroProducer example in the confluent-kafka-python repo, it appears that the key/value schema are loaded from files. That is, from this code:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
value_schema = avro.load('ValueSchema.avsc')
key_schema = avro.load('KeySchema.avsc')
value = {"name": "Value"}
key = {"name": "Key"}
avroProducer = AvroProducer({'bootstrap.servers': 'mybroker,mybroker2', 'schema.registry.url': 'http://schem_registry_host:port'}, default_key_schema=key_schema, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value=value, key=key)
it appears that the files ValueSchema.avsc and KeySchema.avsc are loaded independently of the Avro Schema Registry.
Is this right? What's the point of referencing the URL for the Avro Schema Registry, but then loading schema from disk for key/value's?
Please clarify.
I ran into the same issue where it was initially unclear what the point of the local files are. As mentioned by the other answers, for the first write to an Avro topic, or an update to the topic's schema, you need the schema string - you can see this from the Kafka REST documentation here.
Once you have the schema in the registry, you can read it with REST (I used the requests Python module in this case) and use the avro.loads() method to get it. I found this useful because the produce() function requires that you have a value schema for your AvroProducer, and this code will work without that local file being present:
get_schema_req_data = requests.get("http://1.2.3.4:8081/subjects/sample_value_schema/versions/latest")
get_schema_req_data.raise_for_status()
schema_string = get_schema_req_data.json()['schema']
value_schema = avro.loads(schema_string)
avroProducer = AvroProducer({'bootstrap.servers': '1.2.3.4:9092', 'schema.registry.url': 'http://1.2.3.4:8081'}, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value={"data" : "that matches your schema" })
Hope this helps.
That is just one way to create a key and value schema in the Schema Registry in the first place. You can create it in the SR first using the SR REST API or you can create new schemas or new versions of existing schemas in the SR by publishing them with new messages. It's entirely your choice which method is preferred.
Take a look at the code and consider that schema from the registry is needed by a consumer rather than a producer. MessageSerializer registers schema in the schema registry for you :)