Kafka Connect JSON Schema does not appear to support "$ref" tags - apache-kafka

I am using Kafka Connect with JSONSchema and am in a situation where I need to convert the JSON schema manually (to "Schema") within a Kafka Connect plugin. I can successfully retrieve the JSON Schema from the Schema Registry and am successful converting with simple JSON Schemas but I am having difficulties with ones that are complex and have valid "$ref" tags referencing components within a single JSON Schema definition.
I have several questions:
The JsonConverter.java does not appear to handle "$ref". Am I correct, or does it handle it in another way elsewhere?
Does the Schema Registry handle the referencing of sub-definitions? If yes, is there code that shows how the dereferencing is handled?
Should the JSON Schema be resolved to a string without references (ie. inline the references) before submitting to the Schema Registry and thereby remove the "$ref" issue?
I am looking at the Kafka Source code module JsonConverter.java below:
https://github.com/apache/kafka/blob/trunk/connect/json/src/main/java/org/apache/kafka/connect/json/JsonConverter.java#L428
An example of the complex schema (taken from the JSON Schema site) is shown below (notice the "$ref": "#/$defs/veggie" tag the references a later sub-definition)
{
"$id": "https://example.com/arrays.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"description": "A representation of a person, company, organization, or place",
"title": "complex-schema",
"type": "object",
"properties": {
"fruits": {
"type": "array",
"items": {
"type": "string"
}
},
"vegetables": {
"type": "array",
"items": { "$ref": "#/$defs/veggie" }
}
},
"$defs": {
"veggie": {
"type": "object",
"required": [ "veggieName", "veggieLike" ],
"properties": {
"veggieName": {
"type": "string",
"description": "The name of the vegetable."
},
"veggieLike": {
"type": "boolean",
"description": "Do I like this vegetable?"
}
}
}
}
}
Below is the actual schema returned from the Schema Registry after it the schema was successfully registered:
[
{
"subject": "complex-schema",
"version": 1,
"id": 1,
"schemaType": "JSON",
"schema": "{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}"
}
]
The actual schema is embedded in the above returned string (the contents of the "schema" field) and contains the $ref references:
{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}

Again, the JsonConverter in the Apache Kafka source code has no notion of JSONSchema, therefore, no, $ref doesn't work and it also doesn't integrate with the Registry.
You seem to be looking for the io.confluent.connect.json.JsonSchemaConverter class + logic

Related

JDBC sink topic with multiple structs to postgres

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?
Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

Default value for a record in AVRO?

I wanted to add a new field into an AVRO schema of type "record" that cannot be null and therefore has a default value. The topic is set to compatibility type "Full_Transitive".
The schema did not change from the last version, only the last field produktType was added:
{
"type": "record",
"name": "Finished",
"namespace": "com.domain.finishing",
"doc": "Schema to indicate the end of the ongoing saga...",
"fields": [
{
"name": "numberOfAThing",
"type": [
"null",
{
"type": "string",
"avro.java.string": "String"
}
],
"default": null
},
{
"name": "previousNumbersOfThings",
"type": {
"type": "array",
"items": {
"type": "string",
"avro.java.string": "String"
}
},
"default": []
},
{
"name": "produktType",
"type": {
"type": "record",
"name": "ProduktType",
"fields": [
{
"name": "art",
"type": "int",
"default": 1
},
{
"name": "code",
"type": "int",
"default": 10003
}
]
},
"default": { "art": 1, "code": 10003 }
}
]
}
I've checked with the schema-registry that the new version of the schema is compatible.
But when we try to read old messages that do not contain that new field with the new schema (where the defaults are) there is a EOF Exception and it does not seem to work.
The part that causes headaches is the new added field "produktType". It cannot be null so we tried adding defaults. Which is possible for primitive type fields ("int" and so on). The line "default": { "art": 1, "code": 10003 } seems to be ok with the schema-registry but does not seem to have an effect when we read messages from the topic that do not contain this field.
The schema registry also marks it as not compatible when the last "default": { "art": 1, "code": 10003 } line is missing (but also "default": true works regarding schema compatibility...).
The AVRO specification for complex types contains an example for type "record" and default {"a": 1} so that is where we got that idea from. But since its not working something is still wrong.
There are similar questions like this one claiming records can only have null as a default or this un-answered one.
Is this supposed to work? And if so how can defaults for these "type": "record" fields be defined? Or is it still true that records can only have null as default?
Thanks!
Update on the compatibility cases:
Schema V1 (old one without the new field): can read v1 and v2 records.
Schema V2 (new field added): cannot read v1 records, can read v2 records
The case where a consumer using schema v2 encountering records using v1 is the surprising one - as I thought the defaults are for that purpose.
Even weirder: when I don't set the new field values at all. The v2 record does contain some values:
I have no idea where the value for code is from. The schema uses other numbers for its defaults:
So one of them seems to work, the other does not.

How to validate http PATCH data with a JSON schema having required fields

I'm trying to validate a classic JSON schema (with Ajv and json-server) with required fields, but for a HTTP PATCH request.
It's okay for a POST because the data arrive in full and nothing should be ommited.
However, the required fields of the schema make problem when attempting to PATCH an existing resource.
Here's what I'm currently doing for a POST :
const schema = require('./movieSchema.json');
const validate = new Ajv().compile(schema);
// ...
movieValidation.post('/:id', function (req, res, next) {
const valid = validate(req.body);
if (!valid) {
const [err] = validate.errors;
let field = (err.keyword === 'required') ? err.params.missingProperty : err.dataPath;
return res.status(400).json({
errorMessage: `Erreur de type '${err.keyword}' sur le champs '${field}' : '${err.message}'`
});
}
next();
});
... but if i'm doing the same for a movieValidation.patch(...) and tries to send only this chunk of data :
{
"release_date": "2020-07-15",
"categories": [
"Action",
"Aventure",
"Science-Fiction",
"Thriller"
]
}
... it will fail the whole validation (whereas all the fields are okay and they validate the schema)
Here's my complete moviesSchema.json :
{
"type": "object",
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": {
"title": {
"title": "Titre",
"type": "string",
"description": "Titre complet du film"
},
"release_date": {
"title": "Date de sortie",
"description": "Date de sortie du film au cinéma",
"type": "string",
"format": "date",
"example": "2019-06-28"
},
"categories": {
"title": "Catégories",
"description": "Catégories du film",
"type": "array",
"items": {
"type": "string"
}
},
"description": {
"title": "Résumé",
"description": "Résumé du film",
"type": "string"
},
"poster": {
"title": "Affiche",
"description": "Affiche officielle du film",
"type": "string",
"pattern": "^https?:\/\/"
},
"backdrop": {
"title": "Fond",
"description": "Image de fond",
"type": "string",
"pattern": "^https?:\/\/"
}
},
"required": [
"title",
"release_date",
"categories",
"description"
],
"additionalProperties": false
}
For now, I did the trick using a different schema which is the same as the original one, but without the required clause at the end. But I don't like this solution as it's duplicating code unnecessarily (and it's not elegant at all).
Is there any clever solution/tool to achieve this properly?
Thanks
If you're using HTTP PATCH correctly, there's another way to deal with this problem.
The body of a PATCH request is supposed be a diff media type of some kind, not plain JSON. The diff media type defines a set of operations (add, remove, replace) to perform to transform the JSON. The diff is then applied to the original resource resulting in the new state of the resource. A couple of JSON diff media types are JSON Patch (more powerful) and JSON Merge Patch (more natural).
If you validate the request body for a PATCH, you aren't really validating your resource, you are validating the diff format. However, if you apply the patch to your resource first, then you can validate the result with the full schema (then persist the changes or 400 depending on the result).
Remember, in REST it's resources and representations that matter, not requests and responses.
It's not uncommon to have multiple schemas, one per payload you want to validate.
In your situation, it looks like you've done exactly the right thing.
You can de-duplicate your schemas using references ($ref), splitting your property subschemas into a separate file.
You end up with a schema which contains your model, and a schema for each representation of said model, but without duplication (where possible).
If you need more guidance on how exactly you go about this, please comment and I'll update the answer with more details.
Here is an example of how you could do what you want.
You will need to create multiple schema files, and reference the right schema when you need to validate POST or PATCH requests.
I've simplified the examples to only include "title".
In one schema, you have something like...
{
"$id": "https://example.com/object/movie",
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"movie": {
"properties": {
"title": {
"$ref": "#/definitions/title"
}
},
"additionalProperties": false
},
"title": {
"title": "Titre",
"type": "string",
"description": "Titre complet du film"
}
}
}
Then you would have one for POST and PATCH...
{
"$id": "https://example.com/movie/patch",
"$schema": "http://json-schema.org/draft-07/schema#",
"allOf": [{
"$ref": "/object/movie#/definitions/movie"
}],
}
{
"$id": "https://example.com/movie/post",
"$schema": "http://json-schema.org/draft-07/schema#",
"allOf": [{
"$ref": "/object/movie#/definitions/movie"
}],
"required": ["title"]
}
Change example.com to whatever domain you want to use for your ID.
You will then need to load in to the implementation all of your schemas.
Once loaded in, the references will work, as they are based on URI resolution, using $id for each schema resource.
Notice the $ref values in the POST and PATCH schemas do not start with a #, meaning the target is not THIS schema resource.

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}

Why OpenAPI does not define '$ref' as allowed property?

In compare to draft-07 it defines:
{
"type": ["object", "boolean"],
"properties": {
...
"$ref": {
"type": "string",
"format": "uri-reference"
},
}
...
}
Currently I am trying to write validator for openapi. But validation fails because openapi schema (yes, it is schema from google apis) does not define $ref as allowed property for schema.
Is this a typo? What is recommendation about how to check $ref property?
$ref is a JSON Reference. It's not part of the schema definition, instead it's part of the reference definition:
"reference": {
"type": "object",
"description": "A simple object to allow referencing other components in the specification, internally and externally. The Reference Object is defined by JSON Reference and follows the same structure, behavior and rules. For this specification, reference resolution is accomplished as defined by the JSON Reference specification and not by the JSON Schema specification.",
"required": [
"$ref"
],
"additionalProperties": false,
"properties": {
"$ref": {
"type": "string"
}
}
},
And then other definitions where $ref is allowed use oneOf something or reference (example):
"schemaOrReference": {
"oneOf": [
{
"$ref": "#/definitions/schema"
},
{
"$ref": "#/definitions/reference"
}
]
},
By the way, there are currently two different draft OAS3 JSON Schemas in the official OpenAPI Specification repository. Feel free to try them instead, and provide your feedback in the corresponding discussions.
[WIP] Alternative OAS3 JSON Schema (link to schema)
OpenAPI v3 JSON Schema (link to schema)