Avro Schema: Build Avro Schema from Schema Fields - scala

I am trying to write a function to calculate a diff between two avro schemas and generate another schema.
schema_one = {
"type": "record",
"name": "schema_one",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
},
{
"name": "id",
"type": "string"
}
]
}
schema_two = {
"type": "record",
"name": "schema_two",
"namespace": "test",
"fields": [
{
"name": "type",
"type": "string"
}
]
}
To get elements field in schema_one not in schema_two
import org.apache.avro.Schema._
import org.apache.avro.{Schema, SchemaBuilder}
val diff: Set[Schema.Field] = schema_one.getFields.asScala.toSet.filterNot(schema_two.getFields.asScala.toSet)
So far, so good.
I want to build a new schema from diff and I expect it to be:
schema_three = {
"type": "record",
"name": "schema_three",
"namespace": "test",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
I cant seem to find any method within Avro SchemaBuilder to achieve this without having to explicitly provide named fields. i.e build Schema given Schema.Fields
For example:
SchemaBuilder.record("schema_three").namespace("test").fromFields(diff)
Is there a way to achieve this? Appreciate comments.

I was able to achieve this using the kite sdk "org.kitesdk" % "kite-data-core" % "1.1.0"
val schema_namespace = schema_one.getNamespace
val schema_name = schema_one.getName
val schemas = diff.map( f => {
SchemaBuilder
.record(schema_name)
.namespace(schema_namespace)
.fields()
.name(f.name())
.`type`(f.schema())
.noDefault()
.endRecord()
}
)
val schema_three = SchemaUtil.merge(schemas.asJava)

Related

JDBC sink topic with multiple structs to postgres

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?
Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

What column type do I need for this nested data in BigQuery?

I have a JSON schema for a Kafka stream that I am integrating with BigQuery but I can't get the data type correct at the BigQuery end. This is the schema:
"my_meta_data": {
"type": "object",
"properties": {
"property_1": {
"type": "array",
"items": {
"type": "number"
}
},
"property_2": {
"type": "array",
"items": {
"type": "number"
}
},
"property_3": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
I tried this in the JSON file defining the BigQuery table:
{
"name": "my_meta_data",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "property_1",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_2",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_3",
"type": "INT64",
"mode": "REPEATED"
}
]
}
I am using a hosted connector from Confluent, the Kafka provider, and the error message is
The connector is failing because it cannot write a non-array element to an array column. Please check the schemas of the data in Kafka and the BigQuery tables the connector is writing to, and ensure that all data from Kafka that will be written to an array column in BigQuery is contained in an array.
I haven't defined an array column though, I've defined a RECORD column that contains arrays. Any ideas how I can set up the BigQuery table to capture this data? Thanks in advance.

How to use Schema registry for Kafka Connect AVRO

I have started exploring Kafka and Kafka connect recently and did some initial set up .
But wanted to explore more on schema registry part .
My schema registry is started now what i should do .
I have a AVRO schema stored in avro_schema.avsc.
here is the schema
{
"name": "FSP-AUDIT-EVENT",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "VERSION",
"type": "int"
},
{
"name": "ACTION_TYPE",
"type": "string"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "CLIENT_ID",
"type": "string"
},
{
"name": "DETAILS",
"type": "string"
},
{
"name": "OBJECT_TYPE",
"type": "string"
},
{
"name": "UTC_DATE_TIME",
"type": "long"
},
{
"name": "POINT_IN_TIME_PRECISION",
"type": "string"
},
{
"name": "TIME_ZONE",
"type": "string"
},
{
"name": "TIMELINE_PRECISION",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_UTC_DT",
"type": [
"string",
"null"
]
},
{
"name": "AUDIT_EVENT_TO_DATE_PITP",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TZ",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TP",
"type": "string"
},
{
"name": "GROUP_ID",
"type": "string"
},
{
"name": "OBJECT_DISPLAY_NAME",
"type": "string"
},
{
"name": "OBJECT_ID",
"type": [
"string",
"null"
]
},
{
"name": "USER_DISPLAY_NAME",
"type": [
"string",
"null"
]
},
{
"name": "USER_ID",
"type": "string"
},
{
"name": "PARENT_EVENT_ID",
"type": [
"string",
"null"
]
},
{
"name": "NOTES",
"type": [
"string",
"null"
]
},
{
"name": "SUMMARY",
"type": [
"string",
"null"
]
}
]
}
Is my schema is valid .I converted it online from JSON ?
where should i keep this schema file location i am not sure about .
Please guide me with the step to follow
.
I am sending records from Lambda function and from JDBC source both .
So basically how can i enforce AVRO schema and test ?
Do i have to change anything in avro-consumer properties file ?
Or is this correct way to register schema
./bin/kafka-avro-console-producer \
--broker-list b-3.**:9092,b-**:9092,b-**:9092 --topic AVRO-AUDIT_EVENT \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"struct\",\"fields\":[{\"type\":\"string\",\"optional\":false,\"field\":\"ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"VERSION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"ACTION_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"EVENT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"CLIENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"DETAILS\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"UTC_DATE_TIME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"POINT_IN_TIME_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIME_ZONE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIMELINE_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"GROUP_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"PARENT_EVENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"NOTES\"},{\"type\":\"string\",\"optional\":true,\"field\":\"SUMMARY\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_UTC_DT\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_PITP\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TZ\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TP\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
what next i have to do
But when i try to see my schema i get only below
curl --silent -X GET http://localhost:8081/subjects/AVRO-AUDIT-EVENT/versions/latest
this is the result
{"subject":"AVRO-AUDIT-EVENT","version":1,"id":161,"schema":"{\"type\":\"string\",\"optional\":false}"}
Why i do not see my full registered schema
Also when i try to delete schema
i get below error
{"error_code":405,"message":"HTTP 405 Method Not Allowed"
i am not sure if my schema is registered correctly .
Please help me.
Thanks in Advance
is my schema valid
You can use the REST API of the Registry to try and submit it and see...
where should i keep this schema file location i am not sure about
It's not clear how you're sending messages...
If you actually wrote Kafka producer code, you store it within your code (as a string) or as a resource file.. If using Java, you can instead use the SchemaBuilder class to create the Schema object
You need to rewrite your producer to use Avro Schema and Serializers if you've not already
If we create AVRO schema will it work for Json as well .
Avro is a Binary format, but there is a JSONDecoder for it.
what should be the URL for our AVRO schema properties file ?
It needs to be the IP of your Schema Registry once you figure out how to start it. (with schema-registry-start)
Do i have to change anything in avro-consumer properties file ?
You need to use the Avro Deserializer
is this correct way to register schema
.> /bin/kafka-avro-console-producer \
Not quite. That's how you produce a message with a schema (and you need to use the correct schema). You also must provide --property schema.registry.url
You use the REST API of the Registry to register and verify schemas

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Does Pymongo have validation rules built in?

I am trying to validate an inserted document against a schema, and was trying to find a way to validate the inserted document.
There are libraries like MongoEngine that say they do the work, but is there a way to do document validation directly via pymongo ?
The python driver docs are indeed a little light on how to use the db.command. Here is a complete working example:
from pymongo import MongoClient
from collections import OrderedDict
import sys
client = MongoClient() # supply connection args as appropriate
db = client.testX
db.myColl.drop()
db.create_collection("myColl") # Force create!
# $jsonSchema expression type is prefered. New since v3.6 (2017):
vexpr = {"$jsonSchema":
{
"bsonType": "object",
"required": [ "name", "year", "major", "gpa" ],
"properties": {
"name": {
"bsonType": "string",
"description": "must be a string and is required"
},
"gender": {
"bsonType": "string",
"description": "must be a string and is not required"
},
"year": {
"bsonType": "int",
"minimum": 2017,
"maximum": 3017,
"exclusiveMaximum": False,
"description": "must be an integer in [ 2017, 3017 ] and is required"
},
"major": {
"enum": [ "Math", "English", "Computer Science", "History", None ],
"description": "can only be one of the enum values and is required"
},
"gpa": {
# In case you might want to allow doubles OR int, then add
# "int" to the bsonType array below:
"bsonType": [ "double" ],
"minimum": 0,
"description": "must be a double and is required"
}
}
}
}
# Per the docs, args to command() require that the first kev/value pair
# be the command string and its principal argument, followed by other
# arguments. There are two ways to do this: Using an OrderDict:
cmd = OrderedDict([('collMod', 'myColl'),
('validator', vexpr),
('validationLevel', 'moderate')]
db.command(cmd)
# Or, use the kwargs construct:
# db.command('collMod','myColl', validator=vexpr, validationLevel='moderate')
try:
db.myColl.insert({"x":1})
print "NOT good; the insert above should have failed."
except:
print "OK. Expected exception:", sys.exc_info()
try:
okdoc = {"name":"buzz", "year":2019, "major":"Math", "gpa":3.8}
db.myColl.insert(okdoc)
print "All good."
except:
print "exc:", sys.exc_info()
MongoDB supports document validation at the engine level so you'll pick it up via pymongo. You declare your "schema" (rules actually) to the engine. Here's a great place to start: https://docs.mongodb.com/manual/core/document-validation/
You can make a separated JSON file for your Document Validations Schema, like this:
{
"collMod": "users",
"validator": {
"$jsonSchema": {
"bsonType": "object",
"required": ["email", "password","name"],
"properties": {
"email": {
"bsonType": "string",
"description": "Correo Electrónico"
},
"password": {
"bsonType": "string",
"description": "Una representación Hash de la contraseña"
},
"name": {
"bsonType": "object",
"required": ["first", "last"],
"description": "Objeto que separa los nombres y apellidos",
"properties": {
"first": {
"bsonType": "string",
"description": "Primer y segundo nombre"
},
"last": {
"bsonType": "string",
"description": "Primer y segundo apellido"
}
}
},
}
}
}
}
Then you can use in python script, example:
from pymongo import MongoClient
import json #parse JSON file as dict
from collections import OrderedDict #preserve the order (key, value) in the gived insertions on the dict
client = MongoClient("your_mongo_uri")
db = client.your_db_name
with open('your_schema_file.json', 'r') as j:
d = json.loads(j.read())
d = OrderedDict(d)
db.command(d)
OrderedDict Info
collMod Info
Schema Validation Info
I know 2 options to deal with:
By creating or setting schema for collection, so any insertions will be checked against it on server side, rejected or warned depending on validationAction
The following code demonstrates scheme creation and testing:
import pymongo
mongo_client = MongoClient(url=...,
port=...,
username=...,
password=...,
authSource=...,
authMechanism=...,
connect=True, )
mongo_client.server_info()
db = mongo_client.your_db
users = db.create_collection(name="users",
validator={"$jsonSchema": {
"bsonType": "object",
"required": ["username"],
"properties": {
"username": {
"bsonType": "string",
"pattern": "[a-z0-9]{5,15}",
"description": "user name (required), only lowercase letters "
"and digits allowed, from 5 to 15 characters long"
},
"email": {
"bsonType": "string",
"description": "User's email (optional)"
},
}
}},
validationAction="error",
)
# Inserting user document that fits the scheme
users.insert_one({"username": "admin", "email": "some_admin_mail"})
# Insertion below will be rejected with "pymongo.errors.WriteError: Document failed validation, full error"
# caused by too short username (root)
users.insert_one({"username": "root", "email": "some_root_mail"})
You can think about your Mongo's documents as ordinary JSON entities and check them on the client code side using standard JSON scheme validation
from jsonschema import validate
from jsonschema.exceptions import ValidationError
db = MongoClient(...).your_db
schema = {
"type": "object",
"required": ["username"],
"properties": {
"username": {"type": "string", "pattern": "[a-z0-9]{5,15}"},
"email": {"type": "string"},
},
}
try:
new_user = {"username": "admin", "email": "some_admin_mail"}
# No exception will be raised in validation below
validate(instance=new_user, schema=schema)
db.users.insert_one(new_user)
new_user = {"username": "root", "email": "some_root_mail"}
# Exception <ValidationError: 'root' does not match '[a-z0-9]{5,15}'> will be raised
validate(instance=new_user, schema=schema)
db.users.insert_one(new_user)
except ValidationError:
# Performing error