How to validate my data with jsonSchema scala - scala

I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you

Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns

Related

Sink Connector auto create tables with proper data type

I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
Access method: heap
In the schema registry:
{
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
{
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
},
"connect.default": 0
},
"default": 0
},
{
"name": "name",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col4",
"type": [
"null",
{
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
}
}
],
"default": null
},
{
"name": "col5",
"type": [
"null",
{
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col6",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.MicroTimestamp"
}
],
"default": null
},
{
"name": "col7",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "col8",
"type": [
"null",
{
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
}
],
"connect.name": "test.public.tbl1.Value"
}
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Target:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+------------------+-----------+----------+---------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
{
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"dialect.name": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"table.name.format": "tbl1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "true",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"
}
}

Aeropspike Kafka Outbound Connector - using map data type with Kafka Avro format

Am trying to setup Aerospike Kafka Outbound Connector (Source Connector) for Change Data Capture (CDC). When using Kafka Avro format for messages (as explained in https://docs.aerospike.com/connect/kafka/from-asdb/formats/kafka-avro-serialization-format), am getting following error from the connector:
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.607 GMT INFO metrics-ticker - requests-total: rate(per second) mean=8.05313731102431, m1=9.48698171570335, m5=2.480116641993411, m15=0.8667674157832074
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.613 GMT INFO metrics-ticker - requests-total: duration(ms) min=1.441459, max=3101.488585, mean=15.582822432504553, stddev=149.48409869767494, median=4.713083, p75=7.851875, p95=17.496458, p98=28.421125, p99=85.418959, p999=3090.952252
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.624 GMT ERROR metrics-ticker - **java.lang.Exception - Map type not allowed, has to be record type**: count=184
My data contains a map field. The Avro schema for the data is as follows:
{
"name": "mydata",
"type": "record",
"fields": [
{
"name": "metadata",
"type": {
"name": "com.aerospike.metadata",
"type": "record",
"fields": [
{
"name": "namespace",
"type": "string"
},
{
"name": "set",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "userKey",
"type": [
"null",
"long",
"double",
"bytes",
"string"
],
"default": null
},
{
"name": "digest",
"type": "bytes"
},
{
"name": "msg",
"type": "string"
},
{
"name": "durable",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "gen",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "exp",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "lut",
"type": [
"null",
"long"
],
"default": null
}
]
}
},
{
"name": "test",
"type": "string"
},
{
"name": "testmap",
"type": {
"type": "map",
"values": "string",
"default": {}
}
}
]
}
Has anyone tried and got this working?
Edit:
Looks like Aerospike connector doesn't support map type. I enabled verbose logging:
aerospike-kafka_connector-1 | 2023-01-06 18:41:31.098 GMT ERROR ErrorRegistry - Error stack trace
aerospike-kafka_connector-1 | java.lang.Exception: Map type not allowed, has to be record type
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:155)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:164)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertSchemaValid(KafkaAvroOutboundRecordGenerator.kt:101)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator.<init>(KafkaAvroOutboundRecordGenerator.kt:180)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:59)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.access$getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:29)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule$bindKafkaAvroParserFactory$1.get(KafkaOutboundGuiceModule.kt:48)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.getInbuiltRecordFormatter(XdrExchangeConverter.kt:422)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.access$getInbuiltRecordFormatter(XdrExchangeConverter.kt:75)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter$RouterAndInbuiltFormatter.<init>(XdrExchangeConverter.kt:285)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.processXdrRecord(XdrExchangeConverter.kt:192)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.parse(XdrExchangeConverter.kt:134)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.OutboundBridge$processAsync$1.invokeSuspend(OutboundBridge.kt:182)
aerospike-kafka_connector-1 | at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
aerospike-kafka_connector-1 | at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
aerospike-kafka_connector-1 | at java.base/java.lang.Thread.run(Thread.java:829)
However its not mentioned in the documentation.

java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt

I've a below code in scala that serializes the class into byte array -
import org.apache.avro.io.EncoderFactory
import org.apache.avro.reflect.ReflectDatumWriter
import java.io.ByteArrayOutputStream
case class RecordWithPrimitives(
string: String,
bool: Boolean,
bigInt: BigInt,
bigDecimal: BigDecimal,
)
object AvroEncodingDemoApp extends App {
val a = new RecordWithPrimitives(string = "???", bool = false, bigInt = BigInt.long2bigInt(1), bigDecimal = BigDecimal.decimal(5))
val parser = new org.apache.avro.Schema.Parser()
val avroSchema = parser.parse(
"""
|{
| "type": "record",
| "name": "RecordWithPrimitives",
| "fields": [{
| "name": "string",
| "type": "string"
| }, {
| "name": "bool",
| "type": "boolean"
| }, {
| "name": "bigInt",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 24,
| "scale": 24
| }
| }, {
| "name": "bigDecimal",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 48,
| "scale": 24
| }
| }]
|}
|""".stripMargin)
val writer = new ReflectDatumWriter[RecordWithPrimitives](avroSchema)
val boaStream = new ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get.jsonEncoder(avroSchema, boaStream)
writer.write(a, jsonEncoder)
jsonEncoder.flush()
}
When I run the above program I get below error -
Exception in thread "main" java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:79)
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:30)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:84)
at AvroEncodingDemoApp$.delayedEndpoint$AvroEncodingDemoApp$1(AvroEncodingDemoApp.scala:50)
at AvroEncodingDemoApp$delayedInit$body.apply(AvroEncodingDemoApp.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:42)
at scala.Function0.apply$mcV$sp$(Function0.scala:42)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:98)
at scala.App.$anonfun$main$1$adapted(App.scala:98)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at scala.App.main(App.scala:98)
at scala.App.main$(App.scala:96)
at AvroEncodingDemoApp$.main(AvroEncodingDemoApp.scala:12)
at AvroEncodingDemoApp.main(AvroEncodingDemoApp.scala)
Caused by: java.lang.ClassCastException: class scala.math.BigInt cannot be cast to class java.nio.ByteBuffer (scala.math.BigInt is in unnamed module of loader 'app'; java.nio.ByteBuffer is in module java.base of loader 'bootstrap')
at org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:400)
at org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:134)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:168)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:93)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:245)
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:117)
at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:184)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234)
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
... 14 more
How to fix this error?

Convert nested json to dataframe in scala spark

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+
You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+

Retrieving a particular field inside objects in JsArray

A part of my JSON response looks like this:
"resources": [{
"password": "",
"metadata": {
"updated_at": "20190806172149Z",
"guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
"created_at": "20190405013547Z",
"url": ""
},
"iam": false,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": "Default"
}
}, {
"password": "",
"metadata": {
"updated_at": "20190416192020Z",
"guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
"created_at": "20190416192020Z",
"url": ""
},
"iam": true,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": null
}
},
...
]
I am using Json helpers provided by the Play framework to parse this Json like this:
val resources: JsArray = response("resources").as[JsArray]
Now I need to extract the field email from all these objects in the resources JsArray. For this I tried writing a foreach loop like:
for (resource <- resources) {
}
But I'm getting an error Cannot resolve symbol foreach at the <- sign. How do I retrieve a particular field like email from each of the JSON objects inside a JsArray
With Play JSON I always use case classes. So your example would look like:
import play.api.libs.json._
case class Resource(password: String, metadata: JsObject, iam: Boolean, email: String, authentication: JsObject)
object Resource {
implicit val jsonFormat: Format[Resource] = Json.format[Resource]
}
val resources: Seq[Resource] = response("resources").validate[Seq[Resource]] match {
case JsSuccess(res, _) => res
case errors => // handle errors , e.g. throw new IllegalArgumentException(..)
}
Now you can access any field in a type-safe manner.
Of course you can replace the JsObjects with case classes in the same way - let me know if you need this in my answer.
But in your case as you only need the email there is no need:
resources.map(_.email) // returns Seq[String]
So like #pme said you should work with case classes, they should look something like this:
import java.util.UUID
import play.api.libs.json._
case class Resource(password:String, metadata: Metadata, iam:Boolean, email:UUID, authentication:Authentication)
object Resource{
implicit val resourcesImplicit: OFormat[Resource] = Json.format[Resource]
}
case class Metadata(updatedAt:String, guid:UUID, createdAt:String, url:String)
object Metadata{
implicit val metadataImplicit: OFormat[Metadata] = Json.format[Metadata]
}
case class Authentication(method:String, policyId: String)
object Authentication{
implicit val authenticationImplicit: OFormat[Authentication] =
Json.format[Authentication]
}
You can also use Writes and Reads instead of OFormat, or custom Writes and Reads, I used OFormat because it is less verbose.
Then when you have your responses, you validate them, you can validate them the way #pme said, or the way I do it:
val response_ = response("resources").validate[Seq[Resource]]
response_.fold(
errors => Future.succeful(BadRequest(JsError.toJson(errors)),
resources => resources.map(_.email))// extracting emails from your objects
???
)
So here you do something when the Json is invalid, and another thing when Json is valid, the behavior is the same as what pme did, just a bit more elegant in my opinion
Assuming that your json looks like this:
val jsonString =
"""
|{
| "resources": [
| {
| "password": "",
| "metadata": {
| "updated_at": "20190806172149Z",
| "guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
| "created_at": "20190405013547Z",
| "url": ""
| },
| "iam": false,
| "email": "<some mail id1>",
| "authentication": {
| "method": "internal",
| "policy_id": "Default"
| }
| },
| {
| "password": "",
| "metadata": {
| "updated_at": "20190416192020Z",
| "guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
| "created_at": "20190416192020Z",
| "url": ""
| },
| "iam": true,
| "email": "<some mail id2>",
| "authentication": {
| "method": "internal",
| "policy_id": null
| }
| }
| ]
|}
""".stripMargin
you can do:
(Json.parse(jsonString) \ "resources").as[JsValue] match{
case js: JsArray => js.value.foreach(x => println((x \ "email").as[String]))
case x => println((x \ "email").as[String])
}
or:
(Json.parse(jsonString) \ "resources").validate[JsArray] match {
case s: JsSuccess[JsArray] => s.get.value.foreach(x => println((x \ "email").as[String]))
case _: JsError => arr().value //or do something else
}
both works for me.
The resources is a JsArray, a type that doesn't provide .flatMap so cannot be used at right of <- in a for comprehension.
val emailReads: Reads[String] = (JsPath \ "email").reads[String]
val resourcesReads = Reads.seqReads(emailReads)
val r: JsResult[Seq[String]] = resources.validate(resources reads)