I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns
Related
I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
Access method: heap
In the schema registry:
{
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
{
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
},
"connect.default": 0
},
"default": 0
},
{
"name": "name",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col4",
"type": [
"null",
{
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
}
}
],
"default": null
},
{
"name": "col5",
"type": [
"null",
{
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col6",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.MicroTimestamp"
}
],
"default": null
},
{
"name": "col7",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "col8",
"type": [
"null",
{
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
}
],
"connect.name": "test.public.tbl1.Value"
}
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Target:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+------------------+-----------+----------+---------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
{
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"dialect.name": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"table.name.format": "tbl1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "true",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"
}
}
Am trying to setup Aerospike Kafka Outbound Connector (Source Connector) for Change Data Capture (CDC). When using Kafka Avro format for messages (as explained in https://docs.aerospike.com/connect/kafka/from-asdb/formats/kafka-avro-serialization-format), am getting following error from the connector:
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.607 GMT INFO metrics-ticker - requests-total: rate(per second) mean=8.05313731102431, m1=9.48698171570335, m5=2.480116641993411, m15=0.8667674157832074
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.613 GMT INFO metrics-ticker - requests-total: duration(ms) min=1.441459, max=3101.488585, mean=15.582822432504553, stddev=149.48409869767494, median=4.713083, p75=7.851875, p95=17.496458, p98=28.421125, p99=85.418959, p999=3090.952252
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.624 GMT ERROR metrics-ticker - **java.lang.Exception - Map type not allowed, has to be record type**: count=184
My data contains a map field. The Avro schema for the data is as follows:
{
"name": "mydata",
"type": "record",
"fields": [
{
"name": "metadata",
"type": {
"name": "com.aerospike.metadata",
"type": "record",
"fields": [
{
"name": "namespace",
"type": "string"
},
{
"name": "set",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "userKey",
"type": [
"null",
"long",
"double",
"bytes",
"string"
],
"default": null
},
{
"name": "digest",
"type": "bytes"
},
{
"name": "msg",
"type": "string"
},
{
"name": "durable",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "gen",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "exp",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "lut",
"type": [
"null",
"long"
],
"default": null
}
]
}
},
{
"name": "test",
"type": "string"
},
{
"name": "testmap",
"type": {
"type": "map",
"values": "string",
"default": {}
}
}
]
}
Has anyone tried and got this working?
Edit:
Looks like Aerospike connector doesn't support map type. I enabled verbose logging:
aerospike-kafka_connector-1 | 2023-01-06 18:41:31.098 GMT ERROR ErrorRegistry - Error stack trace
aerospike-kafka_connector-1 | java.lang.Exception: Map type not allowed, has to be record type
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:155)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:164)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertSchemaValid(KafkaAvroOutboundRecordGenerator.kt:101)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator.<init>(KafkaAvroOutboundRecordGenerator.kt:180)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:59)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.access$getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:29)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule$bindKafkaAvroParserFactory$1.get(KafkaOutboundGuiceModule.kt:48)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.getInbuiltRecordFormatter(XdrExchangeConverter.kt:422)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.access$getInbuiltRecordFormatter(XdrExchangeConverter.kt:75)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter$RouterAndInbuiltFormatter.<init>(XdrExchangeConverter.kt:285)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.processXdrRecord(XdrExchangeConverter.kt:192)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.parse(XdrExchangeConverter.kt:134)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.OutboundBridge$processAsync$1.invokeSuspend(OutboundBridge.kt:182)
aerospike-kafka_connector-1 | at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
aerospike-kafka_connector-1 | at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
aerospike-kafka_connector-1 | at java.base/java.lang.Thread.run(Thread.java:829)
However its not mentioned in the documentation.
I've a below code in scala that serializes the class into byte array -
import org.apache.avro.io.EncoderFactory
import org.apache.avro.reflect.ReflectDatumWriter
import java.io.ByteArrayOutputStream
case class RecordWithPrimitives(
string: String,
bool: Boolean,
bigInt: BigInt,
bigDecimal: BigDecimal,
)
object AvroEncodingDemoApp extends App {
val a = new RecordWithPrimitives(string = "???", bool = false, bigInt = BigInt.long2bigInt(1), bigDecimal = BigDecimal.decimal(5))
val parser = new org.apache.avro.Schema.Parser()
val avroSchema = parser.parse(
"""
|{
| "type": "record",
| "name": "RecordWithPrimitives",
| "fields": [{
| "name": "string",
| "type": "string"
| }, {
| "name": "bool",
| "type": "boolean"
| }, {
| "name": "bigInt",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 24,
| "scale": 24
| }
| }, {
| "name": "bigDecimal",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 48,
| "scale": 24
| }
| }]
|}
|""".stripMargin)
val writer = new ReflectDatumWriter[RecordWithPrimitives](avroSchema)
val boaStream = new ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get.jsonEncoder(avroSchema, boaStream)
writer.write(a, jsonEncoder)
jsonEncoder.flush()
}
When I run the above program I get below error -
Exception in thread "main" java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:79)
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:30)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:84)
at AvroEncodingDemoApp$.delayedEndpoint$AvroEncodingDemoApp$1(AvroEncodingDemoApp.scala:50)
at AvroEncodingDemoApp$delayedInit$body.apply(AvroEncodingDemoApp.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:42)
at scala.Function0.apply$mcV$sp$(Function0.scala:42)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:98)
at scala.App.$anonfun$main$1$adapted(App.scala:98)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at scala.App.main(App.scala:98)
at scala.App.main$(App.scala:96)
at AvroEncodingDemoApp$.main(AvroEncodingDemoApp.scala:12)
at AvroEncodingDemoApp.main(AvroEncodingDemoApp.scala)
Caused by: java.lang.ClassCastException: class scala.math.BigInt cannot be cast to class java.nio.ByteBuffer (scala.math.BigInt is in unnamed module of loader 'app'; java.nio.ByteBuffer is in module java.base of loader 'bootstrap')
at org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:400)
at org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:134)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:168)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:93)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:245)
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:117)
at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:184)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234)
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
... 14 more
How to fix this error?
I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+
You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+
A part of my JSON response looks like this:
"resources": [{
"password": "",
"metadata": {
"updated_at": "20190806172149Z",
"guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
"created_at": "20190405013547Z",
"url": ""
},
"iam": false,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": "Default"
}
}, {
"password": "",
"metadata": {
"updated_at": "20190416192020Z",
"guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
"created_at": "20190416192020Z",
"url": ""
},
"iam": true,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": null
}
},
...
]
I am using Json helpers provided by the Play framework to parse this Json like this:
val resources: JsArray = response("resources").as[JsArray]
Now I need to extract the field email from all these objects in the resources JsArray. For this I tried writing a foreach loop like:
for (resource <- resources) {
}
But I'm getting an error Cannot resolve symbol foreach at the <- sign. How do I retrieve a particular field like email from each of the JSON objects inside a JsArray
With Play JSON I always use case classes. So your example would look like:
import play.api.libs.json._
case class Resource(password: String, metadata: JsObject, iam: Boolean, email: String, authentication: JsObject)
object Resource {
implicit val jsonFormat: Format[Resource] = Json.format[Resource]
}
val resources: Seq[Resource] = response("resources").validate[Seq[Resource]] match {
case JsSuccess(res, _) => res
case errors => // handle errors , e.g. throw new IllegalArgumentException(..)
}
Now you can access any field in a type-safe manner.
Of course you can replace the JsObjects with case classes in the same way - let me know if you need this in my answer.
But in your case as you only need the email there is no need:
resources.map(_.email) // returns Seq[String]
So like #pme said you should work with case classes, they should look something like this:
import java.util.UUID
import play.api.libs.json._
case class Resource(password:String, metadata: Metadata, iam:Boolean, email:UUID, authentication:Authentication)
object Resource{
implicit val resourcesImplicit: OFormat[Resource] = Json.format[Resource]
}
case class Metadata(updatedAt:String, guid:UUID, createdAt:String, url:String)
object Metadata{
implicit val metadataImplicit: OFormat[Metadata] = Json.format[Metadata]
}
case class Authentication(method:String, policyId: String)
object Authentication{
implicit val authenticationImplicit: OFormat[Authentication] =
Json.format[Authentication]
}
You can also use Writes and Reads instead of OFormat, or custom Writes and Reads, I used OFormat because it is less verbose.
Then when you have your responses, you validate them, you can validate them the way #pme said, or the way I do it:
val response_ = response("resources").validate[Seq[Resource]]
response_.fold(
errors => Future.succeful(BadRequest(JsError.toJson(errors)),
resources => resources.map(_.email))// extracting emails from your objects
???
)
So here you do something when the Json is invalid, and another thing when Json is valid, the behavior is the same as what pme did, just a bit more elegant in my opinion
Assuming that your json looks like this:
val jsonString =
"""
|{
| "resources": [
| {
| "password": "",
| "metadata": {
| "updated_at": "20190806172149Z",
| "guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
| "created_at": "20190405013547Z",
| "url": ""
| },
| "iam": false,
| "email": "<some mail id1>",
| "authentication": {
| "method": "internal",
| "policy_id": "Default"
| }
| },
| {
| "password": "",
| "metadata": {
| "updated_at": "20190416192020Z",
| "guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
| "created_at": "20190416192020Z",
| "url": ""
| },
| "iam": true,
| "email": "<some mail id2>",
| "authentication": {
| "method": "internal",
| "policy_id": null
| }
| }
| ]
|}
""".stripMargin
you can do:
(Json.parse(jsonString) \ "resources").as[JsValue] match{
case js: JsArray => js.value.foreach(x => println((x \ "email").as[String]))
case x => println((x \ "email").as[String])
}
or:
(Json.parse(jsonString) \ "resources").validate[JsArray] match {
case s: JsSuccess[JsArray] => s.get.value.foreach(x => println((x \ "email").as[String]))
case _: JsError => arr().value //or do something else
}
both works for me.
The resources is a JsArray, a type that doesn't provide .flatMap so cannot be used at right of <- in a for comprehension.
val emailReads: Reads[String] = (JsPath \ "email").reads[String]
val resourcesReads = Reads.seqReads(emailReads)
val r: JsResult[Seq[String]] = resources.validate(resources reads)