java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt - scala

I've a below code in scala that serializes the class into byte array -
import org.apache.avro.io.EncoderFactory
import org.apache.avro.reflect.ReflectDatumWriter
import java.io.ByteArrayOutputStream
case class RecordWithPrimitives(
string: String,
bool: Boolean,
bigInt: BigInt,
bigDecimal: BigDecimal,
)
object AvroEncodingDemoApp extends App {
val a = new RecordWithPrimitives(string = "???", bool = false, bigInt = BigInt.long2bigInt(1), bigDecimal = BigDecimal.decimal(5))
val parser = new org.apache.avro.Schema.Parser()
val avroSchema = parser.parse(
"""
|{
| "type": "record",
| "name": "RecordWithPrimitives",
| "fields": [{
| "name": "string",
| "type": "string"
| }, {
| "name": "bool",
| "type": "boolean"
| }, {
| "name": "bigInt",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 24,
| "scale": 24
| }
| }, {
| "name": "bigDecimal",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 48,
| "scale": 24
| }
| }]
|}
|""".stripMargin)
val writer = new ReflectDatumWriter[RecordWithPrimitives](avroSchema)
val boaStream = new ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get.jsonEncoder(avroSchema, boaStream)
writer.write(a, jsonEncoder)
jsonEncoder.flush()
}
When I run the above program I get below error -
Exception in thread "main" java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:79)
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:30)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:84)
at AvroEncodingDemoApp$.delayedEndpoint$AvroEncodingDemoApp$1(AvroEncodingDemoApp.scala:50)
at AvroEncodingDemoApp$delayedInit$body.apply(AvroEncodingDemoApp.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:42)
at scala.Function0.apply$mcV$sp$(Function0.scala:42)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:98)
at scala.App.$anonfun$main$1$adapted(App.scala:98)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at scala.App.main(App.scala:98)
at scala.App.main$(App.scala:96)
at AvroEncodingDemoApp$.main(AvroEncodingDemoApp.scala:12)
at AvroEncodingDemoApp.main(AvroEncodingDemoApp.scala)
Caused by: java.lang.ClassCastException: class scala.math.BigInt cannot be cast to class java.nio.ByteBuffer (scala.math.BigInt is in unnamed module of loader 'app'; java.nio.ByteBuffer is in module java.base of loader 'bootstrap')
at org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:400)
at org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:134)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:168)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:93)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:245)
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:117)
at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:184)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234)
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
... 14 more
How to fix this error?

Related

Sink Connector auto create tables with proper data type

I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
Access method: heap
In the schema registry:
{
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
{
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
},
"connect.default": 0
},
"default": 0
},
{
"name": "name",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col4",
"type": [
"null",
{
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
}
}
],
"default": null
},
{
"name": "col5",
"type": [
"null",
{
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col6",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.MicroTimestamp"
}
],
"default": null
},
{
"name": "col7",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "col8",
"type": [
"null",
{
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
}
],
"connect.name": "test.public.tbl1.Value"
}
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Target:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+------------------+-----------+----------+---------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
{
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"dialect.name": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"table.name.format": "tbl1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "true",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"
}
}

How to validate my data with jsonSchema scala

I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns

Convert nested json to dataframe in scala spark

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+
You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+

Retrieving a particular field inside objects in JsArray

A part of my JSON response looks like this:
"resources": [{
"password": "",
"metadata": {
"updated_at": "20190806172149Z",
"guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
"created_at": "20190405013547Z",
"url": ""
},
"iam": false,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": "Default"
}
}, {
"password": "",
"metadata": {
"updated_at": "20190416192020Z",
"guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
"created_at": "20190416192020Z",
"url": ""
},
"iam": true,
"email": "<some mail id>",
"authentication": {
"method": "internal",
"policy_id": null
}
},
...
]
I am using Json helpers provided by the Play framework to parse this Json like this:
val resources: JsArray = response("resources").as[JsArray]
Now I need to extract the field email from all these objects in the resources JsArray. For this I tried writing a foreach loop like:
for (resource <- resources) {
}
But I'm getting an error Cannot resolve symbol foreach at the <- sign. How do I retrieve a particular field like email from each of the JSON objects inside a JsArray
With Play JSON I always use case classes. So your example would look like:
import play.api.libs.json._
case class Resource(password: String, metadata: JsObject, iam: Boolean, email: String, authentication: JsObject)
object Resource {
implicit val jsonFormat: Format[Resource] = Json.format[Resource]
}
val resources: Seq[Resource] = response("resources").validate[Seq[Resource]] match {
case JsSuccess(res, _) => res
case errors => // handle errors , e.g. throw new IllegalArgumentException(..)
}
Now you can access any field in a type-safe manner.
Of course you can replace the JsObjects with case classes in the same way - let me know if you need this in my answer.
But in your case as you only need the email there is no need:
resources.map(_.email) // returns Seq[String]
So like #pme said you should work with case classes, they should look something like this:
import java.util.UUID
import play.api.libs.json._
case class Resource(password:String, metadata: Metadata, iam:Boolean, email:UUID, authentication:Authentication)
object Resource{
implicit val resourcesImplicit: OFormat[Resource] = Json.format[Resource]
}
case class Metadata(updatedAt:String, guid:UUID, createdAt:String, url:String)
object Metadata{
implicit val metadataImplicit: OFormat[Metadata] = Json.format[Metadata]
}
case class Authentication(method:String, policyId: String)
object Authentication{
implicit val authenticationImplicit: OFormat[Authentication] =
Json.format[Authentication]
}
You can also use Writes and Reads instead of OFormat, or custom Writes and Reads, I used OFormat because it is less verbose.
Then when you have your responses, you validate them, you can validate them the way #pme said, or the way I do it:
val response_ = response("resources").validate[Seq[Resource]]
response_.fold(
errors => Future.succeful(BadRequest(JsError.toJson(errors)),
resources => resources.map(_.email))// extracting emails from your objects
???
)
So here you do something when the Json is invalid, and another thing when Json is valid, the behavior is the same as what pme did, just a bit more elegant in my opinion
Assuming that your json looks like this:
val jsonString =
"""
|{
| "resources": [
| {
| "password": "",
| "metadata": {
| "updated_at": "20190806172149Z",
| "guid": "e1be511a-eb8e-1038-9547-0fff94eeae4b",
| "created_at": "20190405013547Z",
| "url": ""
| },
| "iam": false,
| "email": "<some mail id1>",
| "authentication": {
| "method": "internal",
| "policy_id": "Default"
| }
| },
| {
| "password": "",
| "metadata": {
| "updated_at": "20190416192020Z",
| "guid": "6b47118c-f4c8-1038-8d93-ed6d7155964a",
| "created_at": "20190416192020Z",
| "url": ""
| },
| "iam": true,
| "email": "<some mail id2>",
| "authentication": {
| "method": "internal",
| "policy_id": null
| }
| }
| ]
|}
""".stripMargin
you can do:
(Json.parse(jsonString) \ "resources").as[JsValue] match{
case js: JsArray => js.value.foreach(x => println((x \ "email").as[String]))
case x => println((x \ "email").as[String])
}
or:
(Json.parse(jsonString) \ "resources").validate[JsArray] match {
case s: JsSuccess[JsArray] => s.get.value.foreach(x => println((x \ "email").as[String]))
case _: JsError => arr().value //or do something else
}
both works for me.
The resources is a JsArray, a type that doesn't provide .flatMap so cannot be used at right of <- in a for comprehension.
val emailReads: Reads[String] = (JsPath \ "email").reads[String]
val resourcesReads = Reads.seqReads(emailReads)
val r: JsResult[Seq[String]] = resources.validate(resources reads)

Merge Spark dataframe rows based on key column in Scala

I have a streaming Dataframe with 2 columns. A key column represented as String and an objects column which is an array containing one object element. I want to be able to merge records or rows in the Dataframe with the same key such that the merged records form an array of objects.
Dataframe
----------------------------------------------------------------
|key | objects |
----------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}] |
|abc | [{"name": "image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
----------------------------------------------------------------
Merged Dataframe
-------------------------------------------------------------------------
|key | objects |
-------------------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}, {"name":
"image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
--------------------------------------------------------------------------
One option to do this to convert this into a PairedRDD and apply the reduceByKey function, but I'd prefer to do this with Dataframes if possible since it'd more optimal. Is there any way to do this with Dataframes without compromising on performance?
Assuming column objects is an array of a single JSON string, here's how you can merge objects by key:
import org.apache.spark.sql.functions._
case class Obj(name: String, `type`: String, code: String)
val df = Seq(
("abc", Obj("file", "sample", "123")),
("abc", Obj("image", "sample", "456")),
("xyz", Obj("doc", "sample", "707"))
).
toDF("key", "object").
select($"key", array(to_json($"object")).as("objects"))
df.show(false)
// +---+-----------------------------------------------+
// |key|objects |
// +---+-----------------------------------------------+
// |abc|[{"name":"file","type":"sample","code":"123"}] |
// |abc|[{"name":"image","type":"sample","code":"456"}]|
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// +---+-----------------------------------------------+
df.groupBy($"key").agg(collect_list($"objects"(0)).as("objects")).
show(false)
// +---+---------------------------------------------------------------------------------------------+
// |key|objects |
// +---+---------------------------------------------------------------------------------------------+
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// |abc|[{"name":"file","type":"sample","code":"123"}, {"name":"image","type":"sample","code":"456"}]|
// +---+---------------------------------------------------------------------------------------------+