Convert nested json to dataframe in scala spark - scala

I want to create the dataframe out of json for only given key. It values is a list and that is nested json type. I tried for flattening but I think there could be some workaround as I only need one key of json to convert into dataframe.
I have json like:
("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
Now I want to create a DataFrame using spark for only key 'metadata', I have written code:
val json = Json.parse("""
{
"Id_columns": 2,
"metadata": [{
"id": "1234",
"type": "file",
"length": 395
}, {
"id": "1235",
"type": "file2",
"length": 396
}]
}""")
var jsonlist = Json.stringify(json("metadata"))
val rddData = spark.sparkContext.parallelize(jsonlist)
resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
resultDF.show()
But it's giving me error:
overloaded method value json with alternatives:
cannot be applied to (org.apache.spark.rdd.RDD[Char])
[error] val resultDF = spark.read.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").json(rddData)
^
I am expecting result:
+----+-----+--------+
| id | type| length |
+----+-----+--------+
|1234|file1| 395 |
|1235|file2| 396 |
+----+-----+--------+

You need to explode your array like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.json(
spark.sparkContext.parallelize(Seq("""{"Id_columns":2,"metadata":[{"id":"1234","type":"file","length":395},{"id":"1235","type":"file2","length":396}]}"""))
)
df.select(explode($"metadata").as("metadata"))
.select("metadata.*")
.show(false)
Output :
+----+------+-----+
|id |length|type |
+----+------+-----+
|1234|395 |file |
|1235|396 |file2|
+----+------+-----+

Related

java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt

I've a below code in scala that serializes the class into byte array -
import org.apache.avro.io.EncoderFactory
import org.apache.avro.reflect.ReflectDatumWriter
import java.io.ByteArrayOutputStream
case class RecordWithPrimitives(
string: String,
bool: Boolean,
bigInt: BigInt,
bigDecimal: BigDecimal,
)
object AvroEncodingDemoApp extends App {
val a = new RecordWithPrimitives(string = "???", bool = false, bigInt = BigInt.long2bigInt(1), bigDecimal = BigDecimal.decimal(5))
val parser = new org.apache.avro.Schema.Parser()
val avroSchema = parser.parse(
"""
|{
| "type": "record",
| "name": "RecordWithPrimitives",
| "fields": [{
| "name": "string",
| "type": "string"
| }, {
| "name": "bool",
| "type": "boolean"
| }, {
| "name": "bigInt",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 24,
| "scale": 24
| }
| }, {
| "name": "bigDecimal",
| "type": {
| "type": "bytes",
| "logicalType": "decimal",
| "precision": 48,
| "scale": 24
| }
| }]
|}
|""".stripMargin)
val writer = new ReflectDatumWriter[RecordWithPrimitives](avroSchema)
val boaStream = new ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get.jsonEncoder(avroSchema, boaStream)
writer.write(a, jsonEncoder)
jsonEncoder.flush()
}
When I run the above program I get below error -
Exception in thread "main" java.lang.ClassCastException: value 1 (a scala.math.BigInt) cannot be cast to expected type bytes at RecordWithPrimitives.bigInt
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:79)
at org.apache.avro.path.TracingClassCastException.summarize(TracingClassCastException.java:30)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:84)
at AvroEncodingDemoApp$.delayedEndpoint$AvroEncodingDemoApp$1(AvroEncodingDemoApp.scala:50)
at AvroEncodingDemoApp$delayedInit$body.apply(AvroEncodingDemoApp.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:42)
at scala.Function0.apply$mcV$sp$(Function0.scala:42)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:98)
at scala.App.$anonfun$main$1$adapted(App.scala:98)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at scala.App.main(App.scala:98)
at scala.App.main$(App.scala:96)
at AvroEncodingDemoApp$.main(AvroEncodingDemoApp.scala:12)
at AvroEncodingDemoApp.main(AvroEncodingDemoApp.scala)
Caused by: java.lang.ClassCastException: class scala.math.BigInt cannot be cast to class java.nio.ByteBuffer (scala.math.BigInt is in unnamed module of loader 'app'; java.nio.ByteBuffer is in module java.base of loader 'bootstrap')
at org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:400)
at org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:134)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:168)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:93)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:245)
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:117)
at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:184)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234)
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95)
at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
... 14 more
How to fix this error?

How to get values from nested json array using spark?

I have this array
val myJson = {
"record": {
"recordId": 100,
"name": "xyz",
"version": "1.1",
"input": [
{
"format": "Database",
"type": "Oracle",
"connectionStringId": "212",
"connectionString": "ksfksfklsdflk",
"schemaName": "schema1",
"databaseName": "db1",
"tables": [
{
"table_name":"one"
}
{
"table_name":"two"
}
]
}
]
}
}
I am using this code to get this json in dataframe
val df = sparkSession.read.json(myjson)
I want values of schemaName & databaseName, how can i get them?
val schemaName = df.select("record.input.schemaName") //not working
Someone, please help me
You need to explode the array column record.input then select the fields you want :
df.select(explode(col("record.input")).as("inputs"))
.select("inputs.schemaName", "inputs.databaseName")
.show
//+----------+------------+
//|schemaName|databaseName|
//+----------+------------+
//| schema1| db1|
//+----------+------------+

How to validate my data with jsonSchema scala

I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns

Dataframe columns does not keep order and columns with null values are excluded while writing to CosmosDB Collection

I tried to copy data into cosmosDB collection from a dataframe in spark.
The data is writing into cosmosDB , but with two issues.
The order of column in dataframe is not maintaining in cosmosDB.
Columns with null values are not written in cosmosDB, they are totally excluded.
Below is the data available in dataframe:
+-------+------+--------+---------+---------+-------+
| NUM_ID| TIME| SIG1| SIG2| SIG3| SIG4|
+-------+------+--------+---------+---------+-------+
|X00030 | 13000|35.79893| 139.9061| 48.32786| null|
|X00095 | 75000| null| null| null|5860505|
|X00074 | 43000| null| 8.75037| 98.9562|8014505|
Below is the code written in spark to copy the dataframe into cosmosDB.
val finalSignals = spark.sql("""SELECT * FROM db.tableName""")
val toCosmosDF = finalSignals.withColumn("NUM_ID", trim(col("NUM_ID"))).withColumn("SIG1", round(col("SIG1"),5)).select("NUM_ID","TIME","SIG1","SIG2","SIG3","SIG4")
//write DF into COSMOSDB
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.SaveMode
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
val writeConfig = Config(Map(
"Endpoint" -> "xxxxxxxx",
"Masterkey" -> "xxxxxxxxxxx",
"Database" -> "xxxxxxxxx",
"Collection" -> "xxxxxxxxx",
"preferredRegions" -> "xxxxxxxxx",
"Upsert" -> "true"
))
toCosmosDF.write.mode(SaveMode.Append).cosmosDB(writeConfig)
Below is the data written into cosmosDB.
"SIG3": 48.32786,
"SIG2": 139.9061,
"TIME": 13000,
"NUM_ID": "X00030",
"id": "xxxxxxxxxxxx2a",
"SIG1": 35.79893,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"TIME": 75000,
"NUM_ID": "X00095",
"id": "xxxxxxxxxxxx2a",
"_rid": "xxxxxxxxxxxx",
"SIG4": 5860505,
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
{
"SIG3": 98.9562,
"SIG2": 8.75037,
"TIME": 43000,
"NUM_ID": "X00074",
"id": "xxxxxxxxxxxx2a",
"SIG4": 8014505,
"_rid": "xxxxxxxxxxxx",
"_self": "xxxxxxxxxxxxxxxxxx",
"_etag": "\"xxxxxxxxxxxxxxxx\"",
"_attachments": "attachments/",
"_ts": 1571390120
}
The entry for columns with null in dataframe is missing in cosmosDB document.
The data written into cosmosDB is not having the column order which is there in dataframe.
How to resolve these two issues?

Merge Spark dataframe rows based on key column in Scala

I have a streaming Dataframe with 2 columns. A key column represented as String and an objects column which is an array containing one object element. I want to be able to merge records or rows in the Dataframe with the same key such that the merged records form an array of objects.
Dataframe
----------------------------------------------------------------
|key | objects |
----------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}] |
|abc | [{"name": "image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
----------------------------------------------------------------
Merged Dataframe
-------------------------------------------------------------------------
|key | objects |
-------------------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}, {"name":
"image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
--------------------------------------------------------------------------
One option to do this to convert this into a PairedRDD and apply the reduceByKey function, but I'd prefer to do this with Dataframes if possible since it'd more optimal. Is there any way to do this with Dataframes without compromising on performance?
Assuming column objects is an array of a single JSON string, here's how you can merge objects by key:
import org.apache.spark.sql.functions._
case class Obj(name: String, `type`: String, code: String)
val df = Seq(
("abc", Obj("file", "sample", "123")),
("abc", Obj("image", "sample", "456")),
("xyz", Obj("doc", "sample", "707"))
).
toDF("key", "object").
select($"key", array(to_json($"object")).as("objects"))
df.show(false)
// +---+-----------------------------------------------+
// |key|objects |
// +---+-----------------------------------------------+
// |abc|[{"name":"file","type":"sample","code":"123"}] |
// |abc|[{"name":"image","type":"sample","code":"456"}]|
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// +---+-----------------------------------------------+
df.groupBy($"key").agg(collect_list($"objects"(0)).as("objects")).
show(false)
// +---+---------------------------------------------------------------------------------------------+
// |key|objects |
// +---+---------------------------------------------------------------------------------------------+
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// |abc|[{"name":"file","type":"sample","code":"123"}, {"name":"image","type":"sample","code":"456"}]|
// +---+---------------------------------------------------------------------------------------------+