How to get values from nested json array using spark? - scala

I have this array
val myJson = {
"record": {
"recordId": 100,
"name": "xyz",
"version": "1.1",
"input": [
{
"format": "Database",
"type": "Oracle",
"connectionStringId": "212",
"connectionString": "ksfksfklsdflk",
"schemaName": "schema1",
"databaseName": "db1",
"tables": [
{
"table_name":"one"
}
{
"table_name":"two"
}
]
}
]
}
}
I am using this code to get this json in dataframe
val df = sparkSession.read.json(myjson)
I want values of schemaName & databaseName, how can i get them?
val schemaName = df.select("record.input.schemaName") //not working
Someone, please help me

You need to explode the array column record.input then select the fields you want :
df.select(explode(col("record.input")).as("inputs"))
.select("inputs.schemaName", "inputs.databaseName")
.show
//+----------+------------+
//|schemaName|databaseName|
//+----------+------------+
//| schema1| db1|
//+----------+------------+

Related

Convert SQL output to JSON using Spark

I have a Spark SQL query (using Scala as the language) which gives output as the following table where {name, type, category} is unique. Only type has limited values (due to 5-6 unique types).
name
type
category
value
First
type1
cat1
value1
First
type1
cat2
value2
First
type1
cat3
value3
First
type2
cat1
value1
First
type2
cat5
value4
Second
type1
cat1
value5
Second
type1
cat4
value5
I'm looking for a way to convert it into a JSON with Spark such that output is something like this, basically get the output for every name & type combination.
[
{
"name": "First",
"type": "type1",
"result": {
"cat1": "value1",
"cat2": "value2",
"cat3": "value3"
}
},
{
"name": "First",
"type": "type2",
"result": {
"cat1": "value1",
"cat5": "value4"
}
},
{
"name": "Second",
"type": "type1",
"result": {
"cat1": "value5",
"cat4": "value5"
}
}
]
Is this possible via Spark scala? Any pointers or references would be really helpful.
Eventually I have to write the JSON output to S3, so if this is possible during write then it will also be okay.
You can groupBy, collect_set then finally map_from_entries as below:
df = df
.groupBy("name", "type")
.agg(collect_set(struct("category", "value")).as("result"))
.withColumn("result", map_from_entries(col("result")))
Exporting as JSON, however, will not give you the result as you expect. To get the expected result, you can use:
df.toJSON.collect.mkString("[", "," , "]" )
Final result:
[
{
"name": "First",
"type": "type1",
"result": {
"cat3": "value3",
"cat1": "value1",
"cat2": "value2"
}
},
{
"name": "First",
"type": "type2",
"result": {
"cat1": "value1",
"cat5": "value4"
}
},
{
"name": "Second",
"type": "type1",
"result": {
"cat1": "value5",
"cat4": "value5"
}
}
]
Good luck!

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

Read JSON in ADF

In Azure Data Factory, I need to be able to process a JSON response. I don't want to hardcode the array position in case they change, so something like this is out of the question:
#activity('Place Details').output.result.components[2].name
How can I get the name 123 where types = number given a JSON array like below:
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
One example using the OPENJSON method:
DECLARE #json NVARCHAR(MAX) = '{
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
}'
;WITH cte AS (
SELECT
JSON_VALUE( o.[value], '$.name' ) [name],
JSON_VALUE( o.[value], '$.types[0]' ) [types]
FROM OPENJSON( #json, '$.result.components' ) o
)
SELECT [name]
FROM cte
WHERE types = 'number'
I will have a look at other methods.

How to extract values from json string from api in scala?

I am trying to extract specific value from each json in a response from api.
for example if I have http response is kind of string array as below:
[
{
"trackerType": "WEB",
"id": 1,
"appId": "ap-website",
"host": {
"orgId": "ap",
"displayName": "AP Mart",
"id": "3",
"tenantId": "ap"
}
},
{
"trackerType": "WEB",
"id": 2,
"appId": "test-website",
"host": {
"orgId": "t1",
"tenantId": "trn11"
}
}
]
I wanted to extract or keep only list of values app_id and tenant_id as below:
[
{
"appId": "ap-website",
"tenantId": "ap"
},
{
"appId": "test-website",
"tenantId": "trn11"
}
]
If your HTTP response is quite big and you wouldn't hold it all in memory then consider using IO streams for parsing the body and serialization of the result list.
Below is an example of how it can be done with the dijon library.
Add dependency to your build file:
libraryDependencies += "me.vican.jorge" %% "dijon" % "0.5.0+18-46bbb74d", // Use %%% instead of %% for Scala.js
Import following packages:
import com.github.plokhotnyuk.jsoniter_scala.core._
import dijon._
import scala.language.dynamics._
Parse your input stream transforming it value by value in the callback to the output stream:
val in = new java.io.ByteArrayInputStream(
"""
[
{
"trackerType": "WEB",
"id": 1,
"appId": "ap-website",
"host": {
"orgId": "ap",
"displayName": "AP Mart",
"id": "3",
"tenantId": "ap"
}
},
{
"trackerType": "WEB",
"id": 2,
"appId": "test-website",
"host": {
"orgId": "t1",
"tenantId": "trn11"
}
}
]
""".getBytes("UTF-8"))
val out = new java.io.BufferedOutputStream(System.out)
out.write('[')
scanJsonArrayFromStream[SomeJson](in) {
var writeComma = false
x =>
if (writeComma) out.write(',') else writeComma = true
val json = obj("appId" -> x.appId, "tenantId" -> x.host.tenantId)
writeToStream[SomeJson](json, out)(codec)
true
} (codec)
out.write(']')
out.flush()
You can try it with Scastie here
When using this code in your application, you need to replace the source and destination of input and output streams.
There are other options how to solve your task. Please add more context that will help us in selection of the most simple and efficient solution.
Feel free to comment - I will be happy to help you in tuning the solution to your needs.

Scala Map -add new key and copy value from another key

Considering 2 sets of data as follows:
JSON1=> {
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "large",
"platform": "Linux"
}
"stateId":"123-567"
}
]}
JSON2=>{
"data": [
{"id": "1-abc",
"model": "Agile",
"configuration": {
"state": "running",
"diskSize": "0",
"type": "small",
"platform":"Windows"
}
}
]}
I need to compare JSON1 and JSON2 based on the 1st field id and if they match , I need to merge JSON1 with JSON 2 retaining the existing values in JSON2( only append fields not present).
I have coded the same as below:
private def merger(JSON1: Seq[JSON], JSON2: Seq[JSON]):Seq[JSON] = {
val abcKey = JSON1.groupBy(_.id) map { case (k, v) => (k, v.head)
val mergedRecords = for {
xyzJSON<- JSON2
} yield (
abcKey.get(xyzJSON.id) match {
case Some(JSON1) => xyzJSON.copy(status = JSON1.status,
stateId = JSON1.stateId)
case None => xyzJSON.copy(origin = "N/A")
}
)
I am not able to derive at a solution for reconciling the fields within the configurationMap.
Expected result set should be like:
{
"data": [
{"id": "1-abc",
"model": "Agile",
"status":"open"
"configuration": {
"state": "running",
"diskSize": "0",
"rootVolumeSize": "0.00000",
"count": "2",
"type": "small",
"platform": "Windows",
}
"stateId":"123-567"
}
]}