Convert SQL output to JSON using Spark - scala

I have a Spark SQL query (using Scala as the language) which gives output as the following table where {name, type, category} is unique. Only type has limited values (due to 5-6 unique types).
name
type
category
value
First
type1
cat1
value1
First
type1
cat2
value2
First
type1
cat3
value3
First
type2
cat1
value1
First
type2
cat5
value4
Second
type1
cat1
value5
Second
type1
cat4
value5
I'm looking for a way to convert it into a JSON with Spark such that output is something like this, basically get the output for every name & type combination.
[
{
"name": "First",
"type": "type1",
"result": {
"cat1": "value1",
"cat2": "value2",
"cat3": "value3"
}
},
{
"name": "First",
"type": "type2",
"result": {
"cat1": "value1",
"cat5": "value4"
}
},
{
"name": "Second",
"type": "type1",
"result": {
"cat1": "value5",
"cat4": "value5"
}
}
]
Is this possible via Spark scala? Any pointers or references would be really helpful.
Eventually I have to write the JSON output to S3, so if this is possible during write then it will also be okay.

You can groupBy, collect_set then finally map_from_entries as below:
df = df
.groupBy("name", "type")
.agg(collect_set(struct("category", "value")).as("result"))
.withColumn("result", map_from_entries(col("result")))
Exporting as JSON, however, will not give you the result as you expect. To get the expected result, you can use:
df.toJSON.collect.mkString("[", "," , "]" )
Final result:
[
{
"name": "First",
"type": "type1",
"result": {
"cat3": "value3",
"cat1": "value1",
"cat2": "value2"
}
},
{
"name": "First",
"type": "type2",
"result": {
"cat1": "value1",
"cat5": "value4"
}
},
{
"name": "Second",
"type": "type1",
"result": {
"cat1": "value5",
"cat4": "value5"
}
}
]
Good luck!

Related

play framework json lookup inside array

I have simple json:
{
"name": "John",
"placesVisited": [
{
"name": "Paris",
"data": {
"weather": "warm",
"date": "31/01/22"
}
},
{
"name": "New York",
"data": [
{
"weather": "warm",
"date": "31/01/21"
},
{
"weather": "cold",
"date": "28/01/21"
}
]
}
]
}
as you can see in this json there is placesVisited field, and if name is "New York" the "data" field is a List, and if the name is "Paris" its an object.
what I want to do is to pull the placesVisited object where "name": "New York" and then I will parse it to a case class I have, I can't use this case class for both objects in placesVisited cause they have diff types for the same name.
so what I thought is to do something like:
(myJson \ "placesVisited") and here I need to add something that will give me element where name is "New York", how can I do that?
my result should be this:
{
"name": "New York",
"data": [
{
"weather": "warm",
"date": "31/01/21"
},
{
"weather": "cold",
"date": "28/01/21"
}
]
}
something like this maybe can happen but its horrible haha:
(Json.parse(myjson) \ "placesVisited").as[List[JsObject]].find(item => {
item.value.get("name").toString.contains("New York")
}).getOrElse(throw Exception("could not find New York element")).as[NewYorkModel]
item.value.get("name").toString can slightly be simplified to (item \ "name").as[String] but otherwise there's not much to improve.
Another option is to use a case class Place(name: String, data: JsValue) and do it like this:
(Json.parse(myjson) \ "placesVisited")
.as[List[Place]]
.find(_.name == "New York")

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

Read JSON in ADF

In Azure Data Factory, I need to be able to process a JSON response. I don't want to hardcode the array position in case they change, so something like this is out of the question:
#activity('Place Details').output.result.components[2].name
How can I get the name 123 where types = number given a JSON array like below:
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
One example using the OPENJSON method:
DECLARE #json NVARCHAR(MAX) = '{
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
}'
;WITH cte AS (
SELECT
JSON_VALUE( o.[value], '$.name' ) [name],
JSON_VALUE( o.[value], '$.types[0]' ) [types]
FROM OPENJSON( #json, '$.result.components' ) o
)
SELECT [name]
FROM cte
WHERE types = 'number'
I will have a look at other methods.

How to get values from nested json array using spark?

I have this array
val myJson = {
"record": {
"recordId": 100,
"name": "xyz",
"version": "1.1",
"input": [
{
"format": "Database",
"type": "Oracle",
"connectionStringId": "212",
"connectionString": "ksfksfklsdflk",
"schemaName": "schema1",
"databaseName": "db1",
"tables": [
{
"table_name":"one"
}
{
"table_name":"two"
}
]
}
]
}
}
I am using this code to get this json in dataframe
val df = sparkSession.read.json(myjson)
I want values of schemaName & databaseName, how can i get them?
val schemaName = df.select("record.input.schemaName") //not working
Someone, please help me
You need to explode the array column record.input then select the fields you want :
df.select(explode(col("record.input")).as("inputs"))
.select("inputs.schemaName", "inputs.databaseName")
.show
//+----------+------------+
//|schemaName|databaseName|
//+----------+------------+
//| schema1| db1|
//+----------+------------+

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}