How to create nested json using Apache Spark with Scala - scala

I am trying to create a nested JSON from my spark dataframe which has data in following structure.
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
Required json output in below format using Apache-Spark with Scala.
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]

//read file into DataFrame
scala> val df = spark.read.format("csv").option("header", "true").load(<input CSV path>)
df: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 4 more fields]
scala> df.show(false)
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor_Name|count|Categories|Category_Count|Subcategory |Subcategory_Count|
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor1 |10 |Category 1|4 |Sub Category 1|1 |
|Vendor1 |10 |Category 1|4 |Sub Category 2|2 |
|Vendor1 |10 |Category 1|4 |Sub Category 3|3 |
|Vendor1 |10 |Category 1|4 |Sub Category 4|4 |
+-----------+-----+----------+--------------+--------------+-----------------+
//convert into desire Json format
scala> val df1 = df.groupBy("Vendor_Name","count","Categories","Category_Count").agg(collect_list(struct(col("Subcategory").alias("name"),col("Subcategory_Count").alias("count"))).alias("subCategories")).groupBy("Vendor_Name","count").agg(collect_list(struct(col("Categories").alias("name"),col("Category_Count").alias("count"),col("subCategories"))).alias("categories"))
df1: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 1 more field]
scala> df1.printSchema
root
|-- Vendor_Name: string (nullable = true)
|-- count: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- count: string (nullable = true)
| | |-- subCategories: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- count: string (nullable = true)
//Write df in json format
scala> df1.write.format("json").mode("append").save(<output Path>)

Related

Json response contains columns and rows seperately

My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)

pyspark: Is it possible to create array with missing elements in one struct

My input DataFrame schema is like below. The difference between elements 1 and 2 in d is 1 has attributes a,b,c,d and 2 has only a,b,c
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- d: struct (nullable = true)
| |-- 1: struct (nullable = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: double (nullable = true)
| |-- 2: struct (nullable = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
I am trying explode the elements of d using below code
df2 = inputDF.withColumn("d1",f.explode(f.array("d.*").getField("c")))
and getting error pyspark.sql.utils.AnalysisException: cannot resolve 'array(d.1, d.2)' due to data type mismatch: input to function array should all be the same type, but it's [struct<a:string,b:string,c:string,d:double>, struct<a:string,b:string,c:string>];
'Project [a#832, b#833, c#834, d#835, explode(array(d#835.1, d#835.2)[c]) AS d1#843]
+- Relation[a#832,b#833,c#834,d#835] json
Is there any way to instruct the function to assume NULLS when missing columns in input to array function?
You can explode array of struct where one of the element missing a field as in you case by following:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, ArrayType, StructField, StringType
spark = SparkSession \
.builder \
.appName("SparkTesting") \
.getOrCreate()
d_schema = ArrayType(StructType([
StructField('a', StringType(), nullable=True),
StructField('b', StringType(), nullable=True),
StructField('c', StringType(), nullable=True),
StructField('d', StringType(), nullable=True),
]))
df_schema = (StructType()
.add("a", StringType(), nullable=True)
.add("b", StringType(), nullable=True)
.add("c", StringType(), nullable=True)
.add("d", d_schema, nullable=True))
item1 = {
"a": "a1",
"b": "b1",
"c": "c1",
"d": [
{
"a": "a1",
"b": "b1",
"c": "c1",
"d": "d1"
},
{
"a": "a1",
"b": "b1",
"c": "c1",
}
],
}
df = spark.createDataFrame([item1], schema=df_schema)
df.printSchema()
df.show(truncate=False)
df2 = df.withColumn("d1", f.explode(col("d")))
df2.printSchema()
df2.show(truncate=False)
df2.select("d1.c").show()
+---+---+---+--------------------------------------+------------------+
|a |b |c |d |d1 |
+---+---+---+--------------------------------------+------------------+
|a1 |b1 |c1 |[{a1, b1, c1, d1}, {a1, b1, c1, null}]|{a1, b1, c1, d1} |
|a1 |b1 |c1 |[{a1, b1, c1, d1}, {a1, b1, c1, null}]|{a1, b1, c1, null}|
+---+---+---+--------------------------------------+------------------+
In case you are not sure whether the array field d itself will be null then its advisable to use explode_outer() function instead of explode().
As per the comment to match the schema:
below code will work:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
d_inter_schema = (StructType([
StructField('a', StringType(), nullable=True),
StructField('b', StringType(), nullable=True),
StructField('c', StringType(), nullable=True),
StructField('d', StringType(), nullable=True),
]))
d_schema = StructType().add("1", d_inter_schema, nullable=True).add("2", d_inter_schema, nullable=True)
df_schema = (StructType()
.add("a", StringType(), nullable=True)
.add("b", StringType(), nullable=True)
.add("c", StringType(), nullable=True)
.add("d", d_schema, nullable=True))
item1 = {
"a": "a1",
"b": "b1",
"c": "c1",
"d": {"1": {
"a": "a1",
"b": "b1",
"c": "c1",
"d": "d1"
},
"2": {
"a": "a1",
"b": "b1",
"c": "c1",
}
},
}
df = spark.createDataFrame([item1], schema=df_schema)
df.printSchema()
df.show(truncate=False)
+---+---+---+--------------------------------------+
|a |b |c |d |
+---+---+---+--------------------------------------+
|a1 |b1 |c1 |{{a1, b1, c1, d1}, {a1, b1, c1, null}}|
+---+---+---+--------------------------------------+
df.select("d.1.c", "d.2.c").show()
+---+---+
| c| c|
+---+---+
| c1| c1|
+---+---+

How to get field from array and concatenate them into string (spark dataframe)

I have array column
|-- packages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- packageId: string (nullable = true)
| | |-- triggers: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How to get a new column with all packageId
example of column:
"packages": [
{
"packageId": "package1",
"triggers": {
"1": "2"
}
},
{
"packageId": "package2",
"triggers": {
"1": "2",
"2": "2"
}
}
]
to
package1,package2
I used spark 2.4.5
df.withColumn("packageList", explode(df.col("packages").getField("packageId")))
.groupBy(..)
.agg(concat_ws(",", collect_set("packageList")))
Its work for me

How to read json string files in dataframe with spark

I have some sample file which has json string in file how to process this type of file in spark.
Sample file
{"Id":"240","Page":"dashboard","test":"working"}
{"Amt": "0.0","deliveryfee": "Free","ProductList": "{{ProductId=1,Price=200,Quantity=1},{ProductId=2,Price=600,Quantity=1}}","sample": "data"}
Reading of file as json
val data = spark.read.option("multiLine", "true").json("/data/test/test.json")
df.printSchema
root
|-- Amt: string (nullable = true)
|-- ProductList: string (nullable = true)
|-- deliveryfee: string (nullable = true)
|-- sample: string (nullable = true)
printSchema is showing ProductList as String but it is not.
You probably want something like this:
{
"Amt": "0.0",
"deliveryfee": "Free",
"ProductList": [{
"ProductId": 1,
"Price": 200,
"Quantity": 1
}, {
"ProductId": 2,
"Price": 600,
"Quantity": 1
}],
"sample": "data"
}
Edited: The point is that the way your JSON in that field is a String, you need to change your JSON, or work with that field as a String

PySpark UDF of MapType with mixed value type

I have an JSON input like this
{
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
I need to merge the 3 columns and extract the needed values for each column, and this is the output I need:
{
"out": {
"1": 5,
"2": [10, 11, 20],
"3": "a"
}
}
I tried to create a UDF to transform these 3 columns into 1, but I could not figure how to define MapType() with mixed value types - IntegerType(), ArrayType(IntegerType()) and StringType() respectively.
Thanks in advance!
You need to define resulting type of the UDF using StructType, not the MapType, like this:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
MapType() is used for (key, value) pairs definitions not for nested data frames. What you're looking for is StructType()
You can load it directly using createDataFrame but you'd have to pass a schema, so this way is easier:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
Now to access nested dataframes. Note that column "2" is more nested than the other ones:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
It's not exactly your final output since you want it nested in an "out" column:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
Finally back to JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'