I have some sample file which has json string in file how to process this type of file in spark.
Sample file
{"Id":"240","Page":"dashboard","test":"working"}
{"Amt": "0.0","deliveryfee": "Free","ProductList": "{{ProductId=1,Price=200,Quantity=1},{ProductId=2,Price=600,Quantity=1}}","sample": "data"}
Reading of file as json
val data = spark.read.option("multiLine", "true").json("/data/test/test.json")
df.printSchema
root
|-- Amt: string (nullable = true)
|-- ProductList: string (nullable = true)
|-- deliveryfee: string (nullable = true)
|-- sample: string (nullable = true)
printSchema is showing ProductList as String but it is not.
You probably want something like this:
{
"Amt": "0.0",
"deliveryfee": "Free",
"ProductList": [{
"ProductId": 1,
"Price": 200,
"Quantity": 1
}, {
"ProductId": 2,
"Price": 600,
"Quantity": 1
}],
"sample": "data"
}
Edited: The point is that the way your JSON in that field is a String, you need to change your JSON, or work with that field as a String
Related
My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)
Input data-frame:
{
"C_1" : "A",
"C_2" : "B",
"C_3" : [
{
"ID" : "ID1",
"C3_C2" : "V1",
"C3_C3" : "V2"
},
{
"ID" : "ID2",
"C3_C2" : "V3",
"C3_C3" : "V4"
},
{
"ID" : "ID3",
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
]
}
Desired Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID3" : {
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
}
C_3 is an array of n structs with each item having a unique ID. The new data-frame is expected to convert the n structs in C_3 into separate columns and name the columns as per the value of ID.
I am new to Spark & Scala. Any thought on how of achieve this transformation will be very helpful.
Thanks!
You can explode the structs, then pivot by ID:
val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
.groupBy("C_1","C_2")
.pivot("ID")
.agg(first(struct("C3_C2","C3_C3")))
df2.show
+---+---+--------+--------+--------+
|C_1|C_2| ID1| ID2| ID3|
+---+---+--------+--------+--------+
| A| B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+
df2.printSchema
root
|-- C_1: string (nullable = true)
|-- C_2: string (nullable = true)
|-- ID1: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID2: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID3: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
[posting my hacky solution for reference].
#mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. My data-frame had a lot of columns and using all of them on group-by then pivot was an expensive operation.
For my use-case, the IDs in C_3 were unique and known values, hence this is the assumption made in this solution.
I was able to achieve the transformation as follows:
case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required
val custom_flat_func = udf((a: Seq[Row]) =>
{
var _id1: C3_Interm_Struct= null;
var _id2: C3_Interm_Struct = null;
for(item<-a)
{
val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)
if(item(0).equals("ID1")) {
_id1 = intermData
}
else if(item(0).equals("ID2")) {
_id2 = intermData
}
else if()//Manual expansion
..
}
Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
}
)
val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson
Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
}
}
udfs are generally slow but this is much faster than pivot with group-by.
[There may be more efficient solutions possible, I am not aware of them at the time of writing this]
I have array column
|-- packages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- packageId: string (nullable = true)
| | |-- triggers: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How to get a new column with all packageId
example of column:
"packages": [
{
"packageId": "package1",
"triggers": {
"1": "2"
}
},
{
"packageId": "package2",
"triggers": {
"1": "2",
"2": "2"
}
}
]
to
package1,package2
I used spark 2.4.5
df.withColumn("packageList", explode(df.col("packages").getField("packageId")))
.groupBy(..)
.agg(concat_ws(",", collect_set("packageList")))
Its work for me
I am trying to create a nested JSON from my spark dataframe which has data in following structure.
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
Required json output in below format using Apache-Spark with Scala.
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]
//read file into DataFrame
scala> val df = spark.read.format("csv").option("header", "true").load(<input CSV path>)
df: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 4 more fields]
scala> df.show(false)
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor_Name|count|Categories|Category_Count|Subcategory |Subcategory_Count|
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor1 |10 |Category 1|4 |Sub Category 1|1 |
|Vendor1 |10 |Category 1|4 |Sub Category 2|2 |
|Vendor1 |10 |Category 1|4 |Sub Category 3|3 |
|Vendor1 |10 |Category 1|4 |Sub Category 4|4 |
+-----------+-----+----------+--------------+--------------+-----------------+
//convert into desire Json format
scala> val df1 = df.groupBy("Vendor_Name","count","Categories","Category_Count").agg(collect_list(struct(col("Subcategory").alias("name"),col("Subcategory_Count").alias("count"))).alias("subCategories")).groupBy("Vendor_Name","count").agg(collect_list(struct(col("Categories").alias("name"),col("Category_Count").alias("count"),col("subCategories"))).alias("categories"))
df1: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 1 more field]
scala> df1.printSchema
root
|-- Vendor_Name: string (nullable = true)
|-- count: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- count: string (nullable = true)
| | |-- subCategories: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- count: string (nullable = true)
//Write df in json format
scala> df1.write.format("json").mode("append").save(<output Path>)
I have an JSON input like this
{
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
I need to merge the 3 columns and extract the needed values for each column, and this is the output I need:
{
"out": {
"1": 5,
"2": [10, 11, 20],
"3": "a"
}
}
I tried to create a UDF to transform these 3 columns into 1, but I could not figure how to define MapType() with mixed value types - IntegerType(), ArrayType(IntegerType()) and StringType() respectively.
Thanks in advance!
You need to define resulting type of the UDF using StructType, not the MapType, like this:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
MapType() is used for (key, value) pairs definitions not for nested data frames. What you're looking for is StructType()
You can load it directly using createDataFrame but you'd have to pass a schema, so this way is easier:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
Now to access nested dataframes. Note that column "2" is more nested than the other ones:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
It's not exactly your final output since you want it nested in an "out" column:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
Finally back to JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'