My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)
Related
I have array column
|-- packages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- packageId: string (nullable = true)
| | |-- triggers: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How to get a new column with all packageId
example of column:
"packages": [
{
"packageId": "package1",
"triggers": {
"1": "2"
}
},
{
"packageId": "package2",
"triggers": {
"1": "2",
"2": "2"
}
}
]
to
package1,package2
I used spark 2.4.5
df.withColumn("packageList", explode(df.col("packages").getField("packageId")))
.groupBy(..)
.agg(concat_ws(",", collect_set("packageList")))
Its work for me
I am trying to create a nested JSON from my spark dataframe which has data in following structure.
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
Required json output in below format using Apache-Spark with Scala.
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]
//read file into DataFrame
scala> val df = spark.read.format("csv").option("header", "true").load(<input CSV path>)
df: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 4 more fields]
scala> df.show(false)
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor_Name|count|Categories|Category_Count|Subcategory |Subcategory_Count|
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor1 |10 |Category 1|4 |Sub Category 1|1 |
|Vendor1 |10 |Category 1|4 |Sub Category 2|2 |
|Vendor1 |10 |Category 1|4 |Sub Category 3|3 |
|Vendor1 |10 |Category 1|4 |Sub Category 4|4 |
+-----------+-----+----------+--------------+--------------+-----------------+
//convert into desire Json format
scala> val df1 = df.groupBy("Vendor_Name","count","Categories","Category_Count").agg(collect_list(struct(col("Subcategory").alias("name"),col("Subcategory_Count").alias("count"))).alias("subCategories")).groupBy("Vendor_Name","count").agg(collect_list(struct(col("Categories").alias("name"),col("Category_Count").alias("count"),col("subCategories"))).alias("categories"))
df1: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 1 more field]
scala> df1.printSchema
root
|-- Vendor_Name: string (nullable = true)
|-- count: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- count: string (nullable = true)
| | |-- subCategories: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- count: string (nullable = true)
//Write df in json format
scala> df1.write.format("json").mode("append").save(<output Path>)
I have some sample file which has json string in file how to process this type of file in spark.
Sample file
{"Id":"240","Page":"dashboard","test":"working"}
{"Amt": "0.0","deliveryfee": "Free","ProductList": "{{ProductId=1,Price=200,Quantity=1},{ProductId=2,Price=600,Quantity=1}}","sample": "data"}
Reading of file as json
val data = spark.read.option("multiLine", "true").json("/data/test/test.json")
df.printSchema
root
|-- Amt: string (nullable = true)
|-- ProductList: string (nullable = true)
|-- deliveryfee: string (nullable = true)
|-- sample: string (nullable = true)
printSchema is showing ProductList as String but it is not.
You probably want something like this:
{
"Amt": "0.0",
"deliveryfee": "Free",
"ProductList": [{
"ProductId": 1,
"Price": 200,
"Quantity": 1
}, {
"ProductId": 2,
"Price": 600,
"Quantity": 1
}],
"sample": "data"
}
Edited: The point is that the way your JSON in that field is a String, you need to change your JSON, or work with that field as a String
I have an JSON input like this
{
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
I need to merge the 3 columns and extract the needed values for each column, and this is the output I need:
{
"out": {
"1": 5,
"2": [10, 11, 20],
"3": "a"
}
}
I tried to create a UDF to transform these 3 columns into 1, but I could not figure how to define MapType() with mixed value types - IntegerType(), ArrayType(IntegerType()) and StringType() respectively.
Thanks in advance!
You need to define resulting type of the UDF using StructType, not the MapType, like this:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
MapType() is used for (key, value) pairs definitions not for nested data frames. What you're looking for is StructType()
You can load it directly using createDataFrame but you'd have to pass a schema, so this way is easier:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
Now to access nested dataframes. Note that column "2" is more nested than the other ones:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
It's not exactly your final output since you want it nested in an "out" column:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
Finally back to JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'
How to add/append a column form different data frame ? I am trying to find the percentile of placeName which are rated 3 and above.
// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"requestId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()
result1 has rows with total and result2 has total after filtering. Now I am trying to find :
sqlContext.sql("select top/total as percentile from temp groupBy placeName")
But finalResult has duplicate columns placeName and customerId. Can someone tell me what I am doing wrong here ? Also is there a way to do this without doing join ?
My Schema :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 1 count = 3
Canada top = 1 count = 3
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
UK top = 1 count = 1
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 2 count = 3
Canada top = 1 count = 3
PlaceName percnetile
US (1+1+2)/(3+1+3) *100
Canada (1+1)/(3+3) *100
UK 1 *100
Schema:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
Also is there a way to do this without doing join ?
No
Can someone tell me what I am doing wrong here ?
Nothing. If you don't need both use this:
result1.join(result2, List("placeName","customerId"), "outer")