How to add/append a column form different data frame ? I am trying to find the percentile of placeName which are rated 3 and above.
// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"requestId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()
result1 has rows with total and result2 has total after filtering. Now I am trying to find :
sqlContext.sql("select top/total as percentile from temp groupBy placeName")
But finalResult has duplicate columns placeName and customerId. Can someone tell me what I am doing wrong here ? Also is there a way to do this without doing join ?
My Schema :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 1 count = 3
Canada top = 1 count = 3
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
UK top = 1 count = 1
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 2 count = 3
Canada top = 1 count = 3
PlaceName percnetile
US (1+1+2)/(3+1+3) *100
Canada (1+1)/(3+3) *100
UK 1 *100
Schema:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
Also is there a way to do this without doing join ?
No
Can someone tell me what I am doing wrong here ?
Nothing. If you don't need both use this:
result1.join(result2, List("placeName","customerId"), "outer")
Related
My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)
A dynamodb table is exported to s3 and aws glue crawler crawls the s3 data.
Aws glue jobs take the source from the crawled data and here's the schema that was transformed by MergeLineItems:
def MergeLineItems(rec):
rec["lineItems1"] = {}
a = []
for x in rec["lineItems"]:
a.append(x["M"])
rec["lineItems1"] = a
return rec
mapped_dyF = Map.apply(frame = Transform0, f = MergeLineItems)
The schema is like this:
-- lineItems1: array
| |-- element: struct
| | |-- price: struct
| | | |-- N: string
| | |-- grade: struct
| | | |-- S: string
| | |-- expectedAmount: struct
| | | |-- N: string
| | |-- notifiedAmount: struct
| | | |-- N: string
When I run the aws glue job and the data that was saved into a dynamodb is like this:
[
{
"M":
{
"expectedAmount":
{
"M":
{
"N":
{
"S": "10"
}
}
},
"grade":
{
"M":
{
"S":
{
"S": "GradeAAA"
}
}
},
"notifiedAmount":
{
"M":
{
"N":
{
"S": "0"
}
}
},
"price":
{
"M":
{
"N":
{
"S": "2.15"
}
}
}
}
}
]
While the data from the original dynamodb is different than this. How can I change the data into this one:
[
{
"M":
{
"expectedAmount":
{
"N": "10"
},
"notifiedAmount":
{
"N": "0"
},
"grade":
{
"S": "GradeAAA"
},
"price":
{
"N": "2.15"
}
}
}
]
I got it working. Here's my answer:
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydb", table_name = "data", transformation_ctx = "DataSource0")
Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("item.lineItems.L", "array", "lineItems", "array")], transformation_ctx = "Transform0")
def MergeLineItems(rec):
rec["lineItems1"] = {}
a = []
for x in rec["lineItems"]:
val = x["M"]["expectedAmount"]["N"]
x["M"]["expectedAmount"] = Decimal(val)
val = x["M"]["notifiedAmount"]["N"]
x["M"]["notifiedAmount"] = Decimal(val)
val = x["M"]["grade"]["S"]
x["M"]["grade"] = str(val)
val = x["M"]["price"]["N"]
x["M"]["price"] = Decimal(val)
a.append(x["M"])
rec["lineItems1"] = a
return rec
mapped_dyF = Map.apply(frame = Transform0, f = MergeLineItems)
mapped_dyF = DropFields.apply(mapped_dyF, paths=['lineItems'])
mapped_dyF = RenameField.apply(mapped_dyF, "lineItems1", "lineItems")
glueContext.write_dynamic_frame_from_options(
frame=mapped_dyF,
connection_type="dynamodb",
connection_options={
"dynamodb.region": "us-east-1",
"dynamodb.output.tableName": "mydb",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thanks #Minah for posting this question and answer, I was looking exactly for this (mapping DynamoDB arrays exported to S3 from an AWS Glue Python ETL job) and this was the only helpful post I could find.
This version worked for me, removing the additional steps of DropField and RenameField and creating new items instead of overwriting them:
DataSource = glueContext.create_dynamic_frame.from_catalog(database="mydb", table_name="data",
Transform = ApplyMapping.apply(frame=DataSource0, mappings=[("item.lineItems.L", "array", "lineItems", "array")],
transformation_ctx="Transform")
def MergeLineItems(record):
mappedLineItems = []
for lineItem in record["lineItems"]:
mappedLineItems.append({
"expectedAmount": Decimal(lineItem["M"]["expectedAmount"]["N"]),
"notifiedAmount": Decimal(lineItem["M"]["notifiedAmount"]["N"]),
"grade": lineItem["M"]["grade"]["S"],
"price": Decimal(lineItem["M"]["price"]["N"]),
})
record["lineItems"] = mappedLineItems
return record
Mapped = Map.apply(frame=Transform0, f=MergeLineItems, transformation_ctx="Mapped")
glueContext.write_dynamic_frame_from_options(
frame=Mapped,
connection_type="dynamodb",
connection_options={
"dynamodb.region": "us-east-1",
"dynamodb.output.tableName": "mydb",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Input data-frame:
{
"C_1" : "A",
"C_2" : "B",
"C_3" : [
{
"ID" : "ID1",
"C3_C2" : "V1",
"C3_C3" : "V2"
},
{
"ID" : "ID2",
"C3_C2" : "V3",
"C3_C3" : "V4"
},
{
"ID" : "ID3",
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
]
}
Desired Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID3" : {
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
}
C_3 is an array of n structs with each item having a unique ID. The new data-frame is expected to convert the n structs in C_3 into separate columns and name the columns as per the value of ID.
I am new to Spark & Scala. Any thought on how of achieve this transformation will be very helpful.
Thanks!
You can explode the structs, then pivot by ID:
val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
.groupBy("C_1","C_2")
.pivot("ID")
.agg(first(struct("C3_C2","C3_C3")))
df2.show
+---+---+--------+--------+--------+
|C_1|C_2| ID1| ID2| ID3|
+---+---+--------+--------+--------+
| A| B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+
df2.printSchema
root
|-- C_1: string (nullable = true)
|-- C_2: string (nullable = true)
|-- ID1: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID2: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID3: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
[posting my hacky solution for reference].
#mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. My data-frame had a lot of columns and using all of them on group-by then pivot was an expensive operation.
For my use-case, the IDs in C_3 were unique and known values, hence this is the assumption made in this solution.
I was able to achieve the transformation as follows:
case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required
val custom_flat_func = udf((a: Seq[Row]) =>
{
var _id1: C3_Interm_Struct= null;
var _id2: C3_Interm_Struct = null;
for(item<-a)
{
val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)
if(item(0).equals("ID1")) {
_id1 = intermData
}
else if(item(0).equals("ID2")) {
_id2 = intermData
}
else if()//Manual expansion
..
}
Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
}
)
val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson
Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
}
}
udfs are generally slow but this is much faster than pivot with group-by.
[There may be more efficient solutions possible, I am not aware of them at the time of writing this]
I am trying to create a nested JSON from my spark dataframe which has data in following structure.
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
Required json output in below format using Apache-Spark with Scala.
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]
//read file into DataFrame
scala> val df = spark.read.format("csv").option("header", "true").load(<input CSV path>)
df: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 4 more fields]
scala> df.show(false)
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor_Name|count|Categories|Category_Count|Subcategory |Subcategory_Count|
+-----------+-----+----------+--------------+--------------+-----------------+
|Vendor1 |10 |Category 1|4 |Sub Category 1|1 |
|Vendor1 |10 |Category 1|4 |Sub Category 2|2 |
|Vendor1 |10 |Category 1|4 |Sub Category 3|3 |
|Vendor1 |10 |Category 1|4 |Sub Category 4|4 |
+-----------+-----+----------+--------------+--------------+-----------------+
//convert into desire Json format
scala> val df1 = df.groupBy("Vendor_Name","count","Categories","Category_Count").agg(collect_list(struct(col("Subcategory").alias("name"),col("Subcategory_Count").alias("count"))).alias("subCategories")).groupBy("Vendor_Name","count").agg(collect_list(struct(col("Categories").alias("name"),col("Category_Count").alias("count"),col("subCategories"))).alias("categories"))
df1: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 1 more field]
scala> df1.printSchema
root
|-- Vendor_Name: string (nullable = true)
|-- count: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- count: string (nullable = true)
| | |-- subCategories: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- count: string (nullable = true)
//Write df in json format
scala> df1.write.format("json").mode("append").save(<output Path>)
I have some sample file which has json string in file how to process this type of file in spark.
Sample file
{"Id":"240","Page":"dashboard","test":"working"}
{"Amt": "0.0","deliveryfee": "Free","ProductList": "{{ProductId=1,Price=200,Quantity=1},{ProductId=2,Price=600,Quantity=1}}","sample": "data"}
Reading of file as json
val data = spark.read.option("multiLine", "true").json("/data/test/test.json")
df.printSchema
root
|-- Amt: string (nullable = true)
|-- ProductList: string (nullable = true)
|-- deliveryfee: string (nullable = true)
|-- sample: string (nullable = true)
printSchema is showing ProductList as String but it is not.
You probably want something like this:
{
"Amt": "0.0",
"deliveryfee": "Free",
"ProductList": [{
"ProductId": 1,
"Price": 200,
"Quantity": 1
}, {
"ProductId": 2,
"Price": 600,
"Quantity": 1
}],
"sample": "data"
}
Edited: The point is that the way your JSON in that field is a String, you need to change your JSON, or work with that field as a String