Input data-frame:
{
"C_1" : "A",
"C_2" : "B",
"C_3" : [
{
"ID" : "ID1",
"C3_C2" : "V1",
"C3_C3" : "V2"
},
{
"ID" : "ID2",
"C3_C2" : "V3",
"C3_C3" : "V4"
},
{
"ID" : "ID3",
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
]
}
Desired Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID3" : {
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
}
C_3 is an array of n structs with each item having a unique ID. The new data-frame is expected to convert the n structs in C_3 into separate columns and name the columns as per the value of ID.
I am new to Spark & Scala. Any thought on how of achieve this transformation will be very helpful.
Thanks!
You can explode the structs, then pivot by ID:
val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
.groupBy("C_1","C_2")
.pivot("ID")
.agg(first(struct("C3_C2","C3_C3")))
df2.show
+---+---+--------+--------+--------+
|C_1|C_2| ID1| ID2| ID3|
+---+---+--------+--------+--------+
| A| B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+
df2.printSchema
root
|-- C_1: string (nullable = true)
|-- C_2: string (nullable = true)
|-- ID1: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID2: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID3: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
[posting my hacky solution for reference].
#mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. My data-frame had a lot of columns and using all of them on group-by then pivot was an expensive operation.
For my use-case, the IDs in C_3 were unique and known values, hence this is the assumption made in this solution.
I was able to achieve the transformation as follows:
case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required
val custom_flat_func = udf((a: Seq[Row]) =>
{
var _id1: C3_Interm_Struct= null;
var _id2: C3_Interm_Struct = null;
for(item<-a)
{
val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)
if(item(0).equals("ID1")) {
_id1 = intermData
}
else if(item(0).equals("ID2")) {
_id2 = intermData
}
else if()//Manual expansion
..
}
Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
}
)
val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson
Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
}
}
udfs are generally slow but this is much faster than pivot with group-by.
[There may be more efficient solutions possible, I am not aware of them at the time of writing this]
Related
My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)
I have array column
|-- packages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- packageId: string (nullable = true)
| | |-- triggers: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How to get a new column with all packageId
example of column:
"packages": [
{
"packageId": "package1",
"triggers": {
"1": "2"
}
},
{
"packageId": "package2",
"triggers": {
"1": "2",
"2": "2"
}
}
]
to
package1,package2
I used spark 2.4.5
df.withColumn("packageList", explode(df.col("packages").getField("packageId")))
.groupBy(..)
.agg(concat_ws(",", collect_set("packageList")))
Its work for me
I have this error in my Scala test:
StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,false)) did not equal StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,true))
ScalaTestFailureLocation: com.holdenkarau.spark.testing.TestSuite$class at (TestSuite.scala:13)
Expected :StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,true))
Actual :StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,false))
Last StructField is false when it should be true and I do not why. This true means that the schema accepts null values.
And this is my test:
val schema1 = Array("a", "b", "c", "d", "e", "f")
val df = List(("a1", "b1", "c1", "d1", "e1", "f1"),
("a2", "b2", "c2", "d2", "e2", "f2"))
.toDF(schema1: _*)
val schema2 = Array("a", "b", "c", "d", "e", "f", "NewColumn")
val dfExpected = List(("a1", "b1", "c1", "d1", "e1", "f1", "a1_b1_c1_d1_e1_f1"),
("a2", "b2", "c2", "d2", "e2", "f2", "a2_b2_c2_d2_e2_f2")).toDF(schema2: _*)
val transformer = KeyContract("NewColumn", schema1)
val newDf = transformer(df)
newDf.columns should contain ("NewColumn")
assertDataFrameEquals(newDf, dfExpected)
And this is KeyContract:
case class KeyContract(tempColumn: String, columns: Seq[String],
unsigned: Boolean = true) extends Transformer {
override def apply(input: DataFrame): DataFrame = {
import org.apache.spark.sql.functions._
val inputModif = columns.foldLeft(input) { (tmpDf, columnName) =>
tmpDf.withColumn(columnName, when(col(columnName).isNull,
lit("")).otherwise(col(columnName)))
}
inputModif.withColumn(tempColumn, concat_ws("_", columns.map(col): _*))
}
}
Thanks in advance!!
This happens because concat_ws never returns null and the resulting field is marked as not nullable.
If you want to use a second DataFrame as a reference, you'll have to use schema and Rows:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder.getOrCreate()
val dfExpected = spark.createDataFrame(spark.sparkContext.parallelize(List(
Row("a1", "b1", "c1", "d1", "e1", "f1", "a1_b1_c1_d1_e1_f1"),
Row("a2", "b2", "c2", "d2", "e2", "f2", "a2_b2_c2_d2_e2_f2")
)), StructType(schema2.map { c => StructField(c, StringType, c != "NewColumn") }))
This way the last column won't be nullable:
dfExpected.printSchema
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- d: string (nullable = true)
|-- e: string (nullable = true)
|-- f: string (nullable = true)
|-- NewColumn: string (nullable = false)
I have an JSON input like this
{
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
I need to merge the 3 columns and extract the needed values for each column, and this is the output I need:
{
"out": {
"1": 5,
"2": [10, 11, 20],
"3": "a"
}
}
I tried to create a UDF to transform these 3 columns into 1, but I could not figure how to define MapType() with mixed value types - IntegerType(), ArrayType(IntegerType()) and StringType() respectively.
Thanks in advance!
You need to define resulting type of the UDF using StructType, not the MapType, like this:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
MapType() is used for (key, value) pairs definitions not for nested data frames. What you're looking for is StructType()
You can load it directly using createDataFrame but you'd have to pass a schema, so this way is easier:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
Now to access nested dataframes. Note that column "2" is more nested than the other ones:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
It's not exactly your final output since you want it nested in an "out" column:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
Finally back to JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'
How to add/append a column form different data frame ? I am trying to find the percentile of placeName which are rated 3 and above.
// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"requestId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()
result1 has rows with total and result2 has total after filtering. Now I am trying to find :
sqlContext.sql("select top/total as percentile from temp groupBy placeName")
But finalResult has duplicate columns placeName and customerId. Can someone tell me what I am doing wrong here ? Also is there a way to do this without doing join ?
My Schema :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 1 count = 3
Canada top = 1 count = 3
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
UK top = 1 count = 1
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 2 count = 3
Canada top = 1 count = 3
PlaceName percnetile
US (1+1+2)/(3+1+3) *100
Canada (1+1)/(3+3) *100
UK 1 *100
Schema:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
Also is there a way to do this without doing join ?
No
Can someone tell me what I am doing wrong here ?
Nothing. If you don't need both use this:
result1.join(result2, List("placeName","customerId"), "outer")