Transform the schema of a spark data-frame in Scala - scala

Input data-frame:
{
"C_1" : "A",
"C_2" : "B",
"C_3" : [
{
"ID" : "ID1",
"C3_C2" : "V1",
"C3_C3" : "V2"
},
{
"ID" : "ID2",
"C3_C2" : "V3",
"C3_C3" : "V4"
},
{
"ID" : "ID3",
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
]
}
Desired Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID3" : {
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
}
C_3 is an array of n structs with each item having a unique ID. The new data-frame is expected to convert the n structs in C_3 into separate columns and name the columns as per the value of ID.
I am new to Spark & Scala. Any thought on how of achieve this transformation will be very helpful.
Thanks!

You can explode the structs, then pivot by ID:
val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
.groupBy("C_1","C_2")
.pivot("ID")
.agg(first(struct("C3_C2","C3_C3")))
df2.show
+---+---+--------+--------+--------+
|C_1|C_2| ID1| ID2| ID3|
+---+---+--------+--------+--------+
| A| B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+
df2.printSchema
root
|-- C_1: string (nullable = true)
|-- C_2: string (nullable = true)
|-- ID1: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID2: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID3: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)

[posting my hacky solution for reference].
#mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. My data-frame had a lot of columns and using all of them on group-by then pivot was an expensive operation.
For my use-case, the IDs in C_3 were unique and known values, hence this is the assumption made in this solution.
I was able to achieve the transformation as follows:
case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required
val custom_flat_func = udf((a: Seq[Row]) =>
{
var _id1: C3_Interm_Struct= null;
var _id2: C3_Interm_Struct = null;
for(item<-a)
{
val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)
if(item(0).equals("ID1")) {
_id1 = intermData
}
else if(item(0).equals("ID2")) {
_id2 = intermData
}
else if()//Manual expansion
..
}
Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
}
)
val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson
Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
}
}
udfs are generally slow but this is much faster than pivot with group-by.
[There may be more efficient solutions possible, I am not aware of them at the time of writing this]

Related

Json response contains columns and rows seperately

My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)
Response is as follows,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }],
"rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Please help me how to parse it using PySpark.
Since you have your json as a string stored in Response i.e.,
Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
Use json.loads() to create a dictionary from this Response(string).
import json
data_dict = json.loads(Response)
print(data_dict.keys())
# dict_keys(['columns', 'rows'])
Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
cols = data_dict['columns']
l = [i for column in cols for i in column.items()]
#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["
convert = []
#list of columns that should be as DateTime.
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.
for c in l:
#column name
col_name = c[0]
#column type
if(c[1]['type'] =='Numeric'):
col_type = 'IntegerType()'
elif(c[1]['type'] == 'Text'):
col_type = 'StringType()'
elif(c[1]['type'] == 'DateTime'):
#converting datetime type to StringType, to later convert to TimestampType
col_type = 'StringType()'
convert.append(col_name) #appending columns to be converted to a list
#if column is nullable or not
col_nullable = c[1]['nullable']
schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
schema_str = schema_str[:-1] + '])'
print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
Now use row data and above schema(string) to create dataframe
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()
#output
+---+----+-----------+--------------------+
| id|name|description| last_updated|
+---+----+-----------+--------------------+
| 1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
| 4| bar| null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: string (nullable = false)
Convert the columns that need to be type casted to TimestampType()
for to_be_converted in convert:
df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()
#output
+---+----+-----------+-------------------+
| id|name|description| last_updated|
+---+----+-----------+-------------------+
| 1| foo|Lorem ipsum|2016-10-26 00:09:14|
| 4| bar| null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+
root
|-- id: integer (nullable = false)
|-- name: string (nullable = false)
|-- description: string (nullable = true)
|-- last_updated: timestamp (nullable = true)

How to get field from array and concatenate them into string (spark dataframe)

I have array column
|-- packages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- packageId: string (nullable = true)
| | |-- triggers: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How to get a new column with all packageId
example of column:
"packages": [
{
"packageId": "package1",
"triggers": {
"1": "2"
}
},
{
"packageId": "package2",
"triggers": {
"1": "2",
"2": "2"
}
}
]
to
package1,package2
I used spark 2.4.5
df.withColumn("packageList", explode(df.col("packages").getField("packageId")))
.groupBy(..)
.agg(concat_ws(",", collect_set("packageList")))
Its work for me

Error in StructField(a,StringType,false). It is false and should be true

I have this error in my Scala test:
StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,false)) did not equal StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,true))
ScalaTestFailureLocation: com.holdenkarau.spark.testing.TestSuite$class at (TestSuite.scala:13)
Expected :StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,true))
Actual :StructType(StructField(a,StringType,true), StructField(b,StringType,true), StructField(c,StringType,true), StructField(d,StringType,true), StructField(e,StringType,true), StructField(f,StringType,true), StructField(NewColumn,StringType,false))
Last StructField is false when it should be true and I do not why. This true means that the schema accepts null values.
And this is my test:
val schema1 = Array("a", "b", "c", "d", "e", "f")
val df = List(("a1", "b1", "c1", "d1", "e1", "f1"),
("a2", "b2", "c2", "d2", "e2", "f2"))
.toDF(schema1: _*)
val schema2 = Array("a", "b", "c", "d", "e", "f", "NewColumn")
val dfExpected = List(("a1", "b1", "c1", "d1", "e1", "f1", "a1_b1_c1_d1_e1_f1"),
("a2", "b2", "c2", "d2", "e2", "f2", "a2_b2_c2_d2_e2_f2")).toDF(schema2: _*)
val transformer = KeyContract("NewColumn", schema1)
val newDf = transformer(df)
newDf.columns should contain ("NewColumn")
assertDataFrameEquals(newDf, dfExpected)
And this is KeyContract:
case class KeyContract(tempColumn: String, columns: Seq[String],
unsigned: Boolean = true) extends Transformer {
override def apply(input: DataFrame): DataFrame = {
import org.apache.spark.sql.functions._
val inputModif = columns.foldLeft(input) { (tmpDf, columnName) =>
tmpDf.withColumn(columnName, when(col(columnName).isNull,
lit("")).otherwise(col(columnName)))
}
inputModif.withColumn(tempColumn, concat_ws("_", columns.map(col): _*))
}
}
Thanks in advance!!
This happens because concat_ws never returns null and the resulting field is marked as not nullable.
If you want to use a second DataFrame as a reference, you'll have to use schema and Rows:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder.getOrCreate()
val dfExpected = spark.createDataFrame(spark.sparkContext.parallelize(List(
Row("a1", "b1", "c1", "d1", "e1", "f1", "a1_b1_c1_d1_e1_f1"),
Row("a2", "b2", "c2", "d2", "e2", "f2", "a2_b2_c2_d2_e2_f2")
)), StructType(schema2.map { c => StructField(c, StringType, c != "NewColumn") }))
This way the last column won't be nullable:
dfExpected.printSchema
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- d: string (nullable = true)
|-- e: string (nullable = true)
|-- f: string (nullable = true)
|-- NewColumn: string (nullable = false)

PySpark UDF of MapType with mixed value type

I have an JSON input like this
{
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
I need to merge the 3 columns and extract the needed values for each column, and this is the output I need:
{
"out": {
"1": 5,
"2": [10, 11, 20],
"3": "a"
}
}
I tried to create a UDF to transform these 3 columns into 1, but I could not figure how to define MapType() with mixed value types - IntegerType(), ArrayType(IntegerType()) and StringType() respectively.
Thanks in advance!
You need to define resulting type of the UDF using StructType, not the MapType, like this:
from pyspark.sql.types import *
udf_result = StructType([
StructField('1', IntegerType()),
StructField('2', ArrayType(StringType())),
StructField('3', StringType())
])
MapType() is used for (key, value) pairs definitions not for nested data frames. What you're looking for is StructType()
You can load it directly using createDataFrame but you'd have to pass a schema, so this way is easier:
import json
data_json = {
"1": {
"id": 1,
"value": 5
},
"2": {
"id": 2,
"list": {
"10": {
"id": 10
},
"11": {
"id": 11
},
"20": {
"id": 20
}
}
},
"3": {
"id": 3,
"key": "a"
}
}
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- value: long (nullable = true)
|-- 2: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- list: struct (nullable = true)
| | |-- 10: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 11: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | |-- 20: struct (nullable = true)
| | | |-- id: long (nullable = true)
|-- 3: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- key: string (nullable = true)
Now to access nested dataframes. Note that column "2" is more nested than the other ones:
nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
[df[c].id.alias(c) for c in df.columns]
)
root
|-- 1: long (nullable = true)
|-- 3: long (nullable = true)
|-- 2: array (nullable = false)
| |-- element: long (containsNull = true)
It's not exactly your final output since you want it nested in an "out" column:
import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()
root
|-- out: struct (nullable = false)
| |-- 1: long (nullable = true)
| |-- 3: long (nullable = true)
| |-- 2: array (nullable = false)
| | |-- element: long (containsNull = true)
Finally back to JSON:
df.toJSON().first()
'{"1":1,"3":3,"2":[10,11,20]}'

How to add a column from different data frame : Scala Frame

How to add/append a column form different data frame ? I am trying to find the percentile of placeName which are rated 3 and above.
// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("temp.txt")
//df.show()
val res = df.withColumn("visited", explode($"visited"))
val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))
val result2 = res
.filter($"visited.rating" < 4)
.groupBy($"requestId", $"visited.placeName")
.agg(count("*").alias("top"))
result1.show()
result2.show()
val finalResult = result1.join(result2, result1("placeName") <=> result2("placeName") && result1("customerId") <=> result2("customerId"), "outer").show()
result1 has rows with total and result2 has total after filtering. Now I am trying to find :
sqlContext.sql("select top/total as percentile from temp groupBy placeName")
But finalResult has duplicate columns placeName and customerId. Can someone tell me what I am doing wrong here ? Also is there a way to do this without doing join ?
My Schema :
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "2",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 1 count = 3
Canada top = 1 count = 3
{
"country": "Canada",
"customerId": "Canada012",
"visited": [
{
"placeName": "UK",
"rating": "3",
"famousRest": "N/A",
"placeId": "XSdce2"
},
]
}
UK top = 1 count = 1
{
"country": "France",
"customerId": "France001",
"visited": [
{
"placeName": "US",
"rating": "4.3",
"famousRest": "N/A",
"placeId": "AVBS34"
},
{
"placeName": "US",
"rating": "3.3",
"famousRest": "SeriousPie",
"placeId": "VBSs34"
},
{
"placeName": "Canada",
"rating": "4.3",
"famousRest": "TimHortons",
"placeId": "AVBv4d"
}
]
}
US top = 2 count = 3
Canada top = 1 count = 3
PlaceName percnetile
US (1+1+2)/(3+1+3) *100
Canada (1+1)/(3+3) *100
UK 1 *100
Schema:
root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- placeId: string (nullable = true)
| | |-- placeName: string (nullable = true)
| | |-- famousRest: string (nullable = true)
| | |-- rating: string (nullable = true)
Also is there a way to do this without doing join ?
No
Can someone tell me what I am doing wrong here ?
Nothing. If you don't need both use this:
result1.join(result2, List("placeName","customerId"), "outer")