I have a column which is a wrapped array of type struct with an integer and a Double value.
The schema looks like this:
|-- pricing_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
So, whenever this column value is [[0,0.0]] I need to change it as an empty array.
[[0,0.0]] -> [[]].
How can I do this using a map? or using a Dataframe?
Try this-
spark>=2.4
val df = Seq(Seq((0, 0.0)), Seq((1, 2.2))).toDF("pricing_data")
df.show(false)
df.printSchema()
/**
* +------------+
* |pricing_data|
* +------------+
* |[[0, 0.0]] |
* |[[1, 2.2]] |
* +------------+
*
* root
* |-- pricing_data: array (nullable = true)
* | |-- element: struct (containsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: double (nullable = false)
*/
df.withColumn("pricing_data", expr(
"TRANSFORM(pricing_data, x -> if(x._1=0 and x._2=0.0, named_struct('_1', null, '_2', null), x))"
))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
spark<2.4
// spark<2.4
val dataType = df.schema("pricing_data").dataType
val replace = udf((arrayOfStruct: mutable.WrappedArray[Row]) => {
arrayOfStruct.map(row => {
val map = row.getValuesMap(row.schema.map(_.name))
if(map("_1")==0 && map("_2") == 0.0) {
Row.fromTuple((null, null))
} else row
})
}, dataType)
df.withColumn("pricing_data", replace($"pricing_data"))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
Related
u"Union can only be performed on tables with the compatible column types. map<string,int> <> structint:int,long:null at the Nth column of the second table.
Here is how the schema looks like:
Dataset 1
root
|-- name: string (nullable = true)
|-- count: struct (nullable = true)
| |-- int: integer (nullable = true)
| |-- long: null (nullable = true)
DataSet 2
root
|-- name: string (nullable = true)
|-- count: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Unable to do union operation on DF when using following:
data= dataset1_df.union(dataset2_df)
How to solve this?
Updated:
I would like to change schema such as:
DataSet 1
root
|-- name: string (nullable = true)
|-- count: long
DataSet2
root
|-- name: string (nullable = true)
|-- count: long
Simple solution would be typecasting one of the dataframe to match another as below-
val df1 = spark.sql("select 'foo' name, named_struct('int', 1, 'long', null) count")
df1.show(false)
df1.printSchema()
/**
* +----+-----+
* |name|count|
* +----+-----+
* |foo |[1,] |
* +----+-----+
*
* root
* |-- name: string (nullable = false)
* |-- count: struct (nullable = false)
* | |-- int: integer (nullable = false)
* | |-- long: null (nullable = true)
*/
val df2 = spark.sql("select 'bar' name, map('2', 3) count")
df2.show(false)
df2.printSchema()
/**
* +----+--------+
* |name|count |
* +----+--------+
* |bar |[2 -> 3]|
* +----+--------+
*
* root
* |-- name: string (nullable = false)
* |-- count: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
*/
df1.withColumn("count",
map($"count.int".cast("string"), $"count.long".cast("integer")))
.union(df2)
.show(false)
/**
* +----+--------+
* |name|count |
* +----+--------+
* |foo |[1 ->] |
* |bar |[2 -> 3]|
* +----+--------+
*/
below is my data, I am doing groupby with parcel_id, I need to do sum if
imprv_det_type_cd is start with MA
input:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
in that case only two row considered from above.
expected output:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
code:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
)
.withColumn(
"parcel_id",
F.regexp_replace('parcel_id', r'^[0]*', '')
)
.withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
.withColumn("year_built", F.min("year").over(w_impr))
).drop("sqft", "year").drop_duplicates(["parcel_id"])
)
I know there is change in .withColumn("structure_total_sqft", F.sum("sqft").over(w_impr)) this code but not sure what change I have to do. I tried when function but still not working.
Thank you in Advance.
I don't know why you want to do the groupBy but you didn't.
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+
Use sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
first("imprv_det_type_cd").as("imprv_det_type_cd"),
first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/
I am having different map types like below:
MapType(StringType(), StringType())
MapType(StringType(), IntegerType())
MapType(StringType(), DoubleType())
How can i concat into one and keeping its type intact ?
You can concat the columns of maptype having different key and value types. But post concat spark converts the map key/value types to the highest type it finds.
For example-
If you consider the 3 columns having below types resp.-
col1 - MapType(StringType(), StringType())
col2 - MapType(StringType(), IntegerType())
col3 - MapType(StringType(), DoubleType())
the map_concat output will be as below-
map_concat(col1, col2, col3) - MapType(StringType(), StringType())
Since spark finds the highest type as StringType for key and value.
Now,
Why can't spark keeps the original type intact for key-value pairs?
Ans-
Spark stores the MapType as backed by 2 ArrayData
class ArrayBasedMapData(val keyArray: ArrayData, val valueArray: ArrayData) extends MapData {
...
}
& ArrayData can't handle the heterogenous type. Hence spark can't keep its original type intact post concatenation.
Working example for reference
val df = spark.sql("select map('a', 'b') as col1, map('c', cast(1 as int)) as col2, " +
"map(1, cast(2.2 as double)) as col3")
df.printSchema()
df.show(false)
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: string
* | |-- value: double (valueContainsNull = false)
*
* +--------+--------+----------+
* |col1 |col2 |col3 |
* +--------+--------+----------+
* |[a -> b]|[c -> 1]|[d -> 2.2]|
* +--------+--------+----------+
*/
val p = df.withColumn("new_col", map_concat($"col1", $"col2", $"col3"))
p.printSchema()
p.show(false)
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: string
* | |-- value: double (valueContainsNull = false)
* |-- new_col: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
*
* +--------+--------+----------+--------------------------+
* |col1 |col2 |col3 |new_col |
* +--------+--------+----------+--------------------------+
* |[a -> b]|[c -> 1]|[d -> 2.2]|[a -> b, c -> 1, d -> 2.2]|
* +--------+--------+----------+--------------------------+
*/
Update-1
Use struct to combine columns into one
val x = df.withColumn("x", struct($"col1", $"col2", $"col3"))
x.printSchema()
x.selectExpr("x.col1['a']", "x.col2['c']", "x.col3['d']").printSchema()
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: integer
* | |-- value: double (valueContainsNull = false)
* |-- x: struct (nullable = false)
* | |-- col1: map (nullable = false)
* | | |-- key: string
* | | |-- value: string (valueContainsNull = false)
* | |-- col2: map (nullable = false)
* | | |-- key: string
* | | |-- value: integer (valueContainsNull = false)
* | |-- col3: map (nullable = false)
* | | |-- key: integer
* | | |-- value: double (valueContainsNull = false)
*
* root
* |-- x.col1 AS `col1`[a]: string (nullable = true)
* |-- x.col2 AS `col2`[c]: integer (nullable = true)
* |-- x.col3 AS `col3`[CAST(d AS INT)]: double (nullable = true)
*/
I have a DataFrame with the following schema :
root
|-- user_id: string (nullable = true)
|-- user_loans_arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- loan_date: string (nullable = true)
| | |-- loan_amount: string (nullable = true)
|-- new_loan: struct (nullable = true)
| |-- loan_date : string (nullable = true)
| |-- loan_amount : string (nullable = true)
I want to use a UDF, which takes user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
Thanks in advance.
if spark >= 2.4 then you don't need UDF, check the example below-
Load the input data
val df = spark.sql(
"""
|select user_id, user_loans_arr, new_loan
|from values
| ('u1', array(named_struct('loan_date', '2019-01-01', 'loan_amount', 100)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100)),
| ('u2', array(named_struct('loan_date', '2020-01-01', 'loan_amount', 200)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100))
| T(user_id, user_loans_arr, new_loan)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +-------+-------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+-------------------+-----------------+
* |u1 |[[2019-01-01, 100]]|[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200]]|[2020-01-01, 100]|
* +-------+-------------------+-----------------+
*
* root
* |-- user_id: string (nullable = false)
* |-- user_loans_arr: array (nullable = false)
* | |-- element: struct (containsNull = false)
* | | |-- loan_date: string (nullable = false)
* | | |-- loan_amount: integer (nullable = false)
* |-- new_loan: struct (nullable = false)
* | |-- loan_date: string (nullable = false)
* | |-- loan_amount: integer (nullable = false)
*/
Process as per below requirement
user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
spark >= 2.4
df.withColumn("user_loans_arr",
expr(
"""
|FILTER(array_union(user_loans_arr, array(new_loan)),
| x -> months_between(current_date(), to_date(x.loan_date)) < 12)
""".stripMargin))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
spark < 2.4
// spark < 2.4
val outputSchema = df.schema("user_loans_arr").dataType
import java.time._
val add_and_filter = udf((userLoansArr: mutable.WrappedArray[Row], loan: Row) => {
(userLoansArr :+ loan).filter(row => {
val loanDate = LocalDate.parse(row.getAs[String]("loan_date"))
val period = Period.between(loanDate, LocalDate.now())
period.getYears * 12 + period.getMonths < 12
})
}, outputSchema)
df.withColumn("user_loans_arr", add_and_filter($"user_loans_arr", $"new_loan"))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
You need to pass you array and structure column to the udf as an array or struct. I prefer passing it as struct.
There you can manipulate the elements and return an array type.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
tst_1=(tst.withColumn("arr",F.array('col1','col2'))).withColumn("str",F.struct('col3','col4'))
# udf to return array
#udf(ArrayType(StringType()))
def fn(row):
if(row.arr[1]>row.str.col4):
res=[]
else:
res.append(row.str[i])
res = row.arr+row.str.asDict().values()
return(res)
# calling udf with a struct of array and struct column
tst_fin = tst_1.withColumn("res",fn(F.struct('arr','str')))
The result is
tst_fin.show()
+----+----+----+----+------+------+------------+
|col1|col2|col3|col4| arr| str| res|
+----+----+----+----+------+------+------------+
| 1| 2| 3| 4|[1, 2]|[3, 4]|[1, 2, 4, 3]|
| 3| 4| 5| 1|[3, 4]|[5, 1]| []|
| 5| 6| 7| 8|[5, 6]|[7, 8]|[5, 6, 8, 7]|
| 7| 8| 9| 2|[7, 8]|[9, 2]| []|
+----+----+----+----+------+------+----------
This example takes everything as int. Since you have strings as date , inside you udf you have to use datetime functions of python for the comparison.
I have a dataframe of schema as below:
input dataframe
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge 2020 and 2019 columns into one column of array of structs as well based on the matching key value.
Desired schema:
expected output dataframe
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x_this_year: double (nullable = true)
| | |-- y_this_year: double (nullable = true)
| | |-- x_last_year: double (nullable = true)
| | |-- y_last_year: double (nullable = true)
| | |-- z_this_year: double (nullable = true)
I would like to merge on the matching key in the structs. Also note, if there is a key present only in one of 2019 or 2020 data, then null need to be used to substitute the values of the other year in merged column.
scala> val df = Seq(
| ("ABC",
| Seq(
| ("a", 2, 4, 6),
| ("b", 3, 6, 9),
| ("c", 1, 2, 3)
| ),
| Seq(
| ("a", 4, 8),
| ("d", 3, 4)
| ))
| ).toDF("A", "B_2020", "B_2019").select(
| $"A",
| $"B_2020" cast "array<struct<key:string,x:double,y:double,z:double>>",
| $"B_2019" cast "array<struct<key:string,x:double,y:double>>"
| )
df: org.apache.spark.sql.DataFrame = [A: string, B_2020: array<struct<key:string,x:double,y:double,z:double>> ... 1 more field]
scala> df.printSchema
root
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
scala> df.show(false)
+---+------------------------------------------------------------+------------------------------+
|A |B_2020 |B_2019 |
+---+------------------------------------------------------------+------------------------------+
|ABC|[[a, 2.0, 4.0, 6.0], [b, 3.0, 6.0, 9.0], [c, 1.0, 2.0, 3.0]]|[[a, 4.0, 8.0], [d, 3.0, 4.0]]|
+---+------------------------------------------------------------+------------------------------+
scala> val df2020 = df.select($"A", explode($"B_2020") as "this_year").select($"A",
| $"this_year.key" as "key", $"this_year.x" as "x_this_year",
| $"this_year.y" as "y_this_year", $"this_year.z" as "z_this_year")
df2020: org.apache.spark.sql.DataFrame = [A: string, key: string ... 3 more fields]
scala> val df2019 = df.select($"A", explode($"B_2019") as "last_year").select($"A",
| $"last_year.key" as "key", $"last_year.x" as "x_last_year",
| $"last_year.y" as "y_last_year")
df2019: org.apache.spark.sql.DataFrame = [A: string, key: string ... 2 more fields]
scala> df2020.show(false)
+---+---+-----------+-----------+-----------+
|A |key|x_this_year|y_this_year|z_this_year|
+---+---+-----------+-----------+-----------+
|ABC|a |2.0 |4.0 |6.0 |
|ABC|b |3.0 |6.0 |9.0 |
|ABC|c |1.0 |2.0 |3.0 |
+---+---+-----------+-----------+-----------+
scala> df2019.show(false)
+---+---+-----------+-----------+
|A |key|x_last_year|y_last_year|
+---+---+-----------+-----------+
|ABC|a |4.0 |8.0 |
|ABC|d |3.0 |4.0 |
+---+---+-----------+-----------+
scala> val outputDF = df2020.join(df2019, Seq("A", "key"), "outer").select(
| $"A" as "market_name",
| struct($"key", $"x_this_year", $"y_this_year", $"x_last_year",
| $"y_last_year", $"z_this_year") as "cancellation_policy_booking")
outputDF: org.apache.spark.sql.DataFrame = [market_name: string, cancellation_policy_booking: struct<key: string, x_this_year: double ... 4 more fields>]
scala> outputDF.printSchema
root
|-- market_name: string (nullable = true)
|-- cancellation_policy_booking: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- x_this_year: double (nullable = true)
| |-- y_this_year: double (nullable = true)
| |-- x_last_year: double (nullable = true)
| |-- y_last_year: double (nullable = true)
| |-- z_this_year: double (nullable = true)
scala> outputDF.show(false)
+-----------+----------------------------+
|market_name|cancellation_policy_booking |
+-----------+----------------------------+
|ABC |[b, 3.0, 6.0,,, 9.0] |
|ABC |[a, 2.0, 4.0, 4.0, 8.0, 6.0]|
|ABC |[d,,, 3.0, 4.0,] |
|ABC |[c, 1.0, 2.0,,, 3.0] |
+-----------+----------------------------+