Array operations in a map function : Spark 1.6

Array operations in a map function : Spark 1.6 - scala

I have a column which is a wrapped array of type struct with an integer and a Double value.
The schema looks like this:
|-- pricing_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
So, whenever this column value is [[0,0.0]] I need to change it as an empty array.
[[0,0.0]] -> [[]].
How can I do this using a map? or using a Dataframe?

Try this-
spark>=2.4
val df = Seq(Seq((0, 0.0)), Seq((1, 2.2))).toDF("pricing_data")
df.show(false)
df.printSchema()
/**
* +------------+
* |pricing_data|
* +------------+
* |[[0, 0.0]] |
* |[[1, 2.2]] |
* +------------+
*
* root
* |-- pricing_data: array (nullable = true)
* | |-- element: struct (containsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: double (nullable = false)
*/
df.withColumn("pricing_data", expr(
"TRANSFORM(pricing_data, x -> if(x._1=0 and x._2=0.0, named_struct('_1', null, '_2', null), x))"
))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
spark<2.4
// spark<2.4
val dataType = df.schema("pricing_data").dataType
val replace = udf((arrayOfStruct: mutable.WrappedArray[Row]) => {
arrayOfStruct.map(row => {
val map = row.getValuesMap(row.schema.map(_.name))
if(map("_1")==0 && map("_2") == 0.0) {
Row.fromTuple((null, null))
} else row
})
}, dataType)
df.withColumn("pricing_data", replace($"pricing_data"))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/

Related

Union can only be performed on tables with the compatible column types

u"Union can only be performed on tables with the compatible column types. map<string,int> <> structint:int,long:null at the Nth column of the second table.
Here is how the schema looks like:
Dataset 1
root
|-- name: string (nullable = true)
|-- count: struct (nullable = true)
| |-- int: integer (nullable = true)
| |-- long: null (nullable = true)
DataSet 2
root
|-- name: string (nullable = true)
|-- count: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Unable to do union operation on DF when using following:
data= dataset1_df.union(dataset2_df)
How to solve this?
Updated:
I would like to change schema such as:
DataSet 1
root
|-- name: string (nullable = true)
|-- count: long
DataSet2
root
|-- name: string (nullable = true)
|-- count: long

Simple solution would be typecasting one of the dataframe to match another as below-
val df1 = spark.sql("select 'foo' name, named_struct('int', 1, 'long', null) count")
df1.show(false)
df1.printSchema()
/**
* +----+-----+
* |name|count|
* +----+-----+
* |foo |[1,] |
* +----+-----+
*
* root
* |-- name: string (nullable = false)
* |-- count: struct (nullable = false)
* | |-- int: integer (nullable = false)
* | |-- long: null (nullable = true)
*/
val df2 = spark.sql("select 'bar' name, map('2', 3) count")
df2.show(false)
df2.printSchema()
/**
* +----+--------+
* |name|count |
* +----+--------+
* |bar |[2 -> 3]|
* +----+--------+
*
* root
* |-- name: string (nullable = false)
* |-- count: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
*/
df1.withColumn("count",
map($"count.int".cast("string"), $"count.long".cast("integer")))
.union(df2)
.show(false)
/**
* +----+--------+
* |name|count |
* +----+--------+
* |foo |[1 ->] |
* |bar |[2 -> 3]|
* +----+--------+
*/

calculating the sum with condition

below is my data, I am doing groupby with parcel_id, I need to do sum if
imprv_det_type_cd is start with MA
input:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
in that case only two row considered from above.
expected output:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
code:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
)
.withColumn(
"parcel_id",
F.regexp_replace('parcel_id', r'^[0]*', '')
)
.withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
.withColumn("year_built", F.min("year").over(w_impr))
).drop("sqft", "year").drop_duplicates(["parcel_id"])
)
I know there is change in .withColumn("structure_total_sqft", F.sum("sqft").over(w_impr)) this code but not sure what change I have to do. I tried when function but still not working.
Thank you in Advance.

I don't know why you want to do the groupBy but you didn't.
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+

Use sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
first("imprv_det_type_cd").as("imprv_det_type_cd"),
first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/

How to concat different map type in pyspark

I am having different map types like below:
MapType(StringType(), StringType())
MapType(StringType(), IntegerType())
MapType(StringType(), DoubleType())
How can i concat into one and keeping its type intact ?

You can concat the columns of maptype having different key and value types. But post concat spark converts the map key/value types to the highest type it finds.
For example-
If you consider the 3 columns having below types resp.-
col1 - MapType(StringType(), StringType())
col2 - MapType(StringType(), IntegerType())
col3 - MapType(StringType(), DoubleType())
the map_concat output will be as below-
map_concat(col1, col2, col3) - MapType(StringType(), StringType())
Since spark finds the highest type as StringType for key and value.
Now,
Why can't spark keeps the original type intact for key-value pairs?
Ans-
Spark stores the MapType as backed by 2 ArrayData
class ArrayBasedMapData(val keyArray: ArrayData, val valueArray: ArrayData) extends MapData {
...
}
& ArrayData can't handle the heterogenous type. Hence spark can't keep its original type intact post concatenation.
Working example for reference
val df = spark.sql("select map('a', 'b') as col1, map('c', cast(1 as int)) as col2, " +
"map(1, cast(2.2 as double)) as col3")
df.printSchema()
df.show(false)
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: string
* | |-- value: double (valueContainsNull = false)
*
* +--------+--------+----------+
* |col1 |col2 |col3 |
* +--------+--------+----------+
* |[a -> b]|[c -> 1]|[d -> 2.2]|
* +--------+--------+----------+
*/
val p = df.withColumn("new_col", map_concat($"col1", $"col2", $"col3"))
p.printSchema()
p.show(false)
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: string
* | |-- value: double (valueContainsNull = false)
* |-- new_col: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
*
* +--------+--------+----------+--------------------------+
* |col1 |col2 |col3 |new_col |
* +--------+--------+----------+--------------------------+
* |[a -> b]|[c -> 1]|[d -> 2.2]|[a -> b, c -> 1, d -> 2.2]|
* +--------+--------+----------+--------------------------+
*/
Update-1
Use struct to combine columns into one
val x = df.withColumn("x", struct($"col1", $"col2", $"col3"))
x.printSchema()
x.selectExpr("x.col1['a']", "x.col2['c']", "x.col3['d']").printSchema()
/**
* root
* |-- col1: map (nullable = false)
* | |-- key: string
* | |-- value: string (valueContainsNull = false)
* |-- col2: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
* |-- col3: map (nullable = false)
* | |-- key: integer
* | |-- value: double (valueContainsNull = false)
* |-- x: struct (nullable = false)
* | |-- col1: map (nullable = false)
* | | |-- key: string
* | | |-- value: string (valueContainsNull = false)
* | |-- col2: map (nullable = false)
* | | |-- key: string
* | | |-- value: integer (valueContainsNull = false)
* | |-- col3: map (nullable = false)
* | | |-- key: integer
* | | |-- value: double (valueContainsNull = false)
*
* root
* |-- x.col1 AS `col1`[a]: string (nullable = true)
* |-- x.col2 AS `col2`[c]: integer (nullable = true)
* |-- x.col3 AS `col3`[CAST(d AS INT)]: double (nullable = true)
*/

How to write Spark UDF which takes Array[StructType], StructType as input and return Array[StructType]

I have a DataFrame with the following schema :
root
|-- user_id: string (nullable = true)
|-- user_loans_arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- loan_date: string (nullable = true)
| | |-- loan_amount: string (nullable = true)
|-- new_loan: struct (nullable = true)
| |-- loan_date : string (nullable = true)
| |-- loan_amount : string (nullable = true)
I want to use a UDF, which takes user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
Thanks in advance.

if spark >= 2.4 then you don't need UDF, check the example below-
Load the input data
val df = spark.sql(
"""
|select user_id, user_loans_arr, new_loan
|from values
| ('u1', array(named_struct('loan_date', '2019-01-01', 'loan_amount', 100)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100)),
| ('u2', array(named_struct('loan_date', '2020-01-01', 'loan_amount', 200)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100))
| T(user_id, user_loans_arr, new_loan)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +-------+-------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+-------------------+-----------------+
* |u1 |[[2019-01-01, 100]]|[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200]]|[2020-01-01, 100]|
* +-------+-------------------+-----------------+
*
* root
* |-- user_id: string (nullable = false)
* |-- user_loans_arr: array (nullable = false)
* | |-- element: struct (containsNull = false)
* | | |-- loan_date: string (nullable = false)
* | | |-- loan_amount: integer (nullable = false)
* |-- new_loan: struct (nullable = false)
* | |-- loan_date: string (nullable = false)
* | |-- loan_amount: integer (nullable = false)
*/
Process as per below requirement
user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
spark >= 2.4
df.withColumn("user_loans_arr",
expr(
"""
|FILTER(array_union(user_loans_arr, array(new_loan)),
| x -> months_between(current_date(), to_date(x.loan_date)) < 12)
""".stripMargin))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
spark < 2.4
// spark < 2.4
val outputSchema = df.schema("user_loans_arr").dataType
import java.time._
val add_and_filter = udf((userLoansArr: mutable.WrappedArray[Row], loan: Row) => {
(userLoansArr :+ loan).filter(row => {
val loanDate = LocalDate.parse(row.getAs[String]("loan_date"))
val period = Period.between(loanDate, LocalDate.now())
period.getYears * 12 + period.getMonths < 12
})
}, outputSchema)
df.withColumn("user_loans_arr", add_and_filter($"user_loans_arr", $"new_loan"))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/

You need to pass you array and structure column to the udf as an array or struct. I prefer passing it as struct.
There you can manipulate the elements and return an array type.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
tst_1=(tst.withColumn("arr",F.array('col1','col2'))).withColumn("str",F.struct('col3','col4'))
# udf to return array
#udf(ArrayType(StringType()))
def fn(row):
if(row.arr[1]>row.str.col4):
res=[]
else:
res.append(row.str[i])
res = row.arr+row.str.asDict().values()
return(res)
# calling udf with a struct of array and struct column
tst_fin = tst_1.withColumn("res",fn(F.struct('arr','str')))
The result is
tst_fin.show()
+----+----+----+----+------+------+------------+
|col1|col2|col3|col4| arr| str| res|
+----+----+----+----+------+------+------------+
| 1| 2| 3| 4|[1, 2]|[3, 4]|[1, 2, 4, 3]|
| 3| 4| 5| 1|[3, 4]|[5, 1]| []|
| 5| 6| 7| 8|[5, 6]|[7, 8]|[5, 6, 8, 7]|
| 7| 8| 9| 2|[7, 8]|[9, 2]| []|
+----+----+----+----+------+------+----------
This example takes everything as int. Since you have strings as date , inside you udf you have to use datetime functions of python for the comparison.

Merge two columns of array of structs based on a key

I have a dataframe of schema as below:
input dataframe
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge 2020 and 2019 columns into one column of array of structs as well based on the matching key value.
Desired schema:
expected output dataframe
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x_this_year: double (nullable = true)
| | |-- y_this_year: double (nullable = true)
| | |-- x_last_year: double (nullable = true)
| | |-- y_last_year: double (nullable = true)
| | |-- z_this_year: double (nullable = true)
I would like to merge on the matching key in the structs. Also note, if there is a key present only in one of 2019 or 2020 data, then null need to be used to substitute the values of the other year in merged column.

scala> val df = Seq(
| ("ABC",
| Seq(
| ("a", 2, 4, 6),
| ("b", 3, 6, 9),
| ("c", 1, 2, 3)
| ),
| Seq(
| ("a", 4, 8),
| ("d", 3, 4)
| ))
| ).toDF("A", "B_2020", "B_2019").select(
| $"A",
| $"B_2020" cast "array<struct<key:string,x:double,y:double,z:double>>",
| $"B_2019" cast "array<struct<key:string,x:double,y:double>>"
| )
df: org.apache.spark.sql.DataFrame = [A: string, B_2020: array<struct<key:string,x:double,y:double,z:double>> ... 1 more field]
scala> df.printSchema
root
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
scala> df.show(false)
+---+------------------------------------------------------------+------------------------------+
|A |B_2020 |B_2019 |
+---+------------------------------------------------------------+------------------------------+
|ABC|[[a, 2.0, 4.0, 6.0], [b, 3.0, 6.0, 9.0], [c, 1.0, 2.0, 3.0]]|[[a, 4.0, 8.0], [d, 3.0, 4.0]]|
+---+------------------------------------------------------------+------------------------------+
scala> val df2020 = df.select($"A", explode($"B_2020") as "this_year").select($"A",
| $"this_year.key" as "key", $"this_year.x" as "x_this_year",
| $"this_year.y" as "y_this_year", $"this_year.z" as "z_this_year")
df2020: org.apache.spark.sql.DataFrame = [A: string, key: string ... 3 more fields]
scala> val df2019 = df.select($"A", explode($"B_2019") as "last_year").select($"A",
| $"last_year.key" as "key", $"last_year.x" as "x_last_year",
| $"last_year.y" as "y_last_year")
df2019: org.apache.spark.sql.DataFrame = [A: string, key: string ... 2 more fields]
scala> df2020.show(false)
+---+---+-----------+-----------+-----------+
|A |key|x_this_year|y_this_year|z_this_year|
+---+---+-----------+-----------+-----------+
|ABC|a |2.0 |4.0 |6.0 |
|ABC|b |3.0 |6.0 |9.0 |
|ABC|c |1.0 |2.0 |3.0 |
+---+---+-----------+-----------+-----------+
scala> df2019.show(false)
+---+---+-----------+-----------+
|A |key|x_last_year|y_last_year|
+---+---+-----------+-----------+
|ABC|a |4.0 |8.0 |
|ABC|d |3.0 |4.0 |
+---+---+-----------+-----------+
scala> val outputDF = df2020.join(df2019, Seq("A", "key"), "outer").select(
| $"A" as "market_name",
| struct($"key", $"x_this_year", $"y_this_year", $"x_last_year",
| $"y_last_year", $"z_this_year") as "cancellation_policy_booking")
outputDF: org.apache.spark.sql.DataFrame = [market_name: string, cancellation_policy_booking: struct<key: string, x_this_year: double ... 4 more fields>]
scala> outputDF.printSchema
root
|-- market_name: string (nullable = true)
|-- cancellation_policy_booking: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- x_this_year: double (nullable = true)
| |-- y_this_year: double (nullable = true)
| |-- x_last_year: double (nullable = true)
| |-- y_last_year: double (nullable = true)
| |-- z_this_year: double (nullable = true)
scala> outputDF.show(false)
+-----------+----------------------------+
|market_name|cancellation_policy_booking |
+-----------+----------------------------+
|ABC |[b, 3.0, 6.0,,, 9.0] |
|ABC |[a, 2.0, 4.0, 4.0, 8.0, 6.0]|
|ABC |[d,,, 3.0, 4.0,] |
|ABC |[c, 1.0, 2.0,,, 3.0] |
+-----------+----------------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Array operations in a map function : Spark 1.6 - scala

Related

Union can only be performed on tables with the compatible column types

calculating the sum with condition

How to concat different map type in pyspark

How to write Spark UDF which takes Array[StructType], StructType as input and return Array[StructType]

Merge two columns of array of structs based on a key

Categories

Resources