calculating the sum with condition - pyspark

below is my data, I am doing groupby with parcel_id, I need to do sum if
imprv_det_type_cd is start with MA
input:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
in that case only two row considered from above.
expected output:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
code:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
)
.withColumn(
"parcel_id",
F.regexp_replace('parcel_id', r'^[0]*', '')
)
.withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
.withColumn("year_built", F.min("year").over(w_impr))
).drop("sqft", "year").drop_duplicates(["parcel_id"])
)
I know there is change in .withColumn("structure_total_sqft", F.sum("sqft").over(w_impr)) this code but not sure what change I have to do. I tried when function but still not working.
Thank you in Advance.

I don't know why you want to do the groupBy but you didn't.
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+

Use sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
first("imprv_det_type_cd").as("imprv_det_type_cd"),
first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/

Related

Union can only be performed on tables with the compatible column types

u"Union can only be performed on tables with the compatible column types. map<string,int> <> structint:int,long:null at the Nth column of the second table.
Here is how the schema looks like:
Dataset 1
root
|-- name: string (nullable = true)
|-- count: struct (nullable = true)
| |-- int: integer (nullable = true)
| |-- long: null (nullable = true)
DataSet 2
root
|-- name: string (nullable = true)
|-- count: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Unable to do union operation on DF when using following:
data= dataset1_df.union(dataset2_df)
How to solve this?
Updated:
I would like to change schema such as:
DataSet 1
root
|-- name: string (nullable = true)
|-- count: long
DataSet2
root
|-- name: string (nullable = true)
|-- count: long
Simple solution would be typecasting one of the dataframe to match another as below-
val df1 = spark.sql("select 'foo' name, named_struct('int', 1, 'long', null) count")
df1.show(false)
df1.printSchema()
/**
* +----+-----+
* |name|count|
* +----+-----+
* |foo |[1,] |
* +----+-----+
*
* root
* |-- name: string (nullable = false)
* |-- count: struct (nullable = false)
* | |-- int: integer (nullable = false)
* | |-- long: null (nullable = true)
*/
val df2 = spark.sql("select 'bar' name, map('2', 3) count")
df2.show(false)
df2.printSchema()
/**
* +----+--------+
* |name|count |
* +----+--------+
* |bar |[2 -> 3]|
* +----+--------+
*
* root
* |-- name: string (nullable = false)
* |-- count: map (nullable = false)
* | |-- key: string
* | |-- value: integer (valueContainsNull = false)
*/
df1.withColumn("count",
map($"count.int".cast("string"), $"count.long".cast("integer")))
.union(df2)
.show(false)
/**
* +----+--------+
* |name|count |
* +----+--------+
* |foo |[1 ->] |
* |bar |[2 -> 3]|
* +----+--------+
*/

Array operations in a map function : Spark 1.6

I have a column which is a wrapped array of type struct with an integer and a Double value.
The schema looks like this:
|-- pricing_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
So, whenever this column value is [[0,0.0]] I need to change it as an empty array.
[[0,0.0]] -> [[]].
How can I do this using a map? or using a Dataframe?
Try this-
spark>=2.4
val df = Seq(Seq((0, 0.0)), Seq((1, 2.2))).toDF("pricing_data")
df.show(false)
df.printSchema()
/**
* +------------+
* |pricing_data|
* +------------+
* |[[0, 0.0]] |
* |[[1, 2.2]] |
* +------------+
*
* root
* |-- pricing_data: array (nullable = true)
* | |-- element: struct (containsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: double (nullable = false)
*/
df.withColumn("pricing_data", expr(
"TRANSFORM(pricing_data, x -> if(x._1=0 and x._2=0.0, named_struct('_1', null, '_2', null), x))"
))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
spark<2.4
// spark<2.4
val dataType = df.schema("pricing_data").dataType
val replace = udf((arrayOfStruct: mutable.WrappedArray[Row]) => {
arrayOfStruct.map(row => {
val map = row.getValuesMap(row.schema.map(_.name))
if(map("_1")==0 && map("_2") == 0.0) {
Row.fromTuple((null, null))
} else row
})
}, dataType)
df.withColumn("pricing_data", replace($"pricing_data"))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/

How to write Spark UDF which takes Array[StructType], StructType as input and return Array[StructType]

I have a DataFrame with the following schema :
root
|-- user_id: string (nullable = true)
|-- user_loans_arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- loan_date: string (nullable = true)
| | |-- loan_amount: string (nullable = true)
|-- new_loan: struct (nullable = true)
| |-- loan_date : string (nullable = true)
| |-- loan_amount : string (nullable = true)
I want to use a UDF, which takes user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
Thanks in advance.
if spark >= 2.4 then you don't need UDF, check the example below-
Load the input data
val df = spark.sql(
"""
|select user_id, user_loans_arr, new_loan
|from values
| ('u1', array(named_struct('loan_date', '2019-01-01', 'loan_amount', 100)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100)),
| ('u2', array(named_struct('loan_date', '2020-01-01', 'loan_amount', 200)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100))
| T(user_id, user_loans_arr, new_loan)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +-------+-------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+-------------------+-----------------+
* |u1 |[[2019-01-01, 100]]|[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200]]|[2020-01-01, 100]|
* +-------+-------------------+-----------------+
*
* root
* |-- user_id: string (nullable = false)
* |-- user_loans_arr: array (nullable = false)
* | |-- element: struct (containsNull = false)
* | | |-- loan_date: string (nullable = false)
* | | |-- loan_amount: integer (nullable = false)
* |-- new_loan: struct (nullable = false)
* | |-- loan_date: string (nullable = false)
* | |-- loan_amount: integer (nullable = false)
*/
Process as per below requirement
user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
spark >= 2.4
df.withColumn("user_loans_arr",
expr(
"""
|FILTER(array_union(user_loans_arr, array(new_loan)),
| x -> months_between(current_date(), to_date(x.loan_date)) < 12)
""".stripMargin))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
spark < 2.4
// spark < 2.4
val outputSchema = df.schema("user_loans_arr").dataType
import java.time._
val add_and_filter = udf((userLoansArr: mutable.WrappedArray[Row], loan: Row) => {
(userLoansArr :+ loan).filter(row => {
val loanDate = LocalDate.parse(row.getAs[String]("loan_date"))
val period = Period.between(loanDate, LocalDate.now())
period.getYears * 12 + period.getMonths < 12
})
}, outputSchema)
df.withColumn("user_loans_arr", add_and_filter($"user_loans_arr", $"new_loan"))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
You need to pass you array and structure column to the udf as an array or struct. I prefer passing it as struct.
There you can manipulate the elements and return an array type.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
tst_1=(tst.withColumn("arr",F.array('col1','col2'))).withColumn("str",F.struct('col3','col4'))
# udf to return array
#udf(ArrayType(StringType()))
def fn(row):
if(row.arr[1]>row.str.col4):
res=[]
else:
res.append(row.str[i])
res = row.arr+row.str.asDict().values()
return(res)
# calling udf with a struct of array and struct column
tst_fin = tst_1.withColumn("res",fn(F.struct('arr','str')))
The result is
tst_fin.show()
+----+----+----+----+------+------+------------+
|col1|col2|col3|col4| arr| str| res|
+----+----+----+----+------+------+------------+
| 1| 2| 3| 4|[1, 2]|[3, 4]|[1, 2, 4, 3]|
| 3| 4| 5| 1|[3, 4]|[5, 1]| []|
| 5| 6| 7| 8|[5, 6]|[7, 8]|[5, 6, 8, 7]|
| 7| 8| 9| 2|[7, 8]|[9, 2]| []|
+----+----+----+----+------+------+----------
This example takes everything as int. Since you have strings as date , inside you udf you have to use datetime functions of python for the comparison.

Better Alternatives to EXCEPT Spark Scala

I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?
Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/

How to create a new timestamp(minute) column in a CSV file using scala/python/spark

I have a CSV file and I want to create a new minute timestamp column as shown below
Actual:
Col1, Col2
1.19185711131486, 0.26615071205963
-1.3598071336738, -0.0727811733098497
-0.966271711572087, -0.185226008082898
-0.966271711572087, -0.185226008082898
-1.15823309349523, 0.877736754848451
-0.425965884412454, 0.960523044882985
Expected:
Col1, Col2, ts
1.19185711131486, 0.26615071205963, 00:00:00
-1.3598071336738, -0.0727811733098497, 00:01:00
-0.966271711572087, -0.185226008082898, 00:02:00
-0.966271711572087, -0.185226008082898, 00:03:00
-1.15823309349523, 0.877736754848451, 00:04:00
-0.425965884412454, 0.960523044882985, 00:05:00
Thanks in advance!
perhaps this is useful -
val data =
"""
|Col1, Col2
|1.19185711131486, 0.26615071205963
|-1.3598071336738, -0.0727811733098497
|-0.966271711572087, -0.185226008082898
|-0.966271711572087, -0.185226008082898
|-1.15823309349523, 0.877736754848451
|-0.425965884412454, 0.960523044882985
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- Col1: double (nullable = true)
* |-- Col2: double (nullable = true)
*
* +------------------+-------------------+
* |Col1 |Col2 |
* +------------------+-------------------+
* |1.19185711131486 |0.26615071205963 |
* |-1.3598071336738 |-0.0727811733098497|
* |-0.966271711572087|-0.185226008082898 |
* |-0.966271711572087|-0.185226008082898 |
* |-1.15823309349523 |0.877736754848451 |
* |-0.425965884412454|0.960523044882985 |
* +------------------+-------------------+
*/
df.withColumn("ts",
date_format(to_timestamp((row_number().over(Window.orderBy(df.columns.map(col): _*)) - 1).cast("string"),
"mm")
, "00:mm:00"))
.show(false)
/**
* +------------------+-------------------+--------+
* |Col1 |Col2 |ts |
* +------------------+-------------------+--------+
* |-1.3598071336738 |-0.0727811733098497|00:00:00|
* |-1.15823309349523 |0.877736754848451 |00:01:00|
* |-0.966271711572087|-0.185226008082898 |00:02:00|
* |-0.966271711572087|-0.185226008082898 |00:03:00|
* |-0.425965884412454|0.960523044882985 |00:04:00|
* |1.19185711131486 |0.26615071205963 |00:05:00|
* +------------------+-------------------+--------+
*/