I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?
Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/
Related
below is my data, I am doing groupby with parcel_id, I need to do sum if
imprv_det_type_cd is start with MA
input:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
in that case only two row considered from above.
expected output:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
code:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),
F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),
F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),
F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),
)
.withColumn(
"parcel_id",
F.regexp_replace('parcel_id', r'^[0]*', '')
)
.withColumn("structure_total_sqft", F.sum("sqft").over(w_impr))
.withColumn("year_built", F.min("year").over(w_impr))
).drop("sqft", "year").drop_duplicates(["parcel_id"])
)
I know there is change in .withColumn("structure_total_sqft", F.sum("sqft").over(w_impr)) this code but not sure what change I have to do. I tried when function but still not working.
Thank you in Advance.
I don't know why you want to do the groupBy but you didn't.
df.withColumn('parcel_id', f.regexp_replace('parcel_id', r'^[0]*', '')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id', 'year') \
.agg(f.sum('sqft').alias('sqft'), f.first(f.substring('imprv_det_type_cd', 0, 2)).alias('imprv_det_type_cd')) \
.show(10, False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+
Use sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"), $"sqft")).as("structure_total_sqft"),
first("imprv_det_type_cd").as("imprv_det_type_cd"),
first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/
I have the below dataframe with me.
val df1=Seq(
("1_2_3","5_10"),
("4_5_6","15_20")
)toDF("c1","c2")
+-----+-----+
| c1| c2|
+-----+-----+
|1_2_3| 5_10|
|4_5_6|15_20|
+-----+-----+
How to get the sum in a separate column based on the condition -
-Omit third value after delimiter - '_' in the first column.
-adding first value of each column ie, omitting '_3' and '_6' in 1_2_3 and 4_5_6
and then adding 1,5 and 2,10. Also adding 15+4 and 20+5.
Expected output -
+-----+-----+-----+
| c1| c2| res|
+-----+-----+-----+
|1_2_3| 5_10| 6_12|
|4_5_6|15_20|19_25|
+-----+-----+-----+
Try this-
zip_with + split
val df1=Seq(
("1_2_3","5_10"),
("4_5_6","15_20")
)toDF("c1","c2")
df1.show(false)
df1.withColumn("res",
expr("concat_ws('_', zip_with(split(c1, '_'), split(c2, '_'), (x, y) -> cast(x+y as int)))"))
.show(false)
/**
* +-----+-----+-----+
* |c1 |c2 |res |
* +-----+-----+-----+
* |1_2_3|5_10 |6_12 |
* |4_5_6|15_20|19_25|
* +-----+-----+-----+
*/
update dynamically for 50 columns
val end = 51 // 50 cols
val df = spark.sql("select '1_2_3' as c1")
val new_df = Range(2, end).foldLeft(df){(df, i) => df.withColumn(s"c$i", $"c1")}
new_df.show(false)
/**
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
* |c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |c10 |c11 |c12 |c13 |c14 |c15 |c16 |c17 |c18 |c19 |c20 |c21 |c22 |c23 |c24 |c25 |c26 |c27 |c28 |c29 |c30 |c31 |c32 |c33 |c34 |c35 |c36 |c37 |c38 |c39 |c40 |c41 |c42 |c43 |c44 |c45 |c46 |c47 |c48 |c49 |c50 |
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
* |1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
*/
val res = new_df.withColumn("res", $"c1")
Range(2, end).foldLeft(res){(df4, i) =>
df4.withColumn("res",
expr(s"concat_ws('_', zip_with(split(res, '_'), split(${s"c$i"}, '_'), (x, y) -> cast(x+y as int)))"))
}
.show(false)
/**
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
* |c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |c10 |c11 |c12 |c13 |c14 |c15 |c16 |c17 |c18 |c19 |c20 |c21 |c22 |c23 |c24 |c25 |c26 |c27 |c28 |c29 |c30 |c31 |c32 |c33 |c34 |c35 |c36 |c37 |c38 |c39 |c40 |c41 |c42 |c43 |c44 |c45 |c46 |c47 |c48 |c49 |c50 |res |
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
* |1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|50_100_150|
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
*/
I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.
For example,
Col1 Col2 Col3
Row1 True False False
Row2 True True True
Row3 False False True
Row4 False False False
The result should be something like:
Col1 Count Col2 Count Col3 Count
True 2 True 1 True 2
False 2 False 3 False 2
I have tried applying GroupBy kind of like this:
df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show
But this wouldn't work for me since I wouldn't know the number or names of the columns.
Try this-
Load the test data provided
val data =
"""
|Col1 Col2 Col3
|True False False
|True True True
|False False True
|False False False
""".stripMargin
val stringDS2 = data.split(System.lineSeparator())
.map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df2 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS2)
df2.show(false)
df2.printSchema()
/**
* +-----+-----+-----+
* |Col1 |Col2 |Col3 |
* +-----+-----+-----+
* |true |false|false|
* |true |true |true |
* |false|false|true |
* |false|false|false|
* +-----+-----+-----+
*
* root
* |-- Col1: boolean (nullable = true)
* |-- Col2: boolean (nullable = true)
* |-- Col3: boolean (nullable = true)
*/
Simple way to compute the count for each distinct values in the column
val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
df2.select(findCounts: _*).distinct()
.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |false|2 |false|3 |true |2 |
* |true |2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/
If you need in the same format as mentioned, try this
Assuming all the columns in the dataframe have same distinct values
// Assuming all the columns in the dataframe have same distinct values
val columns = df2.columns
val head = columns.head
val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
columns.tail.foldLeft(zeroDF){
(df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
}.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/
I have a CSV file and I want to create a new minute timestamp column as shown below
Actual:
Col1, Col2
1.19185711131486, 0.26615071205963
-1.3598071336738, -0.0727811733098497
-0.966271711572087, -0.185226008082898
-0.966271711572087, -0.185226008082898
-1.15823309349523, 0.877736754848451
-0.425965884412454, 0.960523044882985
Expected:
Col1, Col2, ts
1.19185711131486, 0.26615071205963, 00:00:00
-1.3598071336738, -0.0727811733098497, 00:01:00
-0.966271711572087, -0.185226008082898, 00:02:00
-0.966271711572087, -0.185226008082898, 00:03:00
-1.15823309349523, 0.877736754848451, 00:04:00
-0.425965884412454, 0.960523044882985, 00:05:00
Thanks in advance!
perhaps this is useful -
val data =
"""
|Col1, Col2
|1.19185711131486, 0.26615071205963
|-1.3598071336738, -0.0727811733098497
|-0.966271711572087, -0.185226008082898
|-0.966271711572087, -0.185226008082898
|-1.15823309349523, 0.877736754848451
|-0.425965884412454, 0.960523044882985
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- Col1: double (nullable = true)
* |-- Col2: double (nullable = true)
*
* +------------------+-------------------+
* |Col1 |Col2 |
* +------------------+-------------------+
* |1.19185711131486 |0.26615071205963 |
* |-1.3598071336738 |-0.0727811733098497|
* |-0.966271711572087|-0.185226008082898 |
* |-0.966271711572087|-0.185226008082898 |
* |-1.15823309349523 |0.877736754848451 |
* |-0.425965884412454|0.960523044882985 |
* +------------------+-------------------+
*/
df.withColumn("ts",
date_format(to_timestamp((row_number().over(Window.orderBy(df.columns.map(col): _*)) - 1).cast("string"),
"mm")
, "00:mm:00"))
.show(false)
/**
* +------------------+-------------------+--------+
* |Col1 |Col2 |ts |
* +------------------+-------------------+--------+
* |-1.3598071336738 |-0.0727811733098497|00:00:00|
* |-1.15823309349523 |0.877736754848451 |00:01:00|
* |-0.966271711572087|-0.185226008082898 |00:02:00|
* |-0.966271711572087|-0.185226008082898 |00:03:00|
* |-0.425965884412454|0.960523044882985 |00:04:00|
* |1.19185711131486 |0.26615071205963 |00:05:00|
* +------------------+-------------------+--------+
*/
I have a dataframe, say df1, which I am trying to filter based on a date range.
Example:
| id | name | disconnect_dt_time |
|----|------|---------------------|
| 1 | "a" | 2020-05-19 00:00:00 |
| 2 | "b" | 2020-05-20 00:00:00 |
val df = spark.table("df1")
.filter(col("disconnect_dt_time").cast("timestamp").between(analysisStartDate , analysisEndDate) )
I am getting the below issue:
Reason: [ cannot resolve '((((CAST(CAST(df1.disconnect_dt_time AS
TIMESTAMP) AS STRING) >= '20200520T00:00:00+0000') AND
(CAST(CAST(df1.disconnect_date_datetime AS
TIMESTAMP) AS STRING) <= '20200530T00:00:00+0000'))
What is the reason for this double casting? CAST(CAST(df1.disconnect_dt_time AS TIMESTAMP) AS STRING)? How can this be fixed?
Try this-
val data =
"""
|id | name | disconnect_dt_time
|1 | "a" | 2020-05-10 00:00:00
|2 | "b" | 2020-05-20 00:00:00
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---+----+-------------------+
* |id |name|disconnect_dt_time |
* +---+----+-------------------+
* |1 |a |2020-05-10 00:00:00|
* |2 |b |2020-05-20 00:00:00|
* +---+----+-------------------+
*
* root
* |-- id: integer (nullable = true)
* |-- name: string (nullable = true)
* |-- disconnect_dt_time: timestamp (nullable = true)
*/
df.createOrReplaceTempView("df1")
val analysisStartDate = "20200515T00:00:00+0000"
val analysisEndDate = "20200530T00:00:00+0000"
val fmt = "yyyyMMdd'T'HH:mm:ssZ"
val processedDF = spark.table("df1")
.filter(col("disconnect_dt_time").cast("timestamp")
.between(to_timestamp(lit(analysisStartDate), fmt) , to_timestamp(lit(analysisEndDate), fmt)) )
processedDF.show(false)
/**
* +---+----+-------------------+
* |id |name|disconnect_dt_time |
* +---+----+-------------------+
* |2 |b |2020-05-20 00:00:00|
* +---+----+-------------------+
*/
The casting depends on how you have defined analysisStartDate and analysisEndDate
Case 1 : If your analysisStartDate and analysisEndDate are String:
val df = List((1,"a","2020-05-19 00:00:00"),(2,"b","2020-05-20 00:00:00")).toDF("id","name","disconnect_dt_time")
df.filter(col("disconnect_dt_time").cast("timestamp").between( "2020-05-20 00:00:00", "2020-05-30 00:00:00" ) ).explain(true)
== Analyzed Logical Plan ==
id: int, name: string, disconnect_dt_time: string
Filter ((cast(cast(disconnect_dt_time#22 as timestamp) as string) >= 2020-05-20 00:00:00) && (cast(cast(disconnect_dt_time#22 as timestamp) as string) <= 2020-05-30 00:00:00))
+- Project [_1#16 AS id#20, _2#17 AS name#21, _3#18 AS disconnect_dt_time#22]
+- LocalRelation [_1#16, _2#17, _3#18]
+---+----+-------------------+
| id|name| disconnect_dt_time|
+---+----+-------------------+
| 2| b|2020-05-20 00:00:00|
+---+----+-------------------+
Case 2: If your analysisStartDate and analysisEndDate are timestamp:
val df = List((1,"a","2020-05-19 00:00:00"),(2,"b","2020-05-20 00:00:00")).toDF("id","name","disconnect_dt_time")
df.filter(col("disconnect_dt_time").cast("timestamp").between( lit("2020-05-20 00:00:00").cast("timestamp"), lit("2020-05-30 00:00:00").cast("timestamp") ) ).explain(true)
== Analyzed Logical Plan ==
id: int, name: string, disconnect_dt_time: string
Filter ((cast(disconnect_dt_time#22 as timestamp) >= cast(2020-05-20 00:00:00 as timestamp)) && (cast(disconnect_dt_time#22 as timestamp) <= cast(2020-05-30 00:00:00 as timestamp)))
+- Project [_1#16 AS id#20, _2#17 AS name#21, _3#18 AS disconnect_dt_time#22]
+- LocalRelation [_1#16, _2#17, _3#18]
+---+----+-------------------+
| id|name| disconnect_dt_time|
+---+----+-------------------+
| 2| b|2020-05-20 00:00:00|
+---+----+-------------------+