Scala — GroupBy column in specific formatting - scala

DF1 is what I have now, and I want make DF1 looks like DF2.
Desired Output:
DF1 DF2
+---------+-------------------+ +---------+------------------------------+
| ID | Category | | ID | Category |
+---------+-------------------+ +---------+------------------------------+
| 31898 | Transfer | | 31898 | Transfer (e-Transfer) |
| 31898 | e-Transfer | =====> | 32614 | Transfer (e-Transfer + IMT) |
| 32614 | Transfer | =====> | 33987 | Transfer (IMT) |
| 32614 | e-Transfer + IMT | +---------+------------------------------+
| 33987 | Transfer |
| 33987 | IMT |
+---------+-------------------+
Code:
val df = DF1.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
val DF2 = df.withColumn("Category", $"CategorySet"(0) ($"CategorySet"(1)))
The code is not working, how to solve it? And if there is any other better ways to do the same thing, I am open to it. Thank you in advance

You can try this:
val sliceRight = udf((array : Seq[String], from : Int) => " (" + array.takeRight(from).mkString(",") +")")
val df2 = df.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
df2.withColumn("Category", concat($"CategorySet"(0),sliceRight($"CategorySet",lit(1))))
.show(false)
Output:
+-----+----------------------------+---------------------------+
|ID |CategorySet |Category |
+-----+----------------------------+---------------------------+
|33987|[Transfer, IMT] |Transfer (IMT) |
|32614|[Transfer, e-Transfer + IMT]|Transfer (e-Transfer + IMT)|
|31898|[Transfer, e-Transfer] |Transfer (e-Transfer) |
+-----+----------------------------+---------------------------+

answer with slight modification
df.groupBy(“ID”).agg(collect_set(col(“Category”)).as(“Category”)).withColumn(“Category”, concat(col(“Category”)(0),lit(“ (“),col(“Category”)(1), lit(“)”))).show

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

how to find which date the consecutive column status "Complete" started with in a 7day period

I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")

Create dummy variables frame pyspark

I have a spark data frame like:
|---------------------|------------------------------|
| Brand | Model |
|---------------------|------------------------------|
| Hyundai | Elentra,Creta |
|---------------------|------------------------------|
| Hyundai | Creta,Grand i10,Verna |
|---------------------|------------------------------|
| Maruti | Eritga,S-cross,Vitara Brezza|
|---------------------|------------------------------|
| Maruti | Celerio,Eritga,Ciaz |
|---------------------|------------------------------|
I want a data frame like this:
|---------------------|---------|--------|--------------|--------|---------|
| Brand | Model0 | Model1 | Model2 | Model3 | Model4 |
|---------------------|---------|--------|--------------|--------|---------|
| Hyundai | Elentra | Creta | Grand i10 | Verna | null |
|---------------------|---------|--------|--------------|--------|---------|
| Maruti | Ertiga | S-Cross| Vitara Brezza| Celerio| Ciaz |
|---------------------|---------|--------|--------------|--------|---------|
I have used this code :
schema = StructType([
StructField("Brand", StringType()),StructField("Model", StringType())])
tempCSV = spark.read.csv("PATH\\Cars.csv", sep='|', schema=schema)
tempDF = tempCSV.select(
"Brand",
f.split("Model", ",").alias("Model"),
f.posexplode(f.split("Model", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Brand",
f.concat(f.lit("Model"),f.col("pos").cast("string")).alias("name"),
f.expr("Model[pos]").alias("val")
)\
.groupBy("Brand").pivot("name").agg(f.first("val")).toPandas()
But I'm not getting the desired result. Instead of giving the second table result its giving :
|---------------------|---------|--------|--------------|
| Brand | Model0 | Model1 | Model2 |
|---------------------|---------|--------|--------------|
| Hyundai | Elentra | Creta | Grand i10 |
|---------------------|---------|--------|--------------|
| Maruti | Ertiga | S-Cross| Vitara Brezza|
|---------------------|---------|--------|--------------|
Thanks in advance.
This is happening because you are pivoting data on pos which has the repeat value in the same brand group.
You can use the rownumber() and pivot your data to generate the desired result.
Here are the sample code on top of the data you have provided.
df = sqlContext.createDataFrame([('Hyundai',"Elentra,Creta"),("Hyundai","Creta,Grand i10,Verna"),("Maruti","Eritga,S-cross,Vitara Brezza"),("Maruti","Celerio,Eritga,Ciaz")],("Brand","Model"))
tmpDf = df.select("Brand",f.split("Model", ",").alias("Model"),f.posexplode(f.split("Model", ",")).alias("pos", "val"))
tmpDf.createOrReplaceTempView("tbl")
seqDf = sqlContext.sql("select Brand, Model, pos, val, row_number() over(partition by Brand order by pos) as rnk from tbl")
seqDf.groupBy('Brand').pivot('rnk').agg(f.first('val'))
This will generate following result.
+-------+-------+-------+-------+---------+-------------+----+
| Brand| 1| 2| 3| 4| 5| 6|
+-------+-------+-------+-------+---------+-------------+----+
| Maruti| Eritga|Celerio|S-cross| Eritga|Vitara Brezza|Ciaz|
|Hyundai|Elentra| Creta| Creta|Grand i10| Verna|null|
+-------+-------+-------+-------+---------+-------------+----+

Scala -- Conditional replace column value of a data frame

DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2.
Transfer is the big category; e-transfer and IMT are subcategories.
The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a ID (34193), it should just be e-transfer pr IMT.
New to scala, don't know how to write a good function to do this. Please help!!
DataFrame 1 DataFrame 2
+---------+-------------+ +---------+------------------+
| ID | Category | | ID | Category |
+---------+-------------+ +---------+------------------+
| 31898 | Transfer | | 31898 | e-Transfer |
| 31898 | e-Transfer | | 32614 | e-Transfer + IMT|
| 32614 | Transfer | =====> | 33987 | Other |
| 32614 | e-Transfer | =====> | 34193 | e-Transfer |
| 32614 | IMT | +---------+------------------+
| 33987 | Transfer |
| 34193 | e-Transfer |
+---------+-------------+
You can group the DataFrame by ID to aggregate Category using collect_set to assemble arrays of categories, and create a new column based on content in the category arrays using array_contains:
import org.apache.spark.sql.functions._
val df = Seq(
(31898, "Transfer"),
(31898, "e-Transfer"),
(32614, "Transfer"),
(32614, "e-Transfer"),
(32614, "IMT"),
(33987, "Transfer"),
(34193, "e-Transfer")
).toDF("ID", "Category")
df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
withColumn( "Category",
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
"e-Transfer + IMT").otherwise(
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
"e-Transfer").otherwise(
when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
$"CategorySet"(0)).otherwise(
when($"CategorySet" === Array("Transfer"), "Other")
)))
).
show(false)
// +-----+---------------------------+----------------+
// |ID |CategorySet |Category |
// +-----+---------------------------+----------------+
// |33987|[Transfer] |Other |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer] |e-Transfer |
// |31898|[Transfer, e-Transfer] |e-Transfer |
// +-----+---------------------------+----------------+
Your sample data might not have covered all cases (e.g. [Transfer, MIT]). The existing sample code would generate null category value for any remaining cases. Simply modify/expand the conditional check if additional cases are identified.

Missing spark partition column in partition table

I am creating a partitioned parquet file in HDFS with a datasource.
The datasource looks like:
scala> sqlContext.sql("select * from parquetFile").show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| 0|LSKG5GC19BA210794|
| 0|LSKG5GC15BA210372|
| 0|LSKG5GC18BA210107|
| 0|LSKG4GC16BA211971|
| 0|LSKG4GC19BA210233|
| 0|LSKG5GC17BA210017|
| 0|LSKG4GC19BA211785|
| 0|LSKG4GC15BA210004|
| 0|LSKG4GC12BA211739|
| 0|LSKG4GC18BA210238|
| 0|LSKG4GC13BA210261|
| 0|LSKG5GC16BA210106|
| 0|LSKG4GC1XBA210287|
| 0|LSKG4GC10BA210265|
| 0|LSKG5GC10CA210118|
| 0|LSKG5GC16BA212289|
| 0|LSKG5GC1XBA211016|
| 0|LSKG5GC15CA210194|
| 0|LSKG5GC12CA210119|
| 0|LSKG4GC19BA211379|
+--------+-----------------+
I create partition with the following commands (I did it in spark shell):
scala>val df1 = sqlContext.sql("select * from parquetFile where area_tag=0 ")
scala>df1.write.parquet("/tmp/test_table3/area_tag=0")
scala>val p1 = sqlContext.read.parquet("/tmp/test_table3")
When I print the data by loading from the partitioned table, it shows:
scala> p1.show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| |LSKG5GC19BA210794|
| |LSKG5GC15BA210372|
| |LSKG5GC18BA210107|
| |LSKG4GC16BA211971|
| |LSKG4GC19BA210233|
| |LSKG5GC17BA210017|
| |LSKG4GC19BA211785|
| |LSKG4GC15BA210004|
| |LSKG4GC12BA211739|
| |LSKG4GC18BA210238|
| |LSKG4GC13BA210261|
| |LSKG5GC16BA210106|
| |LSKG4GC1XBA210287|
| |LSKG4GC10BA210265|
| |LSKG5GC10CA210118|
| |LSKG5GC16BA212289|
| |LSKG5GC1XBA211016|
| |LSKG5GC15CA210194|
| |LSKG5GC12CA210119|
| |LSKG4GC19BA211379|
+--------+-----------------+
only showing top 20 rows
The partition column was missing. What happened with the column, is it a bug?