Tell pyspark about known ordering - pyspark

I have a huge dataset that includes a row_id, so I know that the row_id is monotonically increasing and that the data partitions are ordered by this row_id.
Now I want to shift some column using the lag() function:
window = sql.Window.partitionBy().orderBy('row_id')
df = df.withColumn('shifted_my_value', F.lag(df.my_value).over(window))
However, pyspark is not aware that the data is already ordered by row_id and wants to analyze/reshuffle the entire dataset (although I know that this would be a noop).
Is there a way to tell spark that my data is already ordered by row_id and it should just shift the values in my_values without trying to shuffle.

One option could be just not ordering by a row in the window -
window = sql.Window.partitionBy().orderBy('select 1')
df = df.withColumn('shifted_my_value', F.lag(df.my_value).over(window))
I'm not sure in pyspark it will work like this or like that -
window = sql.Window.partitionBy().orderBy('(select 1)')
df = df.withColumn('shifted_my_value', F.lag(df.my_value).over(window))
In SQL it could work like that -
df.registerTempTable('df')
spark.sql('select LAG(df.my_value) OVER (order by (select 1)) FROM df' )

Related

groupBy Id and get multiple records for multiple columns in scala

I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..
You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4

Joining two clustered tables in spark dataset seems to end up with full shuffle

I have two hive clustered tables t1 and t2
CREATE EXTERNAL TABLE `t1`(
`t1_req_id` string,
...
PARTITIONED BY (`t1_stats_date` string)
CLUSTERED BY (t1_req_id) INTO 1000 BUCKETS
// t2 looks similar with same amount of buckets
The insert part happens in hive
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table `t1` partition(t1_stats_date,t1_stats_hour)
select *
from t1_raw
where t1_stats_date='2020-05-10' and t1_stats_hour='12' AND
t1_req_id is not null
The code looks like as following:
val t1 = spark.table("t1").as[T1]
val t2= spark.table("t2").as[T2]
val outDS = t1.joinWith(t2, t1("t1_req_id) === t2("t2_req_id), "fullouter")
.map { case (t1Obj, t2Obj) =>
val t3:T3 = // do some logic
t3
}
outDS.toDF.write....
I see projection in DAG - but it seems that the job still does full data shuffle
Also, while looking into the logs of executor I don't see it reads the same bucket of the two tables in one chunk - that what I would expect to find
There are spark.sql.sources.bucketing.enabled, spark.sessionState.conf.bucketingEnabled and
spark.sql.join.preferSortMergeJoin flags
What am I missing? and why is there still full shuffle, if there are bucketed tables?
The current spark version is 2.3.1
One possibility here to check for is if you have a type mismatch. E.g. if the type of the join column is string in T1 and BIGINT in T2. Even if the types are both integer (e.g. one is INT, another BIGINT) Spark will still add shuffle here because different types use different hash functions for bucketing.

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)