Filtering empty partitions in RDD - scala

Is there a way to filter empty partitions in RDD? I have some empty partitions after partitioning and I can't use them in action method.
I use Apache Spark in Scala

This is my sample data
val sc = spark.sparkContext
val myDataFrame = spark.range(20).toDF("mycol").repartition($"mycol")
myDataFrame.show(false)
Output :
+-----+
|mycol|
+-----+
|19 |
|0 |
|7 |
|6 |
|9 |
|17 |
|5 |
|1 |
|10 |
|3 |
|12 |
|8 |
|11 |
|2 |
|4 |
|13 |
|18 |
|14 |
|15 |
|16 |
+-----+
In the above code when you do repartition on column then 200 paritions will be created since spark.sql.shuffle.partitions = 200 in that many are not used or empty partitions since data is just 10 numbers (we are trying to fit 20 numbers in to 200 partitions means.... most of the partitions are empty.... :-))
1) Prepare a long accumulator variable to quickly count non empty partitions.
2) Add all non empty partitions in to accumulator variable like below example.
val nonEmptyPartitions = sc.longAccumulator("nonEmptyPartitions")
myDataFrame.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPartitions.add(1))
drop non empty partitions (means coalesce them... less shuffle/ minimum shuffle ).
print them.
val finalDf = myDataFrame.coalesce(nonEmptyPartitions.value.toInt)
println(s"nonEmptyPart : ${nonEmptyPartitions.value.toInt}")
println(s"df.rdd.partitions.length : ${myDataFrame.rdd.getNumPartitions}")
println(s"finalDf.rdd.partitions.length : ${finalDf.rdd.getNumPartitions}")
print them ...
Result :
nonEmptyPart : 20
df.rdd.partitions.length : 200
finalDf.rdd.partitions.length : 20
Proof that all non empty partitions are dropped...
myDataFrame.withColumn("partitionId", org.apache.spark.sql.functions.spark_partition_id)
.groupBy("partitionId")
.count
.show
Result printed partition wise record count :
+-----------+-----+
|partitionId|count|
+-----------+-----+
|128 |1 |
|190 |1 |
|140 |1 |
|164 |1 |
|5 |1 |
|154 |1 |
|112 |1 |
|107 |1 |
|4 |1 |
|49 |1 |
|69 |1 |
|77 |1 |
|45 |1 |
|121 |1 |
|143 |1 |
|58 |1 |
|11 |1 |
|150 |1 |
|68 |1 |
|116 |1 |
+-----------+-----+
Note :
Usage spark_partition_id is for demo/debug purpose only not for production purpose.
I reduced 200 partitions (due to repartition on column ) to 20 non empty partitions.
Conclusion :
Finally you got rid of extra empty partitions which doesnt have any data and avoided un necessary schedule to dummy tasks on empty partitions.

From the little info you provide, I can think about two options. Use mapPartitions and just catching empty iterators and returning them, while working on the non-empty ones.
rdd.mapPartitions { case iter => if(iter.isEmpty) { iter } else { ??? } }
Or you can use repartition, to get rid of the empty partitions.
rdd.repartition(10) // or any proper number

If you dont know the distinct values within the column, and wish to avoid having empty partitions, you can use countApproxDistinct() as:
df.repartition(df.rdd.countApproxDistinct().toInt)
If you wish to filter the existing empty partitions and repartition, you can use as solution suggeste by Sasa
OR:
df.repartition(df.mapPartitions(part => List(part.length).iterator).collect().count(_ != 0)).df.getNumPartitions)
However, in later case the partitions may or may not contain records by value.

Related

Loop through large dataframe in Pyspark - alternative

df_hrrchy
|lefId |Lineage |
|-------|--------------------------------------|
|36326 |["36326","36465","36976","36091","82"]|
|36121 |["36121","36908","36976","36091","82"]|
|36380 |["36380","36465","36976","36091","82"]|
|36448 |["36448","36465","36976","36091","82"]|
|36683 |["36683","36465","36976","36091","82"]|
|36949 |["36949","36908","36976","36091","82"]|
|37349 |["37349","36908","36976","36091","82"]|
|37026 |["37026","36908","36976","36091","82"]|
|36879 |["36879","36465","36976","36091","82"]|
df_trans
|tranID | T_Id |
|-----------|-------------------------------------------------------------------------|
|1000540 |["36121","36326","37349","36949","36380","37026","36448","36683","36879"]|
df_creds
|T_Id |T_val |T_Goal |Parent_T_Id |Parent_Val |parent_Goal|
|-------|-------|-------|---------------|----------------|-----------|
|36448 |100 |1 |36465 |200 |1 |
|36465 |200 |1 |36976 |300 |2 |
|36326 |90 |1 |36465 |200 |1 |
|36091 |500 |19 |82 |600 |4 |
|36121 |90 |1 |36908 |200 |1 |
|36683 |90 |1 |36465 |200 |1 |
|36908 |200 |1 |36976 |300 |2 |
|36949 |90 |1 |36908 |200 |1 |
|36976 |300 |2 |36091 |500 |19 |
|37026 |90 |1 |36908 |200 |1 |
|37349 |100 |1 |36908 |200 |1 |
|36879 |90 |1 |36465 |200 |1 |
|36380 |90 |1 |36465 |200 |1 |
Desired Result
T_id
children
T_Val
T_Goal
parent_T_id
parent_Goal
trans_id
36091
["36976"]
500
19
82
4
1000540
36465
["36448","36326","36683","36879","36380"]
200
1
36976
2
1000540
36908
["36121","36949","37026","37349"]
200
1
36976
2
1000540
36976
["36465","36908"]
300
2
36091
19
1000540
36683
null
90
1
36465
1
1000540
37026
null
90
1
36908
1
1000540
36448
null
100
1
36465
1
1000540
36949
null
90
1
36908
1
1000540
36326
null
90
1
36465
1
1000540
36380
null
90
1
36465
1
1000540
36879
null
90
1
36465
1
1000540
36121
null
90
1
36908
1
1000540
37349
null
100
1
36908
1
1000540
Code Tried
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
for row in df_transactions.rdd.toLocalIterator():
# def find_nodemap(row):
dfs = []
df_hy_set = (df_hrrchy.filter(df_hrrchy. lefId.isin(row["T_ds"]))
.select(explode("Lineage").alias("Terrs"))
.agg(collect_set(col("Terrs")).alias("hierarchy_list"))
.select(F.lit(row["trans_id"]).alias("trans_id "),"hierarchy_list")
)
df_childrens = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_ creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
display(df_nodemap)
# dfs.append(df_nodemap)
# df = reduce(DataFrame.union, dfs)
# display(df)
# # display(df)
My problem - Its a bad design. df_trans is having millions of data and looping through dataframe , its taking forever. Without looping can I do it. I tried couple of other methods, not able to get the desired result.
You certainly need to process entire DataFrame in batch, not iterate row by row.
Key points are to "reverse" df_hrrchy, ie. from parent lineage obtain list of children for every T_Id:
val df_children = df_hrrchy.withColumn("children", slice($"Lineage", lit(1), size($"Lineage") - 1))
.withColumn("parents", slice($"Lineage", 2, 999999))
.select(explode(arrays_zip($"children", $"parents")).as("rels"))
.distinct
.groupBy($"rels.parents".as("T_Id"))
.agg(collect_set($"rels.children").as("children"))
df_children.show(false)
+-----+-----------------------------------+
|T_Id |children |
+-----+-----------------------------------+
|36091|[36976] |
|36465|[36448, 36380, 36326, 36879, 36683]|
|36976|[36465, 36908] |
|82 |[36091] |
|36908|[36949, 37349, 36121, 37026] |
+-----+-----------------------------------+
then expand list of T_Ids in df_trans and also include all T_Ids from the hierarchy:
val df_trans_map = df_trans.withColumn("T_Id", explode($"T_Id"))
.join(df_hrrchy, array_contains($"Lineage", $"T_Id"))
.select($"tranID", explode($"Lineage").as("T_Id"))
.distinct
df_trans_map.show(false)
+-------+-----+
|tranID |T_Id |
+-------+-----+
|1000540|36976|
|1000540|82 |
|1000540|36091|
|1000540|36465|
|1000540|36326|
|1000540|36121|
|1000540|36908|
|1000540|36380|
|1000540|36448|
|1000540|36683|
|1000540|36949|
|1000540|37349|
|1000540|37026|
|1000540|36879|
+-------+-----+
With this it is just a simple join to obtain final result:
df_trans_map.join(df_creds, Seq("T_Id"))
.join(df_children, Seq("T_Id"), "left_outer")
.show(false)
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|T_Id |tranID |T_val|T_Goal|Parent_T_Id|Parent_Val|parent_Goal|children |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|36976|1000540|300 |2 |36091 |500 |19 |[36465, 36908] |
|36091|1000540|500 |19 |82 |600 |4 |[36976] |
|36465|1000540|200 |1 |36976 |300 |2 |[36448, 36380, 36326, 36879, 36683]|
|36326|1000540|90 |1 |36465 |200 |1 |null |
|36121|1000540|90 |1 |36908 |200 |1 |null |
|36908|1000540|200 |1 |36976 |300 |2 |[36949, 37349, 36121, 37026] |
|36380|1000540|90 |1 |36465 |200 |1 |null |
|36448|1000540|100 |1 |36465 |200 |1 |null |
|36683|1000540|90 |1 |36465 |200 |1 |null |
|36949|1000540|90 |1 |36908 |200 |1 |null |
|37349|1000540|100 |1 |36908 |200 |1 |null |
|37026|1000540|90 |1 |36908 |200 |1 |null |
|36879|1000540|90 |1 |36465 |200 |1 |null |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
You need to re-write this to use the full cluster, using a localIterator means that you aren't fully utilizing the cluster for shared work.
Below code was not run as you didn't provide a workable data set to test. If you do I'll run the code to make sure it's sound.
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
#uses explode I know this will create a lot of short lived records but the flip side is it will use the entire cluster to complete the work instead of the driver.
df_trans_expld = df_trans.select( df_trans.tranID, explode(df_trans.T_Id).alias("T_Id") )
#uses explode
df_hrrchy_expld = df_hrrchy.select( df_hrrchy.leftId, explode( df_hrrchy.Lineage ).alias("Lineage") )
#uses exploded data to join which is the same as a filter.
df_hy_set = df_trans_expld.join( df_hrrchy_expld, df_hrrchy_expld.lefId === df_trans_expld.T_id, "left").select( "trans_id" ).agg(collect_set(col("Lineage")).alias("hierarchy_list"))
.select(F.lit(col("trans_id")).alias("trans_id "),"hierarchy_list")
#logic unchanged from here down
df_childrens = (df_creds.join(df_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(ddf_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
# no need to append/union data as it's now just one dataframe df_nodemap
I'd have to look into this more but I'm pretty sure you are pulling all the data through the driver(with your existing code), which will really slow things down, this will make use of all executors to complete the work.
There may be another optimization to get rid of the array_contains (and use a join instead). I'd have to look at the explain to see if you could get even more performance out of it. Don't remember off the top of my head, you are avoiding a shuffle so it may be better as is.

spark sql max function not producing right value

I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.

Delete values lower than cummax on multiple spark dataframe columns in scala

I am having a data frame as shown below. The number of signals are more than 100, so there will be more than 100 columns in the data frame.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |3 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |2 |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |1 |3 |1 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |3 |3 |null |
|052|2021-03-06 |2 |null |2 |
|052|2021-03-16 |3 |5 |5 |.......
+-------------------------------------------+
I have to find out cummax of each signal and then compare with respective signal columns and delete the signal records which are having value lower than cummax and null values.
step1. find cumulative max for each signal with respect to id column.
step2. delete the records which are having lower value than cummax for each signal.
step3. Take count of records which are having cummax less than signal value(excluded of null) for each signals with respect to id.
After the count the final output should be as shown below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 | 3 | 1 |
|050|2021-01-15 |null | null | 2 |
|050|2021-02-02 |2 | 3 | 3 |
|
|051|2021-01-14 |1 | 3 | 0 |
|051|2021-01-15 |2 | null | null |
|051|2021-02-02 |3 | 3 | 2 |
|051|2021-02-03 |null | 3 | null |
|
|052|2021-03-03 |1 | 3 | 0 |
|052|2021-03-05 |3 | 3 | null |
|052|2021-03-06 |null | null | 2 |
|052|2021-03-16 |3 | 5 | 5 | ......
+----------------+--------+--------+--------+
I have tried by using window function as below and it worked for almost all records.
val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val signalList01 = ListBuffer[Column]()
signalList01.append(col("id"), col("date"))
for (column <- signalColumns) {
// Applying the max non null aggregate function on each signal column
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax")) }
val cumMaxDf = df.select(signalList01: _*)
But I am getting error values as shown below for few records.
Is there any idea about how this error records in the cummax column? Any leads appreciated!
Just giving out hints here (as you suggested) to help you unblock the situation, but --WARNING-- haven't tested the code !
the code you provided in the comments looks good. It'll get you your max column
val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))
now, you need to be able to compare the two values in "singal01" and "singal01_cummax". A function like this, maybe:
def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any =
{ if (signal == null || signal < singal_cummax) null
else singal_cummax }
since we'll be applying it to columns, we'll wrap it up in a UDF
val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf {
(i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j)
}
and then, you can combine everything above so it can be applicable on your original dataframe. Something like this could work:
val signal_cummax_suffix = "_cummax"
val result = original_df.columns.foldLeft(original_df)(
(dfac, colname) => dfac
.withColumn(colname.concat(signal_cummax_suffix),
sum(col(colname)).over(windowCodedSO))
.withColumn(colname.concat("output"),
takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix))))
)

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

Data loss after writing in spark

I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.