I am using Spark 2.4.1, to figure out some ratios on my data frame.
Where I need to find different ratio factors of ratios, different columns in given data frame(df_data) by joining to meta dataframe (i.e. resDs).
I am getting these ratio factors (i.e. ratio_1_factor, ratio_2_factor & ratio_3_factor) by using three different joins with different join conditions i.e. joinedDs , joinedDs2, joinedDs3
Is there any other alternative to reduce the number of joins ?? make it work optimum?
You can find the entire sample data in the below public URL.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3521103084252405/7035720262824085/latest.html
How to handle multi-steps instead of single step in when clause:
.withColumn("step_1_ratio_1", (col("ratio_1").minus(lit(0.00000123))).cast(DataTypes.DoubleType)) // step-2
.withColumn("step_2_ratio_1", (col("step_1_ratio_1").multiply(lit(0.02))).cast(DataTypes.DoubleType)) //step-3
.withColumn("step_3_ratio_1", (col("step_2_ratio_1").divide(col("step_1_ratio_1"))).cast(DataTypes.DoubleType)) //step-4
.withColumn("ratio_1_factor", (col("ratio_1_factor")).cast(DataTypes.DoubleType)) //step-5
i.e. "ratio_1_factor" calucated based on various other columns in the dataframe , df_data
these steps -2,3,4 , are being used in other ratio_factors calculation too. i.e. ratio_2_factor, ratio_2_factor
how this should be handled ?
You can join one time and calculate ratio_1_factor, ratio_2_factor and ratio_3_factor columns using max and when function in aggregation :
val joinedDs = df_data.as("aa")
.join(
broadcast(resDs.as("bb")),
col("aa.g_date").between(col("bb.start_date"), col("bb.end_date"))
)
.groupBy("item_id", "g_date", "ratio_1", "ratio_2", "ratio_3")
.agg(
max(when(
col("aa.ratio_1").between(col("bb.A"), col("bb.A_lead")),
col("ratio_1").multiply(lit(0.1))
)
).cast(DoubleType).as("ratio_1_factor"),
max(when(
col("aa.ratio_2").between(col("bb.A"), col("bb.A_lead")),
col("ratio_2").multiply(lit(0.2))
)
).cast(DoubleType).as("ratio_2_factor"),
max(when(
col("aa.ratio_3").between(col("bb.A"), col("bb.A_lead")),
col("ratio_3").multiply(lit(0.3))
)
).cast(DoubleType).as("ratio_3_factor")
)
joinedDs.show(false)
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|item_id|g_date |ratio_1 |ratio_2 |ratio_3 |ratio_1_factor |ratio_2_factor|ratio_3_factor|
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|50312 |2016-01-04|0.0456646|0.046899415|0.046000415|0.0045664600000000005|0.009379883 |0.0138001245 |
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
Related
The following code is intended to do this operation, however, does not work. It still joins on both columns yielding two rows when I should only get one row.
df = df.join(
msa,
(msa.msa_join_key == df.metro_join_key)
| (
(msa.msa_join_key != df.metro_join_key)
& (msa.msa_join_key == df.state_join_key)
),
"left",
)
This code should work, so I am not sure why it does not. I've seen similar questions that also cannot solve this given that they too yield additional rows. pyspark - join with OR condition
Is this a known bug in pyspark? Is there a different way to do this join?
I have serious difficulties in understanding why I cannot run a transform which, after waiting so many minutes (sometimes hours), returns the error "Serialized Results too large".
In the transform I have a list of dates that I am iterating in a for loop to proceed with delta calculations within specific time intervals.
The expected dataset is the union of the iterated datasets and should contain 450k rows, not too many, but i have a lot of computing stages, tasks and attempts!!
The profile is already set a Medium profile and i can't scale on other profile and i can't set maxResultSize = 0.
Example of code:
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
Don't pay attention to the syntax. This is just an example to show that there are a series of operations inside the loop. My desidered output is an dataframe which contains the dataframe of each iteration!
Thank you!!
Initial Theory
You are hitting a known limitation of Spark, similar to the findings discussed over here.
However, there are ways to work around this by re-thinking your implementation to instead be a series of dispatched instructions describing the batches of data you wish to operate on, similar to how you create your tmp DataFrame.
This may unfortunately require quite a bit more work to re-think your logic in this way since you'll want to imagine your manipulations purely as a series of column manipulation commands given to PySpark instead of row-by-row manipulations. There are some operations you cannot do purely using PySpark calls, so this isn't always possible. In general it's worth thinking through very carefully.
Concretely
As an example, your data range calculations are possible to perform purely in PySpark and will be substantially faster if you do this operations over many years or other increased scale. Instead of using Python list comprehension or other logic, we instead use column manipulations on a small set of initial data to build up our ranges.
I've written up some example code here on how you can create your date batches, this should let you perform a join to create your tmp DataFrame, after which you can describe the types of operations you wish to do to it.
Code to create date ranges (start and end dates of each week of the year):
from pyspark.sql import types as T, functions as F, SparkSession, Window
from datetime import date
spark = SparkSession.builder.getOrCreate()
year_marker_schema = T.StructType([
T.StructField("max_year", T.IntegerType(), False),
])
year_marker_data = [
{"max_year": 2022}
]
year_marker_df = spark.createDataFrame(year_marker_data, year_marker_schema)
year_marker_df.show()
"""
+--------+
|max_year|
+--------+
| 2022|
+--------+
"""
previous_week_window = Window.partitionBy(F.col("start_year")).orderBy("start_week_index")
year_marker_df = year_marker_df.select(
(F.col("max_year") - 1).alias("start_year"),
"*"
).select(
F.to_date(F.col("max_year").cast(T.StringType()), "yyyy").alias("max_year_date"),
F.to_date(F.col("start_year").cast(T.StringType()), "yyyy").alias("start_year_date"),
"*"
).select(
F.datediff(F.col("max_year_date"), F.col("start_year_date")).alias("days_between"),
"*"
).select(
F.floor(F.col("days_between") / 7).alias("weeks_between"),
"*"
).select(
F.sequence(F.lit(0), F.col("weeks_between")).alias("week_indices"),
"*"
).select(
F.explode(F.col("week_indices")).alias("start_week_index"),
"*"
).select(
F.lead(F.col("start_week_index"), 1).over(previous_week_window).alias("end_week_index"),
"*"
).select(
((F.col("start_week_index") * 7) + 1).alias("start_day"),
((F.col("end_week_index") * 7) + 1).alias("end_day"),
"*"
).select(
F.concat_ws(
"-",
F.col("start_year"),
F.col("start_day").cast(T.StringType())
).alias("start_day_string"),
F.concat_ws(
"-",
F.col("start_year"),
F.col("end_day").cast(T.StringType())
).alias("end_day_string"),
"*"
).select(
F.to_date(
F.col("start_day_string"),
"yyyy-D"
).alias("start_date"),
F.to_date(
F.col("end_day_string"),
"yyyy-D"
).alias("end_date"),
"*"
)
year_marker_df.drop(
"max_year",
"start_year",
"weeks_between",
"days_between",
"week_indices",
"max_year_date",
"start_day_string",
"end_day_string",
"start_day",
"end_day",
"start_week_index",
"end_week_index",
"start_year_date"
).show()
"""
+----------+----------+
|start_date| end_date|
+----------+----------+
|2021-01-01|2021-01-08|
|2021-01-08|2021-01-15|
|2021-01-15|2021-01-22|
|2021-01-22|2021-01-29|
|2021-01-29|2021-02-05|
|2021-02-05|2021-02-12|
|2021-02-12|2021-02-19|
|2021-02-19|2021-02-26|
|2021-02-26|2021-03-05|
|2021-03-05|2021-03-12|
|2021-03-12|2021-03-19|
|2021-03-19|2021-03-26|
|2021-03-26|2021-04-02|
|2021-04-02|2021-04-09|
|2021-04-09|2021-04-16|
|2021-04-16|2021-04-23|
|2021-04-23|2021-04-30|
|2021-04-30|2021-05-07|
|2021-05-07|2021-05-14|
|2021-05-14|2021-05-21|
+----------+----------+
only showing top 20 rows
"""
Potential Optimizations
Once you have this code and if you are unable to express your work through joins / column derivations alone and are forced to perform the operation with the union_many, you may consider using Spark's localCheckpoint feature on your df2 result. This will allow Spark to simply calculate the resultant DataFrame and not add its query plan onto the result you will push to your df_total. This could be paired with the cache to also keep the resultant DataFrame in memory, but this will depend on your data scale.
localCheckpoint and cache are useful to avoid re-computing the same DataFrames many times over and truncating the amount of query planning done on top of your intermediate DataFrames.
You'll likely find localCheckpoint and cache can be useful on your df DataFrame as well since it will be used many times over in your loop (assuming you are unable to re-work your logic to use SQL-based operations and instead are still forced to use the loop).
As a quick and dirty summary of when to use each:
Use localCheckpoint on a DataFrame that was complex to compute and is going to be used in operations later. Oftentimes these are the nodes feeding into unions
Use cache on a DataFrame that is going to be used many times later. This often is a DataFrame sitting outside of a for/while loop that will be called in the loop
All Together
Your initial code
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
Should now look like:
# year_marker_df as derived in my code above
year_marker_df = year_marker_df.cache()
df = df.join(year_marker_df, df.my_date_column between year_marker_df.start_date, year_marker_df.end_date)
# Other work previously in your for_loop, resulting in df_total
return df_total
Or, if you are unable to re-work your inner loop operations, you can do some optimizations like:
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
df = df.cache()
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df2 = df2.localCheckpoint()
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..
You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4
There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.
Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.