I have serious difficulties in understanding why I cannot run a transform which, after waiting so many minutes (sometimes hours), returns the error "Serialized Results too large".
In the transform I have a list of dates that I am iterating in a for loop to proceed with delta calculations within specific time intervals.
The expected dataset is the union of the iterated datasets and should contain 450k rows, not too many, but i have a lot of computing stages, tasks and attempts!!
The profile is already set a Medium profile and i can't scale on other profile and i can't set maxResultSize = 0.
Example of code:
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
Don't pay attention to the syntax. This is just an example to show that there are a series of operations inside the loop. My desidered output is an dataframe which contains the dataframe of each iteration!
Thank you!!
Initial Theory
You are hitting a known limitation of Spark, similar to the findings discussed over here.
However, there are ways to work around this by re-thinking your implementation to instead be a series of dispatched instructions describing the batches of data you wish to operate on, similar to how you create your tmp DataFrame.
This may unfortunately require quite a bit more work to re-think your logic in this way since you'll want to imagine your manipulations purely as a series of column manipulation commands given to PySpark instead of row-by-row manipulations. There are some operations you cannot do purely using PySpark calls, so this isn't always possible. In general it's worth thinking through very carefully.
Concretely
As an example, your data range calculations are possible to perform purely in PySpark and will be substantially faster if you do this operations over many years or other increased scale. Instead of using Python list comprehension or other logic, we instead use column manipulations on a small set of initial data to build up our ranges.
I've written up some example code here on how you can create your date batches, this should let you perform a join to create your tmp DataFrame, after which you can describe the types of operations you wish to do to it.
Code to create date ranges (start and end dates of each week of the year):
from pyspark.sql import types as T, functions as F, SparkSession, Window
from datetime import date
spark = SparkSession.builder.getOrCreate()
year_marker_schema = T.StructType([
T.StructField("max_year", T.IntegerType(), False),
])
year_marker_data = [
{"max_year": 2022}
]
year_marker_df = spark.createDataFrame(year_marker_data, year_marker_schema)
year_marker_df.show()
"""
+--------+
|max_year|
+--------+
| 2022|
+--------+
"""
previous_week_window = Window.partitionBy(F.col("start_year")).orderBy("start_week_index")
year_marker_df = year_marker_df.select(
(F.col("max_year") - 1).alias("start_year"),
"*"
).select(
F.to_date(F.col("max_year").cast(T.StringType()), "yyyy").alias("max_year_date"),
F.to_date(F.col("start_year").cast(T.StringType()), "yyyy").alias("start_year_date"),
"*"
).select(
F.datediff(F.col("max_year_date"), F.col("start_year_date")).alias("days_between"),
"*"
).select(
F.floor(F.col("days_between") / 7).alias("weeks_between"),
"*"
).select(
F.sequence(F.lit(0), F.col("weeks_between")).alias("week_indices"),
"*"
).select(
F.explode(F.col("week_indices")).alias("start_week_index"),
"*"
).select(
F.lead(F.col("start_week_index"), 1).over(previous_week_window).alias("end_week_index"),
"*"
).select(
((F.col("start_week_index") * 7) + 1).alias("start_day"),
((F.col("end_week_index") * 7) + 1).alias("end_day"),
"*"
).select(
F.concat_ws(
"-",
F.col("start_year"),
F.col("start_day").cast(T.StringType())
).alias("start_day_string"),
F.concat_ws(
"-",
F.col("start_year"),
F.col("end_day").cast(T.StringType())
).alias("end_day_string"),
"*"
).select(
F.to_date(
F.col("start_day_string"),
"yyyy-D"
).alias("start_date"),
F.to_date(
F.col("end_day_string"),
"yyyy-D"
).alias("end_date"),
"*"
)
year_marker_df.drop(
"max_year",
"start_year",
"weeks_between",
"days_between",
"week_indices",
"max_year_date",
"start_day_string",
"end_day_string",
"start_day",
"end_day",
"start_week_index",
"end_week_index",
"start_year_date"
).show()
"""
+----------+----------+
|start_date| end_date|
+----------+----------+
|2021-01-01|2021-01-08|
|2021-01-08|2021-01-15|
|2021-01-15|2021-01-22|
|2021-01-22|2021-01-29|
|2021-01-29|2021-02-05|
|2021-02-05|2021-02-12|
|2021-02-12|2021-02-19|
|2021-02-19|2021-02-26|
|2021-02-26|2021-03-05|
|2021-03-05|2021-03-12|
|2021-03-12|2021-03-19|
|2021-03-19|2021-03-26|
|2021-03-26|2021-04-02|
|2021-04-02|2021-04-09|
|2021-04-09|2021-04-16|
|2021-04-16|2021-04-23|
|2021-04-23|2021-04-30|
|2021-04-30|2021-05-07|
|2021-05-07|2021-05-14|
|2021-05-14|2021-05-21|
+----------+----------+
only showing top 20 rows
"""
Potential Optimizations
Once you have this code and if you are unable to express your work through joins / column derivations alone and are forced to perform the operation with the union_many, you may consider using Spark's localCheckpoint feature on your df2 result. This will allow Spark to simply calculate the resultant DataFrame and not add its query plan onto the result you will push to your df_total. This could be paired with the cache to also keep the resultant DataFrame in memory, but this will depend on your data scale.
localCheckpoint and cache are useful to avoid re-computing the same DataFrames many times over and truncating the amount of query planning done on top of your intermediate DataFrames.
You'll likely find localCheckpoint and cache can be useful on your df DataFrame as well since it will be used many times over in your loop (assuming you are unable to re-work your logic to use SQL-based operations and instead are still forced to use the loop).
As a quick and dirty summary of when to use each:
Use localCheckpoint on a DataFrame that was complex to compute and is going to be used in operations later. Oftentimes these are the nodes feeding into unions
Use cache on a DataFrame that is going to be used many times later. This often is a DataFrame sitting outside of a for/while loop that will be called in the loop
All Together
Your initial code
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
Should now look like:
# year_marker_df as derived in my code above
year_marker_df = year_marker_df.cache()
df = df.join(year_marker_df, df.my_date_column between year_marker_df.start_date, year_marker_df.end_date)
# Other work previously in your for_loop, resulting in df_total
return df_total
Or, if you are unable to re-work your inner loop operations, you can do some optimizations like:
Date_list = [All weeks from: '2021-01-01', to: '2022-01-01'] --> ~50 elements
df_total = spark.createDataframe([], schema)
df_date = []
df = df.cache()
for date in Date_list:
tmp = df.filter(between [date, date-7days]).withColumn('example', F.lit(date))
........
df2 = df.join(tmp, 'column', 'inner').......
df2 = df2.localCheckpoint()
df_date += [df2]
df_total = df_total.unionByName(union_many(*df_date))
return df_total
Related
I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?
The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...
events_df = (
df
.withColumn(
"header",
f.create_map(
f.lit("id"),
f.lit(1)
)
)
...
So we can create columns as much as we need and make transformations to get the required header structure
PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !
I am new to pyspark. I was struggling with the following problem I am facing:
I have a dataframe named date_range:
date_range = spark.sql("SELECT sequence(date_add(current_date, -2), current_date(), interval 1 day) as date").withColumn("date", F.explode(F.col("date")))
display(date_range)
date
2021-11-13
2021-11-14
2021-11-15
I need to loop through each date in this df to subset another df named df_base like below:
for i in range(0, date_range.count()):
d = date_range.collect()[i]
df_temp = df_base.where("DPTR_DATE < 'd'")
display(df_temp)
However, it returns the msg:
Query returned no results
df_base should look like something below:
First Last DPTR_DATE
Daisy Feng 2021-08-03
Daisy Feng 2021-11-09
Daisy Feng 2021-04-27
Daisy Feng 2021-08-16
The code I used to generate df_base:\
data = [('Daisy', 'Feng', "2021-08-03"),
('Daisy', 'Feng', "2021-11-09"),
('Daisy', 'Feng', "2021-04-27"),
('Daisy', 'Feng', "2021-08-16")]
schema = StructType([
StructField('First', StringType()),
StructField('Last', StringType()),
StructField('DPTR_DATE', StringType())]
df_base = spark.createDataFrame(data=data, schema=schema)
df_base = df_base.withColumn('DPTR_DATE', col('DPTR_DATE').cast(DateType()))
Since each date in DPTR_DATE is less than all three dates in df_range, I expect df_temp will be displayed three times, but it looks like nothing got returned.
Any help will he appreciated.
thanks,
Daisy
There are a few issues with your code.
your sql query string inside .where() is not correct, as you are comparing DPTR_DATE with string 'd' rather than with date from date_range dataframe. You should change the query inside where to
"DPTR_DATE < '{}'".format(d)
d inside loop needs to be changed as follows:
date_range.collect()[i][0]
(or even better date_range.select("date").collect()[i][0]). Try to print the string query "DPTR_DATE < '{}'".format(d) to see why d needs to be modified like this and why the way you define will not work.
Also note that you are overwriting df_temp with each iteration of the loop, but you probably want to union the dataframes at each iteration.
Below is the modified version of the code based on these comments.
df = spark.createDataFrame([], schema=df_base.schema)
for i in range(0, date_range.count()):
d = date_range.collect()[i][0]
df_temp = df_base.where("DPTR_DATE < '{}'".format(d))
df = df.union(df_temp)
this will produce the following df
Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.
I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.
i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count