PySpark analyse execution time of queries - pyspark

I use an Docker image with a jupyter / pyspark notebook and run different queries on a huge dataset. I utilize RDD`s aswell as DataFrames and I would like to analyse the execution time of various queries. These queries might be either nested inside some function
def get_rdd_pair(rdd):
rdd = rdd.map(lambda x: (x[0], x[1])
.flatMapValues(lambda x: x)
return rdd
or like so:
df = df.select(df.column1, explode(df.column2))
I hope you get the idea. I am looking for a way now to reliable measure the total execution time. I have tried writing a decorator combined with the time module. The problem is, that this only works fur functions like get_rdd_pair(rdd) and these are called a lot (for each line) if I use something like
rdd = rdd.map(get_rdd_pair)
So this did not work at all, any ideas? I heard of SparkMeasure, but it seems rather involved to get it running with Docker and might not be worth the effort.

SparkSession.time() is not available in Pyspark.
Instead of that, import time and measure it
import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

How to improve performance PySpark Pandas loop in Databricks

I have what I assume should be a very parallelisable problem, but I cannot seem to make it work. I am using Azure Databrick, with the 10.4 LTS ML runtime.
I have a dataset, which within contains sets of test results. It is around 4Gb in size and contains around 20,000 tests. Each test result contains around 5,000-10,000 data points, which makes the shape of several peaks (the number of peaks and their location is different for each test). For a single test, I want to remove the space between the peaks, and separate them out into different dataframes. I have some pseudocode attached here, which is applied to a single test's results:
def peak_finder(pandas_dataframe):
# code to find peaks
return list_of_Pandas_dataframes
In this function, I use pyspark.pandas.DataFrame.truncate, which I am unsure of how to replicate in pure pyspark. I return a list of Pyspark Pandas dataframes which each contain 1 peak from the test. Running this, on a dataframe, takes around 0.02 seconds
My problem is then applying this to the other 20,000 tests in the entire set. This is my current method:
#list_test_ids = list of all test ids
# all_tests = the full dataset
all_peaks = []
for single_test_id in list_test_ids:
single_test = all_tests.where(col("TestId") == single_test_id)
single_test = single_test.toPandas()
peaks = peak_finder(single_test)
all_peaks.extend(peaks)
This is incredibly slow, or causes an Out of Memory error (Ive already increased the size of the driver). I think the using .toPandas() is partly to blame, as this seems slow. Generally though, this seems like an incredibly parallelisable problem, one which is currently not parallel.
My questions:
On large sets like this, should i ever use PySpark Pandas? Is it good practise to always use the regular API?
I feel like using a loop here is a mistake. However, I do not know what to replace it with. Using a map or a forEach seems more appropriate. But, I can't see how I could make this work with my peak_finder function or with PySpark Pandas.
For problems like this, where I am trying to manipulate a 4Gb dataset, what worker/driver configuration do you recommend? Perhaps my current choice is not suitable?

Spark grouped map UDF in Scala

I am trying to write some code that would allow me to compute some action on a group of rows of a dataframe. In PySpark, this is possible by defining a Pandas UDF of type GROUPED_MAP. However, in Scala, I only found a way to create custom aggregators (UDAFs) or classic UDFs.
My temporary solution is to generate a list of keys that would encode my groups which would allow me to filter the dataframe and perform my action for each subset of dataframe. However, this approach is not optimal and very slow.
The performed actions are made sequentially, thus taking a lot of time. I could parallelize the loop but I'm sure this would show any improvement since Spark is already distributed.
Is there any better way to do what I want ?
Edit: Tried parallelizing using Futures but there was no speed improvement, as expected
To the best of my knowledge, this is something that's not possible in Scala. Depending on what you want, I think there could be other ways of applying a transformation to a group of rows in Spark / Scala:
Do a groupBy(...).agg(collect_list(<column_names>)), and use a UDF that operates on the array of values. If desired, you can use a select statement with explode(<array_column>) to revert to the original format
Try rewriting what you want to achieve using window functions. You can add a new column with an aggregate expression like so:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val result = spark.range(100)
.withColumn("group", pmod('id, lit(3)))
.withColumn("group_sum", sum('id).over(w))

pyspark data pipeline use intermediary results

In pyspark, I'd doing successive operations on dataframes and like to get outputs from intermediate results. It always takes the same time though, I'm wondering if it ever caches anything? Asked differently, what's best practice for using intermediary results? In dask you can do dd.compute(df.amount.max(), df.amount.min()) which will figure out what needs to cached and computed. Is there an equivalent in pyspark?
In the example below, when it gets to print() will it execute 3x?
df_purchase = spark.read.parquet("s3a:/example/location")[['col1','col2']]
df_orders = df_purchase.groupby(['col1']).agg(pyspark.sql.functions.first("col2")).withColumnRenamed("first(col2, false)", "col2")
df_orders_clean = df_orders.dropna(subset=['col2'])
print(df_purchase.count(), df_orders.count(), df_orders_clean.count())
Yes, each time you do an action on a dag. It executes and optimizes the full query.
By default, Spark caches nothing.
Be careful when caching, a cache can interfer in a negative way : Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?

Spark Repeating `DataFrame` Processing Work?

I am using Apache Spark (1.6) for a ML task and I noticed that Spark seems to be repeating processing on a single DataFrame.
My code looks something like this:
val df1 = sqlContext.read.parquet("data.parquet")
val df2 = df1.withColumn("new", explode(expensiveTextProcessing($"text"))
println(df2.count)
... (no changes to df2)
println(df2.count)
So I know that my withColumn is a transformation and count is an action so the count will seem like the longer operation.
However, I noticed that the second time I run df2.count takes just as long as the first df2.count. Additionally, a NLP tool I am using throws a few warnings during expensiveTextProcessing and these warnings show up during both of the count calls.
Is Spark doing all of the expensiveTextProcessing each time I use the data in df2?
(for more context you can see my actual Jupyter Notebook here)
DataFrame like RDD has lineage which used to built resulting DataFrame during action call. As you call count the results from all executors are collected to driver. You can check Spark Web UI DAG representation and staging of DataFrame and also duration and localization of processes in order to implement transformations.