I am currently doing an analysis on the execution time of a combination of queries executed by RDD's and DataFrames using PySpark. Both take the same data and return the same result, however, DataFrame is almost 30% faster.
I read a lot that PySpark DataFrame is superior, but I want to find out why. I came across the Catalyst Optimizer, which is utilized by DataFrames but not by RDD's. To check the extent of the impact Catalyst has, I would like to disable it completely and then compare the computational time again.
Is there a way do this? I found some guides for Scala, but nothing for Python.
Related
So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;
I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.
Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:
val newDf = df.withColumn(colB, opaqueUdf(col("colA")))
Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.
Yes #Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.
It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.
In pyspark, I'd doing successive operations on dataframes and like to get outputs from intermediate results. It always takes the same time though, I'm wondering if it ever caches anything? Asked differently, what's best practice for using intermediary results? In dask you can do dd.compute(df.amount.max(), df.amount.min()) which will figure out what needs to cached and computed. Is there an equivalent in pyspark?
In the example below, when it gets to print() will it execute 3x?
df_purchase = spark.read.parquet("s3a:/example/location")[['col1','col2']]
df_orders = df_purchase.groupby(['col1']).agg(pyspark.sql.functions.first("col2")).withColumnRenamed("first(col2, false)", "col2")
df_orders_clean = df_orders.dropna(subset=['col2'])
print(df_purchase.count(), df_orders.count(), df_orders_clean.count())
Yes, each time you do an action on a dag. It executes and optimizes the full query.
By default, Spark caches nothing.
Be careful when caching, a cache can interfer in a negative way : Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
I'll try my best to describe my situation and then I'm hoping another user on this site can tell me if the course I'm taking makes sense or if I need to reevaluate my approach/options.
Background:
I use pyspark since I am most familiar with python vs scala, java or R. I have a spark dataframe that was constructed from a hive table using pyspark.sql to query the table. In this dataframe I have many different 'files'. Each file is consists of timeseries data. I need to perform a rolling regression on a subset of the data, across the entire time values for each 'file'. After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function), then returning the data to the dataframe. This would happen inside of the context of a .withColumn() operation. This made sense and I feel like this approach is correct. What I discovered is that currently pyspark does not support the ability to create UDAF (see the linked jira). So here is what I'm currently considering doing.
It is shown here and here that it is possible to create a UDAF in scala and then reference said function within the context of pyspark. Furthermore it is shown here that a UDAF (written in scala) is able to take multiple input columns (a necessary feature since I will be doing multiple linear regression - taking in 3 parameters). What I am unsure of is the ability for my UDAF to use org.apache.spark.ml.regression which I plan on using for my regression. If this can't be done, I could manually execute the operation using matrices (I believe, if scala allows that). I have virtually no experience using scala but am certainly motivated to learn enough to write this one function.
I'm wondering if anyone has insight or suggestions about this task ahead. I feel like after the research I've done, this is both possible and the appropriate course of action to take. However, I'm scared of burning a ton of time trying to make this work when it is fundamentally impossible or way more difficult than I could imagine.
Thanks for your insight.
After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function
This cannot work, no matter if PySpark supports UDAF or not. You are not allowed to use distributed algorithms from UDF / UDAF.
Question is a bit vague, and it is not clear how much data you have but I'd consider using plain RDD with scikit-learn (or similar tool) or try to implement a whole thing from scratch.
I am reaching out to the community to understand the impact of coding in certain ways in scala for Spark. I received some review comments that I feel need discussion. Coming from a traditional Java and OOP background, I am writing my opinion and questions here. I would appreciate if you could chime in with your wisdom. I am in a Spark 1.3.0 environment.
1. Use of for loops: Is it against the rules to use for loops?
There are distributed data structures like RDDs and DataFrames in Spark. We should not be collecting and using for loops on them, as the computation will end up happening on the driver node alone. This will have adverse affects especially if the data is large.
But if I have a utility map that stores parameters for the job, it is fine to use a for loop on it if desired. Using a for loop or a map on the iteratable is a coding choice. It is important to understand that this map here is different from map on a distributed data structure. This map will still happen on the driver node alone.
2. Use of var vs val
val is an immutable reference to an object and var is a mutable reference. In the example below
val driverDf = {
var df = dataLoader.loadDriverInput()
df = df.sqlContext.createDataFrame(df.rdd, df.schema)
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
Even though we have used var for the df, driverDf is an immutable reference to the originally created data frame. This kind of use for var is perfectly fine.
Similarly the following is also fine.
var driverDf = dataLoader.loadDriverInput();
driverDf = applyTransformations (driverDf)
def applyTransformations (driverDf:DataFrame)={...}
Are there any generic rules that say vars cannot be used in Spark environment?
3. Use of if-else vs case, not throw exceptions
Is it against standard practices to not throw exceptions or not to use if-else?
4. Use of hive context vs sql context
Are there any performance implications of using SQLContext vs HiveContext (I know HiveContext extends SQLContext) for underneath Hive tables?
Is it against standards to create multiple HiveContexts in the program. My job is iterates through a part of a whole data frame of values every time. Whole data frame is cached in a one hive context. Each iteration data frame is created from the whole data using a new hive context and cached. This cache is purged at the end of iteration. This approach gave me performance improvements in Spark 1.3.0. Is this approach breaking any standards?
I appreciate the responses.
Regarding loops, as you mentioned correctly, you should prefer RDD map to perform operations in parallel and on multiple nodes. For smaller iterables, you can go with for loop. Again it comes down to the driver memory and time it takes to iterate.
For smaller sets of around 100, the distributed way of handling will incur unnecessary network usage rather than giving performance boost
val or var is a choice at scala level rather than spark. I never heard of it. Its dependent on your requirement.
No sure what you you asked. The only major negative for using if-else is making them cumbersome and while handling inner-if-else. Apart from that, all should be fine. An exception can be thrown based on a condition. I see that's the one of many ways to handle issues in an otherwise Happy path flow.
As mentioned here, the compiler generates more byte code for match..case rather than simple if. So its simple condition check vs code readability - complex condition check
HiveContext gives the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. Please not in spark 2.0, both HIveContext and SQLContext are replaced with SparkSession.