how to improve .collect() in pyspark? - pyspark

Is there any other way to tune pyspark so that .collect()'s performance can be improved?
I'm using map(lambda row: row.asDict(), x.collect()) which is taking more that 5seconds for 10K records.

I haven't tried it, but maybe the
Apache Arrow project could help you

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Can you disable Catalyst Optimizer in PySpark?

I am currently doing an analysis on the execution time of a combination of queries executed by RDD's and DataFrames using PySpark. Both take the same data and return the same result, however, DataFrame is almost 30% faster.
I read a lot that PySpark DataFrame is superior, but I want to find out why. I came across the Catalyst Optimizer, which is utilized by DataFrames but not by RDD's. To check the extent of the impact Catalyst has, I would like to disable it completely and then compare the computational time again.
Is there a way do this? I found some guides for Scala, but nothing for Python.

XGBClassifier fit with pyspark dataframe?

Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.

Spark - How to calculate percentiles in Spark 1.6 dataframe?

I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.
I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.

Did spark dataframes load parquet data lazily?

I want to run sql on my parquet data in spark using the following code,
val parquetDF = spark.read.parquet(path)
parquetDF.createOrReplaceTempView("table_name")
val df = spark.sql("select column_1, column_4, column 10 from table_name");
println(df.count())
My question is, Does this code read only the required columns from the disc?
Theoretically the answer should be Yes. But I need an expert opinion because in the case of Jdbc queries (Mysql),
the read(spark.read) phase is taking more time when compared to actions(may be relates to connection but not sure). Jdbc code follows,
spark.read.format("jdbc").jdbc(jdbcUrl, query, props).createOrReplaceTempView(table_name)
spark.sql("select column_1, column_4, column 10 from table_name");
df.show()
println(df.count())
If someone can explain the framework flow in both the cases, it will be very helpful.
Spark version 2.3.0
Scala version 2.11.11
In both cases Spark will do its best (exact behavior depends on format and version. Depending on the context some optimizations might not be applied, typically with deeply nested data) to limit traffic to only required data. In fact spark.sql("select ...) part is not even relevant, as actual query should be limited to something equivalent to SELECT 1 FROM table, for a given format.
This stays true, as long as you don't use cache / persist. If you do, all optimizations go away, and Spark will load all data eagerly (see my answer to Any performance issues forcing eager evaluation using count in spark? and Caching dataframes while keeping partitions. Also here is an example how execution plan changes when cache is used.).