Spark - How to calculate percentiles in Spark 1.6 dataframe? - scala

I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.

I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.

Related

Can you disable Catalyst Optimizer in PySpark?

I am currently doing an analysis on the execution time of a combination of queries executed by RDD's and DataFrames using PySpark. Both take the same data and return the same result, however, DataFrame is almost 30% faster.
I read a lot that PySpark DataFrame is superior, but I want to find out why. I came across the Catalyst Optimizer, which is utilized by DataFrames but not by RDD's. To check the extent of the impact Catalyst has, I would like to disable it completely and then compare the computational time again.
Is there a way do this? I found some guides for Scala, but nothing for Python.

QuantileTransformer in PySpark

How to implement the Scikit learn QuantileTransformer in PySpark? Due to the size of my data set (~68 million rows w/ 100+ columns), I am forced to attempt this in PySpark rather than converting it into Pandas. I am on PySpark 2.4.
I've seen PySpark has scalers such as StandardScaler, MinMaxScaler, etc. But I would like to use an equivalent of QuantileTransformer. Is there anything off the shelf that exists for this purpose?

XGBClassifier fit with pyspark dataframe?

Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.

Did spark dataframes load parquet data lazily?

I want to run sql on my parquet data in spark using the following code,
val parquetDF = spark.read.parquet(path)
parquetDF.createOrReplaceTempView("table_name")
val df = spark.sql("select column_1, column_4, column 10 from table_name");
println(df.count())
My question is, Does this code read only the required columns from the disc?
Theoretically the answer should be Yes. But I need an expert opinion because in the case of Jdbc queries (Mysql),
the read(spark.read) phase is taking more time when compared to actions(may be relates to connection but not sure). Jdbc code follows,
spark.read.format("jdbc").jdbc(jdbcUrl, query, props).createOrReplaceTempView(table_name)
spark.sql("select column_1, column_4, column 10 from table_name");
df.show()
println(df.count())
If someone can explain the framework flow in both the cases, it will be very helpful.
Spark version 2.3.0
Scala version 2.11.11
In both cases Spark will do its best (exact behavior depends on format and version. Depending on the context some optimizations might not be applied, typically with deeply nested data) to limit traffic to only required data. In fact spark.sql("select ...) part is not even relevant, as actual query should be limited to something equivalent to SELECT 1 FROM table, for a given format.
This stays true, as long as you don't use cache / persist. If you do, all optimizations go away, and Spark will load all data eagerly (see my answer to Any performance issues forcing eager evaluation using count in spark? and Caching dataframes while keeping partitions. Also here is an example how execution plan changes when cache is used.).

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.