How to implement the Scikit learn QuantileTransformer in PySpark? Due to the size of my data set (~68 million rows w/ 100+ columns), I am forced to attempt this in PySpark rather than converting it into Pandas. I am on PySpark 2.4.
I've seen PySpark has scalers such as StandardScaler, MinMaxScaler, etc. But I would like to use an equivalent of QuantileTransformer. Is there anything off the shelf that exists for this purpose?
Related
df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?
Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.
I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.
I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.
When I clean big data by pandas, I have two methods:one method is to use #pandas_udf from pyspark 2.3+ clean data,another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
TL;DR: #pandas_udf and toPandas are very different;
#pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s
memory.
So if your data is large, then you can't use toPandas; #pandas_udf or udf or other built in methods would be your only option;
Now a days, data comes with large number of features. To get a short summary of data, people load data in data frames and use head() method to display it. Its pretty common to run experiments using Jupyter Notebooks (with Toree for scala).
Spark (scala) is good for handling large amount of data, but its head() method doesn't show column headers in horizontally scrollable notebook.
Pandas Dataframe head
Spark Scala Dataframe head
I know you can get column header in scala dataframe by using .columns, but printing it doesn't display header along data columns making it difficult to understand.
Instead of df.head(20) try df.show(n=20, truncate=False). Here is the detailed documentation.