I am using pyspark and want to increase network performance and memory tuning.Can anyone suggest which Serializer is better for pyspark: MarshalSerializer or PickleSerializer.
Related
How to implement the Scikit learn QuantileTransformer in PySpark? Due to the size of my data set (~68 million rows w/ 100+ columns), I am forced to attempt this in PySpark rather than converting it into Pandas. I am on PySpark 2.4.
I've seen PySpark has scalers such as StandardScaler, MinMaxScaler, etc. But I would like to use an equivalent of QuantileTransformer. Is there anything off the shelf that exists for this purpose?
Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.
I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.
I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.
When I clean big data by pandas, I have two methods:one method is to use #pandas_udf from pyspark 2.3+ clean data,another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
TL;DR: #pandas_udf and toPandas are very different;
#pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s
memory.
So if your data is large, then you can't use toPandas; #pandas_udf or udf or other built in methods would be your only option;
I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.