XGBClassifier fit with pyspark dataframe? - pyspark

Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks

I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.

Related

QuantileTransformer in PySpark

How to implement the Scikit learn QuantileTransformer in PySpark? Due to the size of my data set (~68 million rows w/ 100+ columns), I am forced to attempt this in PySpark rather than converting it into Pandas. I am on PySpark 2.4.
I've seen PySpark has scalers such as StandardScaler, MinMaxScaler, etc. But I would like to use an equivalent of QuantileTransformer. Is there anything off the shelf that exists for this purpose?

Spark grouped map UDF in Scala

I am trying to write some code that would allow me to compute some action on a group of rows of a dataframe. In PySpark, this is possible by defining a Pandas UDF of type GROUPED_MAP. However, in Scala, I only found a way to create custom aggregators (UDAFs) or classic UDFs.
My temporary solution is to generate a list of keys that would encode my groups which would allow me to filter the dataframe and perform my action for each subset of dataframe. However, this approach is not optimal and very slow.
The performed actions are made sequentially, thus taking a lot of time. I could parallelize the loop but I'm sure this would show any improvement since Spark is already distributed.
Is there any better way to do what I want ?
Edit: Tried parallelizing using Futures but there was no speed improvement, as expected
To the best of my knowledge, this is something that's not possible in Scala. Depending on what you want, I think there could be other ways of applying a transformation to a group of rows in Spark / Scala:
Do a groupBy(...).agg(collect_list(<column_names>)), and use a UDF that operates on the array of values. If desired, you can use a select statement with explode(<array_column>) to revert to the original format
Try rewriting what you want to achieve using window functions. You can add a new column with an aggregate expression like so:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val result = spark.range(100)
.withColumn("group", pmod('id, lit(3)))
.withColumn("group_sum", sum('id).over(w))

Spark - How to calculate percentiles in Spark 1.6 dataframe?

I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.
I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.

what is the different between pandas_udf from Pyspark and to_pandas?

When I clean big data by pandas, I have two methods:one method is to use #pandas_udf from pyspark 2.3+ clean data,another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
TL;DR: #pandas_udf and toPandas are very different;
#pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s
memory.
So if your data is large, then you can't use toPandas; #pandas_udf or udf or other built in methods would be your only option;

Spark Multi-class classification -Categorical Variables

I have a data set as a csv file. It has around 50 columns most of which are categorical. I am planning to run a RandomForest multi class classification with a new test data-set.
The pain-point of this is to handle the categorical variables. What would be the best way to handle them? I read the guide for Pipeline in Spark Website http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline which creates a DataFrame from a hard coded sequence also with the features a space delimited string. This looks very specific and I wanted to achieve the same thing on how they use HashingTF for the features using the CSV file i have.
In short I want to achieve the same thing as in the link but using a CSV file.
Any suggestions?
EDIT:
Data -> 50 features, 100k rows, most of it alphanumeric categorical
I am pretty new to MLlib and hence struggling to find the proper pipeline for my data from CSV. I tried creating a DataFrame from the file, but confused as to how I should encode the categorical columns. The doubts I have ar as follows
1. The example in the link above tokenizes the data ans uses it but I have a dataframe.
2. Also even if I try using a StringIndexer , should I write an indexer for every column? Shouldn't there be one method which accepts multiple columns?
3. How will I get back the label from the String Indexer for showing the prediction?
5. For new test data, how will I keep consistent encoding for every column?
I would suggest having a look at the feature transformers http://spark.apache.org/docs/ml-features.html and in particular the StringIndexer and VectorAssembler.