Can I use the L-BFGS optimizer in Spark through pyspark? - pyspark

I would like to create my own estimator in python in PySpark.
I would like to use the L-BFGS optimization algorithm to fit or any other optimization algorithm available.
How can I access these optimization algorithms through pyspark?
Is there an example that shows how this is done?

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.classification.LogisticRegressionWithLBFGS.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS

Related

How to use Apache spark to implement GraphSAGE?

I want to use scala and spark to implement Graph algorithm GraphSAGE, then how to do it? Is there any source code?
I want to get the code for my question
I havenĀ“t implemented yet this graph algorithms on top of Spark, the only available implementation, as far as I know, for using deep learning for graph analysis is this. It is a spectral graph convolution for semi-supervised learning, and it is a transductive algorithm. It can be used for node classification. I have plans to include more algorithms in the future like GraphSAGE.

Custom loss function for Multiclass claasification in Scala and Spark

I want to ask is this possible to write a custom loss function for Multi class Classification in Spark using Scala. I want to code multi-class logarithmic loss in Scala. I searched Spark documentation but could not get any hint.
From the Spark 2.2.0 MLlib guide:
Currently, only binary classification is supported.. This will likely change when multiclass classification is supported.
If you are not restricted to a particular classification technique I would suggest using XGBoost. It has a Spark-compatible implementation, and it makes it possible to use any loss function provided you can compute is derivative twice.
You can find a tutorial here.
Also the explanation about why it is possible to use a custom loss function can be found here.

Is it possible to initialize centers with specific values for spark kmeans?

I am using kmeans from sklearn and from pyspark.ml.
The spark version is much faster. However, it doesn't seem have an option that I need. With sklearn kmeans I can specify an initial values for the cluster centers: KMeans(init=centers,...).
I don't see such an option for pyspark. Am I missing it, or am I out of luck and it doesn't exist?
Thank you

Clustering data with categorical and numeric features in Apache Spark

I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.

Is there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml?

Wondering if there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml to prevent overfitting. It's there in mllib which works with RDDs. I am looking the same for dataframes.
Found a K-Fold Cross Validation support in Spark. It can be done using CrossValidation() with Estimators, Evaluators, ParamMap and number of folds. This helps in finding the best parameters for the model i.e model tuning.
Refer http://spark.apache.org/docs/latest/ml-tuning.html for more details.