I want to ask is this possible to write a custom loss function for Multi class Classification in Spark using Scala. I want to code multi-class logarithmic loss in Scala. I searched Spark documentation but could not get any hint.
From the Spark 2.2.0 MLlib guide:
Currently, only binary classification is supported.. This will likely change when multiclass classification is supported.
If you are not restricted to a particular classification technique I would suggest using XGBoost. It has a Spark-compatible implementation, and it makes it possible to use any loss function provided you can compute is derivative twice.
You can find a tutorial here.
Also the explanation about why it is possible to use a custom loss function can be found here.
Related
I want to use scala and spark to implement Graph algorithm GraphSAGE, then how to do it? Is there any source code?
I want to get the code for my question
I havenĀ“t implemented yet this graph algorithms on top of Spark, the only available implementation, as far as I know, for using deep learning for graph analysis is this. It is a spectral graph convolution for semi-supervised learning, and it is a transductive algorithm. It can be used for node classification. I have plans to include more algorithms in the future like GraphSAGE.
I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.
Wondering if there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml to prevent overfitting. It's there in mllib which works with RDDs. I am looking the same for dataframes.
Found a K-Fold Cross Validation support in Spark. It can be done using CrossValidation() with Estimators, Evaluators, ParamMap and number of folds. This helps in finding the best parameters for the model i.e model tuning.
Refer http://spark.apache.org/docs/latest/ml-tuning.html for more details.
I do not think gaussian mixture model is available in mllib yet. I am wondering if any good Scala/Java implementation of GMM (suitable for large data) is available elsewhere. Please let me know.
Thanks and regards,
It is available in Spark MLlib now:
http://spark.apache.org/docs/latest/mllib-clustering.html#gaussian-mixture
Have a look at https://issues.apache.org/jira/browse/SPARK-4156
It is still under progress. We can expect it soon in MLLib.
I am currently looking for a multilabel AdaBoost implementation for MATLAB or a technique for efficiently using a two-label implementation for the multilabel case. Any help in that matter would be appreciated.
You can use the same approach used in Support Vector Machines. SVMs are originally binary classifiers, several approaches were proposed for handling multiclass data:
one-against-all: construct one binary classifier per class, and train with instances in this class as positive cases and all other instances as negative cases (ie: 1-vs-not1, 2-vs-not2, 3-vs-not3). Finally use the posterior probability of each classifier to predict the class.
one-against-one: construct several binary classifiers for each pair of classes (ie: 1-vs-2, 1-vs-3, 2-vs-3, ..) by simply training over the instances from both classes. Then you can combine the individual results using a majority vote.
Error Correcting Output Codes: based on the theory of error correction (Hamming code and such), it relies on coding the output of several binary classifier using some redundancy to increase accuracy.
Note these are generic method and can applied to any binary classifier.
Otherwise you can search for a specific implementation of multiclass Adaboost, which I'm sure there are plenty out there.. A quick search revealed this one: Multiclass GentleAdaboosting
You can use Adaboost.M2, its a multiclass adaboost, you can found an implementation in Balu toolbox here the command is Bcl_adaboost this toolbox has other useful stuff, just remember to reference. Hope it helps.
Theoretically speaking, the only correct multi-class boosting is the one defined in A theory of multiclass boosting