In mllib LinearRegressionWithSGD, I can not control numIterations, why? - scala

I have one ml task, 34461 samples and I use LinearRegressionWithSGD to train model. But the results of the two training result(MSE and Correlation) are the same:
train model (setNumIterations(5))
val linearRegressionWithSGD: LinearRegressionWithSGD = new LinearRegressionWithSGD()
linearRegressionWithSGD.setIntercept(true)
linearRegressionWithSGD.optimizer.setStepSize(0.1)
linearRegressionWithSGD.optimizer.setNumIterations(5)
train model (setNumIterations(10000))
val linearRegressionWithSGD: LinearRegressionWithSGD = new LinearRegressionWithSGD()
linearRegressionWithSGD.setIntercept(true)
linearRegressionWithSGD.optimizer.setStepSize(0.1)
linearRegressionWithSGD.optimizer.setNumIterations(10000)
I can not control the number of iterations, but why?

Related

AdaBoost - get prediction for the specific number of estimators

I use AdaBoostClassifier from sklearn.ensemble. I trained my model using 1000 estimators:
model = AdaBoostClassifier(
base_estimator = DecisionTreeClassifier(max_depth = 6),
n_estimators = 1000,
learning_rate = 0.2
)
model.fit(X_train, y_train)
then using generator model.staged_predict_proba(X_test) I get know that the best accuracy for X_test data is for 814 estimators.
Now I don't want to use generator model.staged_predict_proba(X_test) to make prediction for new test data X_test_2 because it is a lot of time to calculate predictions for each number of estimators (the dataset is really big). I just want to calculate predictions for model based on 814 estimators. I didn't find a way to do it. Is it possible for AdaBoostClassifier? I think it should be.

Retrieve classification performance from "crossval"

I am classifying some data as a dummy-test against a zero vector, using a Support Vector Machine (SVM), as follows:
kernel = 'linear'; C =1;
class1 = double(data(labels==1,:));
class2 = zeros([size(class1,1),size(class1,2)]);
data = [class1;class2];
theclass = [ones(size(class1,1),1); -1*ones(size(class2,1),1)];
%Train the SVM Classifier
cl = fitcsvm(data,theclass,'KernelFunction',kernel,...
'BoxConstraint',C,'ClassNames',[-1,1]);
% Cross-validation of the trained SVM
CVSVMModel = crossval(cl)
Where can I retrieve the performance of these classifications, as for instance classification accuracy, from crossval?
Edit: I am also wondering, how this kind of cross-validation works, since it is applied to an already fully trained SVM? Does it take the full dataset, and partitions it into (e.g.) 10-folds and trains new classifiers? Or is it then only predicting on the 10 test-sets?

PySpark MLLIB Random Forest: prediction always 0

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.
The dataset looks like:
[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....
tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.
Steps
1.- Load the parquet file, and fill possible null values:
parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)
2.- Create features vector:
features = VectorAssembler(
inputCols = ['N1','N2'...],
outputCol = 'features')
3.- Create string indexer:
label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')
4.- Split the train and test datasets:
(train, test) = df.randomSplit([0.7, 0.3])
Resulting train dataset
Resulting Test dataset
5.- Define the classifier:
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')
6.- Pipeline the stages and fit the train model:
pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)
7.- Transform the test dataset:
predictions = model.transform(test)
8.- Output the test result, grouped by prediction:
predictions.select("prediction", "label", "features").groupBy("prediction").count().show()
As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.
Questions
Is the current setup, as described above, correct?
Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error. (They are representation of Dense and Sparse Vectors)
It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.

How to get probability from predictions using GeneralizedLinearRegression model using spark

I'm newbie to machine-learning and I was trying to implement binomial family of GeneralizedLinearRegression model using spark.
I tried this,
val trainingData = sparkSession.read.format("libsvm").load("trainingData.txt")
val testData = sparkSession.read.format("libsvm").load("testData.txt")
val glr = new GeneralizedLinearRegression().setFamily("binomial").setLink("logit").setRegParam(0.3).setMaxIter(10)
val glrModel = glr.fit(trainingData)
model.transform(testData).show()
For my testData, I got my prediction value as 1.0E-16. And when I'm using LogisticRegression, it gives probability(0.765394663) and prediction(0.0) value.
I want to know,
How to predict classes using GeneralizedLinearRegression from prediction value. Should I find classes from prediction value by using a threshold value ?
How to find probability of the predicted value ?

How to classify new training example after model training in apache spark?

Reading the src of https://spark.apache.org/docs/1.5.2/ml-ann.html :
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
Once the model has been trained how can a new training example be classified ?
Can a new training example be added to the model where the label is not set and the classifier will try to classify this training example based on the training data ?
Why is it required that the dataframe labels are of type Double ?
Firstly, the only way to add another observation to the model is by incorporating that data point into the training data, in this case to your train variable. In order to achieve this, you can convert that point into a DataFrame (obviously of only one record) and then use the unionAll method. Nevertheless, you will have to retrain the model using this new dataset.
However, to classify observations using your model you will have to convert your unclassified data into a DataFrame with the same structure that had your training data. And then use the method transform of your model. By the way, notice that models have that method, because they are subclasses of Transformer.
Finally, you have to use Double because that is the way how the LabeledPoint class was defined. It receives a Double as label and a SparseVector or DenseVector as features. I don't know the exact motivation but in my own experience, which isn't wide, all classification and regression algorithms work with float point numbers.Furthermore, gradient descent algorithm, which is widely used to fit most models, uses numbers not letters nor classes to compute the error in each iteration.