How to increase maxMemoryinMB for DecisionTree - scala

I am trying to train a model with a DecisionTree in Spark using Scala.
My code is as follows:
val numClasses = 19413
val categoricalFeaturesInfo = Map[Int, Int](5 -> 14)
val impurity = "gini"
val maxDepth = 5
val maxBins = 23000
val model = DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
However, when I run it, I get an IllegalArgumentException telling me my minimum maxMemoryinMB should be 8275. I tried looking up how to increase that number but have not found any results. Any help would be greatly appreciated!
Kind Regards

Having the same issue with spark 1.6.2, the solution was to use the Strategy:
import org.apache.spark.mllib.tree.configuration.Strategy
val s = Strategy.defaultStrategy("Classification")
s.setMaxMemoryInMB(756)
... /* other settings */
val model = DecisionTree.train(
trainingVector,s)

If you are using Spark 1.3.1 as I do, these code can help you:
val strategy = new Strategy( Algo.Classification, Gini , maxDepth1,
numClasses1, maxBins = maxBins1,
categoricalFeaturesInfo = categoricalFeaturesInfo1,
maxMemoryInMB = 512)
val model1 = DecisionTree.train(trainingData, strategy)

Related

How to prepare data to apply RandomForest?

I have csv file which contain userId, MovieId,Rating.I want to convert this file to containing label,features .
like in How to prepare data into a LibSVM format from DataFrame?
I need to separete rating column as afile and determine LabeledPoint for label.For applying random forest algorithm I need label column in file but it doesn't exit.
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)
.fit(assembled_df)
val pcaTrainingData = pca.transform(assembled_df).select("id","features","pcaFeatures")
val labeled = pca.transform(assembled_df).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
How to make label column?

Spark load Data for Decision tree - Change Label in LabelledPoint

I try to do the example for decision tree in spark at https://spark.apache.org/docs/latest/mllib-decision-tree.html
I have downloaded a1a dataset from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a
The dataset is in LIBSVM format where the two classes have labels +1.0 and -1.0
When I try
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "/user/cloudera/testDT/a1a.t")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
| impurity, maxDepth, maxBins)
I get:
java.lang.IllegalArgumentException: GiniAggregator given label -1.0 but requires label is non-negative.
So I tried to change the label -1.0 to 0.0. I tried something like
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
{ if (a.label == -1.0) {a.label = 0.0} }
Where I get error:
reassignment to val
So my question is this: How can I change the labels of my data? Or is there a workaround so DecisionTree.trainClassifier() to work with data with negative labels?
TL;DR You cannot resign value argument of a Product class, and even if it was possible (declared as var), you should never modify data in place in Spark.
How about:
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
if (a.label == -1.0) a.copy(label = 0.0) else a
scala> changeLabel(LabeledPoint(-1.0, Vectors.dense(1.0, 2.0, 3.0)))
res1: org.apache.spark.mllib.regression.LabeledPoint = (0.0,[1.0,2.0,3.0])
scala> changeLabel(LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0)))
res2: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,2.0,3.0])

Spark 2 logisticregression remove threshold

I'm using Spark 2 + Scala to train LogisticRegression based binary classification model and I'm using import org.apache.spark.ml.classification.LogisticRegression, which is the new ml API in Spark 2. However, when I evaluated the model by AUROC, I did not find a way to use the probability (double in 0-1) instead of binary classification (0/1). This was previously achieved by removeThreshold(), but in ml.LogisticRegression I did not find a similar method. Thus, is there a way to do that?
The evaluator I'm using is
val evaluator = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("rawPrediction")
.setMetricName("areaUnderROC")
val auroc = evaluator.evaluate(predictions)`
if u want to get probability output other than 0/1 output, try this:
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
val lr = new LogisticRegression()
.setMaxIter(100)
.setRegParam(0.3)
val lrModel = lr.fit(trainData)
val summary = lrModel.summary
summary.predictions.select("probability").show()
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary,
LogisticRegression}
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3)
val lrModel = lr.fit(trainData)
val trainingSummary = lrModel.summary
val predictions = lrModel.transform(test)
predictions.select("label", "probability").show()

Building a random forest in spark, explanation?

I have the a data frame df with the following structure:
amount gender_num marital_num
10000 1 1
20000 1 2
1400 2 1
Lets say I am building an ML to predict the column 'gender_num' in spark using random forest
I am doing the following:
val df1 = df("loan_amount", 'loan_amount.cast("Double")).withColumn("gender_num", 'gender_num.cast("String")).
withColumn("marital_num", 'marital_num.cast("String"))
val labeled = df1.map(row => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))
val numClasses = 7
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Error:
My code is failing at the second step:
138: error: value gender_num is not a member of org.apache.spark.sql.DataFrame
I would really appreciate if someone can explain this to me, the documentation is very hard to follow, newbie here!
That's because you are using R like syntax of DataFrame.
You should access row's data like this:
val labeled = df1.map { row => LabeledPoint(row(1).toDouble, Vectors.dense(row(0).toDouble, row(1).toDouble))}
You can also create case class and use Dataset syntax:
case class ParsedData (amount : Double, gender_num : Int, marital_num : Int)
val labeled = df1.as[ParsedData].map(row => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.
ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly:
val trainRawDf = trainRaw.toDF
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
val transformers = Array(
new StringIndexer().setInputCol("group").setOutputCol("label"),
new Tokenizer().setInputCol("text").setOutputCol("tokens"),
new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)
model.transform(trainRawDf)
If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifier you can use one-vs-rest strategy:
import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)
MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
logistic regression,
decision trees,
random forests,
naive Bayes
At least some of the examples use multiclass classification:
Naive Bayes example - 3 classes
Logistic regression - 10 classes for classifier although only 2 in the example data
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing label and features:
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
or RDD[LabeledPoint].
Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
case class LabeledRecord(group: String, text: String)
val trainRaw = sc.parallelize(
LabeledRecord("foo", "foo v a y b foo") ::
LabeledRecord("bar", "x bar y bar v") ::
LabeledRecord("bar", "x a y bar z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foo", "foo x") ::
LabeledRecord("foobar", "z y x foo a b bar v") ::
Nil
)
Now let's define required transformers and process train Dataset:
// Tokenizer to process text fields
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(10)
// Indexer to convert String labels to Double
val indexer = new StringIndexer()
.setInputCol("group")
.setOutputCol("label")
.fit(trainRaw.toDF)
def transfom(rdd: RDD[LabeledRecord]) = {
val tokenized = tokenizer.transform(rdd.toDF)
val hashed = hashingTF.transform(tokenized)
val indexed = indexer.transform(hashed)
indexed
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)}
}
val train: RDD[LabeledPoint] = transfom(trainRaw)
Please note that indexer is "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.
Next we can train RF model:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16
val model = RandomForest.trainClassifier(
train, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins
)
and finally test it:
val testRaw = sc.parallelize(
LabeledRecord("foo", "foo foo z z z") ::
LabeledRecord("bar", "z bar y y v") ::
LabeledRecord("bar", "a a bar a z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foobar", "a foo a bar") ::
Nil
)
val test: RDD[LabeledPoint] = transfom(testRaw)
val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)
metrics.precision
metrics.recall
Are you using Spark 1.6 rather than Spark 2.1?
I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.