Spark Multiclass Classification Example - scala

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly:
val trainRawDf = trainRaw.toDF
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
val transformers = Array(
new StringIndexer().setInputCol("group").setOutputCol("label"),
new Tokenizer().setInputCol("text").setOutputCol("tokens"),
new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)
model.transform(trainRawDf)
If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifier you can use one-vs-rest strategy:
import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)
MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
logistic regression,
decision trees,
random forests,
naive Bayes
At least some of the examples use multiclass classification:
Naive Bayes example - 3 classes
Logistic regression - 10 classes for classifier although only 2 in the example data
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing label and features:
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
or RDD[LabeledPoint].
Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
case class LabeledRecord(group: String, text: String)
val trainRaw = sc.parallelize(
LabeledRecord("foo", "foo v a y b foo") ::
LabeledRecord("bar", "x bar y bar v") ::
LabeledRecord("bar", "x a y bar z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foo", "foo x") ::
LabeledRecord("foobar", "z y x foo a b bar v") ::
Nil
)
Now let's define required transformers and process train Dataset:
// Tokenizer to process text fields
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(10)
// Indexer to convert String labels to Double
val indexer = new StringIndexer()
.setInputCol("group")
.setOutputCol("label")
.fit(trainRaw.toDF)
def transfom(rdd: RDD[LabeledRecord]) = {
val tokenized = tokenizer.transform(rdd.toDF)
val hashed = hashingTF.transform(tokenized)
val indexed = indexer.transform(hashed)
indexed
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)}
}
val train: RDD[LabeledPoint] = transfom(trainRaw)
Please note that indexer is "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.
Next we can train RF model:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16
val model = RandomForest.trainClassifier(
train, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins
)
and finally test it:
val testRaw = sc.parallelize(
LabeledRecord("foo", "foo foo z z z") ::
LabeledRecord("bar", "z bar y y v") ::
LabeledRecord("bar", "a a bar a z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foobar", "a foo a bar") ::
Nil
)
val test: RDD[LabeledPoint] = transfom(testRaw)
val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)
metrics.precision
metrics.recall

Are you using Spark 1.6 rather than Spark 2.1?
I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.

Related

how to make faster windowing text file and machine learning over windows in spark

I'm trying to use Spark to learn multiclass logistic regression on a windowed text file. What I'm doing is first creating windows and explode them into $"word_winds". Then move the center word of each window into $"word". To fit the LogisticRegression model, I convert each different word into a class ($"label"), thereby it learns. I count the different labels to prone those with few minF samples.
The problem is that some part of the code is very very slow, even for small input files (you can use some README file to test the code). Googling, some users have been experiencing slowness by using explode. They suggest some modifications to the code in order to speed up 2x. However, I think that with a 100MB input file, this wouldn't be sufficient. Please suggest something different, probably to avoid actions that slow down the code. I'm using Spark 2.4.0 and sbt 1.2.8 on a 24-core machine.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types._
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
spark.sparkContext.setCheckpointDir("checked_dfs")
val in_file = "sample.txt"
val stratified = true
val wsize = 7
val ngram = 3
val minF = 2
val windUdf = udf{s: String => s.sliding(ngram).toList.sliding(wsize).toList}
val get_mid = udf{s: Seq[String] => s(s.size/2)}
val rm_punct = udf{s: String => s.replaceAll("""([\p{Punct}|¿|\?|¡|!]|\p{C}|\b\p{IsLetter}{1,2}\b)\s*""", "")}
// Read and remove punctuation
var df = spark.read.text(in_file)
.withColumn("value", rm_punct($"value"))
// Creating windows and explode them, and get the center word into $"word"
df = df.withColumn("char_nGrams", windUdf('value))
.withColumn("word_winds", explode($"char_nGrams"))
.withColumn("word", get_mid('word_winds))
val indexer = new StringIndexer().setInputCol("word")
.setOutputCol("label")
df = indexer.fit(df).transform(df)
val hashingTF = new HashingTF().setInputCol("word_winds")
.setOutputCol("freqFeatures")
df = hashingTF.transform(df)
val idf = new IDF().setInputCol("freqFeatures")
.setOutputCol("features")
df = idf.fit(df).transform(df)
// Remove word whose freq is less than minF
var counts = df.groupBy("label").count
.filter(col("count") > minF)
.orderBy(desc("count"))
.withColumn("id", monotonically_increasing_id())
var filtro = df.groupBy("label").count.filter(col("count") <= minF)
df = df.join(filtro, Seq("label"), "leftanti")
var dfs = if(stratified){
// Create stratified sample 'dfs'
var revs = counts.orderBy(asc("count")).select("count")
.withColumn("id", monotonically_increasing_id())
revs = revs.withColumnRenamed("count", "ascc")
// Weigh the labels (linearly) inversely ("ascc") proportional NORMALIZED weights to word ferquency
counts = counts.join(revs, Seq("id"), "inner").withColumn("weight", col("ascc")/df.count)
val minn = counts.select("weight").agg(min("weight")).first.getDouble(0) - 0.01
val maxx = counts.select("weight").agg(max("weight")).first.getDouble(0) - 0.01
counts = counts.withColumn("weight_n", (col("weight") - minn) / (maxx - minn))
counts = counts.withColumn("weight_n", when(col("weight_n") > 1.0, 1.0)
.otherwise(col("weight_n")))
var fractions = counts.select("label", "weight_n").rdd.map(x => (x(0), x(1)
.asInstanceOf[scala.Double])).collectAsMap.toMap
df.stat.sampleBy("label", fractions, 36L).select("features", "word_winds", "word", "label")
}else{ df }
dfs = dfs.checkpoint()
val lr = new LogisticRegression().setRegParam(0.01)
val Array(tr, ts) = dfs.randomSplit(Array(0.7, 0.3), seed = 12345)
val training = tr.select("word_winds", "features", "label", "word")
val test = ts.select("word_winds", "features", "label", "word")
val model = lr.fit(training)
def mapCode(m: scala.collection.Map[Any, String]) = udf( (s: Double) =>
m.getOrElse(s, "")
)
var labels = training.select("label", "word").distinct.rdd
.map(x => (x(0), x(1).asInstanceOf[String]))
.collectAsMap
var predictions = model.transform(test)
predictions = predictions.withColumn("pred_word", mapCode(labels)($"prediction"))
predictions.write.format("csv").save("spark_predictions")
spark.stop()
}
}
Since your data is somewhat small it might help if you use coalesce before explode. Sometimes it can be inefficient to have too many nodes especially if there is a lot of shuffling in your code.
Like you said, it does seem like a lot of people have issues with explode. I looked at the link you provided but no one mentioned trying flatMap instead of explode.

Spark 2 logisticregression remove threshold

I'm using Spark 2 + Scala to train LogisticRegression based binary classification model and I'm using import org.apache.spark.ml.classification.LogisticRegression, which is the new ml API in Spark 2. However, when I evaluated the model by AUROC, I did not find a way to use the probability (double in 0-1) instead of binary classification (0/1). This was previously achieved by removeThreshold(), but in ml.LogisticRegression I did not find a similar method. Thus, is there a way to do that?
The evaluator I'm using is
val evaluator = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("rawPrediction")
.setMetricName("areaUnderROC")
val auroc = evaluator.evaluate(predictions)`
if u want to get probability output other than 0/1 output, try this:
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
val lr = new LogisticRegression()
.setMaxIter(100)
.setRegParam(0.3)
val lrModel = lr.fit(trainData)
val summary = lrModel.summary
summary.predictions.select("probability").show()
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary,
LogisticRegression}
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3)
val lrModel = lr.fit(trainData)
val trainingSummary = lrModel.summary
val predictions = lrModel.transform(test)
predictions.select("label", "probability").show()

Permutations of Models in Spark mllib

I am using Spark 1.6.1 with Scala 2.10.5 built in. I am building a set of permutation models using binary numbers but I've now hit a wall for a few days now. Here is the code (partially taken from zero323 at RDD to LabeledPoint conversion):
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import scala.util.Random.{setSeed, nextDouble}
setSeed(123)
case class Record(
target: Double, x1: Double, x2: Double, x3: Double, x4:Double, x5:Double, x6:Double)
val rows = sc.parallelize(
(1 to 50).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble)))
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x3, 2) x3,
ROUND(x4, 2) x4,
ROUND(x5, 2) x5,
ROUND(x6, 2) x6
FROM df""").show
Now, I want to build binary numbers with 6 digits. This equates to the number 63 in base 10:
val max_base10=63
val max_binary="%06d".format(max_base10.toBinaryString.toLong)
After that, I create a variable which will allow for column indexes to be permutated in ascending order:
def binary_permutationsSeg1 = for {a <- 1 to max_base10
} yield List(Array(0), %06d".format(a.toBinaryString.toLong).split("").map(_.toInt)) flatten // Array(0) is added as target is not an explanatory variable
This is just to make sure I get what I'm after:
binary_permutationsSeg1(62)
Now I wish to build my set of permutation models for which I MUST have an intercept.. I've learned the hard way that train() does not build an intercept in its algorithm...:
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(100)
This bit is not very elegant as I create a double def... Also, I'd like to automatically create multiModel01, multiModel02 etc in memory but I'm not able to do it....
def multiModel = for{
a <- 1 to binary_permutationsSeg1.size-1
} yield df.rdd.map(r => LabeledPoint(
r.getDouble(1),
Vectors.dense(binary_permutationsSeg1(a).zipWithIndex.filter(_._1 == 1).unzip ._2 map(r.getDouble(_)) toArray))).cache()
def run_multiModel=for{
a <- 1 to binary_permutationsSeg1.size-1
} yield regression.run(multiModel(a))
run_multiModel
When I run the line above, I don't get the 62 models I am after which leaves me perplex.
These last two paragraphs are to do with model evaluation (taken from zero323 again) but I cannot make it so that it'd automatically contrust valuesAndPreds01, valuesAndPreds02 and so forth... whilst bearig in mind that some permutations will give NaN.
val valuesAndPreds = df.rdd.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Thanks for taking the time to look at this problem and thanks for any suggestions or tips in the right direction.
Regards,
Christian

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that?
My code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
Any help?
The correct approach here is the second one you tried - mapping each Row into a LabeledPoint to get an RDD[LabeledPoint]. However, it has two mistakes:
The correct Vector class (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g. Vector[Int]) - so even though you had the right import, the compiler concluded that you meant scala.collection.immutable.Vector which DOES.
The DataFrame returned from PCA.fit() has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:
+-----+--------------------+--------------------+
|label| features| pcaFeatures|
+-----+--------------------+--------------------+
| 5.0|(780,[152,153,154...|[880.071111851977...|
| 1.0|(780,[158,159,160...|[-41.473039034112...|
| 2.0|(780,[155,156,157...|[931.444898405036...|
| 1.0|(780,[124,125,126...|[25.5114585648411...|
+-----+--------------------+--------------------+
To make this easier - I prefer using the column names instead of their indices.
So here's the transformation you need:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)