Naive Bayes with Apache Spark MLlib - scala

I'm using Naive Bayes with Apache Spark MLlib for Text classification follow tutorial: http://avulanov.blogspot.com/2014/08/text-classification-with-apache-spark.html
/* instantiate Spark context (not needed for running inside Spark shell */
val sc = new SparkContext("local", "test")
/* word to vector space converter, limit to 10000 words */
val htf = new HashingTF(10000)
/* load positive and negative sentences from the dataset */
/* let 1 - positive class, 0 - negative class */
/* tokenize sentences and transform them into vector space model */
val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
.map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
.map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
/* split the data 60% for training, 40% for testing */
val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
/* union train data with positive and negative sentences */
val training = posSplits(0).union(negSplits(0))
/* union test data with positive and negative sentences */
val test = posSplits(1).union(negSplits(1))
/* Multinomial Naive Bayesian classifier */
val model = NaiveBayes.train(training)
/* predict */
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
/* metrics */
val metrics = new MulticlassMetrics(predictionAndLabels)
/* output F1-measure for all labels (0 and 1, negative and positive) */
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
But, after training data. What should I do if I want to know sentence "Have a nice day" is positive or negative?
Thank you.

Generally speaking you need two things to make prediction on a raw data:
Apply the same transformations you've used for training data. If some transformer require fitting (like IDF, normalization, encoding) you have to use one fitted on a trained data. Since your approach is extremely simplistic all you need here is something like this:
val testData = htf.transform("Have a nice day".split(" "))
Use predict method of the trained model:
model.predict(testData)

Related

training a classifier with data within a partition

How can I train a classifier with instances within a partition while its classification algorithm depends on the partition index? For example, suppose the following code snippet:
val data = MLUtils.loadLibSVMFile(sc, "path to SVM file")
val r = data.mapPartitionsWithIndex((index,localdata)=>{
if (index % 2 == 0)
{
// train a NaiveBayes with localdata
NaiveBayes.train(localdata) // Error => found:iterator[LabeledPoint] , required: RDD[labeledPoint]
}
else
{
// train a DecisionTree classifier with localdata
DecisionTree.train(localdata) // Error => found:iterator[LabeledPoint] , required: RDD[labeledPoint]
}
})
It sounds to me that the error is right, because the tasks are executed within their separated JVM and could not be distributed from a map task. That is why I can not access the SparkContext in my tasks. However, does anyone have an alternate suggestion for doing my purpose?
based on the discussion in above comments section -
you can give a try to this-
val rdd = MLUtils.loadLibSVMFile(sc, "path to SVM file")
// approach -1
val nb = rdd.sample(withReplacement = false, fraction = 0.5) // sample 50% of the record
val dt = rdd.sample(withReplacement = false, fraction = 0.5) // sample 50% of the record
//or approach-2
val (nb, dt) = rdd.randomSplit(Array(0.5, 0.5))
// apply algo
NaiveBayes.train(nb)
DecisionTree.train(dt, strategy= ..)

How do I efficiently debug a long-running Spark application?

I have a Spark application which cleans and prepares a data set, and then applies a K-means clustering algorithm onto this set. Afterwards, some metrics of the resulting clusters are calculated.
Naturally, calculating the K-means vector clusters is a task that has a long execution time. When debugging the calculation for the cluster metrics I cannot iterate quickly on my code due to the clusters being calculated on each execution. How do I solve this?
Ideas I have are:
writing a unit test for the metric calculation methods. But it would be cumbersome to mock cluster data as
Serialise computed K-means vector cluster data to disk
Any help is appreciated
Code for reference:
def main(args: Array[String]): Unit = {
// -- start of long running execution
val lines = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")
val raw = rawPostings(lines)
val grouped = groupedPostings(raw)
val scored = scoredPostings(grouped)
val vectors = vectorPostings(scored)
// assert(vectors.count() == 2121822, "Incorrect number of vectors: " + vectors.count())
val means = kmeans(sampleVectors(vectors), vectors, debug = true)
// -- end of long running execution
val results = clusterResults(means, vectors) // < -- this method operates on the result of previous ops
printResults(results)
}
Implementation of clusterResults:
def clusterResults(means: Array[(Int, Int)], vectors: RDD[(LangIndex, HighScore)]): Array[(String, Double, Int, Int)] = {
// -- Note that means is quite intensive to compute
val closest = vectors.map(p => (findClosest(p, means), p)) // -- (Int, (LangIndex, HighScore))
val closestGrouped = closest.groupByKey() // -- (Int, Iter((LangIndex, HighScore))
vectors.take(3).foreach(println)
val median = closestGrouped.mapValues { vs =>
// #todo: what does groupBy(identity) do?
// Predef.idintity is equivalent ot x => x
val langId: Int = vs.map(_._1).groupBy(identity).maxBy(_._2.size)._1 // most common language in the cluster
val langLabel: String = langs(langId / langSpread)
val langPercent: Double = vs.map(_._1).count(_.equals(langId)) / vs.size // percent of the questions in the most common language (= number of questions in most common lang divided by total questions)
val clusterSize: Int = vs.size
val medianScore: Int = vs.map(_._2).
(langLabel, langPercent, clusterSize, medianScore)
}
median.collect().map(_._2).sortBy(_._4)
}

How can I divide rdd to specific number of rdds

I have the below code which generates RDD from a text file:
val data = sparkContext.textfile(path)
val k = 3
How can I divide data into k unique RDD?
You can use RDD.randomSplitwhich will divide existing RDD based on weights passed in the parameters and return Array of RDDs.
The internal working will be like below...
/**
* Randomly splits this RDD with the provided weights.
*
* #param weights weights for splits, will be normalized if they don't sum to 1
* #param seed random seed
*
* #return split RDDs in an array
*/
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
require(weights.forall(_ >= 0),
s"Weights must be nonnegative, but got ${weights.mkString("[", ",", "]")}")
require(weights.sum > 0,
s"Sum of weights must be positive, but got ${weights.mkString("[", ",", "]")}")
withScope {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
randomSampleWithRange(x(0), x(1), seed)
}.toArray
}
NOTE : weights weights for splits, will be normalized if they don't sum to 1
Based on the above behavior I created a sample snippet like below which was working :
def getDoubleWeights(numparts:Int) : Array[Double] = {
Array.fill[Double](numparts)(1.0d)
}
caller would be like....
val rddWithNumParts : Array[RDD] = yourRDD.randomSplit(getDoubleWeights(yourRDD.partitions.length))
This will uniformly divide in to number of RDD
NOTE : Same is applicable for below DataFrame.randomSplit as well
You can also convert that in to Dataframe by giving schema to RDD and use like below example.. sqlContext.createDataFrame(rddOfRow, Schema)
later you can call this method.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
other thought I had is dividing based on number of Partitions...
i.e RDD.mapPartitionWithIndex(....)
for each partition you have an Iterator (can be converted in to RDD). you can have some thing like number of partitions = number of RDDs

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)
You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.

Retrieving not only top one predictions from Multiclass Regression with Spark [duplicate]

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.