How to convert spark DataFrame to RDD mllib LabeledPoints? - scala

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that?
My code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
Any help?

The correct approach here is the second one you tried - mapping each Row into a LabeledPoint to get an RDD[LabeledPoint]. However, it has two mistakes:
The correct Vector class (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g. Vector[Int]) - so even though you had the right import, the compiler concluded that you meant scala.collection.immutable.Vector which DOES.
The DataFrame returned from PCA.fit() has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:
+-----+--------------------+--------------------+
|label| features| pcaFeatures|
+-----+--------------------+--------------------+
| 5.0|(780,[152,153,154...|[880.071111851977...|
| 1.0|(780,[158,159,160...|[-41.473039034112...|
| 2.0|(780,[155,156,157...|[931.444898405036...|
| 1.0|(780,[124,125,126...|[25.5114585648411...|
+-----+--------------------+--------------------+
To make this easier - I prefer using the column names instead of their indices.
So here's the transformation you need:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

Related

How to calculate euclidean distance of each row in a dataframe to a constant reference array

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

Spark load Data for Decision tree - Change Label in LabelledPoint

I try to do the example for decision tree in spark at https://spark.apache.org/docs/latest/mllib-decision-tree.html
I have downloaded a1a dataset from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a
The dataset is in LIBSVM format where the two classes have labels +1.0 and -1.0
When I try
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "/user/cloudera/testDT/a1a.t")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
| impurity, maxDepth, maxBins)
I get:
java.lang.IllegalArgumentException: GiniAggregator given label -1.0 but requires label is non-negative.
So I tried to change the label -1.0 to 0.0. I tried something like
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
{ if (a.label == -1.0) {a.label = 0.0} }
Where I get error:
reassignment to val
So my question is this: How can I change the labels of my data? Or is there a workaround so DecisionTree.trainClassifier() to work with data with negative labels?
TL;DR You cannot resign value argument of a Product class, and even if it was possible (declared as var), you should never modify data in place in Spark.
How about:
def changeLabel(a: org.apache.spark.mllib.regression.LabeledPoint) =
if (a.label == -1.0) a.copy(label = 0.0) else a
scala> changeLabel(LabeledPoint(-1.0, Vectors.dense(1.0, 2.0, 3.0)))
res1: org.apache.spark.mllib.regression.LabeledPoint = (0.0,[1.0,2.0,3.0])
scala> changeLabel(LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0)))
res2: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,2.0,3.0])

Building a random forest in spark, explanation?

I have the a data frame df with the following structure:
amount gender_num marital_num
10000 1 1
20000 1 2
1400 2 1
Lets say I am building an ML to predict the column 'gender_num' in spark using random forest
I am doing the following:
val df1 = df("loan_amount", 'loan_amount.cast("Double")).withColumn("gender_num", 'gender_num.cast("String")).
withColumn("marital_num", 'marital_num.cast("String"))
val labeled = df1.map(row => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))
val numClasses = 7
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Error:
My code is failing at the second step:
138: error: value gender_num is not a member of org.apache.spark.sql.DataFrame
I would really appreciate if someone can explain this to me, the documentation is very hard to follow, newbie here!
That's because you are using R like syntax of DataFrame.
You should access row's data like this:
val labeled = df1.map { row => LabeledPoint(row(1).toDouble, Vectors.dense(row(0).toDouble, row(1).toDouble))}
You can also create case class and use Dataset syntax:
case class ParsedData (amount : Double, gender_num : Int, marital_num : Int)
val labeled = df1.as[ParsedData].map(row => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.
ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly:
val trainRawDf = trainRaw.toDF
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
val transformers = Array(
new StringIndexer().setInputCol("group").setOutputCol("label"),
new Tokenizer().setInputCol("text").setOutputCol("tokens"),
new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)
model.transform(trainRawDf)
If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifier you can use one-vs-rest strategy:
import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)
MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
logistic regression,
decision trees,
random forests,
naive Bayes
At least some of the examples use multiclass classification:
Naive Bayes example - 3 classes
Logistic regression - 10 classes for classifier although only 2 in the example data
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing label and features:
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
or RDD[LabeledPoint].
Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
case class LabeledRecord(group: String, text: String)
val trainRaw = sc.parallelize(
LabeledRecord("foo", "foo v a y b foo") ::
LabeledRecord("bar", "x bar y bar v") ::
LabeledRecord("bar", "x a y bar z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foo", "foo x") ::
LabeledRecord("foobar", "z y x foo a b bar v") ::
Nil
)
Now let's define required transformers and process train Dataset:
// Tokenizer to process text fields
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(10)
// Indexer to convert String labels to Double
val indexer = new StringIndexer()
.setInputCol("group")
.setOutputCol("label")
.fit(trainRaw.toDF)
def transfom(rdd: RDD[LabeledRecord]) = {
val tokenized = tokenizer.transform(rdd.toDF)
val hashed = hashingTF.transform(tokenized)
val indexed = indexer.transform(hashed)
indexed
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)}
}
val train: RDD[LabeledPoint] = transfom(trainRaw)
Please note that indexer is "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.
Next we can train RF model:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16
val model = RandomForest.trainClassifier(
train, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins
)
and finally test it:
val testRaw = sc.parallelize(
LabeledRecord("foo", "foo foo z z z") ::
LabeledRecord("bar", "z bar y y v") ::
LabeledRecord("bar", "a a bar a z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foobar", "a foo a bar") ::
Nil
)
val test: RDD[LabeledPoint] = transfom(testRaw)
val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)
metrics.precision
metrics.recall
Are you using Spark 1.6 rather than Spark 2.1?
I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.