I am using Spark 1.6.1 with Scala 2.10.5 built in. I am building a set of permutation models using binary numbers but I've now hit a wall for a few days now. Here is the code (partially taken from zero323 at RDD to LabeledPoint conversion):
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import scala.util.Random.{setSeed, nextDouble}
setSeed(123)
case class Record(
target: Double, x1: Double, x2: Double, x3: Double, x4:Double, x5:Double, x6:Double)
val rows = sc.parallelize(
(1 to 50).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble)))
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x3, 2) x3,
ROUND(x4, 2) x4,
ROUND(x5, 2) x5,
ROUND(x6, 2) x6
FROM df""").show
Now, I want to build binary numbers with 6 digits. This equates to the number 63 in base 10:
val max_base10=63
val max_binary="%06d".format(max_base10.toBinaryString.toLong)
After that, I create a variable which will allow for column indexes to be permutated in ascending order:
def binary_permutationsSeg1 = for {a <- 1 to max_base10
} yield List(Array(0), %06d".format(a.toBinaryString.toLong).split("").map(_.toInt)) flatten // Array(0) is added as target is not an explanatory variable
This is just to make sure I get what I'm after:
binary_permutationsSeg1(62)
Now I wish to build my set of permutation models for which I MUST have an intercept.. I've learned the hard way that train() does not build an intercept in its algorithm...:
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(100)
This bit is not very elegant as I create a double def... Also, I'd like to automatically create multiModel01, multiModel02 etc in memory but I'm not able to do it....
def multiModel = for{
a <- 1 to binary_permutationsSeg1.size-1
} yield df.rdd.map(r => LabeledPoint(
r.getDouble(1),
Vectors.dense(binary_permutationsSeg1(a).zipWithIndex.filter(_._1 == 1).unzip ._2 map(r.getDouble(_)) toArray))).cache()
def run_multiModel=for{
a <- 1 to binary_permutationsSeg1.size-1
} yield regression.run(multiModel(a))
run_multiModel
When I run the line above, I don't get the 62 models I am after which leaves me perplex.
These last two paragraphs are to do with model evaluation (taken from zero323 again) but I cannot make it so that it'd automatically contrust valuesAndPreds01, valuesAndPreds02 and so forth... whilst bearig in mind that some permutations will give NaN.
val valuesAndPreds = df.rdd.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Thanks for taking the time to look at this problem and thanks for any suggestions or tips in the right direction.
Regards,
Christian
Related
I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that?
My code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
Any help?
The correct approach here is the second one you tried - mapping each Row into a LabeledPoint to get an RDD[LabeledPoint]. However, it has two mistakes:
The correct Vector class (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g. Vector[Int]) - so even though you had the right import, the compiler concluded that you meant scala.collection.immutable.Vector which DOES.
The DataFrame returned from PCA.fit() has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:
+-----+--------------------+--------------------+
|label| features| pcaFeatures|
+-----+--------------------+--------------------+
| 5.0|(780,[152,153,154...|[880.071111851977...|
| 1.0|(780,[158,159,160...|[-41.473039034112...|
| 2.0|(780,[155,156,157...|[931.444898405036...|
| 1.0|(780,[124,125,126...|[25.5114585648411...|
+-----+--------------------+--------------------+
To make this easier - I prefer using the column names instead of their indices.
So here's the transformation you need:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))
I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?
I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers
I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
dataset.txt:
[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
rdd.map(Vectors.parse)
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try
rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
Vectors.dense(vect)
}
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.
How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)
Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.
ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly:
val trainRawDf = trainRaw.toDF
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
val transformers = Array(
new StringIndexer().setInputCol("group").setOutputCol("label"),
new Tokenizer().setInputCol("text").setOutputCol("tokens"),
new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)
model.transform(trainRawDf)
If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifier you can use one-vs-rest strategy:
import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)
MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
logistic regression,
decision trees,
random forests,
naive Bayes
At least some of the examples use multiclass classification:
Naive Bayes example - 3 classes
Logistic regression - 10 classes for classifier although only 2 in the example data
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing label and features:
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
or RDD[LabeledPoint].
Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
case class LabeledRecord(group: String, text: String)
val trainRaw = sc.parallelize(
LabeledRecord("foo", "foo v a y b foo") ::
LabeledRecord("bar", "x bar y bar v") ::
LabeledRecord("bar", "x a y bar z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foo", "foo x") ::
LabeledRecord("foobar", "z y x foo a b bar v") ::
Nil
)
Now let's define required transformers and process train Dataset:
// Tokenizer to process text fields
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(10)
// Indexer to convert String labels to Double
val indexer = new StringIndexer()
.setInputCol("group")
.setOutputCol("label")
.fit(trainRaw.toDF)
def transfom(rdd: RDD[LabeledRecord]) = {
val tokenized = tokenizer.transform(rdd.toDF)
val hashed = hashingTF.transform(tokenized)
val indexed = indexer.transform(hashed)
indexed
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)}
}
val train: RDD[LabeledPoint] = transfom(trainRaw)
Please note that indexer is "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.
Next we can train RF model:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16
val model = RandomForest.trainClassifier(
train, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins
)
and finally test it:
val testRaw = sc.parallelize(
LabeledRecord("foo", "foo foo z z z") ::
LabeledRecord("bar", "z bar y y v") ::
LabeledRecord("bar", "a a bar a z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foobar", "a foo a bar") ::
Nil
)
val test: RDD[LabeledPoint] = transfom(testRaw)
val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)
metrics.precision
metrics.recall
Are you using Spark 1.6 rather than Spark 2.1?
I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.