VarianceThreshold Function For Data Cleansing - scala

I have the following function that I want to use to see how many features are selected based on different Threshold values for the variance.
import org.apache.spark.ml.feature.VarianceThresholdSelector
def varianceThreshold(df: DataFrame, thresholds: Seq[Threshold]): Seq[(Threshold, DataFrame)] = {
thresholds.map(threshold => {
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(threshold)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
(threshold, selector.fit(df).transform(df))
})
}
So far so good. I have a DataFrame that looks like this:
Now my question is if col2 is the predictor variable, i.e., the value that I'm trying to predict, how can I then have all the other columns grouped so that I can pass that as a feature as such. For example., I came across this example from the Spark documentation:
import org.apache.spark.ml.feature.VarianceThresholdSelector
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
(6, Vectors.dense(8.0, 9.0, 6.0, 0.0, 0.0, 0.0))
)
val df = spark.createDataset(data).toDF("id", "features")
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(8.0)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
println(s"Output: Features with variance lower than" +
s" ${selector.getVarianceThreshold} are removed.")
result.show()
So for my example what will be the featureCol or rather how can I get my individual columns as a featuresCol array?

Here is what I did to get the effect that I wanted:
type Threshold = Double
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VarianceThresholdSelector
def varianceThreshold(df: DataFrame, thresholds: Seq[Threshold]): Seq[(Threshold, DataFrame)] = {
val assembler = new VectorAssembler()
.setInputCols(df.columns.tail)
.setOutputCol("features")
val output = assembler.transform(df)
thresholds.map(threshold => {
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(threshold)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
(threshold, selector.fit(output).transform(output))
})
}

Related

Error in types double and DenseVector[Double]

The following code is the answer to this question: Anomaly detection with PCA in Spark
import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler,VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
object SparkApp extends App {
val session = SparkSession.builder()
.appName("spark-app").master("local[*]").getOrCreate()
session.sparkContext.setLogLevel("ERROR")
import session.implicits._
val df = Seq(
(1, 4, 0),
(3, 4, 0),
(1, 3, 0),
(3, 3, 0),
(67, 37, 0) //outlier
).toDF("x", "y", "z")
val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized- vector").setWithMean(true)
.setWithStd(true)
val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
val pipeline = new Pipeline().setStages(
Array(vectorAssembler, standardScalar, pca)
)
val pcaDF = pipeline.fit(df).transform(df)
def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
val mahalanobois = udf[Double, Vector] { v =>
val vB = DenseVector(v.toArray)
vB.t * invCovariance * vB
}
df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
}
val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
session.close()
}
But when I try to run it, it crushes in this line:
vB.t * invCovariance * vB
Error message:
type mismatch: found breeze.linalg.DenseVector[Double], required: Double
How can I solve this?

Scale a column of SparseVectors without UDF

I have a dataframe with a feature column of SparseVector. I need to scale each row by a scalar. I have a working implementation below that employs a UDF. The following depicts the original and scaled feature columns:
+-------------------+-------+-------------------+
| features|weights| scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])| 1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])| 2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])| 3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+
Is there a way to do this using Spark's native, and optimized, methods instead of a UDF?
Similarly, is there a Spark-native way to scale a SparseVector by a scalar? See the line below the "Scale the SparseVector" comment in the UDF defined below.
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf}
// Scaling a SparseVector column
val data = Array(
(new SparseVector(6, Array(0,1), Array(0.5, 1.0)), 1.0),
(new SparseVector(6, Array(2,3), Array(1.5, 2.0)), 2.0),
(new SparseVector(6, Array(4,5), Array(0.5, 1.0)), 3.0)
)
val df = spark.createDataFrame(data).toDF("features", "weights")
val scaleUDF = udf((sv: SparseVector, w: Double) => {
// Scale the SparseVector
val unzipped = sv.indices.zip(sv.values).map(iv => (iv._1, iv._2*w)).unzip
new SparseVector(sv.size, unzipped._1, unzipped._2)
})
val scaledDF = df.withColumn("scaledFeatures", scaleUDF(col("features"), col("weights")))
scaledDF.show()
+-------------------+-------+-------------------+
| features|weights| scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])| 1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])| 2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])| 3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+

How to convert a Dataset[Seq[T]] to Dataset[T] in Spark

How do I convert a Dataset[Seq[T]] to Dataset[T]?
For example, Dataset[Seq[Car]] to Dataset[Car].
You can do flatMap:
val df = Seq(Seq(1, 2, 3), Seq(4, 5, 6, 7)).toDF("s").as[Seq[Int]];
df.flatMap(x => x.toList)
You can also try explode function:
df.select(explode('s)).select("col.*").as[Car]
Full example:
import org.apache.spark.sql.functions._
case class Car(i : Int);
val df = Seq(List(Car(1), Car(2), Car(3))).toDF("s").as[List[Car]];
val df1 = df.flatMap(x => x.toList)
val df2 = df.select(explode('s)).select("col.*").as[Car]

RandomForestClassifier was given input with invalid label column error in Apache Spark

I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:
java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
Getting the above error at line---> val cvModel = cv.fit(trainingData)
The code which i used for cross validation of data set using random forest is as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble,
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val trainingData = training.toDF()
val testData = test.toDF()
val nFolds: Int = 5
val NumTrees: Int = 5
val rf = new
RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(NumTrees)
val pipeline = new Pipeline()
.setStages(Array(rf))
val paramGrid = new ParamGridBuilder()
.build()
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("precision")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val cvModel = cv.fit(trainingData)
val results = cvModel.transform(testData)
.select("label","prediction").collect
val numCorrectPredictions = results.map(row =>
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size
println("Test set accuracy: %.3f".format(accuracy))
Can any one please explain what is the mistake in the above code.
RandomForestClassifier, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers like StringIndexer. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.
val df = Seq(
(0.0, Vectors.dense(1, 0, 0, 0)),
(1.0, Vectors.dense(0, 1, 0, 0)),
(2.0, Vectors.dense(0, 0, 1, 0)),
(2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")
val rf = new RandomForestClassifier()
.setFeaturesCol("features")
.setNumTrees(5)
rf.setLabelCol("label").fit(df)
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...
You can either re-encode label column using StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_idx")
.fit(df)
rf.setLabelCol("label_idx").fit(indexer.transform(df))
or set required metadata manually:
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0", "2.0")
.toMetadata
rf.setLabelCol("label_meta").fit(
df.withColumn("label_meta", $"label".as("", meta))
)
Note:
Labels created using StringIndexer depend on the frequency not value:
indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)
PySpark:
In Python metadata fields can be set directly on the schema:
from pyspark.sql.types import StructField, DoubleType
StructField(
"label", DoubleType(), False,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0", "2.0"]
}}
)

Spark: How to transform LabeledPoint features values from int to 0/1?

I want to run Naive Bayes in Spark, but to do this I have to transform features values from my LabeledPoint to 0/1. My LabeledPoint looks like this:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
How can I transform those features values into 1 (it's sparse representation so there will be no 0) ?
I guess you're looking for something like this:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val transformedData = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(5, Array(1, 3), Array(9.0, 3.2))),
LabeledPoint(5.0, Vectors.sparse(5, Array(0, 2, 4), Array(1.0, 2.0, 3.0)))
))
def binarizeFeatures(rdd: RDD[LabeledPoint]) = rdd.map{
case LabeledPoint(label, features) => {
val v = features.toSparse
LabeledPoint(lab,
Vectors.sparse(v.size, v.indices, Array.fill(v.numNonzeros)(1.0)))}}
binarizeFeatures(transformedData).collect
// Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
// (1.0,(5,[1,3],[1.0,1.0])),
// (1.0,(5,[0,2,4],[1.0,1.0,1.0])))