How to make binary classication in Spark ML without StringIndexer - scala

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work:
def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
import sqlc.implicits._
val trainingDF = training.toDF()
//format of this dataframe: [label: double, features: vector]
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(trainingDF)
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))
pipeline.fit(trainingDF)
}
But actually I get
java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.
Of course I can just put StringIndexer and let him make it's work for my double "label" field, but I want to work with output rawPrediction column of DecisionTreeClassifier to get probability of 0.0 and 1.0 for each row like...
val predictions = model.transform(singletonDF)
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)
If I put StringIndexer in Pipeline - I will not know indexes of my input labels "0.0" and "1.0" in rawPrediction vector, because String indexer will index by value's frequency, which could vary.
Please, help to prepare data for DecisionTreeClassifier without using StringIndexer or suggest some another way to get probability of my original labels (0.0; 1.0) for each row.

You can always set required metadata manually:
import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0")
.toMetadata
val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)

Related

How to calculate euclidean distance of each row in a dataframe to a constant reference array

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

Field "features" does not exist. SparkML

I am trying to build a model in Spark ML with Zeppelin.
I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you
val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true
val df = training.toDF
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(df)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
A snippet of the csv file i am using is:
IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2
0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,
As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler.
IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//creating features column
val assembler = new VectorAssembler()
.setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFeaturesCol("features") // setting features column
.setLabelCol("IsAlert") // setting label column
//creating pipeline
val pipeline = new Pipeline().setStages(Array(assembler,lr))
//fitting the model
val lrModel = pipeline.fit(df)
Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label

I am building a training set using two text files representing documents and labels.
Documents.txt
hello world
hello mars
Labels.txt
0
1
I have read in these files and converted my document data to a tf-idf weighted term-document matrix which is represented as a RDD[Vector]. I have also read-in and created a RDD[Vector] for my labels:
val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq)
val labs: RDD[Vector] = sc.textFile("Labels.txt")
.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(docs)
tf.cache()
val idf = new IDF(minDocFreq = 3).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
I would like to use tfidf and labsto create a RDD[LabeledPoint], but I am not sure how to apply a mapping with two different RDDs. Is this even possible/efficient, or do I need to rethink my approach?
One way to handle this is to join based on indices:
import org.apache.spark.RangePartitioner
// Add indices
val idfIndexed = idf.zipWithIndex.map(_.swap)
val labelsIndexed = labels.zipWithIndex.map(_.swap)
// Create range partitioner on larger RDD
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed)
// Join with custom partitioner
labelsIndexed.join(idfIndexed, partitioner).values

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)