Apply PCA on specific columns with Apache Spark - scala

i am trying to apply PCA on a dataset that contains a header and contains fields
Here is the code i used , any help to be able to select a specific columns on which we apply PCA .
val inputMatrix = sc.textFile("C:/Users/mhattabi/Desktop/Realase of 01_06_2017/TopDrive_WithoutConstant.csv").map { line =>
val values = line.split(",").map(_.toDouble)
Vectors.dense(values)
}
val mat: RowMatrix = new RowMatrix(inputMatrix)
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
//updated version
i tried to do this
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.format("com.databricks.spark.csv")
val columnsToUse: Seq[String] = Array("Col0","Col1", "Col2", "Col3", "Col4").toSeq
val k: Int = 2
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(k)
val featurized = rf.fit(df).transform(df)
//prinpal component
val principalComponent = pca.fit(featurized).transform(featurized)
principalComponent.select("pcaFeatures").show(4,false)
+-----------------------------------------+
|pcaFeatures |
+-----------------------------------------+
|[-0.536798281241379,0.495499034754084] |
|[-0.32969328815797916,0.5672811417154808]|
|[-1.32283465170085,0.5982789033642704] |
|[-0.6199718696225502,0.3173072633712586] |
+-----------------------------------------+
I got this for pricipal component , the question i want to save this in csv file and add header.Any help many thanks
Any help would be appreciated .
Thanks a lot

You can use the RFormula in this case :
import org.apache.spark.ml.feature.{RFormula, PCA}
val columnsToUse: Seq[String] = ???
val k: Int = ???
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("/tmp/foo.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setK(k)
val featurized = rf.fit(df).transform(df)
val projected = pca.fit(featurized).transform(featurized)

java.lang.NumberFormatException: For input string: "DateTime"
it means that in your input file there is a value DateTime that you then try to convert to Double.
Probably it is somewhere in the header of you input file

Related

How to prepare data to apply RandomForest?

I have csv file which contain userId, MovieId,Rating.I want to convert this file to containing label,features .
like in How to prepare data into a LibSVM format from DataFrame?
I need to separete rating column as afile and determine LabeledPoint for label.For applying random forest algorithm I need label column in file but it doesn't exit.
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)
.fit(assembled_df)
val pcaTrainingData = pca.transform(assembled_df).select("id","features","pcaFeatures")
val labeled = pca.transform(assembled_df).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
How to make label column?

How to apply kmeans for parquet file?

I want to apply k-means for my parquet file.but error appear .
edited
java.lang.ArrayIndexOutOfBoundsException: 2
code
val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 30
val numIteration = 1
val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)
How to solve this error?
I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.
In the example
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation
To convert from Dataframe to RDD you can use the below sample as reference
val rows: RDD[Row] = df.rdd
val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

how to make faster windowing text file and machine learning over windows in spark

I'm trying to use Spark to learn multiclass logistic regression on a windowed text file. What I'm doing is first creating windows and explode them into $"word_winds". Then move the center word of each window into $"word". To fit the LogisticRegression model, I convert each different word into a class ($"label"), thereby it learns. I count the different labels to prone those with few minF samples.
The problem is that some part of the code is very very slow, even for small input files (you can use some README file to test the code). Googling, some users have been experiencing slowness by using explode. They suggest some modifications to the code in order to speed up 2x. However, I think that with a 100MB input file, this wouldn't be sufficient. Please suggest something different, probably to avoid actions that slow down the code. I'm using Spark 2.4.0 and sbt 1.2.8 on a 24-core machine.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types._
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
spark.sparkContext.setCheckpointDir("checked_dfs")
val in_file = "sample.txt"
val stratified = true
val wsize = 7
val ngram = 3
val minF = 2
val windUdf = udf{s: String => s.sliding(ngram).toList.sliding(wsize).toList}
val get_mid = udf{s: Seq[String] => s(s.size/2)}
val rm_punct = udf{s: String => s.replaceAll("""([\p{Punct}|¿|\?|¡|!]|\p{C}|\b\p{IsLetter}{1,2}\b)\s*""", "")}
// Read and remove punctuation
var df = spark.read.text(in_file)
.withColumn("value", rm_punct($"value"))
// Creating windows and explode them, and get the center word into $"word"
df = df.withColumn("char_nGrams", windUdf('value))
.withColumn("word_winds", explode($"char_nGrams"))
.withColumn("word", get_mid('word_winds))
val indexer = new StringIndexer().setInputCol("word")
.setOutputCol("label")
df = indexer.fit(df).transform(df)
val hashingTF = new HashingTF().setInputCol("word_winds")
.setOutputCol("freqFeatures")
df = hashingTF.transform(df)
val idf = new IDF().setInputCol("freqFeatures")
.setOutputCol("features")
df = idf.fit(df).transform(df)
// Remove word whose freq is less than minF
var counts = df.groupBy("label").count
.filter(col("count") > minF)
.orderBy(desc("count"))
.withColumn("id", monotonically_increasing_id())
var filtro = df.groupBy("label").count.filter(col("count") <= minF)
df = df.join(filtro, Seq("label"), "leftanti")
var dfs = if(stratified){
// Create stratified sample 'dfs'
var revs = counts.orderBy(asc("count")).select("count")
.withColumn("id", monotonically_increasing_id())
revs = revs.withColumnRenamed("count", "ascc")
// Weigh the labels (linearly) inversely ("ascc") proportional NORMALIZED weights to word ferquency
counts = counts.join(revs, Seq("id"), "inner").withColumn("weight", col("ascc")/df.count)
val minn = counts.select("weight").agg(min("weight")).first.getDouble(0) - 0.01
val maxx = counts.select("weight").agg(max("weight")).first.getDouble(0) - 0.01
counts = counts.withColumn("weight_n", (col("weight") - minn) / (maxx - minn))
counts = counts.withColumn("weight_n", when(col("weight_n") > 1.0, 1.0)
.otherwise(col("weight_n")))
var fractions = counts.select("label", "weight_n").rdd.map(x => (x(0), x(1)
.asInstanceOf[scala.Double])).collectAsMap.toMap
df.stat.sampleBy("label", fractions, 36L).select("features", "word_winds", "word", "label")
}else{ df }
dfs = dfs.checkpoint()
val lr = new LogisticRegression().setRegParam(0.01)
val Array(tr, ts) = dfs.randomSplit(Array(0.7, 0.3), seed = 12345)
val training = tr.select("word_winds", "features", "label", "word")
val test = ts.select("word_winds", "features", "label", "word")
val model = lr.fit(training)
def mapCode(m: scala.collection.Map[Any, String]) = udf( (s: Double) =>
m.getOrElse(s, "")
)
var labels = training.select("label", "word").distinct.rdd
.map(x => (x(0), x(1).asInstanceOf[String]))
.collectAsMap
var predictions = model.transform(test)
predictions = predictions.withColumn("pred_word", mapCode(labels)($"prediction"))
predictions.write.format("csv").save("spark_predictions")
spark.stop()
}
}
Since your data is somewhat small it might help if you use coalesce before explode. Sometimes it can be inefficient to have too many nodes especially if there is a lot of shuffling in your code.
Like you said, it does seem like a lot of people have issues with explode. I looked at the link you provided but no one mentioned trying flatMap instead of explode.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

How to change RowMatrix into Array in Spark or export it as a CSV?

I've got this code in Scala:
val mat: CoordinateMatrix = new CoordinateMatrix(data)
val rowMatrix: RowMatrix = mat.toRowMatrix()
val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(100, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val S: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val uArray: Array[Double] = U.toArray // doesn't work, because there is not toArray function in RowMatrix type
val sArray: Array[Double] = S.toArray // works good
val vArray: Array[Double] = V.toArray // works good
How can I change U into uArray or similar type, that could be printed out into CSV file?
That's a basic operation, here is what you have to do considering that U is a RowMatrix as following :
val U = svd.U
rows() is a RowMatrix method that allows you to get an RDD from your RowMatrix by row.
You'll just need to apply rows on your RowMatrix and map the RDD[Vector] to create an Array that you would concatenate into a string creating an RDD[String].
val rdd = U.rows.map( x => x.toArray.mkString(","))
All you'll have to do now it to save the RDD :
rdd.saveAsTextFile(path)
It works:
def exportRowMatrix(matrix:RDD[String], fileName: String) = {
val pw = new PrintWriter(fileName)
matrix.collect().foreach(line => pw.println(line))
pw.flush
pw.close
}
val rdd = U.rows.map( x => x.toArray.mkString(","))
exportRowMatrix(rdd, "U.csv")