How to apply kmeans for parquet file? - scala

I want to apply k-means for my parquet file.but error appear .
edited
java.lang.ArrayIndexOutOfBoundsException: 2
code
val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 30
val numIteration = 1
val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)
How to solve this error?

I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.
In the example
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation
To convert from Dataframe to RDD you can use the below sample as reference
val rows: RDD[Row] = df.rdd

val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

Related

how to make faster windowing text file and machine learning over windows in spark

I'm trying to use Spark to learn multiclass logistic regression on a windowed text file. What I'm doing is first creating windows and explode them into $"word_winds". Then move the center word of each window into $"word". To fit the LogisticRegression model, I convert each different word into a class ($"label"), thereby it learns. I count the different labels to prone those with few minF samples.
The problem is that some part of the code is very very slow, even for small input files (you can use some README file to test the code). Googling, some users have been experiencing slowness by using explode. They suggest some modifications to the code in order to speed up 2x. However, I think that with a 100MB input file, this wouldn't be sufficient. Please suggest something different, probably to avoid actions that slow down the code. I'm using Spark 2.4.0 and sbt 1.2.8 on a 24-core machine.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types._
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
spark.sparkContext.setCheckpointDir("checked_dfs")
val in_file = "sample.txt"
val stratified = true
val wsize = 7
val ngram = 3
val minF = 2
val windUdf = udf{s: String => s.sliding(ngram).toList.sliding(wsize).toList}
val get_mid = udf{s: Seq[String] => s(s.size/2)}
val rm_punct = udf{s: String => s.replaceAll("""([\p{Punct}|¿|\?|¡|!]|\p{C}|\b\p{IsLetter}{1,2}\b)\s*""", "")}
// Read and remove punctuation
var df = spark.read.text(in_file)
.withColumn("value", rm_punct($"value"))
// Creating windows and explode them, and get the center word into $"word"
df = df.withColumn("char_nGrams", windUdf('value))
.withColumn("word_winds", explode($"char_nGrams"))
.withColumn("word", get_mid('word_winds))
val indexer = new StringIndexer().setInputCol("word")
.setOutputCol("label")
df = indexer.fit(df).transform(df)
val hashingTF = new HashingTF().setInputCol("word_winds")
.setOutputCol("freqFeatures")
df = hashingTF.transform(df)
val idf = new IDF().setInputCol("freqFeatures")
.setOutputCol("features")
df = idf.fit(df).transform(df)
// Remove word whose freq is less than minF
var counts = df.groupBy("label").count
.filter(col("count") > minF)
.orderBy(desc("count"))
.withColumn("id", monotonically_increasing_id())
var filtro = df.groupBy("label").count.filter(col("count") <= minF)
df = df.join(filtro, Seq("label"), "leftanti")
var dfs = if(stratified){
// Create stratified sample 'dfs'
var revs = counts.orderBy(asc("count")).select("count")
.withColumn("id", monotonically_increasing_id())
revs = revs.withColumnRenamed("count", "ascc")
// Weigh the labels (linearly) inversely ("ascc") proportional NORMALIZED weights to word ferquency
counts = counts.join(revs, Seq("id"), "inner").withColumn("weight", col("ascc")/df.count)
val minn = counts.select("weight").agg(min("weight")).first.getDouble(0) - 0.01
val maxx = counts.select("weight").agg(max("weight")).first.getDouble(0) - 0.01
counts = counts.withColumn("weight_n", (col("weight") - minn) / (maxx - minn))
counts = counts.withColumn("weight_n", when(col("weight_n") > 1.0, 1.0)
.otherwise(col("weight_n")))
var fractions = counts.select("label", "weight_n").rdd.map(x => (x(0), x(1)
.asInstanceOf[scala.Double])).collectAsMap.toMap
df.stat.sampleBy("label", fractions, 36L).select("features", "word_winds", "word", "label")
}else{ df }
dfs = dfs.checkpoint()
val lr = new LogisticRegression().setRegParam(0.01)
val Array(tr, ts) = dfs.randomSplit(Array(0.7, 0.3), seed = 12345)
val training = tr.select("word_winds", "features", "label", "word")
val test = ts.select("word_winds", "features", "label", "word")
val model = lr.fit(training)
def mapCode(m: scala.collection.Map[Any, String]) = udf( (s: Double) =>
m.getOrElse(s, "")
)
var labels = training.select("label", "word").distinct.rdd
.map(x => (x(0), x(1).asInstanceOf[String]))
.collectAsMap
var predictions = model.transform(test)
predictions = predictions.withColumn("pred_word", mapCode(labels)($"prediction"))
predictions.write.format("csv").save("spark_predictions")
spark.stop()
}
}
Since your data is somewhat small it might help if you use coalesce before explode. Sometimes it can be inefficient to have too many nodes especially if there is a lot of shuffling in your code.
Like you said, it does seem like a lot of people have issues with explode. I looked at the link you provided but no one mentioned trying flatMap instead of explode.

How to calculate euclidean distance of each row in a dataframe to a constant reference array

I have a dataframe which is created from parquet files that has 512 columns(all float values).
I am trying to calculate euclidean distance of each row in my dataframe to a constant reference array.
My development environment is Zeppelin 0.7.3 with spark 2.1 and Scala. Here is the zeppelin paragraphs I run:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
//Create dataframe from parquet file
val filePath = "/tmp/vector.parquet/*.parquet"
val df = spark.read.parquet(filePath)
//Create assembler and vectorize df
val assembler = new VectorAssembler()
.setInputCols(df.columns)
.setOutputCol("features")
val training = assembler.transform(df)
//Create udf
val eucDisUdf = udf((features: Vector,
myvec:Vector)=>Vectors.sqdist(features, myvec))
//Cretae ref vector
val myScalaVec = Vectors.dense( Array.fill(512)(25.44859))
val distDF =
training2.withColumn("euc",eucDisUdf($"features",myScalaVec))
This code gives the following error for eucDisUdf call:
error: type mismatch; found : org.apache.spark.ml.linalg.Vector
required: org.apache.spark.sql.Column
I appreciate any idea how to eliminate this error and compute distances properly in scala.
I think you can use currying to achieve that:
def eucDisUdf(myvec:Vector) = udf((features: Vector) => Vectors.sqdist(features, myvec))
val myScalaVec = Vectors.dense(Array.fill(512)(25.44859))
val distDF = training2.withColumn( "euc", eucDisUdf(myScalaVec)($"features") )

Apply PCA on specific columns with Apache Spark

i am trying to apply PCA on a dataset that contains a header and contains fields
Here is the code i used , any help to be able to select a specific columns on which we apply PCA .
val inputMatrix = sc.textFile("C:/Users/mhattabi/Desktop/Realase of 01_06_2017/TopDrive_WithoutConstant.csv").map { line =>
val values = line.split(",").map(_.toDouble)
Vectors.dense(values)
}
val mat: RowMatrix = new RowMatrix(inputMatrix)
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
//updated version
i tried to do this
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.format("com.databricks.spark.csv")
val columnsToUse: Seq[String] = Array("Col0","Col1", "Col2", "Col3", "Col4").toSeq
val k: Int = 2
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(k)
val featurized = rf.fit(df).transform(df)
//prinpal component
val principalComponent = pca.fit(featurized).transform(featurized)
principalComponent.select("pcaFeatures").show(4,false)
+-----------------------------------------+
|pcaFeatures |
+-----------------------------------------+
|[-0.536798281241379,0.495499034754084] |
|[-0.32969328815797916,0.5672811417154808]|
|[-1.32283465170085,0.5982789033642704] |
|[-0.6199718696225502,0.3173072633712586] |
+-----------------------------------------+
I got this for pricipal component , the question i want to save this in csv file and add header.Any help many thanks
Any help would be appreciated .
Thanks a lot
You can use the RFormula in this case :
import org.apache.spark.ml.feature.{RFormula, PCA}
val columnsToUse: Seq[String] = ???
val k: Int = ???
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("/tmp/foo.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setK(k)
val featurized = rf.fit(df).transform(df)
val projected = pca.fit(featurized).transform(featurized)
java.lang.NumberFormatException: For input string: "DateTime"
it means that in your input file there is a value DateTime that you then try to convert to Double.
Probably it is somewhere in the header of you input file

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)