Writing output of the Principal Components Analysis to text file - scala

I have performed a Principal Component Analysis on a matrix I previously loaded with sc.textFile. The output being a org.apache.spark.mllib.linalg.Matrix I then converted it to a RDD[Vector[Double]].
with:
import java.io.PrintWriter
I did:
val pw = new PrintWriter("Matrix.csv")
rows3.collect().foreach(line => pw.println(line))
pw.flush
The output csv is promising. the only problem is that each line is a DenseVector(some values). How do I split each line into the corresponding coefficients?
Thanks a lot

You can use results of the computePrincipalComponents and breeze.linalg.csvwrite:
import java.io.File
import breeze.linalg.{DenseMatrix => BDM, csvwrite}
val mat: RowMatrix = ...
val pca = mat.computePrincipalComponents(...)
csvwrite(
new File("Matrix.csv"),
new BDM[Double](mat.numRows, mat.numCols, mat.toArray))

convert each vector to a string (you can do it either on the driver or the executers)
val pw = new PrintWriter("Matrix.csv")
rows3.map(_.mkString(",")).collect().foreach(line => pw.println(line))
pw.flush
edit:
if your data is too big to fit in the memory of the driver, you can try something like that:
val rdd = rows3.map(_.mkString(",")).zipWithIndex.cache
val total = rdd.count
val step = 10000 //rows in each chunk
val range = 0 to total by step
val limits = ranges.zip(range.drop(1))
limits.foreach { case(start, end) =>
rdd.filter(x => x._2 >= start && x._2 < end)
.map(_._1)
.collect
.foreach(pw.println(_))
}
I can't try this out, but that is the general idea

Related

How to set type of dataset when applying transformations and how to implement transformations without using spark.sql.functions._?

I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")

how to make faster windowing text file and machine learning over windows in spark

I'm trying to use Spark to learn multiclass logistic regression on a windowed text file. What I'm doing is first creating windows and explode them into $"word_winds". Then move the center word of each window into $"word". To fit the LogisticRegression model, I convert each different word into a class ($"label"), thereby it learns. I count the different labels to prone those with few minF samples.
The problem is that some part of the code is very very slow, even for small input files (you can use some README file to test the code). Googling, some users have been experiencing slowness by using explode. They suggest some modifications to the code in order to speed up 2x. However, I think that with a 100MB input file, this wouldn't be sufficient. Please suggest something different, probably to avoid actions that slow down the code. I'm using Spark 2.4.0 and sbt 1.2.8 on a 24-core machine.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types._
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
spark.sparkContext.setCheckpointDir("checked_dfs")
val in_file = "sample.txt"
val stratified = true
val wsize = 7
val ngram = 3
val minF = 2
val windUdf = udf{s: String => s.sliding(ngram).toList.sliding(wsize).toList}
val get_mid = udf{s: Seq[String] => s(s.size/2)}
val rm_punct = udf{s: String => s.replaceAll("""([\p{Punct}|¿|\?|¡|!]|\p{C}|\b\p{IsLetter}{1,2}\b)\s*""", "")}
// Read and remove punctuation
var df = spark.read.text(in_file)
.withColumn("value", rm_punct($"value"))
// Creating windows and explode them, and get the center word into $"word"
df = df.withColumn("char_nGrams", windUdf('value))
.withColumn("word_winds", explode($"char_nGrams"))
.withColumn("word", get_mid('word_winds))
val indexer = new StringIndexer().setInputCol("word")
.setOutputCol("label")
df = indexer.fit(df).transform(df)
val hashingTF = new HashingTF().setInputCol("word_winds")
.setOutputCol("freqFeatures")
df = hashingTF.transform(df)
val idf = new IDF().setInputCol("freqFeatures")
.setOutputCol("features")
df = idf.fit(df).transform(df)
// Remove word whose freq is less than minF
var counts = df.groupBy("label").count
.filter(col("count") > minF)
.orderBy(desc("count"))
.withColumn("id", monotonically_increasing_id())
var filtro = df.groupBy("label").count.filter(col("count") <= minF)
df = df.join(filtro, Seq("label"), "leftanti")
var dfs = if(stratified){
// Create stratified sample 'dfs'
var revs = counts.orderBy(asc("count")).select("count")
.withColumn("id", monotonically_increasing_id())
revs = revs.withColumnRenamed("count", "ascc")
// Weigh the labels (linearly) inversely ("ascc") proportional NORMALIZED weights to word ferquency
counts = counts.join(revs, Seq("id"), "inner").withColumn("weight", col("ascc")/df.count)
val minn = counts.select("weight").agg(min("weight")).first.getDouble(0) - 0.01
val maxx = counts.select("weight").agg(max("weight")).first.getDouble(0) - 0.01
counts = counts.withColumn("weight_n", (col("weight") - minn) / (maxx - minn))
counts = counts.withColumn("weight_n", when(col("weight_n") > 1.0, 1.0)
.otherwise(col("weight_n")))
var fractions = counts.select("label", "weight_n").rdd.map(x => (x(0), x(1)
.asInstanceOf[scala.Double])).collectAsMap.toMap
df.stat.sampleBy("label", fractions, 36L).select("features", "word_winds", "word", "label")
}else{ df }
dfs = dfs.checkpoint()
val lr = new LogisticRegression().setRegParam(0.01)
val Array(tr, ts) = dfs.randomSplit(Array(0.7, 0.3), seed = 12345)
val training = tr.select("word_winds", "features", "label", "word")
val test = ts.select("word_winds", "features", "label", "word")
val model = lr.fit(training)
def mapCode(m: scala.collection.Map[Any, String]) = udf( (s: Double) =>
m.getOrElse(s, "")
)
var labels = training.select("label", "word").distinct.rdd
.map(x => (x(0), x(1).asInstanceOf[String]))
.collectAsMap
var predictions = model.transform(test)
predictions = predictions.withColumn("pred_word", mapCode(labels)($"prediction"))
predictions.write.format("csv").save("spark_predictions")
spark.stop()
}
}
Since your data is somewhat small it might help if you use coalesce before explode. Sometimes it can be inefficient to have too many nodes especially if there is a lot of shuffling in your code.
Like you said, it does seem like a lot of people have issues with explode. I looked at the link you provided but no one mentioned trying flatMap instead of explode.

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
println
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
)
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
papers.repartition(partitions)
println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show(false)
val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
$"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
).select("g.nodeA_id","g.nodeB_id").distinct().cache()
filteredGt.show()
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
//TOKENIZE
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
)
val pipeline = new Pipeline()
pipeline.setStages(stages)
println("Transforming dataframe\n")
val model = pipeline.fit(papers)
papers = model.transform(papers)
println(papers.count())
papers.show(false)
papers.printSchema()
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
"paper_data",
udf_join_cols(
papers("pubYear_words"),
papers("title_words"),
papers("authors_words"),
papers("journal_words"),
papers("abstract_words")
)
).select("id", "paper_data").cache()
joinedDf.show(5,false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
vectorizedDf.show()
val mh = new MinHashLSH().setNumHashTables(hashTables)
.setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)
mhModel.transform(vectorizedDf).show()
vectorizedDf.createOrReplaceTempView("vecDf")
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
dfA.show(false)
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
dfB.show(false)
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
println("Predictions:")
val predictionsCount = predictionsDF.count()
predictionsDF.show()
println("Predictions count: "+predictionsCount)
predictionsDF.createOrReplaceTempView("predictions")
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
pairs.show(false)
val totalPredictions = pairs.count()
println("Properties:\n")
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("EVALUATION PROCESS STARTS\n")
println("Calculating true positives...\n")
val truePositives = filteredGt.as("g").join(pairs.as("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).cache().count()
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
ss.stop()
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Additionally:
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)
You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.