Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label - scala

I am building a training set using two text files representing documents and labels.
Documents.txt
hello world
hello mars
Labels.txt
0
1
I have read in these files and converted my document data to a tf-idf weighted term-document matrix which is represented as a RDD[Vector]. I have also read-in and created a RDD[Vector] for my labels:
val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq)
val labs: RDD[Vector] = sc.textFile("Labels.txt")
.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(docs)
tf.cache()
val idf = new IDF(minDocFreq = 3).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
I would like to use tfidf and labsto create a RDD[LabeledPoint], but I am not sure how to apply a mapping with two different RDDs. Is this even possible/efficient, or do I need to rethink my approach?

One way to handle this is to join based on indices:
import org.apache.spark.RangePartitioner
// Add indices
val idfIndexed = idf.zipWithIndex.map(_.swap)
val labelsIndexed = labels.zipWithIndex.map(_.swap)
// Create range partitioner on larger RDD
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed)
// Join with custom partitioner
labelsIndexed.join(idfIndexed, partitioner).values

Related

IllegalArgumentException when computing a PCA with Spark ML

I have a parquet file containing the id and features columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
but I have this exception
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)

Store Spark distributed matrix in MongoDB

After calculating the distance matrix related to a set of points stored in a file on HDFS, I need to store the calculated distance matrix which is in a distributed form (CoordinateMatrix/RowMatrix), in MongoDB through MongoDB Connector for Apache Spark. Is there a recommended way to do this or even a better connector for such an operation ?
Here is the part of my code:
val data = sc.textFile("hdfs://localhost:54310/usrp/copy_sample_data.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value, index) => (index, value)}
val pairedSamples = indexedData.cartesian(indexedData)
val dist = pairedSamples.map{case (x,y) => ((x,y),distance(x._2,y._2))}.map{case ((x,y),z) => (((x,y),z,covariance(z)))}
val entries: RDD[MatrixEntry] = dist.map{case (((x,y),z,cov)) => MatrixEntry(x._1, y._1, cov)}
val coomat: CoordinateMatrix = new CoordinateMatrix(entries)
To further note, I have created this matrix in Spark from a RDD. So maybe it is even better/possible to save data from RDD to Mongodb ?
CoordinateMatrix and RowMatrix are basically wrappers around RDD[MatrixEntry] and RDD[Vector] respectively and both can be relatively saved to MongoDB. For coordinate matrix:
val spark: SparkSession = ???
import spark.implicits._
// For 1.x
// val sqlContext: SQLContext = ???
// import sqlContext.implicits._
val options = Map(
"uri" -> ???
"database" -> ???
)
val coordMat = new CoordinateMatrix(sc.parallelize(Seq(
MatrixEntry(1, 3, 1.4), MatrixEntry(3, 6, 2.8))
))
coordMat.entries.toDF().write
.options(options)
.option("collection", "coordinates")
.format("com.mongodb.spark.sql")
.save()
you'll get documents of shape:
{'_id': ObjectId('...'), 'i': 3, 'j': 6, 'value': 2.8}
which can be easily casted back to the original form:
val entries = spark.read
.options(options)
.option("collection", "coordinates")
.format("com.mongodb.spark.sql")
.load()
.drop("_id")
.schema(...)
.as[MatrixEntry]
new CoordinateMatrix(entries.rdd)
Pretty much the same thing can be done for RowMatrix but you'll need a little bit more work (represent Vectors either as dense arrays or sparse tuple (size, indices, values)).
Unfortunately in both cases (CoordinateMatrix, RowMatrix) you'll loose information about matrix shape.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

Handling unseen categorical variables and MaxBins calculation in Spark Multiclass-classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code.
I am calculating the max number of categories and then giving it as a parameter to RF. This takes a lot of time! Is there a parameter to set or an easier way to make the model automatically infer the max categories?Since it can go more than 1000 and I cannot omit them.
How do I handle unseen labels on new data for prediction since StringIndexer will not work in that case. the code below is just a split of data but I will be introducing new data as well in future
// Need to predict 2 classes
val cols_to_predict=Array("Label1","Label2")
// ID col
val omit_cols=Array("Key")
// reading the csv file
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("abc.csv")
.cache()
// creating a features DF by droppping the labels so that I can run all
// the cols through String Indexer
val features=data.drop("Label1").drop("Label2").drop("Key")
// Since I do not know my max categories possible, I find it out
// and use it for maxBins parameter in RF
val distinct_col_counts=features.columns.map(x => data.select(x).distinct().count ).max
val transformers: Array[org.apache.spark.ml.PipelineStage] = features.columns.map(
cname => new StringIndexer().setInputCol(cname).setOutputCol(s"${cname}_index").fit(features)
)
val assembler = new VectorAssembler()
.setInputCols(features.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val labelIndexer2 = new StringIndexer()
.setInputCol("prog_label2")
.setOutputCol("Label2")
.fit(data)
val labelIndexer1 = new StringIndexer()
.setInputCol("orig_label1")
.setOutputCol("Label1")
.fit(data)
val rf = new RandomForestClassifier()
.setLabelCol("Label1")
.setFeaturesCol("features")
.setNumTrees(100)
.setMaxBins(distinct_col_counts.toInt)
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer1.labels)
// Split into train and test
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
trainingData.cache()
testData.cache()
// Running only for one label for now Label1
val stages: Array[org.apache.spark.ml.PipelineStage] =transformers :+ labelIndexer1 :+ assembler :+ rf :+ labelConverter //:+ labelIndexer2
val pipeline=new Pipeline().setStages(stages)
val model=pipeline.fit(trainingData)
val predictions = model.transform(testData)

How to convert a map to Spark's RDD

I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.
I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
I know it should be as easy to directly load the data variable from data_map, but I don't know how.
Any help is appreciated!
I guess you want something like this
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ???
val trainMap = Map(
"x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
"x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))
val trainRdd: RDD[(String, LabeledPoint)] = sc
// Convert Map to Seq so it can passed to parallelize
.parallelize(trainMap.toSeq)
.map{case (id, (labelInt, values)) => {
// Convert nested map to Seq so it can be passed to Vector
val features = Vectors.sparse(nFeatures, values.toSeq)
// Convert label to Double so it can be used for LabeledPoint
val label = labelInt.toDouble
(id, LabeledPoint(label, features))
}}
It can be done in two ways
sc.textFile("libsvm_data.txt").map(s => createObject())
Convert map into collection of objects and use sc.parallelize()
The first one is preferrable.