Converting a Spark DataFrame for ML processing - scala

I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?
val data2 = spark.read.format("csv").option("header",
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
row =>{
val n0 = row(0).toString.toLowerCase().toDouble
val n1 = row(1).toString.toLowerCase().toDouble
val n2 = row(2).toString.toLowerCase().toDouble
val n3 = row(3).toString.toLowerCase().toDouble
(n0, Vectors.dense(n1,n2,n3))
}
).toDF("label", "features")
Thanks
Regards,
Adeel

A VectorAssembler can do the job:
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector
Based on your code, the solution would look like:
val data2 = spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true") //1
.load("/data/c7.csv")
val fields = data2.schema.fieldNames
val assembler = new VectorAssembler()
.setInputCols(fields.tail) //2
.setOutputCol("features") //3
val goodBadRecords = assembler.transform(data2)
.withColumn("label", col(fields(0))) //4
.drop(fields:_*) //5
Remarks:
A schema is necessary for the input data, as the VectorAssembler only accepts the following input column types: all numeric types, boolean type, and vector type (same link). You seem to have a csv with doubles, so infering the schema should work. But of course, any other method to transform the string data to doubles is also ok.
Use all but the first column as input for the VectorAssembler
Name the result column of the VectorAssembler features
Create a new column called label as copy of the first column
Drop all orginal columns. This last step is optional as the learning algorithm usually only looks at the label and feature column and ignores all other columns

Related

IllegalArgumentException when computing a PCA with Spark ML

I have a parquet file containing the id and features columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
but I have this exception
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually ArrayType(DoubleType,true).
Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)

Running SVD in Spark Scala

I have a RDD in which i have word and it's vector representation. I followed following example:https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
The SingularValueDecomposition class returns the RowMatrix. It doesn't have word for which the vector was originally generated in RowMatrix. I am not getting how to use SingularValueDecomposition output now since it is just reduced matrix with no word label in it.
Anyone faced the similar issue?
I was able to do by following below steps:
// GET word and vector.
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("filteredWords").setOutputCol("features").setVocabSize(100000).setMinDF(2).fit(newSentenceData)
// Model is fitted
val fittedModel = cvModel.transform(newSentenceData)
// Converted the Dataframe to RDD as the SVD library works on RDD.
val rddVectorWithAllColumns = fittedModel.rdd
// Here, i have truncated the code and assumed that svd variable is holding the model. In this step, i am accessing the U matrix and adding the word back to the RDD so that we can get reduced vectors and word.
val test = svd.U.rows.map(row => row.toArray).zip(rddVectorWithAllColumns.map(row => row.getString(0))).map(line => line._2 + "\t" + line._1.mkString("\t"))

Select 2000+ columns as Features for Classification ML

I'm looking how I can select a lot of columns(2000+) as a feature from a Dataframe. I don't want to write the name one by one.
I'm doing classification and i have around 2000 features.
data is a Dataframe with around 2000 columns.
First, I get all of the columns name of my DF and drop 9 columns because i don't need them.
My idea was to use all the columns names to feed the VectorAssembler. The result should be something like [Value Of the 1st Feature, Value 2nd Feature, Value 3rd Feature...] for the first row and this for all of my Dataframe.
But I have this error :
java.lang.IllegalArgumentException: Field "features" does not exist.
EDIT : If something is unclear, please let me know that I can fix it.
I deleted some Transformers because it's not the point of my question.(StringIndexer, VectorIndexer, IndexToString)
val array = data.columns drop(9)
val assembler = new VectorAssembler()
.setInputCols(array)
.setOutputCol("features")
val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setNumTrees(50)
val pipeline = new Pipeline()
.setStages(Array(assembler, rf))
val model = pipeline.fit(trainingData)
EDIT 2 I fix my problem. I took off the Vector Indexer and used array in the VectorAssembler and it worked perfectly.
Well at least, I get a result.

Transforming Spark Dataframe Column

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

Preparing data for LDA in spark

I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example:
doc XXXXX term XXXXX
1 x 'a' x
1 x 'a' x
1 x 'b' x
2 x 'b' x
2 x 'd' x
...
Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a given term appears 5 times in a document, that corresponds to 5 lines of text).
In any case, I need to format this data as sparse term-frequency vectors for running a Spark LDA model, but am unfamiliar with Scala so having some trouble.
I start with:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
val corpus:RDD[Array[String]] = sc.textFile("path/to/data")
.map(_.split('\t')).map(x => Array(x(0),x(2)))
And then I get the vocabulary data I'll need to generate the sparse vectors:
val vocab: RDD[String] = corpus.map(_(1)).distinct()
val vocabMap: Map[String, Int] = vocab.collect().zipWithIndex.toMap
What I don't know is the proper mapping function to use here such that I end up with a sparse term frequency vector for each document that I can then feed into the LDA model. I think I need something along these lines...
val documents: RDD[(Long, Vector)] = corpus.groupBy(_(0)).zipWithIndex
.map(x =>(x._2,Vectors.sparse(vocabMap.size, ???)))
At which point I can run the actual LDA:
val lda = new LDA().setK(n_topics)
val ldaModel = lda.run(documents)
Basically, I don't what function to apply to each group so that I can feed term frequency data (presumably as a map?) into a sparse vector. In other words, how do I fill in the ??? in the code snippet above to achieve the desired effect?
One way to handle this:
make sure that spark-csv package is available
load data into DataFrame and select columns of interest
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true") // Optional, providing schema is prefered
.option("delimiter", "\t")
.load("foo.csv")
.select($"doc".cast("long").alias("doc"), $"term")
index term column:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("term")
.setOutputCol("termIndexed")
val indexed = indexer.fit(df)
.transform(df)
.drop("term")
.withColumn("termIndexed", $"termIndexed".cast("integer"))
.groupBy($"doc", $"termIndexed")
.agg(count(lit(1)).alias("cnt").cast("double"))
convert to PairwiseRDD
import org.apache.spark.sql.Row
val pairs = indexed.map{case Row(doc: Long, term: Int, cnt: Double) =>
(doc, (term, cnt))}
group by doc:
val docs = pairs.groupByKey
create feature vectors
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.max
val n = indexed.select(max($"termIndexed")).first.getInt(0) + 1
val docsWithFeatures = docs.mapValues(vs => Vectors.sparse(n, vs.toSeq))
now you have all you need to create LabeledPoints or apply additional processing