subtract vector column from scalar in spark using scala - scala

I used Movielens 20 million dataset which contain file called rating .csv(UserId,MovieId,Rating) .I applied Alternating Least Square(ALS) which give output userId,FeatureVector in 10 parquet files . Dimensinality Reduction
I want to make normalize for featureVector using z-score method.
I want to subtract vector(featureVector) from constant scalar 2.484 ,divide value into 1.8305 and save value in parquet files.features column
val df = sqlContext.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
sqlContext.sql("select features from df")
df.withColumn("output", "features" -2.484).show(20)
How to subtract vector from each value of scalar?

if you want to subtract 2.484 from each vector value you can try it
import spark.implicits._
import org.apache.spark.ml.linalg.{DenseVector, Vectors}
import org.apache.spark.rdd.RDD
val df = Seq(Vector(1.0,2.0,2.5,3.0,3.5)).toDF("features")
val rdd: RDD[DenseVector] = df.select('features)
.rdd
.map(s => s.getAs[Seq[Double]]("features").toArray)
.map(s => Vectors.dense(s.map(s => s - 2.3333)).toDense)
rdd.take(10).foreach(println(_))
output:
[-1.3333,-0.33329999999999993,0.16670000000000007,0.6667000000000001,1.1667]

Related

Add header to correlation matrix in Spark

I am applying correlation on a csv file using apache spark, when loading data I am obliged to skip the first row as a header which are columns in the dataset otherwise I can't load the data.
I get the correlation computed but when I got the correlation matrix, I can't add the columns name as a header in the new matrix. How to get the matrix with its header? This what I have tried:
import org.apache.spark.mllib.linalg.{ Vector, Vectors }
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.rdd.RDD
val data = sc.textFile(strfilePath).mapPartitionsWithIndex {
case (index, iterator) => if (index == 0) iterator.drop(1) else iterator
}
val inputMatrix = data.map { line =>
val values = line.split(",").map(_.toDouble)
Vectors.dense(values)
}
val correlationMatrix = Statistics.corr(inputMatrix, "pearson")
In Spark 2.0+ you can load a csv file into a dataframe using the command:
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("filePath")
The correlations between different columns can then be computed with
df.stat.corr("col1", "col2", "pearson")

How to convert a DataFrame with String into a DataFrame with Vectors in Scala(Spark 2.0)

I have a DataFrame with a column named KFA containing a string with angular braces on both ends. There are 4 double values in this long string. I would like to convert this into a DataFrame with vectors.
This is the first element of the DataFrame:
> dataFrame1.first()
res130: org.apache.spark.sql.Row = [[.00663 .00197 .29809 .0034]]
Could you help me to covert it into a dense vector with 4 double values.
I have tried this command
dataFrame1.select("KFA")
.map((x=>x.mkString("").replace("]","").replace("[","").split(" ")))
.rdd.map(x=>Vectors.dense(x(0).toDouble,x(1).toDouble,x(2).toDouble,x(3).toDouble,x(4).toDouble))
This looks very clumsy and unreadable. Could you suggest any other ways of doing this?
Here is an option with Regular expression:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val p = "[.0-9]+".r
val rddVec = dataFrame1.select("KFA")
.map(x => Vectors.dense(p.findAllIn(x(0).toString).map(_.toDouble).toArray))
# rddVec: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[49] at map at <console>:39
rddVec.collect
# res43: Array[org.apache.spark.mllib.linalg.Vector] =
Array([0.00663,0.00197,0.29809,0.0034], [0.00663,0.00197,0.29809,0.0034])

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Preparing data for LDA in spark

I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example:
doc XXXXX term XXXXX
1 x 'a' x
1 x 'a' x
1 x 'b' x
2 x 'b' x
2 x 'd' x
...
Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a given term appears 5 times in a document, that corresponds to 5 lines of text).
In any case, I need to format this data as sparse term-frequency vectors for running a Spark LDA model, but am unfamiliar with Scala so having some trouble.
I start with:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
val corpus:RDD[Array[String]] = sc.textFile("path/to/data")
.map(_.split('\t')).map(x => Array(x(0),x(2)))
And then I get the vocabulary data I'll need to generate the sparse vectors:
val vocab: RDD[String] = corpus.map(_(1)).distinct()
val vocabMap: Map[String, Int] = vocab.collect().zipWithIndex.toMap
What I don't know is the proper mapping function to use here such that I end up with a sparse term frequency vector for each document that I can then feed into the LDA model. I think I need something along these lines...
val documents: RDD[(Long, Vector)] = corpus.groupBy(_(0)).zipWithIndex
.map(x =>(x._2,Vectors.sparse(vocabMap.size, ???)))
At which point I can run the actual LDA:
val lda = new LDA().setK(n_topics)
val ldaModel = lda.run(documents)
Basically, I don't what function to apply to each group so that I can feed term frequency data (presumably as a map?) into a sparse vector. In other words, how do I fill in the ??? in the code snippet above to achieve the desired effect?
One way to handle this:
make sure that spark-csv package is available
load data into DataFrame and select columns of interest
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true") // Optional, providing schema is prefered
.option("delimiter", "\t")
.load("foo.csv")
.select($"doc".cast("long").alias("doc"), $"term")
index term column:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("term")
.setOutputCol("termIndexed")
val indexed = indexer.fit(df)
.transform(df)
.drop("term")
.withColumn("termIndexed", $"termIndexed".cast("integer"))
.groupBy($"doc", $"termIndexed")
.agg(count(lit(1)).alias("cnt").cast("double"))
convert to PairwiseRDD
import org.apache.spark.sql.Row
val pairs = indexed.map{case Row(doc: Long, term: Int, cnt: Double) =>
(doc, (term, cnt))}
group by doc:
val docs = pairs.groupByKey
create feature vectors
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.functions.max
val n = indexed.select(max($"termIndexed")).first.getInt(0) + 1
val docsWithFeatures = docs.mapValues(vs => Vectors.sparse(n, vs.toSeq))
now you have all you need to create LabeledPoints or apply additional processing

Standardize an RDD

Suppose I have an RDD of doubles and I want to “standardize” it as follows:
Compute the mean and sd for each col
For each col, subtract the column mean from each entry and divide the result by the column sd
Can this be done efficiently and easily (without converting the RDD into a double array at any stage)?
Thanks and regards,
You can use StandardScaler from Spark itself
/**
* Standardizes features by removing the mean and scaling to unit variance
* using column summary
*/
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
val data: RDD[Vector] = ???
val scaler = new StandardScaler(true, true).fit(data)
data.foreach { vector =>
val scaled = scaler.transform(vector)
}