Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark - scala

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),
Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0),
MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0),
MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0),
MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..
I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.
any idea is greatly appreciated.

It will be quite expensive due to shuffling (this is the part you cannot really avoid here) but you can convert entries to PairRDD and reduce by key:
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
val mat: CoordinateMatrix = ???
val rowSums: RDD[Long, Double)] = mat.entries
.map{case MatrixEntry(row, _, value) => (row, value)}
.reduceByKey(_ + _)
Unlike solution based on indexedRowMatrix:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
case IndexedRow(i, values) => (i, values.toArray.sum)
it requires no groupBy transformation or intermediate SparseVectors.


Vector vs Vectors in spark [duplicate]

This question already has an answer here:
Difference between spark Vectors and scala immutable Vector?
(1 answer)
Closed 4 years ago.
i am a newbie in Spark.
I am trying to read a text file that has data like:
timestamp id counter value
00:01 1 c1 0.5
00:02 5 c3 0.3
00:03 1 c2 0.1
00:04 2 c2 0.13
and transform them to:
(id, array_of_counters):
(1, [ c1 c2 ])
[ 0.5 0.1]
So, for every id, i create an 2d array, which will have every counter and every value for that specific id in the text file.
I tried to do it with Vectors but i think that what is stored in them, must be double and that i cannot add two vectors, except the case they are breeze Vectors.
Then, i found out there is a data structure called just Vector but i can't find any details about it.
So, my question is what are the main differences between Vector and Vectors in mllib?
val inputRdd = sc.textFile(inputFile).map(x => x.split(","))
val data = inputRdd.map(y => (y(1), Vector(y(2), y(3)))).reduceByKey(_++_)
I don't think a Vector is necessary or appropriate for what it appears you are trying to do here (I could be wrong, we need more specifics on what you want to accomplish). The only way it makes sense is if there is a fixed number of counters (c1, c2, etc...) for each id. If you simply want a set of every id, with it's corresponding list of counters and values, try this (I'm assuming counters are unique to each id):
val data = inputRdd
.map(y => (y(1).toLong, y(2), y(3).toDouble))
.toDF("id", "counter", "value")
.agg(collect_list(map($"counter", $"value")))
.as[(Long, Seq[Map[String, Double]])]
.map(r => (r._1, r._2.reduce(_++_)))
//this results in a Dataset[(Long, Map[String, Double])]
A spark ml.linalg.Vector is basically an Array[Double], and would require a fixed number of counter for every record. You could tranform from the data above into a vector by ordering the Map[String, Double] by it's ._1 and creating a Vector from it's .values.
ml.linalg.Vectors is just a helper object with functions for creating Vector objects.
Factory methods for org.apache.spark.ml.linalg.Vector. We don't use the name Vector because Scala imports scala.collection.immutable.Vector by default.
It's also worth noting that mllib is intended for the older RDD API while ml is intended for the newer Dataframe/Dataset API.
Edit: RDD[(Long, Seq[(String, Double)])]
val data = inputRdd
.map(y => (y(1).toLong, Seq[(String, Double)]((y(2), y(3).toDouble))))

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
from pyspark.ml.feature import Imputer
imputer = Imputer(
outputCols=["{}_imputed".format(c) for c in df.columns]
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.select(df.columns.map(mean(_)): _*).first.toSeq
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
import org.apache.spark.sql.functions.{col, isnan, when}
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

spark scala avoid shuffling after RowMatix calculation

scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated

How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently?

What I want to do like this:
Find the median value of each column.
It can be done by collecting the RDD to driver, for a big data which will become impossible.
I know Statistics.colStats() can calculate mean, variance... but median is not included.
Additionally, the vector is high-dimensional and sparse.
Well I didn't understand the vector part, however this is my approach (I bet there are better ones):
val a = sc.parallelize(Seq(1, 2, -1, 12, 3, 0, 3))
val n = a.count() / 2
println(n) // outputs 3
val b = a.sortBy(x => x).zipWithIndex()
val median = b.filter(x => x._2 == n).collect()(0)._1 // this part doesn't look nice, I hope someone tells me how to improve it, maybe zero?
println(median) // outputs 2
b.collect().foreach(println) // (-1,0) (0,1) (1,2) (2,3) (3,4) (3,5) (12,6)
The trick is to sort your dataset using sortBy, then zip the entries with their index using zipWithIndex and then get the middle entry, note that I set an odd number of samples, for simplicity but the essence is there, besides you have to do this with every column of your dataset.

How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format :
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.