Similarity matrix using a spark dataframe - scala

For an input Dataframe the intent is to generate only half of the self-cartesian product. Given the cartesian product results in a symmetric matrix we only really need to calculate either the upper or the lower triangular portion above (resp below) the diagonal that is set to zeros:
The dataframe crossjoin :
val df3 = df2.crossJoin(df2)
will generate the FULL - which we do not want.
Given the similarity matrix is symmetric with 1's along the diagonal we do not need to calculate the upper half or the diagonal itself - as shown in the LOWER DiagO's below:
Any suggestions on how to obtain the result with the least computation?

The following is not a perfect answer: it does result in first generating the full cartesian product. But at least the output results are correct.
/** Generate schema for cartesian product of an input dataframe */
def joinSchema(df: DataFrame) =
types.StructType( {
f => StructField(s"${}_a", f.dataType, f.nullable)
} ++ { f => StructField(s"${}_b", f.dataType, f.nullable)}
// Create the cartesian product via crossJoin
val schema = joinSchema(dfIn)
val df3 = df2.crossJoin(dfIn)
val cartesianDf = spark.createDataFrame(df3.rdd, schema)
// Retain the lower triangular entries below the diagonal
select * from cartesian where id_a < id_b


Convert RDD of Matrix to RDD of Vector

I have a RDD[Matrix[Double]] and want to convert it to RDD[Vector] (Each row in the Matrix will be converted to a Vector).
I've seen related answer like Convert Matrix to RowMatrix in Apache Spark using Scala, but it's one Matrix to RDD of Vector. While my case is RDD of Matrix.
Use flatMap on code to convert Matrix to Seq[Vector]:
// from
def toSeqOfVector(m: Matrix): Seq[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD. => new DenseVector(row.toArray))
val matrices: RDD[Matrix] = ??? // your input
val vectors: RDD[Vector] = matrices.flatMap(toSeqOfVector)
Note: I didn't test this code, but this is the principle

Cosine similarity of two sparse vectors in Scala Spark

I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.
In Spark 3 there is now method dot for a SparseVector object, which takes another vector as its argument.
If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:
Take intersection of your vectors' indices.
Get two subarrays of your vectors' values based on the indices from the intersection.
Do pairwise multiplication of the elements of those two subarrays.
Sum the values resulting values from such pairwise multiplication.
Here's my realization of it:
def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
val commonIndices = vec.indices intersect vecOther.indices => vec(x) * vecOther(x)).reduce(_+_)
I guess you know how to turn it into a Spark UDF from here and apply it to your dataframe's columns.
And if you normalize your sparse vectors with before computing your dot product, you'll get cosine similarity in the end (by definition).
Great answer above #Sergey-Zakharov +1.
A few adds-on:
The reduce doesn't work on empty sequences.
Make sure computing L2 normalization.
val normalizer = new Normalizer()
val l2NormData = normalizer.transform(df_features)
val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
and then
val df = dfA.crossJoin(broadcast(dfB))
.withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))
If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors] that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.

Multiply each column of a rowmatrix by an integer

I'm wondering what would be a good way of multiply each column of a RowMatrix
by an integer (each row being multiplied by a different integer).
I know I could for example create a diagonal mllib "Matrix" object containing
the values a1 ... an ( ai being the coefficient I want to multiply the ith
column of the RowMatrix by ), and then I could just use the matrix multiplication of mllib (multiplying a RowMatrix by a Matrix, which yields a RowMatrix as result). However this is not efficient probably and does not show how to
do stuff on a RowMatrix.
I'm new to writing functions on rowmatrices and tried looking a bit at some of
the already existing ones and was a bit confused.
Thanks for you help
It's unclear whether you want to multiply each row or each column by a different integer. Your title and second paragraph say each column, but your first sentence says each row. Regardless, these sorts of operations are probably most easily implemented by calling .rows and operating on the underlying RDD[Vector]. For instance:
def multiplyColumns(m: RowMatrix, xs: Array[Double]): RowMatrix = {
val newRowsRdd = {
row => Vectors.dense({case (a, b) => a * b})
new RowMatrix(newRowsRdd)

Geometric mean of columns in dataframe

I use this code to compute the geometric mean of all rows within a dataframe :
from pyspark.sql.functions import rand, randn, sqrt
df = sqlContext.range(0, 10)
df ="c1"), randn(seed=27).alias("c2"))
newdf = df.withColumn('total', sqrt(sum(df[col] for col in df.columns)))
This displays :
To compute the geometric mean of the columns instead of rows I think this code should suffice :
newdf = df.withColumn('total', sqrt(sum(df[row] for row in df.rows)))
But this throws error : NameError: global name 'row' is not defined
So appears the api for accessing columns is not same as accessing rows.
Should I format the data to convert rows to columns and then re-use working algorithm : newdf = df.withColumn('total', sqrt(sum(df[col] for col in df.columns))) or is there a solution that processes the rows and columns as is ?
I am not sure you definition of geometric mean is correct. According to Wikipedia, the geometric mean is defined as the nth root of the product of n numbers. According to the same page, the geometric mean can also be expressed as the exponential of the arithmetic mean of logarithms. I shall be using this to calculate the geometric mean of each column.
You can calculate the geometric mean, by combining the column data for c1 and c2 into a new column called value storing the source column name in column. After the data has been reformatted, the geometric mean is determined by grouping by column (c1 or c2) and calculating the exponential of the arithmetic mean of the logarithmic value for each group. In this calculation NaN values are ignored.
from pyspark.sql import functions as F
df = sqlContext.range(0, 10)
df ="c1"), F.randn(seed=27).alias("c2"))
df_id = df.withColumn("id", F.monotonically_increasing_id())
kvp = F.explode(F.array([F.struct(F.lit(c).alias("column"), F.col(c).alias("value")) for c in df.columns])).alias("kvp")
df_pivoted =['id'] + [kvp]).select(['id'] + ["kvp.column", "kvp.value"])
df_geometric_mean = df_pivoted.groupBy(['column']).agg(F.exp(F.avg(F.log(df_pivoted.value))))
df_geometric_mean.withColumnRenamed("EXP(avg(LOG(value)))", "geometric_mean").show()
This returns:
|column| geometric_mean|
| c1|0.25618961513533134|
| c2| 0.415119290980354|
These geometrics means, other than their precision, match the geometric mean return by scipy provided NaN values are ignored as well.
from scipy.stats.mstats import gmean
c1=[x['c1'] for x in df.collect() if x['c1']>0]
c2=[x['c2'] for x in df.collect() if x['c2']>0]
print 'c1 : {0}\r\nc2 : {1}'.format(gmean(c1),gmean(c2))
This snippet returns:
| c1|0.256189615135|
| c2|0.41511929098|

Calculate Spearman Correlation on a Spark DataFrame

I would like to run a Spearman correlation on data that is currently in a Spark DataFrame. Currently, only the Pearson correlation calculation is available to operate on columns in a DataFrame. It appears that I can do a Spearman correlation using Spark's MLlib, but I need to pass two RDD[Double] to the function. The columns I want to compare are Double according to the current schema.
Is there a way to select the columns I want and make the be an array of Doubles so that I can use the MLlib correlation function to get the Spearman correlation coefficient?
You can simply select columns of interest, extract values and compute statistics:
import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics
// Generate some random data
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")
// Select columns and extract values
val rddX =$"x")
val rddY =$"y")
val correlation: Double = Statistics.corr(rddX, rddY, "spearman")
You should be able to do something like this
val firstRDD: RDD[Double] ="field1").map(row => row.getDouble(0))
val secondRDD: RDD[Double] ="field2").map(row => row.getDouble(0))
val corr = Statistics.corr(firstRDD, secondRDD, "spearman")