Spark RowMatrix columnSimilarities preserve original index - scala

I have the following Scala Spark DataFrame df of (String, Array[Double]): Note id is of type String (A base64 hash)
id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...
The dataset is quite large (45k) and I would like to perform a pairwise cosine similarity using org.apache.spark.mllib.linalg.distributed.RowMatrix for performance. This works, but I am not able to identify the pairwise similarities as the indexes have turned into integers (output columns i and j). How do I use IndexedRowMatrix to preserve the original indexes?
val rows = df.select("values")
.rdd
.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(rows)
val simsEstimate = mat.columnSimilarities()
Ideally, the end result should look something like this:
id_x, id_y, similarity
"a", "b", 0.9
"b", "c", 0.8
...

columnSimilarities() compute similarities between columns of the RowMatrix, not among rows, so "ids" you have are meaningless in this context and indices are indices in each feature vector.
Additionally these methods are designed for long, narrow and data, so an obvious approach - encode id with StringIndexer, create IndedxedRowMatrix, transpose, compute similarities, and go back (with IndexToString) simply won't do.
Your best bet here is to take crossJoin
df.as("a").crossJoin(df.as("b")).where($"a.id" <= $"b.id").select(
$"a.id" as "id_x", $"b.id" as "id_y", cosine_similarity($"a.values", $b.values")
)
where
val cosine_similarity = udf((xs: Array[Double], ys: Array[Double]) => ???)
is something you have to implement yourself.
Alternatively you can explode the data:
import org.apache.spark.sql.functions.posexplode
val long = ds.select($"id", posexplode($"values")).toDF("item", "feature", "value")
and then use method shown in Spark Scala - How to group dataframe rows and apply complex function to the groups? to compute similarity.

Related

Vector vs Vectors in spark [duplicate]

This question already has an answer here:
Difference between spark Vectors and scala immutable Vector?
(1 answer)
Closed 4 years ago.
i am a newbie in Spark.
I am trying to read a text file that has data like:
timestamp id counter value
00:01 1 c1 0.5
00:02 5 c3 0.3
00:03 1 c2 0.1
00:04 2 c2 0.13
and transform them to:
(id, array_of_counters):
(1, [ c1 c2 ])
[ 0.5 0.1]
So, for every id, i create an 2d array, which will have every counter and every value for that specific id in the text file.
I tried to do it with Vectors but i think that what is stored in them, must be double and that i cannot add two vectors, except the case they are breeze Vectors.
Then, i found out there is a data structure called just Vector but i can't find any details about it.
So, my question is what are the main differences between Vector and Vectors in mllib?
Code:
val inputRdd = sc.textFile(inputFile).map(x => x.split(","))
val data = inputRdd.map(y => (y(1), Vector(y(2), y(3)))).reduceByKey(_++_)
I don't think a Vector is necessary or appropriate for what it appears you are trying to do here (I could be wrong, we need more specifics on what you want to accomplish). The only way it makes sense is if there is a fixed number of counters (c1, c2, etc...) for each id. If you simply want a set of every id, with it's corresponding list of counters and values, try this (I'm assuming counters are unique to each id):
val data = inputRdd
.map(y => (y(1).toLong, y(2), y(3).toDouble))
.toDF("id", "counter", "value")
.groupBy("id")
.agg(collect_list(map($"counter", $"value")))
.as[(Long, Seq[Map[String, Double]])]
.map(r => (r._1, r._2.reduce(_++_)))
//this results in a Dataset[(Long, Map[String, Double])]
A spark ml.linalg.Vector is basically an Array[Double], and would require a fixed number of counter for every record. You could tranform from the data above into a vector by ordering the Map[String, Double] by it's ._1 and creating a Vector from it's .values.
ml.linalg.Vectors is just a helper object with functions for creating Vector objects.
Factory methods for org.apache.spark.ml.linalg.Vector. We don't use the name Vector because Scala imports scala.collection.immutable.Vector by default.
It's also worth noting that mllib is intended for the older RDD API while ml is intended for the newer Dataframe/Dataset API.
Edit: RDD[(Long, Seq[(String, Double)])]
val data = inputRdd
.map(y => (y(1).toLong, Seq[(String, Double)]((y(2), y(3).toDouble))))
.reduceByKey(_++_)

Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a feature vector for one of the movie's fields such as Title, Plot, Genres, Actors, etc.). For Actors and Genres, for example, the vector shows whether a given actor is present (1) or absent (0) in the movie.
The task is to find top 10 similar movies for each movie.
I managed to write a script in Scala that performs all those computations and does the job. It works for smaller sets of movies such as 1000 movies but not for the whole dataset (out of memory, etc.).
The way I do this computation is by using a cross join on the movies dataset. Then reduce the problem by only taking rows where movie1_id < movie2_id.
Still the dataset at this point will contain 46000^2/2 rows which is 1058000000.
And each row has significant amount of data.
Then I calculate similarity score for each row. After similarity is calculated I group the results where movie1_id is same and sort them in descending order by similarity score using a Window function taking top N items (similar to how it's described here: Spark get top N highest score results for each (item1, item2, score)).
The question is - can it be done more efficiently in Spark? E.g. without having to perform a crossJoin?
And another question - how does Spark deal with such huge Dataframes (1058000000 rows consisting of multiple SparseVectors)? Does it have to keep all this in memory at a time? Or does it process such dataframes piece by piece somehow?
I'm using the following function to calculate similarity between movie vectors:
def intersectionCosine(movie1Vec: SparseVector, movie2Vec: SparseVector): Double = {
val a: BSV[Double] = toBreeze(movie1Vec)
val b: BSV[Double] = toBreeze(movie2Vec)
var dot: Double = 0
var offset: Int = 0
while( offset < a.activeSize) {
val index: Int = a.indexAt(offset)
val value: Double = a.valueAt(offset)
dot += value * b(index)
offset += 1
}
val bReduced: BSV[Double] = new BSV(a.index, a.index.map(i => b(i)), a.index.length)
val maga: Double = magnitude(a)
val magb: Double = magnitude(bReduced)
if (maga == 0 || magb == 0)
return 0
else
return dot / (maga * magb)
}
Each row in the Dataframe consists of two joined classes:
final case class MovieVecData(imdbID: Int,
Title: SparseVector,
Decade: SparseVector,
Plot: SparseVector,
Genres: SparseVector,
Actors: SparseVector,
Countries: SparseVector,
Writers: SparseVector,
Directors: SparseVector,
Productions: SparseVector,
Rating: Double
)
It can be done more efficiently, as long as you are fine with approximations, and don't require exact results (or exact number or results).
Similarly to my answer to Efficient string matching in Apache Spark you can use LSH, with:
BucketedRandomProjectionLSH to approximate Euclidean distance.
MinHashLSH to approximate Jaccard Distance.
If feature space is small (or can be reasonably reduced) and each category is relatively small you can also optimize your code by hand:
explode feature array to generate #features records from a single record.
Self join result by feature, compute distance and filter out candidates (each pair of records will be compared if and only if they share specific categorical feature).
Take top records using your current code.
A minimal example would be (consider it to be a pseudocode):
import org.apache.spark.ml.linalg._
// This is oversimplified. In practice don't assume only sparse scenario
val indices = udf((v: SparseVector) => v.indices)
val df = Seq(
(1L, Vectors.sparse(1024, Array(1, 3, 5), Array(1.0, 1.0, 1.0))),
(2L, Vectors.sparse(1024, Array(3, 8, 12), Array(1.0, 1.0, 1.0))),
(3L, Vectors.sparse(1024, Array(3, 5), Array(1.0, 1.0))),
(4L, Vectors.sparse(1024, Array(11, 21), Array(1.0, 1.0))),
(5L, Vectors.sparse(1024, Array(21, 32), Array(1.0, 1.0)))
).toDF("id", "features")
val possibleMatches = df
.withColumn("key", explode(indices($"features")))
.transform(df => df.alias("left").join(df.alias("right"), Seq("key")))
val closeEnough(threshold: Double) = udf((v1: SparseVector, v2: SparseVector) => intersectionCosine(v1, v2) > threshold)
possilbeMatches.filter(closeEnough($"left.features", $"right.features")).select($"left.id", $"right.id").distinct
Note that both solutions are worth the overhead only if hashing / features are selective enough (and optimally sparse). In the example shown above you'd compare only rows inside set {1, 2, 3} and {4, 5}, never between sets.
However in the worst case scenario (M records, N features) we can make N M2 comparisons, instead of M2
Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].
Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.
You can lookup the details here:
https://github.com/actionml/universal-recommender
You can borrow from the idea of locality sensitive hashing. Here is one approach:
Define a set of hash keys based on your matching requirements. You would use these keys to find potential matches. For example, a possible hash key could be based on the movie actor vector.
Perform reduce for each key. This will give sets of potential matches. For each of the potential matched set, perform your "exact match". The exact match will produce sets of exact matches.
Run Connected Component algorithm to perform set merge to get the sets of all exact matches.
I have implemented something similar using the above approach.
Hope this helps.
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation
Notes:
Keep in mind that you will always have N^2 values in resulting similarity matrix
You will have to concatenate your sparse vectors
One very important suggestion , that i have used in similar scenarios is if some movie
relation similarity score
A-> B 8/10
B->C 7/10
C->D 9/10
If
E-> A 4 //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C
E->D

Spark migrate sql window function to RDD for better performance

A function should be executed for multiple columns in a data frame
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
This can be written nicely as shown above in spark-SQL and a for loop. However this is causing a lot of shuffles (spark apply function to columns in parallel).
A minimal example:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
val targetCounts = df.filter(df(target) === 1).groupBy(target)
.agg(count(target).as("cnt_foo_eq_1"))
val newDF = df.join(broadcast(targetCounts), Seq(target), "left")
val result = (columnsToDrop ++ columnsToCode).toSet.foldLeft(newDF) {
(currentDF, colName) => handleBias(currentDF, colName)
}
result.drop(columnsToDrop: _*).show
How can I formulate this more efficient using RDD API? aggregateByKeyshould be a good idea but is still not very clear to me how to apply it here to substitute the window functions.
(provides a bit more context / bigger example https://github.com/geoHeil/sparkContrastCoding)
edit
Initially, I started with Spark dynamic DAG is a lot slower and different from hard coded DAG which is shown below. The good thing is, each column seems to run independent /parallel. The downside is that the joins (even for a small dataset of 300 MB) get "too big" and lead to an unresponsive spark.
handleBiasOriginal("col1", df)
.join(handleBiasOriginal("col2", df), df.columns)
.join(handleBiasOriginal("col3TooMany", df), df.columns)
.drop(columnsToDrop: _*).show
def handleBiasOriginal(col: String, df: DataFrame, target: String = target): DataFrame = {
val pre1_1 = df
.filter(df(target) === 1)
.groupBy(col, target)
.agg((count("*") / df.filter(df(target) === 1).count).alias("pre_" + col))
.drop(target)
val pre2_1 = df
.groupBy(col)
.agg(mean(target).alias("pre2_" + col))
df
.join(pre1_1, Seq(col), "left")
.join(pre2_1, Seq(col), "left")
.na.fill(0)
}
This image is with spark 2.1.0, the images from Spark dynamic DAG is a lot slower and different from hard coded DAG are with 2.0.2
The DAG will be a bit simpler when caching is applied
df.cache
handleBiasOriginal("col1", df). ...
What other possibilities than window functions do you see to optimize the SQL?
At best it would be great if the SQL was generated dynamically.
The main point here is to avoid unnecessary shuffles. Right now your code shuffles twice for each column you want to include and the resulting data layout cannot be reused between columns.
For simplicity I assume that target is always binary ({0, 1}) and all remaining columns you use are of StringType. Furthermore I assume that the cardinality of the columns is low enough for the results to be grouped and handled locally. You can adjust these methods to handle other cases but it requires more work.
RDD API
Reshape data from wide to long:
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
aggregateByKey, reshape and collect:
import org.apache.spark.util.StatCounter
val lookup = long.as[((String, String), Int)].rdd
// You can use prefix partitioner (one that depends only on _._1)
// to avoid reshuffling for groupByKey
.aggregateByKey(StatCounter())(_ merge _, _ merge _)
.map { case ((c, v), s) => (c, (v, s)) }
.groupByKey
.mapValues(_.toMap)
.collectAsMap
You can use lookup to get statistics for individual columns and levels. For example:
lookup("col1")("A")
org.apache.spark.util.StatCounter =
(count: 3, mean: 0.666667, stdev: 0.471405, max: 1.000000, min: 0.000000)
Gives you data for col1, level A. Based on the binary TARGET assumption this information is complete (you get count / fractions for both classes).
You can use lookup like this to generate SQL expressions or pass it to udf and apply it on individual columns.
DataFrame API
Convert data to long as for RDD API.
Compute aggregates based on levels:
val stats = long
.groupBy($"level.k", $"level.v")
.agg(mean($"TARGET"), sum($"TARGET"))
Depending on your preferences you can reshape this to enable efficient joins or convert to a local collection and similarly to the RDD solution.
Using aggregateByKey
A simple explanation on aggregateByKey can be found here. Basically you use two functions: One which works inside a partition and one which works between partitions.
You would need to do something like aggregate by the first column and build a data structure internally with a map for every element of the second column to aggregate and collect data there (of course you could do two aggregateByKey if you want).
This will not solve the case of doing multiple runs on the code for each column you want to work with (you can do use aggregate as opposed to aggregateByKey to work on all data and put it in a map but that will probably give you even worse performance). The result would then be one line per key, if you want to move back to the original records (as window function does) you would actually need to either join this value with the original RDD or save all values internally and flatmap
I do not believe this would provide you with any real performance improvement. You would be doing a lot of work to reimplement things that are done for you in SQL and while doing so you would be losing most of the advantages of SQL (catalyst optimization, tungsten memory management, whole stage code generation etc.)
Improving the SQL
What I would do instead is attempt to improve the SQL itself.
For example, the result of the column in the window function appears to be the same for all values. Do you really need a window function? You can instead do a groupBy instead of a window function (and if you really need this per record you can try to join the results. This might provide better performance as it would not necessarily mean shuffling everything twice on every step).

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

spark scala avoid shuffling after RowMatix calculation

scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated