RowMatrix from DataFrame containing null values - scala

I have a DataFrame of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame where most of the values are null due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix (for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0) but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors somehow (to ignore null values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.

I believe you are approaching this in a wrong way. Dealing with nuances of Spark API is secondary to a proper problem definition - what exactly do you mean by correlation in case of sparse data.
Filling data with zeros in case of explicit feedback (rating), is problematic not because all Vectors would become very similar (variation of the metric will be driven by existing ratings, and results can be always rescaled using min-max scaler), but because it introduces information which is not present in the original dataset. There is a significant difference between item which hasn't been rated and item which has the lowest possible rating.
Overall you can approach this problem in two ways:
You can compute pairwise similarity using only entries where both items have non-missing values. This should work reasonably well if dataset is reasonably dense. It could be expressed using self-join on the input dataset. With pseudocode:
imdbRatingsDF.alias("left")
.join(imdbRatingsDF.alias("right"), Seq("userId"))
.where($"left.imdbId" =!= $"right.imdbId")
.groupBy($"left.imdbId", $"right.imdbId")
.agg(simlarity($"left.rating", $"right.rating"))
where similarity implements required similarity metric.
You can impute missing ratings, for example using some measure of central tendency. Using average (Replace missing values with mean - Spark Dataframe) is probably the most natural choice.
More advanced imputation techniques might provide more reliable results, but likely won't scale very well in a distributed system.
Note
Using SparseVectors is essentially equivalent to na.fill(0).

Related

PySpark MLLib APproximate nearest neighbour search for multiple keys

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).
Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.
Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?
Thank you.

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Jaccard Similarity of an RDD with the help of Spark and Scala without Cartesian?

I am working on pair RDDs. My aim is to calculate jaccard similarity
between the set of rdd values and cluster them according to the jaccard similarity threshold value.Structure of my RDD is :
val a= [Key,Set(String)] //Pair RDD
For example:-
India,[Country,Place,....]
USA,[Country,State,..]
Berlin,[City,Popluatedplace,..]
After finding jaccard similarity, I will cluster the similar entities into one cluster. In the above example, India and USA will be cluster into one cluster based on some threshold value whereas Berlin will be in the other cluster.
So I took the Cartesian product of rdd a
val filterOnjoin = a.cartesian(a).filter(f =>
(!f._1._1.toString().contentEquals(f._2._1.toString())))
//Cartesianproduct of rdd a and filtering rows with same key at both
//the position.
//e.g. ((India,Set[Country,Place,....]),(USA,Set[Country,State,..]))
and compare the set of values with the help of jaccard similarity.
val Jsim = filterOnjoin.map(f => (f._1._1, (f._2._1,
Similarity.sim(f._1._2, f._2._2)))) //calculating jaccard similarity.
//(India,USA,0.8)
The code is running fine on smaller dataset. As the size of dataset is increased, Cartesian product is taking too much time. For 100 MB data(size of rdd "a"), its doing data shuffle read around 25 GB. For 3.5 GB data, its in TB.
I have gone through various links. Like spark tuning methods and some on stack overflow. But most of the post it is written that broadcast the smaller RDD. But here the size of both the rdd is the same and its big.
Links which I followed :-
Spark: produce RDD[(X, X)] of all possible combinations from RDD[X] of-all-possible-combinations-from-rddx
Spark repartition is slow and shuffles too much data
Map key, value pair based on similarity of their value in Spark
I am new to Spark and Scala. I am unable to think beyond Cartesian product which is bottleneck here. Is it possible to solve this problem without Cartesian product.
As Cartesian product is an expensive operation on rdd, I tried to solve above problem by using HashingTF and MinHashLSH library present in Spark MLib for finding jaccard similarity. Steps to find Jaccard similarity in rdd "a" mentioned in the question:
Convert the rdd into dataframe
import sparkSession.implicits._
val dfA = a.toDF("id", "values")
Create the feature vector with the help of HashingTF
val hashingTF = new HashingTF()
.setInputCol("values").setOutputCol("features").setNumFeatures(1048576)
Feature transformation
val featurizedData = hashingTF.transform(dfA) //Feature Transformation
Creating minHash table. More is the value of number of table, more accurate
results will be, but high communication cost and run time.
val mh = new MinHashLSH()
.setNumHashTables(3)
.setInputCol("features")
.setOutputCol("hashes")
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
val model = mh.fit(featurizedData)
//Approximately joining featurizedData with Jaccard distance smaller
//than 0.45
val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData,
0.45)
Since in spark, we have to do manual optimization in our code like setting of number of partition, setting persist level etc. I have configured these parameters also.
Changing storaagelevel from persist() to persist(StorageLevel.MEMORY_AND_DISK),
it help me to remove OOM error.
Also while doing join operation, re-partitioned the data according to the rdd
size. On 16.6 GB data set, while doing simple join operation, I was using 200
partition. On increase it to 600, it also solves my problem related to OOM.
PS: the constant parameters setNumFeatures(1048576) and setNumHashTables(3) are configured while experimenting on 16.6 data set. You can increase or decrease these value according to your data set. Also the number of partition depends upon your data set size. With these optimization, I got my desired results.
Useful links:-
[https://spark.apache.org/docs/2.2.0/ml-features.html#locality-sensitive-hashing]
[https://eng.uber.com/lsh/]
[https://data-flair.training/blogs/limitations-of-apache-spark/]

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

Join or subtractByKey on 2 huge RDD's

I am building a recommendation system for retail purposes. I use python and Spark.
I am trying to subtract all user product combinations of my predictions which also occur in the ratings (so I only predict the values of products users never bought before).
Those 2 RDD's are pretty large and are giving me memory issues on 28gb per worker node (3 nodes) when I do
filter_predictions = predictions.subtractByKey(user_boughtproduct)
When I read the documentation of Spark subtractByKey is optimal when using 1 large and 1 small rdd.
I cannot make the user_boughtproduct smaller (unless I loop it), but I could make.
filter_predictions = predictions.join(user_nonBoughtProduct)
Any thoughts on which of them is faster or best practice? Or another cleaner solution.
subtractByKey pushes filters after co-grouping and doesn't have to touch right values so it should be slightly more efficient than using outer join an filter after flattening.
If you use Spark 2.0+ and records can be encoded using Dataset encoders you can consider leftanti join but depending on the rest of your code cost of moving the data can negate benefits of an optimized execution.
Finally if you can accept potential data loss then building Bloom filter on the right RDD and use it to filter the left one can give really good result without shuffling.