how to set target feature dimension in Spark MLLIb's HashingTF() function? - hash

Apache Spark MLLIB has HashingTF() function which takes tokenized words as input and converts those sets
into fixed-length feature vectors.
As mentioned in documentation link
spark mlib documentation
it is advisable to use power of two as the feature dimension.
The question is whether the exponent value is the number of terms in the input
If yes, Suppose If I consider more than 1000 text document as input which has more than 5000 terms , then the feature dimension become 2^5000
Whether my assumption is correct or is there any other way to find exponent value

From the document HashingTF it said: "it is advisable to use power of two as the feature dimension" --> I think it means numFeatures = 2^n
For example your vocabulary size is 900, then numFeatures value should be > 900 and a power of two, which is 2^10 (=1024) could be a good estimate.

Related

Jaccard Similarity of an RDD with the help of Spark and Scala without Cartesian?

I am working on pair RDDs. My aim is to calculate jaccard similarity
between the set of rdd values and cluster them according to the jaccard similarity threshold value.Structure of my RDD is :
val a= [Key,Set(String)] //Pair RDD
For example:-
India,[Country,Place,....]
USA,[Country,State,..]
Berlin,[City,Popluatedplace,..]
After finding jaccard similarity, I will cluster the similar entities into one cluster. In the above example, India and USA will be cluster into one cluster based on some threshold value whereas Berlin will be in the other cluster.
So I took the Cartesian product of rdd a
val filterOnjoin = a.cartesian(a).filter(f =>
(!f._1._1.toString().contentEquals(f._2._1.toString())))
//Cartesianproduct of rdd a and filtering rows with same key at both
//the position.
//e.g. ((India,Set[Country,Place,....]),(USA,Set[Country,State,..]))
and compare the set of values with the help of jaccard similarity.
val Jsim = filterOnjoin.map(f => (f._1._1, (f._2._1,
Similarity.sim(f._1._2, f._2._2)))) //calculating jaccard similarity.
//(India,USA,0.8)
The code is running fine on smaller dataset. As the size of dataset is increased, Cartesian product is taking too much time. For 100 MB data(size of rdd "a"), its doing data shuffle read around 25 GB. For 3.5 GB data, its in TB.
I have gone through various links. Like spark tuning methods and some on stack overflow. But most of the post it is written that broadcast the smaller RDD. But here the size of both the rdd is the same and its big.
Links which I followed :-
Spark: produce RDD[(X, X)] of all possible combinations from RDD[X] of-all-possible-combinations-from-rddx
Spark repartition is slow and shuffles too much data
Map key, value pair based on similarity of their value in Spark
I am new to Spark and Scala. I am unable to think beyond Cartesian product which is bottleneck here. Is it possible to solve this problem without Cartesian product.
As Cartesian product is an expensive operation on rdd, I tried to solve above problem by using HashingTF and MinHashLSH library present in Spark MLib for finding jaccard similarity. Steps to find Jaccard similarity in rdd "a" mentioned in the question:
Convert the rdd into dataframe
import sparkSession.implicits._
val dfA = a.toDF("id", "values")
Create the feature vector with the help of HashingTF
val hashingTF = new HashingTF()
.setInputCol("values").setOutputCol("features").setNumFeatures(1048576)
Feature transformation
val featurizedData = hashingTF.transform(dfA) //Feature Transformation
Creating minHash table. More is the value of number of table, more accurate
results will be, but high communication cost and run time.
val mh = new MinHashLSH()
.setNumHashTables(3)
.setInputCol("features")
.setOutputCol("hashes")
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
val model = mh.fit(featurizedData)
//Approximately joining featurizedData with Jaccard distance smaller
//than 0.45
val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData,
0.45)
Since in spark, we have to do manual optimization in our code like setting of number of partition, setting persist level etc. I have configured these parameters also.
Changing storaagelevel from persist() to persist(StorageLevel.MEMORY_AND_DISK),
it help me to remove OOM error.
Also while doing join operation, re-partitioned the data according to the rdd
size. On 16.6 GB data set, while doing simple join operation, I was using 200
partition. On increase it to 600, it also solves my problem related to OOM.
PS: the constant parameters setNumFeatures(1048576) and setNumHashTables(3) are configured while experimenting on 16.6 data set. You can increase or decrease these value according to your data set. Also the number of partition depends upon your data set size. With these optimization, I got my desired results.
Useful links:-
[https://spark.apache.org/docs/2.2.0/ml-features.html#locality-sensitive-hashing]
[https://eng.uber.com/lsh/]
[https://data-flair.training/blogs/limitations-of-apache-spark/]

RowMatrix from DataFrame containing null values

I have a DataFrame of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame where most of the values are null due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix (for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0) but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors somehow (to ignore null values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.
I believe you are approaching this in a wrong way. Dealing with nuances of Spark API is secondary to a proper problem definition - what exactly do you mean by correlation in case of sparse data.
Filling data with zeros in case of explicit feedback (rating), is problematic not because all Vectors would become very similar (variation of the metric will be driven by existing ratings, and results can be always rescaled using min-max scaler), but because it introduces information which is not present in the original dataset. There is a significant difference between item which hasn't been rated and item which has the lowest possible rating.
Overall you can approach this problem in two ways:
You can compute pairwise similarity using only entries where both items have non-missing values. This should work reasonably well if dataset is reasonably dense. It could be expressed using self-join on the input dataset. With pseudocode:
imdbRatingsDF.alias("left")
.join(imdbRatingsDF.alias("right"), Seq("userId"))
.where($"left.imdbId" =!= $"right.imdbId")
.groupBy($"left.imdbId", $"right.imdbId")
.agg(simlarity($"left.rating", $"right.rating"))
where similarity implements required similarity metric.
You can impute missing ratings, for example using some measure of central tendency. Using average (Replace missing values with mean - Spark Dataframe) is probably the most natural choice.
More advanced imputation techniques might provide more reliable results, but likely won't scale very well in a distributed system.
Note
Using SparseVectors is essentially equivalent to na.fill(0).

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

SVM-pref package from Cornell university

I'm using SVM-pref (http://svmlight.joachims.org) for a binary classification problem. I don't have much experience with this package and so I seek help with the following questions:
(1) My features are all discrete/nominal. Is there a special way to represent the feature vectors like a special way to convert the nominal values into continuous values or do we just replace the nominal values for dummy numbers like 1, 2, 3 .. etc.?
(2) If the answer to the first question is we replace nominal values with dummy numbers, then my second question is we start numbering feature values from 1 so we have 1:1 but not 1:0 otherwise the learner will consider a zero-value feature as non-existent. Is that correct?
(3) How to we configure the best -c values and the values for the rest of the parameters? Is it only by error and trial or are their other approaches used to decide on these parameters?
To use categorical features in SVM you must encode them using dummy variables, e.g. one-hot coding. For every level of the category, you should introduce a dimension. Something like this for a feature with levels A, B and C:
A -> [1,0,0]
B -> [0,1,0]
C -> [0,0,1]
See answer to previous question: use one dimension per categorical level.
Typically this is done by testing possible values in a cross-validation setting.
Here is also another useful and informative discussion about representing nominal features for SVM classifiers.