TF - IDF rdds into readable format using spark - scala

I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link.
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
val sc: SparkContext = ...
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
output:
Array((1048576,[1088,35436,98482,1024805],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]), (1048576,[49,34227,39165,114066,125344,240472,312955,388260,436506,469864,493361,496101,566174,747007,802226],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]),...
With this I am getting a RDD of vectors but I am not able to get any information from this vector about the original strings. Can anyone help me out in mapping indexes to original strings?

It is hard to answer to your question without more information. My best guess is that you may want to extract the TFIDF value for some term of some document.
tfidf you get at your last line is a RDD of Vector : for every document in your corpus (which is a RDD[Seq[String]]) , you get back a Vector representing the document. Every term in the document has a specific TFIDF value in this vector.
To know the position of a term in the vector, and retrieve the TFIDF :
val position = hashingTF.indexOf("term")
Then use it to retrieve the tfidf value for the given document calling the apply method on the Vector (first document in documents in this example) :
tfidf.first.apply(position)
Raw frequencies may be extracted the same way using tf instead of tfidf in the line above.
With the implementation of Spark using a hashing trick (see documentation and wikipedia article) my understanding is that it is not possible to retrieve the terms from the Vector : this is due to the fact that the hashing function is one way by definition, and that this "trick" may causes collisions (several terms may produce the same hash).

Related

How to model tf-idf spark

I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)
I'm not sure if you understand clearly how tf-idf model works, since tokenizing is essential and fundamental for tf-idf model no matter in sklearn or spark.ml version. You post actually cover 2 questions:
Why tf-idf need to tokenization the sentence: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf) and the inverse frequency of the word across a set of documents (idf). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence like what you mentioned in your sklearn version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.
How tf-idf work in sklearn: If you understand how tf-idf works, then you should understand the different steps in the example of spark official document are essential. Thanks for the sklearn developer to create such convenient API, you can use the .fit_transform() directly with the Series of sentence. In fact, if you check the source code of the TfidfVectorizer in sklearn, you can see that it actually did the "tokenization", just in a different way:
It inherits the from the CountVectorizer (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)
It uses the ._count_vocab() method in CountVectorizer to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)
In ._count_vocab(), it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)
To conclude, tokenizing the sentence is essential for the tf-idf model calculation, the example in spark official documents is efficient enough for your model building. Remember to use the function or method if spark provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.

spark loop in matrix to run linear regression

I have a spark data frame dt as below. BAB is ID and I would like to run a linear regression with column AAB and AAD for every value of BAB.
This is how I run it. By filtering the whole dataframe for every BAB value, it gets really slow. Is there a way of looping the data like a 3-dimensional matrix and running a regression for every BAB? So that I need to go through BAB once only. It does not have to be spark mllib. Any other machine learning tool with scala coding is fine.
val arrColu = Array("AAB", "AAD");
val assFeat = new VectorAssembler().setInputCols(arrColu).setOutputCol("features");
val arrBAB=dt.select("BAB").collect.map(_ (0)).map(x => x.toString);
for (a<-0 to arrBAB.length-1){
val vecDF: DataFrame = assFeat.transform(dt.filter("BAB='"+arrBAB(a)+"'").select("AAB","AAD"));
val lr1=new LinearRegression();
val lr2=lr1.setFeaturesCol("features").setLabelCol("AAD").setFitIntercept(true).
setMaxIter(10).setRegParam(.3).setElasticNetParam(.8);
val fitD1=lr2.fit(vecDF);
...
}
One way is converting the data frame into a list with tuples as element List((BAB1,AAB1,AAD1),(BAB2,AAB2,AAD2),...), then slicing the list w.r.t each individual BAB and running regression on each slice.

How do I properly combine numerical features with text (bag of words) in Spark?

My question is similar to this one but for Spark and the original question does not have a satisfactory answer.
I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows:
val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
.setNumFeatures(30000)
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
.setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
.setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx", "hash-tfidf"))
.setOutputCol("features")
If there are 30,000 words features won't these swamp the month? Or is VectorAssembler smart enough to handle this. (And if possible how do I get the best features of this model?)
VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else.
Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. You can train a model and check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum:
val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients
The features with higher coefficients will have a higher influence on the final result.
If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method.

Fitting Spark ML Kmeans for subsets or groups of data

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this:
for (n_topics <- Range(65,301,5) ){
val s = n_topics.toString
val lda = new LDA().setK(n_topics).setMaxIterations(20) // .setAlpha(), .setBeta()
val ldaModel = lda.run(docsWithFeatures)
// now do some eval, save results to file, etc...
This has been working great, but I also want to compare results if I first normalize my data with TF-IDF. Now, to the best of my knowledge, LDA strictly expects a bag-of-words format where term frequencies are integer values. But in principal (and I've seen plenty of examples of this), the math works out fine if we first convert integer term frequencies to float TF-IDF values. My approach at the moment to do this is the following (again given my docsWithFeatures rdd):
val index_reset = docsWithFeatures.map(_._2).cache()
val idf = new IDF().fit(index_reset)
val tfidf = idf.transform(index_reset).zipWithIndex.map(x => (x._2,x._1))
I can then run the same code as in teh first block, substituting tfidf for docsWithFeatures. This works without any crashes, but my main question here is whether this is OK to do. That is, I want to make sure Spark isn't doing anything funky under the hood, like converting the float values coming out of the TFIDF to integers or something.