How to model tf-idf spark

How to model tf-idf spark - pyspark

I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)

I'm not sure if you understand clearly how tf-idf model works, since tokenizing is essential and fundamental for tf-idf model no matter in sklearn or spark.ml version. You post actually cover 2 questions:
Why tf-idf need to tokenization the sentence: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf) and the inverse frequency of the word across a set of documents (idf). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence like what you mentioned in your sklearn version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.
How tf-idf work in sklearn: If you understand how tf-idf works, then you should understand the different steps in the example of spark official document are essential. Thanks for the sklearn developer to create such convenient API, you can use the .fit_transform() directly with the Series of sentence. In fact, if you check the source code of the TfidfVectorizer in sklearn, you can see that it actually did the "tokenization", just in a different way:
It inherits the from the CountVectorizer (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)
It uses the ._count_vocab() method in CountVectorizer to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)
In ._count_vocab(), it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)
To conclude, tokenizing the sentence is essential for the tf-idf model calculation, the example in spark official documents is efficient enough for your model building. Remember to use the function or method if spark provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.

Related

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext

I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).

You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

Is there any plan to implement complex survey design within Spark's MLLIB package?

I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import pyspark.ml.feature as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
labelCol='df',
weightCol='wgt_panel')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model = pipeline.fit(encoded_df_nonan)
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.

How do I properly combine numerical features with text (bag of words) in Spark?

My question is similar to this one but for Spark and the original question does not have a satisfactory answer.
I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows:
val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
.setNumFeatures(30000)
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
.setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
.setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx", "hash-tfidf"))
.setOutputCol("features")
If there are 30,000 words features won't these swamp the month? Or is VectorAssembler smart enough to handle this. (And if possible how do I get the best features of this model?)

VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else.
Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. You can train a model and check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum:
val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients
The features with higher coefficients will have a higher influence on the final result.
If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method.

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this:
for (n_topics <- Range(65,301,5) ){
val s = n_topics.toString
val lda = new LDA().setK(n_topics).setMaxIterations(20) // .setAlpha(), .setBeta()
val ldaModel = lda.run(docsWithFeatures)
// now do some eval, save results to file, etc...
This has been working great, but I also want to compare results if I first normalize my data with TF-IDF. Now, to the best of my knowledge, LDA strictly expects a bag-of-words format where term frequencies are integer values. But in principal (and I've seen plenty of examples of this), the math works out fine if we first convert integer term frequencies to float TF-IDF values. My approach at the moment to do this is the following (again given my docsWithFeatures rdd):
val index_reset = docsWithFeatures.map(_._2).cache()
val idf = new IDF().fit(index_reset)
val tfidf = idf.transform(index_reset).zipWithIndex.map(x => (x._2,x._1))
I can then run the same code as in teh first block, substituting tfidf for docsWithFeatures. This works without any crashes, but my main question here is whether this is OK to do. That is, I want to make sure Spark isn't doing anything funky under the hood, like converting the float values coming out of the TFIDF to integers or something.

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)

1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to model tf-idf spark - pyspark

Related

how to get the prediction of a model in pyspark

Is there any plan to implement complex survey design within Spark's MLLIB package?

How do I properly combine numerical features with text (bag of words) in Spark?

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Categories

Resources