Fitting Spark ML Kmeans for subsets or groups of data - scala

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

Related

Use own dataset in Denoising Diffusion Implicit Models

Can you tell me how I create and implement my own dataset in this code?
def prepare_dataset(split):
# the validation dataset is shuffled as well, because data order matters
# for the KID estimation
return (
tfds.load(dataset_name, split=split, shuffle_files=True)
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.repeat(dataset_repetitions)
.shuffle(10 * batch_size)
.batch(batch_size, drop_remainder=True)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")
It always redirects me to use the TensorFlow datasets. It's from the Keras Notebook 'Denoising Diffusion Implicit Models'.
I tried following the keras guide on how to create your own dataset, but I'm not sure how to implement it in this code. I also tried this way of creating your own dataset, but I don't know if it's possible to implement it into the code above.

spark loop in matrix to run linear regression

I have a spark data frame dt as below. BAB is ID and I would like to run a linear regression with column AAB and AAD for every value of BAB.
This is how I run it. By filtering the whole dataframe for every BAB value, it gets really slow. Is there a way of looping the data like a 3-dimensional matrix and running a regression for every BAB? So that I need to go through BAB once only. It does not have to be spark mllib. Any other machine learning tool with scala coding is fine.
val arrColu = Array("AAB", "AAD");
val assFeat = new VectorAssembler().setInputCols(arrColu).setOutputCol("features");
val arrBAB=dt.select("BAB").collect.map(_ (0)).map(x => x.toString);
for (a<-0 to arrBAB.length-1){
val vecDF: DataFrame = assFeat.transform(dt.filter("BAB='"+arrBAB(a)+"'").select("AAB","AAD"));
val lr1=new LinearRegression();
val lr2=lr1.setFeaturesCol("features").setLabelCol("AAD").setFitIntercept(true).
setMaxIter(10).setRegParam(.3).setElasticNetParam(.8);
val fitD1=lr2.fit(vecDF);
...
}
One way is converting the data frame into a list with tuples as element List((BAB1,AAB1,AAD1),(BAB2,AAB2,AAD2),...), then slicing the list w.r.t each individual BAB and running regression on each slice.

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

How do I properly combine numerical features with text (bag of words) in Spark?

My question is similar to this one but for Spark and the original question does not have a satisfactory answer.
I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows:
val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
.setNumFeatures(30000)
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
.setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
.setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx", "hash-tfidf"))
.setOutputCol("features")
If there are 30,000 words features won't these swamp the month? Or is VectorAssembler smart enough to handle this. (And if possible how do I get the best features of this model?)
VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else.
Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. You can train a model and check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum:
val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients
The features with higher coefficients will have a higher influence on the final result.
If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method.

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.