spark loop in matrix to run linear regression - scala

I have a spark data frame dt as below. BAB is ID and I would like to run a linear regression with column AAB and AAD for every value of BAB.
This is how I run it. By filtering the whole dataframe for every BAB value, it gets really slow. Is there a way of looping the data like a 3-dimensional matrix and running a regression for every BAB? So that I need to go through BAB once only. It does not have to be spark mllib. Any other machine learning tool with scala coding is fine.
val arrColu = Array("AAB", "AAD");
val assFeat = new VectorAssembler().setInputCols(arrColu).setOutputCol("features");
val arrBAB=dt.select("BAB").collect.map(_ (0)).map(x => x.toString);
for (a<-0 to arrBAB.length-1){
val vecDF: DataFrame = assFeat.transform(dt.filter("BAB='"+arrBAB(a)+"'").select("AAB","AAD"));
val lr1=new LinearRegression();
val lr2=lr1.setFeaturesCol("features").setLabelCol("AAD").setFitIntercept(true).
setMaxIter(10).setRegParam(.3).setElasticNetParam(.8);
val fitD1=lr2.fit(vecDF);
...
}

One way is converting the data frame into a list with tuples as element List((BAB1,AAB1,AAD1),(BAB2,AAB2,AAD2),...), then slicing the list w.r.t each individual BAB and running regression on each slice.

Related

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

How do I properly combine numerical features with text (bag of words) in Spark?

My question is similar to this one but for Spark and the original question does not have a satisfactory answer.
I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows:
val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
.setNumFeatures(30000)
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
.setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
.setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx", "hash-tfidf"))
.setOutputCol("features")
If there are 30,000 words features won't these swamp the month? Or is VectorAssembler smart enough to handle this. (And if possible how do I get the best features of this model?)
VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else.
Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. You can train a model and check the weights of the features to confirm this. Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum:
val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients
The features with higher coefficients will have a higher influence on the final result.
If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method.

Trigger execution model LinearRegression in Flink -> Slower than Spark?

I've develop a Multiple Linear Regression and Kmeans in both Spark and Flink to compare their performance in batch (I'm using Zeppelin to programming and execute, and Ganglia to measure).
I read in the answer of this post that I've to trigger the execution of method train, so I did.
Hovewer, in Linear Regression, Flink takes 3 minutes 27 seconds (just in the trigger part) meanwhile Spark just around 30 seconds (in whole execution)...so I think I'm doing something wrong because this is not possible.
Flinks is also slower comparing K-means algorithms.
This is my code:
//Read the data
val data: DataSet[org.apache.flink.ml.common.LabeledVector] = MLUtils.readLibSVM(benv, /.../quake_test_I.libsvm")
//Example of data
6.1 1:33.0 2:53.26 3:-161.74
5.8 1:45.0 2:51.34 3:173.44
5.9 1:17.0 2:28.62 3:142.42
5.8 1:28.0 2:52.73 3:171.99
// Create multiple linear regression learner
val mlr = MultipleLinearRegression()
.setIterations(10)
.setStepsize(0.5)
.setConvergenceThreshold(0.001)
//Train the model
val model = mlr.fit(data)
//Tigger its execution
val weights = mlr.weightsOption match {
case Some(weights) => weights.collect()
case None => throw new Exception("Could not calculate the weights.")
How should I trigger the execution of this model?
Thanks for your help! :)

Fitting Spark ML Kmeans for subsets or groups of data

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this:
for (n_topics <- Range(65,301,5) ){
val s = n_topics.toString
val lda = new LDA().setK(n_topics).setMaxIterations(20) // .setAlpha(), .setBeta()
val ldaModel = lda.run(docsWithFeatures)
// now do some eval, save results to file, etc...
This has been working great, but I also want to compare results if I first normalize my data with TF-IDF. Now, to the best of my knowledge, LDA strictly expects a bag-of-words format where term frequencies are integer values. But in principal (and I've seen plenty of examples of this), the math works out fine if we first convert integer term frequencies to float TF-IDF values. My approach at the moment to do this is the following (again given my docsWithFeatures rdd):
val index_reset = docsWithFeatures.map(_._2).cache()
val idf = new IDF().fit(index_reset)
val tfidf = idf.transform(index_reset).zipWithIndex.map(x => (x._2,x._1))
I can then run the same code as in teh first block, substituting tfidf for docsWithFeatures. This works without any crashes, but my main question here is whether this is OK to do. That is, I want to make sure Spark isn't doing anything funky under the hood, like converting the float values coming out of the TFIDF to integers or something.