Running KMeans clustering in PySpark

Running KMeans clustering in PySpark - pyspark

it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!

the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd.
Hope it helps.

Related

spark loop in matrix to run linear regression

I have a spark data frame dt as below. BAB is ID and I would like to run a linear regression with column AAB and AAD for every value of BAB.
This is how I run it. By filtering the whole dataframe for every BAB value, it gets really slow. Is there a way of looping the data like a 3-dimensional matrix and running a regression for every BAB? So that I need to go through BAB once only. It does not have to be spark mllib. Any other machine learning tool with scala coding is fine.
val arrColu = Array("AAB", "AAD");
val assFeat = new VectorAssembler().setInputCols(arrColu).setOutputCol("features");
val arrBAB=dt.select("BAB").collect.map(_ (0)).map(x => x.toString);
for (a<-0 to arrBAB.length-1){
val vecDF: DataFrame = assFeat.transform(dt.filter("BAB='"+arrBAB(a)+"'").select("AAB","AAD"));
val lr1=new LinearRegression();
val lr2=lr1.setFeaturesCol("features").setLabelCol("AAD").setFitIntercept(true).
setMaxIter(10).setRegParam(.3).setElasticNetParam(.8);
val fitD1=lr2.fit(vecDF);
...
}

One way is converting the data frame into a list with tuples as element List((BAB1,AAB1,AAD1),(BAB2,AAB2,AAD2),...), then slicing the list w.r.t each individual BAB and running regression on each slice.

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext

I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).

You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

Exception while trying to explain model with MMLSpark's scala LIME library

I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.
This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,
Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
I have around 200 features and many of them contain zero as its feature value.

You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.
More info on VectorAssembler output - Spark ML VectorAssembler returns strange output
The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector
val asDense = udf((v: Vector) => v.toDense)
featuresDF.withColumn("features", asDense(col("features")))
Then you can fit your model.

Fitting Spark ML Kmeans for subsets or groups of data

I've got a Dataset where each Row is a (class: String, vectors: Array[Array[Float]]), and I'd like to fit a kmeans model in Spark MLLib per class. I can explode the vectors to normalize the data, loop through the classes, filter the entire dataset by class, and fit a model per iteration of the loop, but that's horribly inefficient (although it's how Spark does it in the fit method of the OneVsRest classifier here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala).
Here's a snippet that accomplishes this with a ParArray, inspired by OneVsRest's approach:
val classes = normalized_data.select("class").distinct.map(_.getString(0)).collect
val kmeans = new KMeans().setK(5)
val models = classes.par.map { class =>
val training_data = unpacked_data.filter($"label" === class)
val model = kmeans.fit(training_data)
(class, model)
}
It seems that the KMeans fit method needs the data to be a Dataset with one-row-per-vector, which suggests normalizing / exploding the data, but what's the best way to go about this? Can't I somehow leverage the fact that I start with all of my data points in each row and/or group on the label to use just these points without explicitly filtering the entire dataset for every class I want to build a model for?
PS- I know KMeans.fit actually needs org.apache.spark.ml.linalg.Vector; presume I've transformed my Array[Float] accordingly.

HDFS Files as input to Spark Mllib

All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.
Can anyone please guide me how to do this?

Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.
You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.
http://spark.apache.org/docs/latest/mllib-data-types.html

I agree with #Rishi with few additions to that -
LibSVM format represents a org.apache.spark.mllib.regression.LabeledPoint, it contains a label and a feature vector. If you don't have data in LibSVM format then you can create that by building a dataframe having a column of type LabeledPoint.
val trainingData = spark.read.text (<path to data folder or file>)
val trainingLabelPoints = trainingData.map { row =>
//LabeledPoint(<Label as a Double>, Vectors.sparse(....) )
LabeledPoint(row.getAs[Double]("column 1"), Vectors.sparse(row.getAs[Double]("column 2")...)
}.toDF("labelpoints")
//trainingLabelPoints can be used for input to a Mllib library
Clustering algorithms like K-mean don't need LabelPoints, just a Vector column is enough.
Some classification algorithms like LinearSVN can take 2 columns - label and feature vector, a LabelPoint would work too.
If you have words in training document then you can use org.apache.spark.ml.feature.Word2Vec to convert words to vectors.
So you have quite a lot of choices.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Running KMeans clustering in PySpark - pyspark

the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd. Hope it helps.

Related

spark loop in matrix to run linear regression

how to get the prediction of a model in pyspark

Exception while trying to explain model with MMLSpark's scala LIME library

Fitting Spark ML Kmeans for subsets or groups of data

HDFS Files as input to Spark Mllib

Categories

Resources