i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.
Related
Can you tell me how I create and implement my own dataset in this code?
def prepare_dataset(split):
# the validation dataset is shuffled as well, because data order matters
# for the KID estimation
return (
tfds.load(dataset_name, split=split, shuffle_files=True)
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.repeat(dataset_repetitions)
.shuffle(10 * batch_size)
.batch(batch_size, drop_remainder=True)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")
It always redirects me to use the TensorFlow datasets. It's from the Keras Notebook 'Denoising Diffusion Implicit Models'.
I tried following the keras guide on how to create your own dataset, but I'm not sure how to implement it in this code. I also tried this way of creating your own dataset, but I don't know if it's possible to implement it into the code above.
I am using Random Forest algorithm for classification in Spark MLlib using PySpark. My codes are as follows:\
model = RandomForest.trainClassifier(trnData, numClasses=3, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)
predictions = model.predict(tst_dataRDD.map(lambda x: x.features))
labelsAndPredictions = tst_dataRDD.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda x: x[0] != x[1]).count() / float(tst_dataRDD.count())
I got IllegalArgumentException: GiniAggregator given label -0.0625but requires label to be non-negative.
How can I solve this problem? Thanks
It seems for Gini impurity during multiclass classification, the labels must be positive (>=0). Please check if there are any negative labels present.
ref - spark repo
Also, on side note, please use algorithm from ml package and not from legacy mllib
Please use RandomForestClassifier instead and see the docs:
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.
This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,
Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
I have around 200 features and many of them contain zero as its feature value.
You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.
More info on VectorAssembler output - Spark ML VectorAssembler returns strange output
The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector
val asDense = udf((v: Vector) => v.toDense)
featuresDF.withColumn("features", asDense(col("features")))
Then you can fit your model.
I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import pyspark.ml.feature as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
labelCol='df',
weightCol='wgt_panel')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model = pipeline.fit(encoded_df_nonan)
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.
it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!
the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd.
Hope it helps.