I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.
This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,
Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
I have around 200 features and many of them contain zero as its feature value.
You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.
More info on VectorAssembler output - Spark ML VectorAssembler returns strange output
The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector
val asDense = udf((v: Vector) => v.toDense)
featuresDF.withColumn("features", asDense(col("features")))
Then you can fit your model.
Related
i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.
I am using Random Forest algorithm for classification in Spark MLlib using PySpark. My codes are as follows:\
model = RandomForest.trainClassifier(trnData, numClasses=3, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)
predictions = model.predict(tst_dataRDD.map(lambda x: x.features))
labelsAndPredictions = tst_dataRDD.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda x: x[0] != x[1]).count() / float(tst_dataRDD.count())
I got IllegalArgumentException: GiniAggregator given label -0.0625but requires label to be non-negative.
How can I solve this problem? Thanks
It seems for Gini impurity during multiclass classification, the labels must be positive (>=0). Please check if there are any negative labels present.
ref - spark repo
Also, on side note, please use algorithm from ml package and not from legacy mllib
Please use RandomForestClassifier instead and see the docs:
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!
the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd.
Hope it helps.
The documentation for Random Forests does not include feature importances. However, it is listed on the Jira as resolved and is in the source code. HERE also says "The main differences between this API and the original MLlib ensembles API are:
support for DataFrames and ML Pipelines
separation of classification vs. regression
use of DataFrame metadata to distinguish continuous and categorical
features
more functionality for random forests: estimates of feature
importance, as well as the predicted probability of each class
(a.k.a. class conditional probabilities) for classification."
However, I cannot figure out a syntax that works to call this new feature.
scala> model
res13: org.apache.spark.mllib.tree.model.RandomForestModel =
TreeEnsembleModel classifier with 10 trees
scala> model.featureImportances
<console>:60: error: value featureImportances is not a member of org.apache.spark.mllib.tree.model.RandomForestModel
model.featureImportances
You have to use the new Random Forests. Check your imports.
The OLD:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
The NEW Random Forests use:
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier
This S.O. answer provides code for extracting the importances.
This S.O. answer explains the sparse vector that is returned.
All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.
Can anyone please guide me how to do this?
Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.
You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.
http://spark.apache.org/docs/latest/mllib-data-types.html
I agree with #Rishi with few additions to that -
LibSVM format represents a org.apache.spark.mllib.regression.LabeledPoint, it contains a label and a feature vector. If you don't have data in LibSVM format then you can create that by building a dataframe having a column of type LabeledPoint.
val trainingData = spark.read.text (<path to data folder or file>)
val trainingLabelPoints = trainingData.map { row =>
//LabeledPoint(<Label as a Double>, Vectors.sparse(....) )
LabeledPoint(row.getAs[Double]("column 1"), Vectors.sparse(row.getAs[Double]("column 2")...)
}.toDF("labelpoints")
//trainingLabelPoints can be used for input to a Mllib library
Clustering algorithms like K-mean don't need LabelPoints, just a Vector column is enough.
Some classification algorithms like LinearSVN can take 2 columns - label and feature vector, a LabelPoint would work too.
If you have words in training document then you can use org.apache.spark.ml.feature.Word2Vec to convert words to vectors.
So you have quite a lot of choices.