Illegal Argument Exception using Random Forest in PySpark mllib - pyspark

I am using Random Forest algorithm for classification in Spark MLlib using PySpark. My codes are as follows:\
model = RandomForest.trainClassifier(trnData, numClasses=3, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)
predictions = model.predict(tst_dataRDD.map(lambda x: x.features))
labelsAndPredictions = tst_dataRDD.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda x: x[0] != x[1]).count() / float(tst_dataRDD.count())
I got IllegalArgumentException: GiniAggregator given label -0.0625but requires label to be non-negative.
How can I solve this problem? Thanks

It seems for Gini impurity during multiclass classification, the labels must be positive (>=0). Please check if there are any negative labels present.
ref - spark repo
Also, on side note, please use algorithm from ml package and not from legacy mllib

Please use RandomForestClassifier instead and see the docs:
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

Related

Use own dataset in Denoising Diffusion Implicit Models

Can you tell me how I create and implement my own dataset in this code?
def prepare_dataset(split):
# the validation dataset is shuffled as well, because data order matters
# for the KID estimation
return (
tfds.load(dataset_name, split=split, shuffle_files=True)
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.repeat(dataset_repetitions)
.shuffle(10 * batch_size)
.batch(batch_size, drop_remainder=True)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")
It always redirects me to use the TensorFlow datasets. It's from the Keras Notebook 'Denoising Diffusion Implicit Models'.
I tried following the keras guide on how to create your own dataset, but I'm not sure how to implement it in this code. I also tried this way of creating your own dataset, but I don't know if it's possible to implement it into the code above.

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

Exception while trying to explain model with MMLSpark's scala LIME library

I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.
This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,
Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
I have around 200 features and many of them contain zero as its feature value.
You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.
More info on VectorAssembler output - Spark ML VectorAssembler returns strange output
The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector
val asDense = udf((v: Vector) => v.toDense)
featuresDF.withColumn("features", asDense(col("features")))
Then you can fit your model.

Is there any plan to implement complex survey design within Spark's MLLIB package?

I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import pyspark.ml.feature as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
labelCol='df',
weightCol='wgt_panel')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model = pipeline.fit(encoded_df_nonan)
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.

Spark ML Pipelines: unseen label exception when classifying new examples

I cannot find how to use a Spark ML Pipeline to classify a new set of instances (with unknown labels).
All the examples I find are based on a test set with already known labels (which are only used to evaluate the performance of the classification).
I have the following pipeline:
StringIndexer indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex");
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
HashingTF hashingTF = new HashingTF().setNumFeatures(NUM_FEATURES).setInputCol(tokenizer.getOutputCol())
.setOutputCol("rawFeatures");
IDF idf = new IDF().setInputCol(hashingTF.getOutputCol()).setOutputCol("rescaledFeatures");
NaiveBayes naiveBayes = new NaiveBayes().setFeaturesCol(idf.getOutputCol()).setLabelCol("categoryIndex");
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] { indexer, tokenizer, hashingTF, idf, naiveBayes });
It works perfect, both fitting the estimators and running the transformers (against a test set, which contains gold labels)
But when I try to use the same pipeline for a "real" example for which there is no gold label (the label is precisely what we want to obtain), the StringIndexer which is part of the pipeline throws an exception:
Caused by: org.apache.spark.SparkException: Unseen label: UNKNOWN.
UNKNOWN is the fake label I have set when programatically creating a new Dataset element with the new unseen examples, and of course such label was not present in the training set. I understand why this error is raised, but is there a way to tell the pipeline that I am no longer "training" or "evaluating", but using it for real classification?
How can I proceed to build a valid input for the pipeline up from a new example (with no known label) to classify it?
Being this my very first question in stackoverflow, hope I have explained it clear enough.
Thanks in advance :-)
I have found a solution myself.
TL;DR: StringIndexer should be placed outside the pipeline.
Transform the initial dataset with the StringIndexer to obtain each label index, but do not include the transformation in the pipeline. Then set an IndexToString transformer at the end of the pipeline to convert the predicted indexes (the outcomes of the ML algorithm employed for classification/regression) back to categorical labels.
This way, when the pipeline model is stored for later use in production, there will be no StringIndexer causing the aforementioned problem, and the IndexToString will interpret the outcomes of the prediction model to output meaningful labels.