PySpark PCA: avoiding NotConvergedException - pyspark

I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows:
1) Named my columns as one list:
features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns
2) Imported the necessary libraries
from pyspark.ml.feature import PCA as PCAML
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector
3) Collapsed the features to a DenseVector
indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF(['indi_nbr','label','features'])
4) Dropped everything but the features to retain index:
dftest = indi_feat.drop('indi_nbr','label')
5) Instantiated the PCA object
dfPCA = PCAML(k=3, inputCol="features", outputCol="pcafeats")
6) And attempted to fit the model
PCAout = dfPCA.fit(dftest)
But my model fails to converge (error below).
Things I've tried:
- Mean-filling or zero-filling NA and Null values (as appropriate)
- Reducing the number of features (to 25, then I switched to SKlearn's PCA)
Py4JJavaError: An error occurred while calling o2242.fit.
: breeze.linalg.NotConvergedException:
at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:110)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:40)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.svd$.apply(svd.scala:23)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:389)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
My configuration is for 50 executors with 6GB/executor, so I don't think it's a matter of not having enough resources (and I don't see anything about resources here).
My input factors are a mixture of percentages, integers and 2-decimal floats, all positive and all ordinal. Could that be causing difficulty with convergence?
I had no trouble with the SKLearn method converging, and quickly, once I converted the PySpark DF to a Pandas DF.

Related

Percentile rank in pyspark using QuantileDiscretizer

I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.

Issue/Bug when loading and applying MultilayerPerceptronClassifier in Spark Version 3.0.0

IllegalArgumentException: MultilayerPerceptronClassifier_... parameter solver given invalid value auto
I believe I have discovered a bug when loading MultilayerPerceptronClassificationModel in spark 3.0.0, scala 2.1.2 which I have tested and can see is not there in at least Spark 2.4.3, Scala 2.11. .
I am using pyspark on a databricks cluster and importing the library “from pyspark.ml.classification import MultilayerPerceptronClassificationModel”
When running model=MultilayerPerceptronClassificationModel.(“load”) and then model. transform (df) I get the following error: IllegalArgumentException: MultilayerPerceptronClassifier_8055d1368e78 parameter solver given invalid value auto.
This issue can be easily replicated by running the example given on the spark documents: http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
Then adding a save model, load model and transform statement as such:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load training data
data = spark.read.format("libsvm")\
.load("data/mllib/sample_multiclass_classification_data.txt")
# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
# train the model
model = trainer.fit(train)
# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
from pyspark.ml.classification import MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel
model.save(Save_location)
model2. MultilayerPerceptronClassificationModel.load(Save_location)
result_from_loaded = model2.transform(test)
Bug has been confirmed Jira opened: : https://issues.apache.org/jira/browse/SPARK-32232

Text mining : Bad predictions of toxic comments using Word2Vec

I have a dataset containing sentences and boolean columns (0 or 1) to classify the type of the comment (toxic|severe_toxic|obscene|threat|insult|identity_hate).
You can download the dataset here : https://ufile.io/nqns7
I filtered the words with spacy to only keep useful words, i kept : Adjectives, Adverbs, Verbs and Nouns using this function :
def filter_words(words) :
vec = []
conditions = ('ADV','NOUN','ADJ','VERB')
for token in nlp(words):
if not token.is_stop and token.pos_ in conditions:
vec.append(token.lemma_)
return vec
Then i converted the dataframe to a parquet file to speed up the performances.
I ended up with a dataframe which looks like this :
I used a Word2Vec on this DF to create a features column in order to use RandomForestClassifier to predict if model works well.
Here is the code :
from pyspark.ml.feature import Word2Vec
from pyspark.sql.functions import *
word2vec = Word2Vec(inputCol="vector_words",outputCol="features")
model = word2vec.fit(sentences)
result = model.transform(sentences)
result = result.withColumn("toxic", result["toxic"].cast(IntegerType()))
rf =RandomForestClassifier(labelCol="toxic",featuresCol="features")
result = result.dropna()
(trainingSet, testSet) = result.randomSplit([0.7,0.3])
model_toxic = rf.fit(trainingSet)
predictions = model_toxic.transform(testSet)
But the problem i have here, is that i only have 16 predictions that are considered toxic from which 13 are really identified as toxic while there are about 4000 toxic comments in the set.
I don't understand why. Is it because of the filter i applied on the words, which might be too restrictive( i don't know why though ) or is it because the parameters of my Word2Vec and RandomForestClassifier aren't precise enough?
I'm new to pyspark and i couldn't find any information about bad models, basically people on internet are pretty happy about the results. Any help would be appreciated.

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137
You could ask to increase that limit in github forum!

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).
I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.
Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?
Prepare like this:
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)
val preds = predictions.select("probability", "label").rdd.map(row =>
(row.getAs[Vector](0)(0), row.getAs[Double](1)))
And evaluate:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
new BinaryClassificationMetrics(preds, 10).roc
If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:
val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc