How to fit a MultiLayerPerceptron Classifier in Pyspark? - pyspark

Hi I am trying to fit a MultiLayerPerceptron with PySpark 2.4.3 Machine Learning Library. But every time I try to fit the algorithm I get the following error:
Py4JJavaError: An error occurred while calling o4105.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 784.0 failed 4 times, most recent failure: Lost task 0.3 in stage 784.0 (TID 11663, hdpdncwy87013.dpp.acxiom.net, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$org$apache$spark$ml$feature$OneHotEncoderModel$$encoder$1: (double, int) => struct,values:array>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
df = sqlContext.read.format("csv").options(header='true', sep=",", inferschema='true').load(location)
exclude = ["Target"]
inputs = [column for column in df.columns if (column not in exclude)]
vectorAssembler = VectorAssembler(inputCols=inputs, outputCol='Features')
vdf = vectorAssembler.transform(df)
vdf = vdf.select(['Features'] + exclude)
# Feature Scaling
scaler = MinMaxScaler(inputCol="Features", outputCol="scaledFeatures")
scalerModel = scaler.fit(vdf)
scaledData = scalerModel.transform(vdf)
# train-test split
splits = scaledData.randomSplit([0.7, 0.3], seed=2020)
train_df = splits[0]
test_df = splits[1]
layers = [len(inputs), 3, 3, 3, 5]
mlpc = MultilayerPerceptronClassifier(labelCol="Target", featuresCol="scaledFeatures", layers=layers,
blockSize=128, stepSize=0.03, seed=2020, maxIter=1000)
model = mlpc.fit(train_df)
Do you have an idea? Thank you in advance. Number of inputs 1902, number of classes to predict 5.

It's an old question, but we have encountered the exact same error now. We didn't had any issue with binary classification, but we had this exception thrown for multi class classification problems, just like yours.
The problem with the multi class classification for us was that our labels were 1, 2, 3. It turns out MultiLayerPerceptron expects the labels to start from 0. So when we subtracted 1 from our labels (made them 0, 1, 2), the model trained successfully without any exception. If you're having this exception for a multi class classification problem with non-zero labels, this might be your problem.
Hope this saves someone's hours of debugging time.

Related

Issue/Bug when loading and applying MultilayerPerceptronClassifier in Spark Version 3.0.0

IllegalArgumentException: MultilayerPerceptronClassifier_... parameter solver given invalid value auto
I believe I have discovered a bug when loading MultilayerPerceptronClassificationModel in spark 3.0.0, scala 2.1.2 which I have tested and can see is not there in at least Spark 2.4.3, Scala 2.11. .
I am using pyspark on a databricks cluster and importing the library “from pyspark.ml.classification import MultilayerPerceptronClassificationModel”
When running model=MultilayerPerceptronClassificationModel.(“load”) and then model. transform (df) I get the following error: IllegalArgumentException: MultilayerPerceptronClassifier_8055d1368e78 parameter solver given invalid value auto.
This issue can be easily replicated by running the example given on the spark documents: http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
Then adding a save model, load model and transform statement as such:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load training data
data = spark.read.format("libsvm")\
.load("data/mllib/sample_multiclass_classification_data.txt")
# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
# train the model
model = trainer.fit(train)
# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
from pyspark.ml.classification import MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel
model.save(Save_location)
model2. MultilayerPerceptronClassificationModel.load(Save_location)
result_from_loaded = model2.transform(test)
Bug has been confirmed Jira opened: : https://issues.apache.org/jira/browse/SPARK-32232

spark - extract elements from an RDD[Row] when reading Hive table in Spark

I was going to read a Hive table in spark using scala, and extract some/all of fields from it and then save the data into HDFS.
My code is as follow:
val data = spark.sql("select * from table1 limit 1000")
val new_rdd = data.rdd.map(row => {
var arr = new ArrayBuffer[String]
val len = row.size
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
arr.toArray
})
new_rdd.take(10).foreach(println)
new_rdd.map(_.mkString("\t")).saveAsTextFile(dataOutputPath)
The above chunk is the one that finally worked.
I had written another version, where this line:
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
was replaced by this line:
for(i <- 0 to len-1) arr.+=(row.get(i).toString)
To me, both lines did exactly the same thing: for each row, I get the ith element as a string, and put it into the ArrayBuffer, which comes to an Array at the end.
However, the two methods have different results.
The first line works well. Data were able to be correctly saved on HDFS.
While the Error was thrown when I am going to save the data if using the second line:
ERROR ApplicationMaster: User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 56
in stage 3.0 failed 4 times, most recent failure: Lost task 56.3 in stage
3.0 (TID 98, ip-172-31-18-87.ec2.internal, executor 6):
java.lang.NullPointerException
Therefore, I wonder if there is some intrinsic differences in between
getAs[String](i)
and
get(i).toString
?
Many thanks
getAs[String](i) is the same as
get(i).asInstanceOf[String]
therefore it is just a type casting. toString is not.

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137
You could ask to increase that limit in github forum!

SparkML MultilayerPerceptron error: java.lang.ArrayIndexOutOfBoundsException

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().
val formula = new RFormula()
.setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
.setFeaturesCol("features")
.setLabelCol("label")
formula.fit(data).transform(data)
Note: The features is a vector and label is a Double
root
|-- features: vector (nullable = true)
|-- label: double (nullable = false)
I define my MLP estimator as follows:
val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong
val mlp = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = mlp.fit(train)
Unfortunately, I got the following error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11
at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950)
...
org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:
def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
val output = Array.fill(labelCount)(0.0)
output(labeledPoint.label.toInt) = 1.0
(labeledPoint.features, Vectors.dense(output))
}
It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.
By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:
The number of nodes N in the output layer corresponds to the number of classes.
The solution is therefore to either:
Change the number of nodes in the final layer of the network (output nodes)
Reconstruct the data to have the same number of classes as your network output nodes.
Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.
rearrange your dataset as the error shows you have fewer arrays than you have in your features set or your data set has a null set which prompted an error.I came across this type of error while working on my MLP project.hope my answer helps you.
thanks for reaching out
The solution is to first find the local optimal that allows one to escape the ArrayIndexOutBound and then use brute-force search to find the global optimal. Shaido suggest finding n
For example, val layers =
Array[Int](6, 5, 8, n). This assumes the length of the feature vectors
are 6. – Shaido
So make n be a large integer(n =100) then manually use brute-force search to arrive at a good solution(n =50 then try n =32 - error, n = 35 - perfect).
Credit to Shaido.

Apache Spark reduceByKey to sum decimals

I'm trying to map an RDD as such (see output for results) and map reduce by the decimal values and I keep getting error. When I tried using reduceByKey() with word count it worked fine. Are decimal values summed differently?
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).take(5)
Output:
voltageRDD: Array[(String, Float)] = Array((2008,1.62), (2008,1.626), (2008,1.622), (2008,1.612), (2008,1.612))
When trying to reduce:
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).reduceByKey(_+_).take(5)
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2954.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2954.0 (TID 15696, 10.19.240.54): java.lang.NumberFormatException: For input string: "?"
If your data contains columns which are not parseable to a float, then you should either filter them out beforehand or treat them accordingly. Such a treatment could mean that you assign a value of 0.0f, if you see a non-parseable entry. The following code does exactly this.
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i => i(0).split("/")(2)=="2008")
.map(i => (i(0).split("/")(2), Try{ i(2).toFloat }.toOption.getOrElse(0.0f)))
.reduceByKey(_ + _).take(5)
Short version: you probably have a line for which i(2) equals ?.
As per my comment your data most probably isn't consistent which won't be a problem in the first snippet because of the take(5) and no actions that require spark to perform operations on the whole data set. Spark is lazy and therefore will perform computations only until it gets 5 results from the map -> filter -> map chain.
The second snippet on the other hand will perform computations on your whole data set so it can perform the reduceByKey and only then it will take 5 results therefore it might catch problems which are too far in your data set for the first snippet.