SparkML MultilayerPerceptron error: java.lang.ArrayIndexOutOfBoundsException - scala

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().
val formula = new RFormula()
.setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
.setFeaturesCol("features")
.setLabelCol("label")
formula.fit(data).transform(data)
Note: The features is a vector and label is a Double
root
|-- features: vector (nullable = true)
|-- label: double (nullable = false)
I define my MLP estimator as follows:
val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong
val mlp = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = mlp.fit(train)
Unfortunately, I got the following error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11
at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950)
...

org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)
This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:
def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
val output = Array.fill(labelCount)(0.0)
output(labeledPoint.label.toInt) = 1.0
(labeledPoint.features, Vectors.dense(output))
}
It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.
By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:
The number of nodes N in the output layer corresponds to the number of classes.
The solution is therefore to either:
Change the number of nodes in the final layer of the network (output nodes)
Reconstruct the data to have the same number of classes as your network output nodes.
Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.

rearrange your dataset as the error shows you have fewer arrays than you have in your features set or your data set has a null set which prompted an error.I came across this type of error while working on my MLP project.hope my answer helps you.
thanks for reaching out

The solution is to first find the local optimal that allows one to escape the ArrayIndexOutBound and then use brute-force search to find the global optimal. Shaido suggest finding n
For example, val layers =
Array[Int](6, 5, 8, n). This assumes the length of the feature vectors
are 6. – Shaido
So make n be a large integer(n =100) then manually use brute-force search to arrive at a good solution(n =50 then try n =32 - error, n = 35 - perfect).
Credit to Shaido.

Related

How to fit a MultiLayerPerceptron Classifier in Pyspark?

Hi I am trying to fit a MultiLayerPerceptron with PySpark 2.4.3 Machine Learning Library. But every time I try to fit the algorithm I get the following error:
Py4JJavaError: An error occurred while calling o4105.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 784.0 failed 4 times, most recent failure: Lost task 0.3 in stage 784.0 (TID 11663, hdpdncwy87013.dpp.acxiom.net, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$org$apache$spark$ml$feature$OneHotEncoderModel$$encoder$1: (double, int) => struct,values:array>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
df = sqlContext.read.format("csv").options(header='true', sep=",", inferschema='true').load(location)
exclude = ["Target"]
inputs = [column for column in df.columns if (column not in exclude)]
vectorAssembler = VectorAssembler(inputCols=inputs, outputCol='Features')
vdf = vectorAssembler.transform(df)
vdf = vdf.select(['Features'] + exclude)
# Feature Scaling
scaler = MinMaxScaler(inputCol="Features", outputCol="scaledFeatures")
scalerModel = scaler.fit(vdf)
scaledData = scalerModel.transform(vdf)
# train-test split
splits = scaledData.randomSplit([0.7, 0.3], seed=2020)
train_df = splits[0]
test_df = splits[1]
layers = [len(inputs), 3, 3, 3, 5]
mlpc = MultilayerPerceptronClassifier(labelCol="Target", featuresCol="scaledFeatures", layers=layers,
blockSize=128, stepSize=0.03, seed=2020, maxIter=1000)
model = mlpc.fit(train_df)
Do you have an idea? Thank you in advance. Number of inputs 1902, number of classes to predict 5.
It's an old question, but we have encountered the exact same error now. We didn't had any issue with binary classification, but we had this exception thrown for multi class classification problems, just like yours.
The problem with the multi class classification for us was that our labels were 1, 2, 3. It turns out MultiLayerPerceptron expects the labels to start from 0. So when we subtracted 1 from our labels (made them 0, 1, 2), the model trained successfully without any exception. If you're having this exception for a multi class classification problem with non-zero labels, this might be your problem.
Hope this saves someone's hours of debugging time.

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

spark - extract elements from an RDD[Row] when reading Hive table in Spark

I was going to read a Hive table in spark using scala, and extract some/all of fields from it and then save the data into HDFS.
My code is as follow:
val data = spark.sql("select * from table1 limit 1000")
val new_rdd = data.rdd.map(row => {
var arr = new ArrayBuffer[String]
val len = row.size
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
arr.toArray
})
new_rdd.take(10).foreach(println)
new_rdd.map(_.mkString("\t")).saveAsTextFile(dataOutputPath)
The above chunk is the one that finally worked.
I had written another version, where this line:
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
was replaced by this line:
for(i <- 0 to len-1) arr.+=(row.get(i).toString)
To me, both lines did exactly the same thing: for each row, I get the ith element as a string, and put it into the ArrayBuffer, which comes to an Array at the end.
However, the two methods have different results.
The first line works well. Data were able to be correctly saved on HDFS.
While the Error was thrown when I am going to save the data if using the second line:
ERROR ApplicationMaster: User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 56
in stage 3.0 failed 4 times, most recent failure: Lost task 56.3 in stage
3.0 (TID 98, ip-172-31-18-87.ec2.internal, executor 6):
java.lang.NullPointerException
Therefore, I wonder if there is some intrinsic differences in between
getAs[String](i)
and
get(i).toString
?
Many thanks
getAs[String](i) is the same as
get(i).asInstanceOf[String]
therefore it is just a type casting. toString is not.

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

Apache Spark reduceByKey to sum decimals

I'm trying to map an RDD as such (see output for results) and map reduce by the decimal values and I keep getting error. When I tried using reduceByKey() with word count it worked fine. Are decimal values summed differently?
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).take(5)
Output:
voltageRDD: Array[(String, Float)] = Array((2008,1.62), (2008,1.626), (2008,1.622), (2008,1.612), (2008,1.612))
When trying to reduce:
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i=> i(0).split("/")(2)=="2008")
.map(i=> (i(0).split("/")(2),i(2).toFloat)).reduceByKey(_+_).take(5)
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2954.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2954.0 (TID 15696, 10.19.240.54): java.lang.NumberFormatException: For input string: "?"
If your data contains columns which are not parseable to a float, then you should either filter them out beforehand or treat them accordingly. Such a treatment could mean that you assign a value of 0.0f, if you see a non-parseable entry. The following code does exactly this.
val voltageRDD= myRDD.map(i=> i.split(";"))
.filter(i => i(0).split("/")(2)=="2008")
.map(i => (i(0).split("/")(2), Try{ i(2).toFloat }.toOption.getOrElse(0.0f)))
.reduceByKey(_ + _).take(5)
Short version: you probably have a line for which i(2) equals ?.
As per my comment your data most probably isn't consistent which won't be a problem in the first snippet because of the take(5) and no actions that require spark to perform operations on the whole data set. Spark is lazy and therefore will perform computations only until it gets 5 results from the map -> filter -> map chain.
The second snippet on the other hand will perform computations on your whole data set so it can perform the reduceByKey and only then it will take 5 results therefore it might catch problems which are too far in your data set for the first snippet.