Error comes when using User Defined Case Class in Scala - scala

The result that I am trying to find out is that, I need to find the distinct repos accessed by a particular user, using the inverted index based method in Scala. I have made the required Case Class to extract the data from the input file, and hence made the required RDD and joined on the basis of the required user, but whenver after finding all the repos accessed. When I try to use the distinct and then count function, then I get some error, which I will paste below.
The code is as follows:
import org.apache.spark.rdd.RDD
import java.text.SimpleDateFormat
import java.util.Date
case class LogLine(debug_level: String, timestamp: Date, download_id: String,retrieval_stage: String, rest: String);
val regex = """([^\s]+), ([^\s]+)\+00:00, ([^\s]+) -- ([^\s]+): (.*$)""".r
val rdd = sc.
textFile("file.txt").
flatMap ( x => x match {
case regex(debug_level,dateTime,downloadId,retrievalStage,rest) =>
val df = new SimpleDateFormat("yyyy-MM-dd:HH:mm:ss")
new Some(LogLine(debug_level, df.parse(dateTime.replace("T", ":")), downloadId, retrievalStage, rest))
case _ => None;
})
rdd.cache()
val inverted2 = rdd.groupBy(element => element.download_id).cache
val list2 = List[String]("ghtorrent-22")
val user = sc.parallelize(list2).keyBy(x => x)
val poss = inverted2.join(user).flatMap{x => x._2._1}.map(line => line.rest)
After this, whenever, I try to run, the next part to evaluate the result, I get the error.
The code and error are as follows:
Code:
poss.distinct.count
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 8.0 failed 1 times, most recent failure: Lost task 5.0 in stage 8.0 (TID 59) (LAPTOP-Q81MN7NI executor driver): java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.count(RDD.scala:1253)
... 37 elided
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

Related

Failed to execute user defined function in Spark-Scala

Below is the UDF to convert multivalued column into map.
def convertToMapFn (c: String): Map[String,String] = {
val str = Option(c).getOrElse(return Map[String, String]())
val arr = str.split(",")
val l = arr.toList
val regexPattern = ".*(=).*".r
s"$c".toString match {
case regexPattern(a) => l.map(x => x.split("=")).map(a => {if(a.size==2) (a(0).toString -> a(1).toString) else "ip_adr" -> a(0).toString} ).toMap
case "null" => Map[String, String]()
}
}
val convertToMapUDF = udf(convertToMapFn _)
I am able to display the data, but while trying to insert the data into Delta table, I am getting the below error.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 97.0 failed 4 times, most recent failure: Lost task 9.3 in stage 97.0 (TID 2561, 10.73.244.39, executor 5): org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2326/1884779796: (string) => map<string,string>)
Caused by: scala.MatchError: a8:9f:e (of class java.lang.String)
at line396de0100d5344c9994f63f7de7884fe49.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.convertToMapFn
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2326/1884779796: (string) => map<string,string
Could someone pls let me know how to fix this. Thank you
You can see in the error message that you have a MatchError. This happens when you don't account for all possible match cases. A basic fix is to change case "null" => to case _ => which will match anything that the regex doesn't.
Other matters:
s"$c".toString is equivalent to writing c in this case.
I think you mean to match on str and not c

NullPointerException when using Word2VecModel with UserDefinedFunction

I am trying to pass a word2vec model object to my spark udf. Basically I have a test set with movie Ids and I want to pass the ids along with the model object to get an array of recommended movies for each row.
def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) =
udf((col : String) => {
model.findSynonymsArray("20", 1)
})
however this gives me a null pointer exception. When I run model.findSynonymsArray("20", 1) outside the udf I get the expected answer. For some reason it doesn't understand something about the function within the udf but can run it outside the udf.
Note: I added "20" here just to get a fixed answer to see if that would work. It does the same when I replace "20" with col.
Thanks for the help!
StackTrace:
SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more
The SQL and udf API is a bit limited and I am not sure if there is a way to use custom types as columns or as inputs to udfs. A bit of googling didn't turn up anything too useful.
Instead, you can use the DataSet or RDD API and just use a regular Scala function instead of a udf, something like:
val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))
Alternatively, I guess you could serialize the model to and from a string, but that seems much uglier.
I think this issue happens because wordVectors is a transient variable
class Word2VecModel private[ml] (
#Since("1.4.0") override val uid: String,
#transient private val wordVectors: feature.Word2VecModel)
extends Model[Word2VecModel] with Word2VecBase with MLWritable {
I have solved this by broadcasting w2vModel.getVectors and re-creating the Word2VecModel model inside each partition

How to use flatmap in a Play framework controller with SparkContext?

I have a web app using Play 2.6, Scala 2.11 and Spark 2.2.0.
I am getting the exception: org.apache.spark.SparkException: Task not serializable when I execute a flatmap transformation on some variable. I know that I have to implement Serializable in some class, but I don't know the best practice to do that.
The exception happens on the line var namesRdd = names.flatMap(parseNames). If I use MyController with Serializable I have another error: class invalid for deserialization. So I suppose that this is not the solution.
Does anyone know how to serialize a Controller to use Spark Context and flatmap?
class SparkMarvelController #Inject()(cc: ControllerComponents) extends AbstractController(cc) with I18nSupport {
def mostPopularSuperHero() = Action { implicit request: Request[AnyContent] =>
val sparkContext = SparkCommons.sparkSession.sparkContext // got sparkContext
var names = sparkContext
.textFile("resource/marvel/Marvel-names.txt") // build up a hero ID - name RDD
var namesRdd = names.flatMap(parseNames)
val mostPopularHero = sparkContext
.textFile("resource/marvel/Marvel-graph.txt") // build up superhero co-apperance data
.map(countCoOccurrences) // convert to (hero ID, number of connections) RDD
.reduceByKey((x, y) => x + y) // combine entries that span more than one line
.map(x => (x._2, x._1)) // flip it to (number of connections, hero ID)
.max // find the max connections
// Look up the name (lookup returns an array of results, so we need to access the first result with (0))
val mostPopularHeroName = namesRdd.lookup(mostPopularHero._2)(0)
Ok(s"The most popular superhero is [$mostPopularHeroName] with [${mostPopularHero._1}] co-appearances.")
}
// Function to extract the hero ID and number of connections from each line
def countCoOccurrences(line: String) = {
// regex expression to split using any type of space occurrency in the line
val elements = line.split("\\s+")
(elements(0).toInt, elements.length - 1)
}
// function to extract hero ID -> hero name tuples (or None in case of Failure)
def parseNames(line: String): Option[(Int, String)] = {
var fields = line.split('\"')
if (fields.length > 1) return Some(fields(0).trim.toInt, fields(1))
else return None
}
}
error:
play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:255)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:180)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:311)
at play.core.server.AkkaHttpServer$$anonfun$3.applyOrElse(AkkaHttpServer.scala:309)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346)
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
Caused by: java.lang.ClassNotFoundException: controllers.SparkMarvelController$$anonfun$mostPopularSuperHero$1$$anonfun$2
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:429)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)

Spark UnaryTransformer implementation fails with scala.MatchError

I'm implementing an UnaryTransformer in Spark 1.6.2. with this interface:
class myUT(override val uid: String) extends UnaryTransformer[Seq[String], Seq[String], myUT] {
...
override protected def createTransformFunc: Seq[String] => Seq[String] = {
_ => _.map(x => x + "s")
}
This compiles well but at runtime returns me an error:
17/07/21 22:29:33 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, myhost.com.au): scala.MatchError: ArrayBuffer(<contents of my array>) (of class scala.collection.mutable.ArrayBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
The next thing I tried was replace
_ => _.map(x => x + "s")
with
_ => _
So, theoretically it should mean no change of data at all! But the error I've got was:
17/07/21 22:11:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, myhost.com.au): scala.MatchError: WrappedArray(<contains of my array>) (of class scala.collection.mutable.WrappedArray$ofRef)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
So it looks like the type of outgoing data gets changes anyway. How do I avoid this?
Update: The next thing I tried adding .toArray to the map. The error is now this:
[error] /sparkprj/src/main/scala/sp_txt.scala:43: polymorphic expression cannot be instantiated to expected type;
[error] found : [B >: String]Array[B]
[error] required: Seq[String]
[error] ).toArray
It may add some detail, but doesn't add much to my understanding. After reviewing a few examples of mllib UnaryTransformer I tend to believe it's a bug in Catalyst.
This line in definition of class myUT was incorrect:
override protected def outputDataType: DataType = new ArrayType(StringType, true)
As I copied this class definition from a String->String transformer, I had DataType defined as just StringType. My bad.

GraphX VertexRDD NullPointerException

I am trying to do some message passing on a graph to calculate recursive features.
I get an error when I define a graph whose vertices are the output of aggregateMessages. Code for context
> val newGraph = Graph(newVertices, edges)
newGraph: org.apache.spark.graphx.Graph[List[Double],Int] = org.apache.spark.graphx.impl.GraphImpl#2091594b
//This is the RDD that causes the problem
> val result = newGraph.aggregateMessages[List[Double]](
{triplet => triplet.sendToDst(triplet.srcAttr)},
{(a,b) => a.zip(b).map { case (x, y) => x + y }},
{TripletFields.Src})
result: org.apache.spark.graphx.VertexRDD[List[Double]] = VertexRDDImpl[1990] at RDD at VertexRDD.scala:57
> result.take(1)
res121: Array[(org.apache.spark.graphx.VertexId, List[Double])] = Array((1944425548,List(0.0, 0.0, 137.0, 292793.0)))
So far no problem, but when I try
> val newGraph2 = Graph(result, edges)
newGraph2: org.apache.spark.graphx.Graph[List[Double],Int] = org.apache.spark.graphx.impl.GraphImpl#710919e1
> val result2 = newGraph2.aggregateMessages[List[Double]](
{triplet => triplet.sendToDst(triplet.srcAttr)},
{(a,b) => a.zip(b).map { case (x, y) => x + y }},
{TripletFields.Src})
> result2.count
I get the following (trimmed) error:
result2: org.apache.spark.graphx.VertexRDD[List[Double]] = VertexRDDImpl[2009] at RDD at VertexRDD.scala:57
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4839.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4839.0 (TID 735, 10.0.2.15): java.lang.NullPointerException
at $anonfun$2.apply(<console>:62)
at $anonfun$2.apply(<console>:62)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.send(EdgePartition.scala:536)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.sendToDst(EdgePartition.scala:531)
at $anonfun$1.apply(<console>:61)
at $anonfun$1.apply(<console>:61)
at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(EdgePartition.scala:409)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:237)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:207)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
...
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
...
Caused by: java.lang.NullPointerException
at $anonfun$2.apply(<console>:62)
at $anonfun$2.apply(<console>:62)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.send(EdgePartition.scala:536)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.sendToDst(EdgePartition.scala:531)
at $anonfun$1.apply(<console>:61)
at $anonfun$1.apply(<console>:61)
at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(EdgePartition.scala:409)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:237)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:207)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
... 3 more
I don't think this is a type mismatch error because aggregateMessages returns a VertexRDD, any ideas why I am getting this problem?
Not all the nodes in the graph are returned by aggregateMessages, only the ones that receive a message. The NullPointerException is caused by edges in the graph pointing at those nodes plus the absence of a default node value in the graph definition.