ArrayIndexOutOfBoundsException when accessing triplets of a Graph - scala

I'm playing a bit with GraphX and got stuck with an Exception I can't explain.
My code generates 10 random nodes on a graph (of type Point) and then connects some of them. The logic itself doesn't really matter (and actually doesn't have any meaning). I just wanted to build a somewhat connected graph and got this Exception. The code goes as follows:
import scala.util.Random
case class Point(x:Double, y:Double, z:Double)
val vertices = sc.parallelize(
(1 to 10).map(i => (i.toLong, Point(Random.nextInt(10), Random.nextInt(10), Random.nextInt(10))))
)
val tmpGroups = vertices.map(x => (Random.nextInt(5), x) )
val edges = tmpGroups.cartesian(tmpGroups)
.filter{case(x,y) => x._2._1 != y._2._1}
.filter{case(x,y) => Math.abs(x._1 - y._1) <= 1}
.map{case(x,y) => Edge(x._2._1, y._2._1, 1.)}
val graph = Graph(vertices, edges)
So far everything works, and when I collect vertices and edges they look fine:
graph.vertices.collect.foreach(println)
=> (4,Point(6.0,7.0,7.0))
(8,Point(6.0,0.0,5.0))
(1,Point(8.0,4.0,7.0))
...
graph.edges.collect.foreach(println)
=> Edge(1,2,1.0)
Edge(2,1,1.0)
Edge(1,3,1.0)
...
Their types are (as expected):
org.apache.spark.graphx.VertexRDD[Point]
org.apache.spark.graphx.EdgeRDD[Double]
But when I try to collect triplets I get the following error:
graph.triplets.collect
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 40431.0 failed 1 times, most recent failure: Lost task 2.0 in stage 40431.0 (TID 8029, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$afab7c86681139df3241c999f2dafc38$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:214)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$afab7c86681139df3241c999f2dafc38$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:219)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$afab7c86681139df3241c999f2dafc38$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:221)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$afab7c86681139df3241c999f2dafc38$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:223)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$afab7c86681139df3241c999f2dafc38$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:225)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3baf9f919752f0ab1f5a31ad94af9f4$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:227)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3baf9f919752f0ab1f5a31ad94af9f4$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:229)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3baf9f919752f0ab1f5a31ad94af9f4$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:231)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3baf9f919752f0ab1f5a31ad94af9f4$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:233)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3baf9f919752f0ab1f5a31ad94af9f4$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:235)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:237)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:239)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:241)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:243)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:245)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$b968e173293ba7cd5c79f2d1143fd$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:247)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$17f9c57b34a761248de8af38492ff086$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:249)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$17f9c57b34a761248de8af38492ff086$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:251)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$17f9c57b34a761248de8af38492ff086$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:253)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$17f9c57b34a761248de8af38492ff086$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:255)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$17f9c57b34a761248de8af38492ff086$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:257)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$bec1ee5c9e2e4d5af247761bdfbc3b3$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:259)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$bec1ee5c9e2e4d5af247761bdfbc3b3$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:261)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$bec1ee5c9e2e4d5af247761bdfbc3b3$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:263)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$bec1ee5c9e2e4d5af247761bdfbc3b3$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:265)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$bec1ee5c9e2e4d5af247761bdfbc3b3$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:267)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$5acc5a6ce0af8ab20753597dcc84fc0$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:269)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$5acc5a6ce0af8ab20753597dcc84fc0$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:271)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$5acc5a6ce0af8ab20753597dcc84fc0$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:273)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$5acc5a6ce0af8ab20753597dcc84fc0$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:275)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$5acc5a6ce0af8ab20753597dcc84fc0$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:277)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:279)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:281)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:283)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:285)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:287)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$33d793dde4292884a4720419646f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:289)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$725d9ae18728ec9520b65ad133e3b55$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:291)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$725d9ae18728ec9520b65ad133e3b55$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:293)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$725d9ae18728ec9520b65ad133e3b55$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:295)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$725d9ae18728ec9520b65ad133e3b55$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:297)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$725d9ae18728ec9520b65ad133e3b55$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:299)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3d99ae6e19b65c7f617b22f29b431fb$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:301)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3d99ae6e19b65c7f617b22f29b431fb$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:303)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3d99ae6e19b65c7f617b22f29b431fb$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:305)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3d99ae6e19b65c7f617b22f29b431fb$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:307)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$3d99ae6e19b65c7f617b22f29b431fb$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:309)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$ad149dbdbd963d0c9dc9b1d6f07f5e$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:311)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$ad149dbdbd963d0c9dc9b1d6f07f5e$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:313)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$ad149dbdbd963d0c9dc9b1d6f07f5e$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:315)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$ad149dbdbd963d0c9dc9b1d6f07f5e$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:317)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$ad149dbdbd963d0c9dc9b1d6f07f5e$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:319)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$6e49527b15a75f3b188beeb1837a4f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:321)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$6e49527b15a75f3b188beeb1837a4f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:323)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$6e49527b15a75f3b188beeb1837a4f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:325)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$6e49527b15a75f3b188beeb1837a4f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:327)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$6e49527b15a75f3b188beeb1837a4f1$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:329)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$93297bcd59dca476dd569cf51abed168$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:331)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$93297bcd59dca476dd569cf51abed168$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:333)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$93297bcd59dca476dd569cf51abed168$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:335)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$93297bcd59dca476dd569cf51abed168$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:337)
at $iwC$$iwC$$iwC$$iwC$$iwC$$$$93297bcd59dca476dd569cf51abed168$$$$$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:339)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:341)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:343)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:345)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:347)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:349)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:351)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:353)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:355)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:357)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:359)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:361)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:363)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:365)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:367)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:369)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:371)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:373)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:375)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:377)
at $iwC$$iwC$$iwC$$iwC.(:379)
at $iwC$$iwC$$iwC.(:381)
at $iwC$$iwC.(:383)
at $iwC.(:385)
at (:387)
at .(:391)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:810)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:753)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:746)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
I can only guess the $iwC stuff are the anonymous functions in my code (filter and map). I've tried to test them separately and collect with/without some of them and nothing led me to a solution.
What am I missing?
EDIT:
I got it to work, but now it's even weirder...
If I collect and re-parallelize the edges RDD, it seems to work fine:
val graph = Graph(vertices, sc.parallelize(edges.collect))
graph.triplets.collect.foreach(println)
=> ((2,Point(5.0,9.0,0.0)),(3,Point(3.0,7.0,8.0)),1.0)
((2,Point(5.0,9.0,0.0)),(4,Point(2.0,5.0,4.0)),1.0)
((2,Point(5.0,9.0,0.0)),(5,Point(0.0,3.0,3.0)),1.0)
...
Can someone please explain this? I don't like voodoos...
What happens on re-parallelization? I guess partitions may change? Does that have anything to do with the original problem?

It's Spark's laziness along with the usage of Random that bites you:
Since tmpGroups and edges aren't cached and are accessed twice, the map operation that is used to create them creates random values more than once, ending up with different values each time.
One way to solve this (more elegant usable than collecting and parallelizing again) is caching. By adding .cache() at the end of the line that creates tmpGroups, the result will be created exactly once:
val tmpGroups = vertices.map(x => (Random.nextInt(5), x) ).cache()

Related

java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala

In EMR, we are using Salesforce Bulk API call to fetch records from salesforce object. For one of the Object(TASK) data frame while saving to parquet getting below error.
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174)
at org.apache.spark.sql.Row$class.apply(Row.scala:163)
at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(rows.scala:166)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
val sfdcObjectSchema = StructType(
nonCompoundMetas.map(_.Name).map(
fieldName => StructField(fieldName, StringType, true)
)
)
val sfdcObjectDF = spark.read.format("com.springml.spark.salesforce").option("username", userName).
option("password", s"$sfdcPassword$sfdcToken").option("soql", retrievingSOQL).
option("version", JavaUtils.getConfigProps(runtimeEnvironment).getProperty("sfdc.api.version")).
option("sfObject", sfdcObject).option("bulk", "true").option("pkChunking", pkChunking).
option("chunkSize", checkingSize).
option("timeout", bulkTimeoutMillis.toString).option("maxCharsPerColumn", "-1").option("maxColumns", nonCompoundMetas.size.toString).
schema(sfdcObjectSchema).load()
sfdcObjectDF.na.drop("all").write.mode(SaveMode.Overwrite).parquet(s"${JavaUtils.getConfigProps(runtimeEnvironment).getProperty("etl.dataset.root")}/$accountName/$sfdcObject")
Please help us how to debug this issue further.
This issue is caused by that your Salesforce "SOQL" returns an empty resultset, which trigger this runtime error.
The root cause I believe is that when https://github.com/springml/spark-salesforce designs the Spark DataSource API, it doesn't handle for the empty DataFrame case, so this bug exists. Maybe you want to create an issue in the git to raise this up.
For a temporary solution, you "could" use a "select count(id) ...." SOQL, make sure the result is "> 0", before you generate the Dataframe and use it in the Spark.

Failed to execute user defined function($anonfun$9: (string) => double) on using String Indexer for multiple columns

I am trying to apply string indexer on multiple columns. Here is my code
val stringIndexers = Categorical_Model.map { colName =>new StringIndexer().setInputCol(colName).setOutputCol(colName + "_indexed")}
var dfStringIndexed = stringIndexers(0).fit(df3).transform(df3) // 'fit's a model then 'transform's data
for(x<-1 to stringIndexers.length-1)
{dfStringIndexed = stringIndexers(x).fit(dfStringIndexed).transform(dfStringIndexed)
}
dfStringIndexed = dfStringIndexed.drop(Categorical_Model: _*)
The Schema shows up with all columns having nullable as false
The stringIndexers array shows up like this
stringIndexers: Array[org.apache.spark.ml.feature.StringIndexer] = Array(strIdx_c53c3bdf464c, strIdx_61e685c520f7, strIdx_d6e59b2fc69d, ......)
dfStringIndexed.show(10)
This throws the following error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
Why is it that print schema is showing up but no data is available .
Update: If I loop the string Indexers manually like so instead of the loop. This code works. Which is wierd.
var dfStringIndexed = stringIndexers(0).fit(df3).transform(df3) // 'fit's a model then 'transform's data
dfStringIndexed = stringIndexers(1).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(2).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(3).fit(dfStringIndexed).transform(dfStringIndexed)
dfStringIndexed = stringIndexers(4).fit(dfStringIndexed).transform(dfStringIndexed)
Adding Stacktrace on request
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
... 63 elided
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
Caused by: org.apache.spark.SparkException: StringIndexer encountered NULL value. To handle or skip NULLS, try setting StringIndexer.handleInvalid.
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:251)
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
... 19 more
I have also been getting a similar issue, even on a tiny subset of 50 rows, none of which have nulls in the column I am string indexing. But it didn't work even when I ran it manually.
I can avoid the error by including .setHandleInvalid("keep"), and I've checked the outputs and it's not doing anything strange like setting everything to be 0 or the same value or anything. I'm still unhappy about that being the resolution as it seems quite unsafe. Would be very interested to know if you found a more reasonable answer and resolution!
dfStringIndexed = stringIndexers(1).setHandleInvalid("keep").fit(dfStringIndexed).transform(dfStringIndexed)
I think it might also be fixed by changing the nullability of your column, even if it doesn't contain nulls in it, which I did as per here
Can I change the nullability of a column in my Spark dataframe?

SparkException: Job aborted due to stage failure: NullPointerException when working with Spark-Graphx

I'm new in scala and I'm looking for solving this error.
The scenario I'm working on is this. I've 3 tables:
user: containing ID and name
business: containing ID and name
reviews: containing user.ID and business.ID
Only users make a review and only business receive a review. The graph will be something like this:
What I'm looking for is:
For each user I want to know the other users that made a review to the same business
I did this actions to create the graph:
val users = sqlContext.sql("Select user_id as ID from user")
val business= sqlContext.sql("Select business_id as ID from business")
users.write.mode(SaveMode.Append).saveAsTable("user_busin_db")
business.write.mode(SaveMode.Append).saveAsTable("user_busin_db")
val user_bus = sqlContext.sql("Select ID from user_busin_db")
val reviews = sqlContext.sql("Select user_id, business_id from review")
The table user_bus will be used for vertexs creation.
After that I created the graph with GraphX with this code:
def str2Long(s: String) = s.##.toLong
val vertex: RDD[(VertexId, String)] = user_bus.rdd.map(x => (str2Long(x(0).asInstanceOf[String]),(x(0).asInstanceOf[String])))
val edge:RDD[Edge[String]] = reviews.rdd.map(row => Edge(str2Long(row(0).asInstanceOf[String]), str2Long(row(1).asInstanceOf[String]), "review"))
val default = "missing"
val myGraph = Graph(vertex, edge, default)
myGraph.cache()
Now to answer my question I tried to do a aggregateMessages for eaither users and business with this code:
val userAggregate: VertexRDD[(List[Long])] = myGraph.aggregateMessages[(List[Long])](triplet => {
triplet.sendToSrc((List(triplet.dstId)))
},
(a,b) => (a.union(b))
)
val businessAggregate: VertexRDD[(List[Long])] = myGraph.aggregateMessages[(List[Long])](triplet => {
triplet.sendToDst((List(triplet.srcId)))
},
(a,b) => (a.union(b))
)
And then the code that gives me the error. To collect for each user what are the other users that made a reviews at same business I wrote this:
userAggregate.map(userAggr =>
(userAggr._1, userAggr._2.flatMap(userAggrListElem =>
userAggr._2.patch(0,businessAggregate.filter(busAggr => busAggr._1 == userAggrListElem).map(row => row._2).take(1)(0),userAggr._2.size+1))))
If I try to use .collect or .count on it i got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 138.0 failed 1 times, most recent failure: Lost task 1.0 in stage 138.0 (TID 2807, localhost): java.lang.NullPointerException
at org.apache.spark.graphx.impl.VertexRDDImpl.mapVertexPartitions(VertexRDDImpl.scala:94)
at org.apache.spark.graphx.VertexRDD.filter(VertexRDD.scala:98)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5$$anonfun$apply$1.apply(<console>:102)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5$$anonfun$apply$1.apply(<console>:101)
at scala.collection.immutable.List.flatMap(List.scala:327)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5.apply(<console>:101)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5.apply(<console>:100)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1769)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
at org.apache.spark.rdd.RDD.count(RDD.scala:1134)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:105)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:115)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:117)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:119)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:121)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:123)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:125)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:127)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:129)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:131)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:133)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:135)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:137)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:139)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:141)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:143)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:145)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:147)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw.<init>(<console>:149)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw.<init>(<console>:151)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw.<init>(<console>:153)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw.<init>(<console>:155)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$eval$.$print$lzycompute(<console>:7)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$eval$.$print(<console>:6)
Caused by: java.lang.NullPointerException
at org.apache.spark.graphx.impl.VertexRDDImpl.mapVertexPartitions(VertexRDDImpl.scala:94)
at org.apache.spark.graphx.VertexRDD.filter(VertexRDD.scala:98)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5$$anonfun$apply$1.apply(<console>:102)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5$$anonfun$apply$1.apply(<console>:101)
at scala.collection.immutable.List.flatMap(List.scala:327)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5.apply(<console>:101)
at linea6ec9c0b0ced4184a0288c57eb3bdda585.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$5.apply(<console>:100)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1769)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:314)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
The algorithm works well if I use a subset of userAggregate, indeed if I use take(1) I get this result:
Array[(org.apache.spark.graphx.VertexId, List[Long])] = Array((-1324024017,List(-1851582020, -1799460264, -1614007919, -1573604682, ...)))
Which is: (user_ID, List(user_id that made a review to the same business,...)
Now I think there is a problem with the Vertexs, there is somewhere an unconnected vertex that gives me NullPointer error, but I'm not able to find it and delete from my grapf. What can I do for solving this problem?
TL;DR It is not a valid Spark code.
This is an expected outcome. It is not allowed to nest transformations in Apache Spark, hence you cannot access businessAggregate inside the closure of userAggregate.map.

Spark-Shell error: "spark.dynamicAllocation.{min/max}Executors must be set

I am trying to start spark-shell after setting up Spark 1.2.1 on cloudera quick start VM. I am getting the below error.Looking for help in resolving this issue. Appreciate any quick help on this to resolve the issue. The log of the error is mentioned below:
16/03/03 09:40:37 INFO EventLoggingListener: Logging events to hdfs://quickstart.cloudera:8020/user/spark/applicationHistory/local-1457026830824
org.apache.spark.SparkException: spark.dynamicAllocation.{min/max}Executors must be set!
at org.apache.spark.ExecutorAllocationManager.validateSettings(ExecutorAllocationManager.scala:135)
at org.apache.spark.ExecutorAllocationManager.<init>(ExecutorAllocationManager.scala:98)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:377)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
at $iwC$$iwC.<init>(<console>:9)
at $iwC.<init>(<console>:18)
at <init>(<console>:20)
at .<init>(<console>:24)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
scala>
The exception is pretty clear. It seems that you've set the spark.dynamicAllocation.enabled property to true, but failed to set spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors. The spark 1.2.1 documentation clearly states this (from spark.dynamicAllocation.enabled description, emphasis mine):
This requires the following configurations to be set:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.shuffle.service.enabled
If you look at the 1.2 branch of Spark, you'll see that if you don't specify those values, the default defers to -1:
// Lower and upper bounds on the number of executors. These are required.
private val minNumExecutors = conf.getInt("spark.dynamicAllocation.minExecutors", -1)
private val maxNumExecutors = conf.getInt("spark.dynamicAllocation.maxExecutors", -1)
This behavior has changed. If you look at the updated 1.6 branch of Spark, you'll see that they defer to 0 and Integer.MAX_VALUE, respectively:
// Lower and upper bounds on the number of executors.
private val minNumExecutors = conf.getInt("spark.dynamicAllocation.minExecutors", 0)
private val maxNumExecutors = conf.getInt("spark.dynamicAllocation.maxExecutors",
Integer.MAX_VALUE)
This simply means, you need to add these either to the SparkConf settings, or to any other configuration file you're providing to the spark-shell:
val sparkConf = new SparkConf()
.set("spark.dynamicAllocation.minExecutors", minExecutors)
.set("spark.dynamicAllocation.maxExecutors", maxExecutors)

dataframe filter gives NullPointerException

In Spark 1.6.0 I have a data frame with a column that holds a job description, like:
Description
bartender
bartender
employee
taxi-driver
...
I retrieve a list of unique values from that column with:
val jobs = people.select("Description").distinct().rdd.map(r => r(0).asInstanceOf[String]).repartition(4)
I then try, for each job description, to retrieve people with that job and do something, but I get a NullPointerException:
jobs.foreach {
ajob =>
var peoplewithjob = people.filter($"Description" === ajob)
// ... do stuff
}
I don't understand why this happens, because every job has been extracted from the people data frame, so there should be at least one with that job... any hint more that welcome! Here's the stack trace:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4.0 (TID 206, localhost): java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at jago.Run$$anonfun$main$1.apply(Run.scala:89)
at jago.Run$$anonfun$main$1.apply(Run.scala:82)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It happens because Spark doesn't support nested actions or transformations. If you want to operate on distinct values extracted from the DataFrame you have to fetch the results to the driver and iterate locally:
// or toLocalIterator
jobs.collect.foreach {
ajob =>
var peoplewithjob = people.filter($"Description" === ajob)
}
Depending on what kind of transformations you apply as "do stuff" it can be a better idea to simply grouBy and aggregate:
people.groupBy($"Description").agg(...)