GC overhead limit exceeded in Sparkjob with join - scala

I am writing a spark job to get the latest student records filtered by the student date. But when I try this with hundred thousand of records it is working fine. But when I run it with large number of records my sparkjob returns below error.
I guess this error happens since I load all the data from the table and put int in a RDD. Because my table contains about 4.2 millions of records. If that is so, is there any better way to efficiently load those data and continue my operation successfully ?
Please anybody help me to resolve this
WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.10.10.10): java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2157)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1964)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3316)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:463)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3040)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2681)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:408)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/03/09 10:54:09 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 2, 10.10.10.10, partition 0, PROCESS_LOCAL, 5288 bytes)
17/03/09 10:54:09 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 2 on executor id: 1 hostname: 10.10.10.10.
17/03/09 10:54:09 WARN TransportChannelHandler: Exception in connection from /10.10.10.10:48464
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel. java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNio ByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoo p.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:38 2)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEvent Executor.java:111)
at java.lang.Thread.run(Thread.java:745)
17/03/09 10:54:09 ERROR TaskSchedulerImpl: Lost executor 1 on 10.10.10.10: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/03/09 10:54:09 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170309105209-0032/1 is now EXITED (Command exited with code 52)
Code
object StudentDataPerformanceEnhancerImpl extends studentDataPerformanceEnhancer {
val LOG = LoggerFactory.getLogger(this.getClass.getName)
val USER_PRIMARY_KEY = "user_id";
val COURSE_PRIMARY_KEY = "course_id";
override def extractData(sparkContext: SparkContext, sparkSession: SparkSession, jobConfiguration: JobConfiguration): Unit = {
val context = sparkSession.read.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", jobConfiguration.jdbcURL)
.option("dbtable", "student_student")
.option("user", test_user)
.option("password", test_password)
.load()
context.cache()
val mainRDD = context.rdd.map(k => ((k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time"))),
(k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time")),
k.getString(k.fieldIndex("student_student_index")),
k.getLong(k.fieldIndex("student_category_pk")),
k.getString(k.fieldIndex("effeciency")),
k.getString(k.fieldIndex("net_score")),
k.getString(k.fieldIndex("avg_score")),
k.getString(k.fieldIndex("test_score"))))).persist(StorageLevel.DISK_ONLY)
LOG.info("Data extractions started....!")
try {
val studentCompositeRDD = context.rdd.map(r => ((r.getLong(r.fieldIndex(USER_PRIMARY_KEY)),
r.getLong(r.fieldIndex(COURSE_PRIMARY_KEY))),
r.getTimestamp(r.fieldIndex("student_date_time")))).reduceByKey((t1, t2) => if (t1.after(t2)) t1 else t2)
.map(t => ((t._1._1, t._1._2, t._2), t._2)).persist(StorageLevel.DISK_ONLY)
val filteredRDD = mainRDD.join(studentCompositeRDD).map(k => k._2._1)
DataWriter.persistLatestData(filteredRDD)
} catch {
case e: Exception => LOG.error("Error in spark job: " + e.getMessage)
}
}
}
My DataWriter class related to database persistence is below
object DataWriter {
def persistLatestStudentRiskData(rDD: RDD[(Long, Long, Timestamp, String, Long, String, String, String, String)]): Unit = {
var jdbcConnection: java.sql.Connection = null
try {
jdbcConnection = DatabaseUtil.getConnection
if (jdbcConnection != null) {
val statement = "{call insert_latest_student_risk (?,?,?,?,?,?,?,?,?)}"
val callableStatement = jdbcConnection.prepareCall(statement)
rDD.collect().foreach(x => sendLatestStudentRiskData(callableStatement, x))
}
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
case e: RuntimeException => LOG.error("Error in the latest student persistence: " + e.getMessage)
case e: Exception => LOG.error("Error in the latest student persistence: " + e.getMessage)
} finally {
if (jdbcConnection != null) {
try {
jdbcConnection.close()
} catch {
case e: SQLException => LOG.error("Error in jdbc connection close : " + e.getMessage)
case e: Exception => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}
}
def sendLatestStudentRiskData(callableStatement: java.sql.CallableStatement,
latestStudentData: (Long, Long, Timestamp, String, Long,
String, String, String, String)): Unit = {
try {
callableStatement.setLong(1, latestStudentData._1)
callableStatement.setLong(2, latestStudentData._2)
callableStatement.setTimestamp(3, latestStudentData._3)
callableStatement.setString(4, latestStudentData._4)
callableStatement.setLong(5, latestStudentData._5)
callableStatement.setString(6, latestStudentData._6)
callableStatement.setString(7, latestStudentData._7)
callableStatement.setString(8, latestStudentData._8)
callableStatement.setString(9, latestStudentData._9)
callableStatement.executeUpdate
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}

The problem isn't that you are putting data into the RDD, is that you are taking the data out of the RDD and onto the driver memory. Specifically, the problem is the collect call you are using to persist the data. You should remove it. collect brings the entire RDD into memory on the driver and you are no longer using spark and your cluster to process data so you quickly run out of memory unless your data size is very small. collect should rarely be used by spark processes, it's mostly useful for development and debugging with small amounts of data. It has some uses in production applications for some supporting operations but not as a main data flow.
Spark is able to write to jdbc directly if you use spark-sql, leverage that and remove calls to collect.

Related

Error comes when using User Defined Case Class in Scala

The result that I am trying to find out is that, I need to find the distinct repos accessed by a particular user, using the inverted index based method in Scala. I have made the required Case Class to extract the data from the input file, and hence made the required RDD and joined on the basis of the required user, but whenver after finding all the repos accessed. When I try to use the distinct and then count function, then I get some error, which I will paste below.
The code is as follows:
import org.apache.spark.rdd.RDD
import java.text.SimpleDateFormat
import java.util.Date
case class LogLine(debug_level: String, timestamp: Date, download_id: String,retrieval_stage: String, rest: String);
val regex = """([^\s]+), ([^\s]+)\+00:00, ([^\s]+) -- ([^\s]+): (.*$)""".r
val rdd = sc.
textFile("file.txt").
flatMap ( x => x match {
case regex(debug_level,dateTime,downloadId,retrievalStage,rest) =>
val df = new SimpleDateFormat("yyyy-MM-dd:HH:mm:ss")
new Some(LogLine(debug_level, df.parse(dateTime.replace("T", ":")), downloadId, retrievalStage, rest))
case _ => None;
})
rdd.cache()
val inverted2 = rdd.groupBy(element => element.download_id).cache
val list2 = List[String]("ghtorrent-22")
val user = sc.parallelize(list2).keyBy(x => x)
val poss = inverted2.join(user).flatMap{x => x._2._1}.map(line => line.rest)
After this, whenever, I try to run, the next part to evaluate the result, I get the error.
The code and error are as follows:
Code:
poss.distinct.count
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 8.0 failed 1 times, most recent failure: Lost task 5.0 in stage 8.0 (TID 59) (LAPTOP-Q81MN7NI executor driver): java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.count(RDD.scala:1253)
... 37 elided
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

Spark Launcher: Can't see the complete stack trace for failed SQL query

I'm using SparkLauncher to connect to Spark in cluster mode on top of Yarn. I'm running some SQL code using Scala like this:
def execute(code: String): Unit = {
try {
val resultDataframe = spark.sql(code)
resultDataframe.write.json("s3://some/prefix")
catch {
case NonFatal(f) =>
log.warn(s"Fail to execute query $code", f)
log.info(f.getMessage, getNestedStackTrace(f, Seq[String]()))
}
}
def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
if (e.getCause == null) return msg
getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}
Now when I run a query that should fail with the execute() method, for example, querying a partitioned table without a partitioned predicate - select * from partitioned_table_on_dt limit 1;, I get an incorrect stack trace back.
Correct stack trace when I run spark.sql(code).write.json() manually from spark-shell:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(1) LocalLimit 1
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
...
Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: No partition predicate found for partitioned table
partitioned_table_on_dt.
If the table is cached in memory then turn off this check by setting
hive.mapred.mode to nonstrict
at org.apache.spark.sql.hive.execution.HiveTableScanExec.prunePartitions(HiveTableScanExec.scala:155)
...
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
...
Incorrect stack trace from the execute() method above:
Job Aborted:
"org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)",
"org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)",
"org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)",
...
"org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)",
"org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)",
...
The spark-shell stack trace has three nested exceptions SparkException(SemanticException (TreeNodeException)) but the traceback that I'm seeing with my code is only from the SparkException and TreeNodeException but the most valuable SemanticException traceback is missing even after fetching the nested stack traces in the getNestedStackTrace() method.
Can any Spark/Scala experts tell me what am I doing wrong or how do I fetch the complete stack trace here with all the exceptions?
The recursive method getNestedStackTrace() had a bug.
def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
if (e == null) return msg // this should be e not e.getCause
getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}

Spark java.lang.NullPointerException when using tuples

I am using the GraphX API for spark to build a graph and process it with Pregel API. The error does not happen if I return an argument tuple from vprog function, but if I return a new tuple using the same tuple, I get null point error.
Here is the relevant code:
val verticesRDD = cleanDtaDF.select("ChildHash", "DN").rdd.map(row => (row(0).toString.toLong, (row(1).toString.toDouble,row(0).toString.toLong)))
val edgesRDD = (rawDtaDF.select("ChildHash", "ParentHash", "dealer_code", "dealer_customer_number", "parent_dealer_cust_number").rdd
.map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, (row(3) + " is a child of " + row(4), " when dealer is " + row.get(2)))))
val myGraph = Graph(verticesRDD, edgesRDD)
def vprog(vertexId: VertexId, vertexDTA:(Double, Long), msg: Double): (Double, Long) = {
(vertexDTA._1, vertexDTA._2)
}
val result = myGraph.pregel(0.0, 1, activeDirection = EdgeDirection.Out)(vprog,t => Iterator((t.dstId, t.srcAttr._2)),(x, y) => x + y)
The error does not happen if I make a simple change to vprog(...)--not access the tuples' members:
def vprog(vertexId: VertexId, vertexDTA:(Double, Long), msg: Double): (Double, Long) = {
vertexDTA
}
The error is
[Stage 101:> (0 + 0) / 200][Stage 102:> (0 + 4) / 200]18/03/10 20:43:16 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 102.0 (TID 5959, ue1lslaved25.na.aws.cat.com, executor 146): java.lang.NullPointerException
at $line69.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.vprog(<console>:60)
at $line70.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:75)
at $line70.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:75)
at org.apache.spark.graphx.Pregel$$anonfun$1.apply(Pregel.scala:125)
at org.apache.spark.graphx.Pregel$$anonfun$1.apply(Pregel.scala:125)
at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:988)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:979)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:919)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:979)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:697)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This issue has a simple explanation. It's not related with Spark or Graphx.
Having the function (just strip out irrelevant items from the original):
def vprog(vertexDTA:(Double, Long)): (Double, Long) = {
(vertexDTA._1, vertexDTA._2)
}
If the arg vertexDTA is null, both vertexDTA._1 and vertexDTA._2 will throw NullPointerException.
If we change the function to
def vprog(vertexDTA:(Double, Long)): (Double, Long) = {
vertexDTA
}
when the arg is null, it simply returns it, there is no access to tuple's members, so no NPE.

Reading and writing to Cassandra from Spark worker throws error

I am using the Datastax Cassandra java driver to write to Cassandra from spark workers. Code snippet
rdd.foreachPartition(record => {
val cluster = SimpleApp.connect_cluster(Spark.cassandraip)
val session = cluster.connect()
record.foreach { case (bin_key: (Int, Int), kpi_map_seq: Iterable[Map[String, String]]) => {
kpi_map_seq.foreach { kpi_map: Map[String, String] => {
update_tables(session, bin_key, kpi_map)
}
}
}
} //record.foreach
session.close()
cluster.close()
}
While reading I am using the spark cassandra connector (which uses the same driver internally I assume)
val bin_table = javaFunctions(Spark.sc).cassandraTable("keyspace", "bin_1")
.select("bin").where("cell = ?", cellname) // assuming this will run on worker nodes
println(s"get_bins_for_cell Count of Bins for Cell $cellname is ", cell_bin_table.count())
return bin_table
Doing this each at a time does not cause any problem. Doing it together is throwing this stack trace.
My main goal is not to do the write or read directly from the Spark driver program. Still it seems that it has to do something with the context; two context getting used ?
16/07/06 06:21:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 22, euca-10-254-179-202.eucalyptus.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The Spark Context was getting closed after using the session with Cassandra like below
Example
def update_table_using_cassandra_driver() ={
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
val statement_4: Statement = QueryBuilder.insertInto("keyspace", "table")
.value("bin", my_tuple_value)
.value("cell", my_val("CName"))
session.executeAsync(statement_4)
...
}
So next time I call this in the loop I was getting exception. Looks like a bug in Cassandra driver;will have to check this. For the time being did the following to work around this
for(a <- 1 to 1000) {
val sc = new SparkContext(SparkWriter.conf)
update_table_using_cassandra_driver()
sc.stop()
...sleep(xxx)
}

Can't access broadcast variable in transformation

I'm having problems accessing a variable from inside a transformation function. Could someone help me out?
Here are my relevant classes and functions.
#SerialVersionUID(889949215L)
object MyCache extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
#volatile var cache: Broadcast[Map[UUID, Definition]] = null
def getInstance(sparkContext: SparkContext) : Broadcast[Map[UUID, Definition]] = {
if (cache == null) {
synchronized {
val map = sparkContext.cassandraTable("keyspace", "table")
.collect()
.map(m => m.getUUID("id") ->
Definition(m.getString("c1"), m.getString("c2"), m.getString("c3"),
m.getString("c4"))).toMap
cache = sparkContext.broadcast(map)
}
}
cache
}
}
In a different file:
object Processor extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
def processData[T: ClassTag](rawStream: DStream[(String, String)], ssc: StreamingContext,
processor: (String, Broadcast[Map[UUID, Definition]]) => T): DStream[T] = {
MYCache.getInstance(ssc.sparkContext)
var newCacheValues = Map[UUID, Definition]()
rawStream.cache()
rawStream
.transform(rdd => {
val array = rdd.collect()
array.foreach(r => {
val value = getNewCacheValue(r._2, rdd.context)
if (value.isDefined) {
newCacheValues = newCacheValues + value.get
}
})
rdd
})
if (newCacheValues.nonEmpty) {
logger.info(s"Rebroadcasting. There are ${newCacheValues.size} new values")
logger.info("Destroying old cache")
MyCache.cache.destroy()
// this is probably wrong here, destroying object, but then referencing it. But I haven't gotten to this part yet.
MyCache.cache = ssc.sparkContext.broadcast(MyCache.cache.value ++ newCacheValues)
}
rawStream
.map(r => {
println("######################")
println(MyCache.cache.value)
r
})
.map(r => processor(r._2, MyCache.cache.value))
.filter(r => null != r)
}
}
Every time I run this I get SparkException: Failed to get broadcast_1_piece0 of broadcast_1 when trying to access cache.value
When I add a println(MyCache.cache.values) right after the .getInstance I'm able to access the broadcast variable, but when I deploy it to a mesos cluster I'm unable to access the broadcast values again, but with a null pointer exception.
Update:
The error I'm seeing is on println(MyCache.cache.value). I shouldn't have added this if statement containing the destroy, because my tests are never hitting that.
The basics of my application are, I have a table in cassandra that won't be updated very much. But I need to do some validation on some streaming data. So I want to pull all the data from this table, that isn't update much, into memory. getInstance pulls the whole table in on startup, and then I check all my streaming data to see if I need to pull from cassandra again (which I will have to very rarely). The transform and collect is where I check to see if I need to pull new data in. But since there is a chance that my table will be updated, I will need to update the broadcast occasionally. So my idea was to destroy it and then rebroadcast. I will update that once I get the other stuff working.
I get the same error if I comment out the destroy and rebroadcast.
Another update:
I need to access the broadcast variable in processor this line: .map(r => processor(r._2, MyCache.cache.value)).
I'm able to broadcast variable in the transform, and if I do println(MyCache.cache.value) in the transform, then all my tests pass, and I'm able to then access the broadcast in processor
Update:
rawStream
.map(r => {
println("$$$$$$$$$$$$$$$$$$$")
println(metrics.value)
r
})
This is the stack trace I get when it hits this line.
ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 135.0 (TID 114)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:160)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:414)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 24 more
[Updated answer]
You're getting an error because the code inside rawStream.map i.e. MyCache.cache.value is getting executed on one of the executor and there the MyCache.cache is still null!
When you did MyCache.getInstance, it created the MyCache.cache value on the driver and broadcasted it alright. But you're not referring to the same object in the your map method, so it doesn't get sent over to executors. Instead since you are directly referring to the MyCache, the executors invoke MyCache.cache on their own copy of MyCache object, and this obviously is null.
You can get this to work as expected by first getting an instance of cache broadcast object within the driver and using that object in the map. The following code should work for you --
val cache = MYCache.getInstance(ssc.sparkContext)
rawStream.map(r => {
println(cache.value)
r
})