Reading and writing to Cassandra from Spark worker throws error - scala

I am using the Datastax Cassandra java driver to write to Cassandra from spark workers. Code snippet
rdd.foreachPartition(record => {
val cluster = SimpleApp.connect_cluster(Spark.cassandraip)
val session = cluster.connect()
record.foreach { case (bin_key: (Int, Int), kpi_map_seq: Iterable[Map[String, String]]) => {
kpi_map_seq.foreach { kpi_map: Map[String, String] => {
update_tables(session, bin_key, kpi_map)
}
}
}
} //record.foreach
session.close()
cluster.close()
}
While reading I am using the spark cassandra connector (which uses the same driver internally I assume)
val bin_table = javaFunctions(Spark.sc).cassandraTable("keyspace", "bin_1")
.select("bin").where("cell = ?", cellname) // assuming this will run on worker nodes
println(s"get_bins_for_cell Count of Bins for Cell $cellname is ", cell_bin_table.count())
return bin_table
Doing this each at a time does not cause any problem. Doing it together is throwing this stack trace.
My main goal is not to do the write or read directly from the Spark driver program. Still it seems that it has to do something with the context; two context getting used ?
16/07/06 06:21:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 22, euca-10-254-179-202.eucalyptus.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

The Spark Context was getting closed after using the session with Cassandra like below
Example
def update_table_using_cassandra_driver() ={
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
val statement_4: Statement = QueryBuilder.insertInto("keyspace", "table")
.value("bin", my_tuple_value)
.value("cell", my_val("CName"))
session.executeAsync(statement_4)
...
}
So next time I call this in the loop I was getting exception. Looks like a bug in Cassandra driver;will have to check this. For the time being did the following to work around this
for(a <- 1 to 1000) {
val sc = new SparkContext(SparkWriter.conf)
update_table_using_cassandra_driver()
sc.stop()
...sleep(xxx)
}

Related

Receiving Asynchronous Exception in Flink 1.7.2 Stateful-processing with KeyedProcessFunction and RocksDB state-backend

I've written a simple word count application using Flink 1.7.2 with Kafka 2.2 as both consumer and producer. I use Exactly-Once semantic for the Kafka producer, KeyedProcessFunction for stateful processing, MapState to keep my state and RocksDB with incremental checkpointing as my state-backend.
The application works pretty fine when I run it from IntelliJ but when I submit it to my local Flink cluster I receive the AsynchronousException exception and the Flink application keeps retrying after every 0-20 seconds. Has anyone encountered this issue before? Am I missing anything from the configuration perspective?
Here is my code:
class KeyedProcFuncWordCount extends KeyedProcessFunction[String, String, (String, Int)]
{
private var state: MapState[String, Int] = _
override def open(parameters: Configuration): Unit =
{
state = getRuntimeContext
.getMapState(new MapStateDescriptor[String, Int]("wordCountState", createTypeInformation[String],
createTypeInformation[Int]))
}
override def processElement(value: String,
ctx: KeyedProcessFunction[String, String, (String, Int)]#Context,
out: Collector[(String, Int)]): Unit =
{
val currentSum =
if (state.contains(value)) state.get(value)
else 0
val newSum = currentSum + 1
state.put(value, newSum)
out.collect((value, newSum))
}
}
object KafkaProcFuncWordCount
{
val bootstrapServers = "localhost:9092"
val inTopic = "test"
val outTopic = "test-out"
def main(args: Array[String]): Unit =
{
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(30000)
env.setStateBackend(new RocksDBStateBackend("file:///tmp/data/db.rdb", true))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val consumerProps = new Properties
consumerProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers)
consumerProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "KafkaProcFuncWordCount")
consumerProps.setProperty(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed")
val kafkaConsumer = new FlinkKafkaConsumer011[String](inTopic, new SimpleStringSchema, consumerProps)
val producerProps = new Properties
producerProps.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers)
producerProps.setProperty(ProducerConfig.RETRIES_CONFIG, "2147483647")
producerProps.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
producerProps.setProperty(ProducerConfig.ACKS_CONFIG, "all")
producerProps.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
val kafkaProducer = new FlinkKafkaProducer011[String](
outTopic,
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema),
producerProps,
Optional.of(new FlinkFixedPartitioner[String]),
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE,
5
)
val text = env.addSource(kafkaConsumer)
val runningCounts = text
.keyBy(_.toString)
.process(new KeyedProcFuncWordCount)
.map(_.toString())
runningCounts
.addSink(kafkaProducer)
env.execute("KafkaProcFuncWordCount")
}
}
Here is part from the flink taskexecutor log which keeps repeating:
2019-07-05 14:05:47,548 INFO org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer - Flushing new partitions
2019-07-05 14:05:47,552 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011 - Starting FlinkKafkaProducer (1/1) to produce into default topic test-out
2019-07-05 14:05:47,775 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally KeyedProcess -> Map -> Sink: Unnamed (1/1) (f61d24c993f400394eaa028981a26bfe).
2019-07-05 14:05:47,776 INFO org.apache.flink.runtime.taskmanager.Task - KeyedProcess -> Map -> Sink: Unnamed (1/1) (f61d24c993f400394eaa028981a26bfe) switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 6 for operator KeyedProcess -> Map -> Sink: Unnamed (1/1).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 6 for operator KeyedProcess -> Map -> Sink: Unnamed (1/1).
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.flink.api.common.typeutils.SimpleTypeSerializerSnapshot.<init>(Ljava/util/function/Supplier;)V
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
... 5 more
Caused by: java.lang.NoSuchMethodError: org.apache.flink.api.common.typeutils.SimpleTypeSerializerSnapshot.<init>(Ljava/util/function/Supplier;)V
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011$TransactionStateSerializer$TransactionStateSerializerSnapshot.<init>(FlinkKafkaProducer011.java:1244)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011$TransactionStateSerializer.snapshotConfiguration(FlinkKafkaProducer011.java:1235)
at org.apache.flink.api.common.typeutils.CompositeTypeSerializerConfigSnapshot.<init>(CompositeTypeSerializerConfigSnapshot.java:53)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializerConfigSnapshot.<init>(TwoPhaseCommitSinkFunction.java:847)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializer.snapshotConfiguration(TwoPhaseCommitSinkFunction.java:792)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializer.snapshotConfiguration(TwoPhaseCommitSinkFunction.java:615)
at org.apache.flink.runtime.state.RegisteredOperatorStateBackendMetaInfo.computeSnapshot(RegisteredOperatorStateBackendMetaInfo.java:170)
at org.apache.flink.runtime.state.RegisteredOperatorStateBackendMetaInfo.snapshot(RegisteredOperatorStateBackendMetaInfo.java:103)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:711)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)
at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
... 7 more
Thank you very much in advance for your help.
Can you double-check that you are not packaging Flink core dependencies (flink-java, flink-streaming-java, flink-runtime, ...) in your jar? Also double-check that your running the same version of Flink in your cluster as the dependency of the Kafka Connector (flink-kafka-connector). The flink-kakfa-connector (like all connector needs to be part of your fatjar).
Hope this helps.
Cheers,
Konstantin

NullPointerException on DataFrame where

I have the following method written in Scala:
def fillEmptyCells: Unit = {
val hourIndex = _weather.schema.fieldIndex("Hour")
val dateIndex = _weather.schema.fieldIndex("Date")
val groundSurfaceIndex = _weather.schema.fieldIndex("GroundSurface")
val snowyGroundIndex = _weather.schema.fieldIndex("SnowyGroundSurface")
val precipitationIndex = _weather.schema.fieldIndex("catPrec")
val snowDepthIndex = _weather.schema.fieldIndex("catSnowDepth")
var resultDf : DataFrame = sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row],_weather.schema)
val days = _weather.select("Date").distinct().rdd
_weather.where("Date = '2014-08-01'").show()
days.foreach(x => {
println(s"Date = '${x.getDate(0)}'")
_weather.where(s"Date = '${x.getDate(0)}'").show()
val day = _weather.where(s"Date = '${x.getDate(0)}'")
val dayValues = day.where("Hour = 6").first()
val grSur = dayValues.getString(groundSurfaceIndex)
val snSur = dayValues.getString(snowyGroundIndex)
val prec = dayValues.getString(precipitationIndex)
val snowDepth = dayValues.getString(snowDepthIndex)
val dayRddMapped = day.rdd.map(y => (y(0), y(1), grSur, snSur, y(4), y(5), y(6), y(7), prec, snowDepth))
.foreach(z => {
resultDf = resultDf.union(Seq(z).toDF())
})
})
resultDf.show(20)
Unit
}
The problem is this line: _weather.where(s"Date = '${x.getDate(0)}'").show() where the NullPointerException occurs. As can be seen at line above, I print the where clause to console (it looks like Date = '2014-06-03') and the line just before foreach takes one of the outputs as parameters and works fine. _weather is a class variable and does not change while this method is running. Debugger shows more stranger things: _weather gets nulled after first iteration.
What is the source of this magic and how can I avoid it?
Moreover, if you have any suggestions according to architecture and code quality, welcome here
Stacktrace:
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.where(Dataset.scala:1344)
at org.[package].WeatherHelper$$anonfun$fillEmptyCells$1.apply(WeatherHelper.scala:148)
at org.[package].WeatherHelper$$anonfun$fillEmptyCells$1.apply(WeatherHelper.scala:146)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
19/01/10 13:39:35 ERROR Executor: Exception in task 6.0 in stage 10.0 (TID 420)
The class name is WeatherHelper it's just a part of the whole stacktrace which repeats ~20 times.
you cannot use dataframes in RDD code (you use dataframes in days.foreach), th dataframes are null here as it only lives on the driver, but not on the executors

GC overhead limit exceeded in Sparkjob with join

I am writing a spark job to get the latest student records filtered by the student date. But when I try this with hundred thousand of records it is working fine. But when I run it with large number of records my sparkjob returns below error.
I guess this error happens since I load all the data from the table and put int in a RDD. Because my table contains about 4.2 millions of records. If that is so, is there any better way to efficiently load those data and continue my operation successfully ?
Please anybody help me to resolve this
WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.10.10.10): java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2157)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1964)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3316)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:463)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3040)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2681)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:408)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/03/09 10:54:09 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 2, 10.10.10.10, partition 0, PROCESS_LOCAL, 5288 bytes)
17/03/09 10:54:09 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 2 on executor id: 1 hostname: 10.10.10.10.
17/03/09 10:54:09 WARN TransportChannelHandler: Exception in connection from /10.10.10.10:48464
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel. java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNio ByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoo p.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:38 2)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEvent Executor.java:111)
at java.lang.Thread.run(Thread.java:745)
17/03/09 10:54:09 ERROR TaskSchedulerImpl: Lost executor 1 on 10.10.10.10: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/03/09 10:54:09 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170309105209-0032/1 is now EXITED (Command exited with code 52)
Code
object StudentDataPerformanceEnhancerImpl extends studentDataPerformanceEnhancer {
val LOG = LoggerFactory.getLogger(this.getClass.getName)
val USER_PRIMARY_KEY = "user_id";
val COURSE_PRIMARY_KEY = "course_id";
override def extractData(sparkContext: SparkContext, sparkSession: SparkSession, jobConfiguration: JobConfiguration): Unit = {
val context = sparkSession.read.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", jobConfiguration.jdbcURL)
.option("dbtable", "student_student")
.option("user", test_user)
.option("password", test_password)
.load()
context.cache()
val mainRDD = context.rdd.map(k => ((k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time"))),
(k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time")),
k.getString(k.fieldIndex("student_student_index")),
k.getLong(k.fieldIndex("student_category_pk")),
k.getString(k.fieldIndex("effeciency")),
k.getString(k.fieldIndex("net_score")),
k.getString(k.fieldIndex("avg_score")),
k.getString(k.fieldIndex("test_score"))))).persist(StorageLevel.DISK_ONLY)
LOG.info("Data extractions started....!")
try {
val studentCompositeRDD = context.rdd.map(r => ((r.getLong(r.fieldIndex(USER_PRIMARY_KEY)),
r.getLong(r.fieldIndex(COURSE_PRIMARY_KEY))),
r.getTimestamp(r.fieldIndex("student_date_time")))).reduceByKey((t1, t2) => if (t1.after(t2)) t1 else t2)
.map(t => ((t._1._1, t._1._2, t._2), t._2)).persist(StorageLevel.DISK_ONLY)
val filteredRDD = mainRDD.join(studentCompositeRDD).map(k => k._2._1)
DataWriter.persistLatestData(filteredRDD)
} catch {
case e: Exception => LOG.error("Error in spark job: " + e.getMessage)
}
}
}
My DataWriter class related to database persistence is below
object DataWriter {
def persistLatestStudentRiskData(rDD: RDD[(Long, Long, Timestamp, String, Long, String, String, String, String)]): Unit = {
var jdbcConnection: java.sql.Connection = null
try {
jdbcConnection = DatabaseUtil.getConnection
if (jdbcConnection != null) {
val statement = "{call insert_latest_student_risk (?,?,?,?,?,?,?,?,?)}"
val callableStatement = jdbcConnection.prepareCall(statement)
rDD.collect().foreach(x => sendLatestStudentRiskData(callableStatement, x))
}
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
case e: RuntimeException => LOG.error("Error in the latest student persistence: " + e.getMessage)
case e: Exception => LOG.error("Error in the latest student persistence: " + e.getMessage)
} finally {
if (jdbcConnection != null) {
try {
jdbcConnection.close()
} catch {
case e: SQLException => LOG.error("Error in jdbc connection close : " + e.getMessage)
case e: Exception => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}
}
def sendLatestStudentRiskData(callableStatement: java.sql.CallableStatement,
latestStudentData: (Long, Long, Timestamp, String, Long,
String, String, String, String)): Unit = {
try {
callableStatement.setLong(1, latestStudentData._1)
callableStatement.setLong(2, latestStudentData._2)
callableStatement.setTimestamp(3, latestStudentData._3)
callableStatement.setString(4, latestStudentData._4)
callableStatement.setLong(5, latestStudentData._5)
callableStatement.setString(6, latestStudentData._6)
callableStatement.setString(7, latestStudentData._7)
callableStatement.setString(8, latestStudentData._8)
callableStatement.setString(9, latestStudentData._9)
callableStatement.executeUpdate
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}
The problem isn't that you are putting data into the RDD, is that you are taking the data out of the RDD and onto the driver memory. Specifically, the problem is the collect call you are using to persist the data. You should remove it. collect brings the entire RDD into memory on the driver and you are no longer using spark and your cluster to process data so you quickly run out of memory unless your data size is very small. collect should rarely be used by spark processes, it's mostly useful for development and debugging with small amounts of data. It has some uses in production applications for some supporting operations but not as a main data flow.
Spark is able to write to jdbc directly if you use spark-sql, leverage that and remove calls to collect.

Can't access broadcast variable in transformation

I'm having problems accessing a variable from inside a transformation function. Could someone help me out?
Here are my relevant classes and functions.
#SerialVersionUID(889949215L)
object MyCache extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
#volatile var cache: Broadcast[Map[UUID, Definition]] = null
def getInstance(sparkContext: SparkContext) : Broadcast[Map[UUID, Definition]] = {
if (cache == null) {
synchronized {
val map = sparkContext.cassandraTable("keyspace", "table")
.collect()
.map(m => m.getUUID("id") ->
Definition(m.getString("c1"), m.getString("c2"), m.getString("c3"),
m.getString("c4"))).toMap
cache = sparkContext.broadcast(map)
}
}
cache
}
}
In a different file:
object Processor extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
def processData[T: ClassTag](rawStream: DStream[(String, String)], ssc: StreamingContext,
processor: (String, Broadcast[Map[UUID, Definition]]) => T): DStream[T] = {
MYCache.getInstance(ssc.sparkContext)
var newCacheValues = Map[UUID, Definition]()
rawStream.cache()
rawStream
.transform(rdd => {
val array = rdd.collect()
array.foreach(r => {
val value = getNewCacheValue(r._2, rdd.context)
if (value.isDefined) {
newCacheValues = newCacheValues + value.get
}
})
rdd
})
if (newCacheValues.nonEmpty) {
logger.info(s"Rebroadcasting. There are ${newCacheValues.size} new values")
logger.info("Destroying old cache")
MyCache.cache.destroy()
// this is probably wrong here, destroying object, but then referencing it. But I haven't gotten to this part yet.
MyCache.cache = ssc.sparkContext.broadcast(MyCache.cache.value ++ newCacheValues)
}
rawStream
.map(r => {
println("######################")
println(MyCache.cache.value)
r
})
.map(r => processor(r._2, MyCache.cache.value))
.filter(r => null != r)
}
}
Every time I run this I get SparkException: Failed to get broadcast_1_piece0 of broadcast_1 when trying to access cache.value
When I add a println(MyCache.cache.values) right after the .getInstance I'm able to access the broadcast variable, but when I deploy it to a mesos cluster I'm unable to access the broadcast values again, but with a null pointer exception.
Update:
The error I'm seeing is on println(MyCache.cache.value). I shouldn't have added this if statement containing the destroy, because my tests are never hitting that.
The basics of my application are, I have a table in cassandra that won't be updated very much. But I need to do some validation on some streaming data. So I want to pull all the data from this table, that isn't update much, into memory. getInstance pulls the whole table in on startup, and then I check all my streaming data to see if I need to pull from cassandra again (which I will have to very rarely). The transform and collect is where I check to see if I need to pull new data in. But since there is a chance that my table will be updated, I will need to update the broadcast occasionally. So my idea was to destroy it and then rebroadcast. I will update that once I get the other stuff working.
I get the same error if I comment out the destroy and rebroadcast.
Another update:
I need to access the broadcast variable in processor this line: .map(r => processor(r._2, MyCache.cache.value)).
I'm able to broadcast variable in the transform, and if I do println(MyCache.cache.value) in the transform, then all my tests pass, and I'm able to then access the broadcast in processor
Update:
rawStream
.map(r => {
println("$$$$$$$$$$$$$$$$$$$")
println(metrics.value)
r
})
This is the stack trace I get when it hits this line.
ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 135.0 (TID 114)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:160)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:414)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 24 more
[Updated answer]
You're getting an error because the code inside rawStream.map i.e. MyCache.cache.value is getting executed on one of the executor and there the MyCache.cache is still null!
When you did MyCache.getInstance, it created the MyCache.cache value on the driver and broadcasted it alright. But you're not referring to the same object in the your map method, so it doesn't get sent over to executors. Instead since you are directly referring to the MyCache, the executors invoke MyCache.cache on their own copy of MyCache object, and this obviously is null.
You can get this to work as expected by first getting an instance of cache broadcast object within the driver and using that object in the map. The following code should work for you --
val cache = MYCache.getInstance(ssc.sparkContext)
rawStream.map(r => {
println(cache.value)
r
})

Apache Spark - java.lang.NoSuchMethodError: breeze.linalg.DenseVector

I am having issues running Apache Spark 1.0.1 within a Play! app. Currently, I am trying to run Spark within the Play! application and use some of the basic Machine Learning within Spark.
Here's my app creation:
def sparkFactory: SparkContext = {
val logFile = "public/README.md" // Should be some file on your system
val driverHost = "localhost"
val conf = new SparkConf(false) // skip loading external settings
.setMaster("local[4]") // run locally with enough threads
.setAppName("firstSparkApp")
.set("spark.logConf", "true")
.set("spark.driver.host", s"$driverHost")
new SparkContext(conf)
}
And here's an error when I try to do some basic discovery of a Tall and Skinny Matrix:
[error] o.a.s.e.ExecutorUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.NoSuchMethodError: breeze.linalg.DenseVector$.dv_v_ZeroIdempotent_InPlaceOp_Double_OpAdd()Lbreeze/linalg/operators/BinaryUpdateRegistry;
at org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$5.apply(RowMatrix.scala:313) ~[spark-mllib_2.10-1.0.1.jar:1.0.1]
at org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$5.apply(RowMatrix.scala:313) ~[spark-mllib_2.10-1.0.1.jar:1.0.1]
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) ~[scala-library-2.10.4.jar:na]
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) ~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na]
The error above is triggered by the following:
def computePrincipalComponents(datasetId: String) = Action {
val datapoints = DataPoint.listByDataset(datasetId)
// load the data into spark
val rows = datapoints.map(_.data).map { row =>
row.map(_.toDouble)
}
val RDDRows = WorkingSpark.context.makeRDD(rows).map { line =>
Vectors.dense(line)
}
val mat = new RowMatrix(RDDRows)
val result = mat.computePrincipalComponents(mat.numCols().toInt)
Ok(result.toString)
}
It looks like a dependency issue, but no idea where it starts. Any ideas?
Ah this was indeed caused by a dependency conflict. Apparently the new Spark uses new Breeze methods that were not available in a version I had pulled in. By removing Breeze from my Play! Build file I was able to run the function above just fine.
For those interested, here's the output:
-0.23490049167080018 0.4371989078912155 0.5344916752692394 ... (6 total)
-0.43624389448418854 0.531880914138611 0.1854269324452522 ...
-0.5312372137092107 0.17954211389001487 -0.456583286485726 ...
-0.5172743086226219 -0.2726152326516076 -0.36740474569706394 ...
-0.3996400343756039 -0.5147253632175663 0.303449047782936 ...
-0.21216780828347453 -0.39301803119012546 0.4943679121187219 ...