ZIO kafka producer closed - scala

This is my code
trait KafkaPublisher {
def publishA(a: Seq[A]): Task[RecordMetadata]
}
case class KafkaPublisherLive(producer: Producer) extends KafkaPublisher {
def publishA(a: Seq[A]): Task[RecordMetadata] =
publish(aTopic, a)
private def publish[T](topic: String, value: T)(implicit format: Format[T]): Task[RecordMetadata] =
producer.produce(
topic = topic,
key = "",
value = serialize(value),
keySerializer = Serde.string,
valueSerializer = Serde.string
)
private def serialize[T](value: T)(implicit format: Format[T]): String = format.writes(value).toString()
}
object KafkaPublisher {
val layer: ZLayer[KafkaConfig, Throwable, KafkaPublisherLive] =
ZLayer.scoped {
for {
config <- ZIO.service[KafkaConfig]
_ <- ZIO.logWarning("Connecting to kafka on " + config.brokers)
producer <- Producer.make(
ProducerSettings(config.brokers.split(",").toList)
)
} yield KafkaPublisherLive(producer)
}
}
When I run the application, in 3 minutes I am getting the following message:
[ZScheduler-Worker-3] INFO
org.apache.kafka.clients.producer.KafkaProducer - [Producer
clientId=producer-1] Closing the Kafka producer with timeoutMillis =
30000 ms. [kafka-producer-network-thread | producer-1] INFO
org.apache.kafka.clients.NetworkClient - [Producer
clientId=producer-1] Disconnecting from node -1 due to socket
connection setup timeout. The timeout value is 10759 ms.
[kafka-producer-network-thread | producer-1] WARN
org.apache.kafka.clients.NetworkClient - [Producer
clientId=producer-1] Bootstrap broker 10.72.26.19:9092 (id: -1 rack:
null) disconnected [ZScheduler-Worker-3] INFO
org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
[ZScheduler-Worker-3] INFO org.apache.kafka.common.metrics.Metrics -
Closing reporter org.apache.kafka.common.metrics.JmxReporter
[ZScheduler-Worker-3] INFO org.apache.kafka.common.metrics.Metrics -
Metrics reporters closed [ZScheduler-Worker-3] INFO
org.apache.kafka.common.utils.AppInfoParser - App info kafka.producer
for producer-1 unregistered
What am I doing wrong? Did I scope it incorrectly? I intend to keep the publisher open "forever", so I don't really need it managed.

Related

Spark kafka producer Introducing duplicate records during kafka ingestion

I have written a spark kafka producer, which pulls messages from hive and pushes into kafka, most of the records(messages) are getting duplicated when we ingest into kafka, though i do not have any duplicates before pushing into kafka. I have added configurations related to exactly-once semantics making kafka producer idempotent
below is the code-snippet i am using for kafka producer
import java.util.{Properties, UUID}
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.KafkaException
import org.apache.kafka.common.errors.{AuthorizationException, OutOfOrderSequenceException, ProducerFencedException}
import org.apache.log4j.LogManager
import org.apache.spark.sql.DataFrame
object KafkaWriter {
lazy val log = LogManager.getLogger("KafkaWriter")
private def writeKafkaWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int,
kafkaBootstrapServer: String): Unit = {
log.info("Inside Method KafkaWriter::writeKafkaWithoutRepartition no of partitions =" + noOfPartitions)
df.foreachPartition(
iter => {
val properties = getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer)
properties.setProperty(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString + "-" + System.currentTimeMillis().toString)
log.info("Inside Method writeKafkaWithoutRepartition:: inside writekafka :: kafka properties ::" + properties)
val kafkaProducer = new KafkaProducer[String, String](properties)
try {
log.info("kafka producer property enable.idempotence ::" + properties.getProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG))
kafkaProducer.initTransactions
kafkaProducer.beginTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: inside each partition kafka transactions started")
iter.foreach(row => {
log.info("Inside Method writeKafkaWithoutRepartition:: inside each iterator record kafka transactions started")
kafkaProducer.send(new ProducerRecord(topic, row.getAs[String]("value")))
})
kafkaProducer.commitTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: kafka transactions completed")
} catch {
case e#(_: ProducerFencedException ) =>
// We can't recover from these exceptions, so our only option is to close the producer and exit.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.close
case e: KafkaException =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.abortTransaction
case ex: Exception =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + ex.getMessage)
kafkaProducer.abortTransaction
} finally {
kafkaProducer.close
}
})
}
def writeWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int, kafkaBootstrapServer: String): Unit = {
var repartitionedDF = df.selectExpr("to_json(struct(*)) AS value")
log.info("Inside KafkaWriter::writeWithoutRepartition ")
writeKafkaWithoutRepartition(repartitionedDF, topic, noOfPartitions, kafkaBootstrapServer)
}
def getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer: String): Properties = {
val properties = new Properties
properties.setProperty(CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServer)
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
// properties.setProperty(CommonClientConfigs.REQUEST_TIMEOUT_MS_CONFIG, "400000")
// properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, "300000")
properties.put(ProducerConfig.RETRIES_CONFIG, "1000")
properties.put(ProducerConfig.ACKS_CONFIG, "all")
properties.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
properties
}
}
set the isolation.level-->Committed on the kafka consumer end.
tried setting min.insync.replicas-->2(In my opinion this property might not play an important role , still tried)
Spark version : 2.3.1
kafka client version : 2.2.1
And I am also using transactions while producing messages into kafka, init begin and commit transactions for each message.I am ingesting around 100 Million records one time, I did split the data into smaller chunks, say 100 Million divided into 1 Million at once before ingesting into kafka
Tried using structured streaming, still no luck
df.selectExpr(s""" '${key}' as key """, "to_json(struct(*)) AS value").write.format("kafka").options(getKafkaProducerProperties(topic)).save
I am not sure if i am missing any configurations on the kafka producer , broker or at the consumer end.
Not sure if i can include any
Thanks in Advance

Receiving Asynchronous Exception in Flink 1.7.2 Stateful-processing with KeyedProcessFunction and RocksDB state-backend

I've written a simple word count application using Flink 1.7.2 with Kafka 2.2 as both consumer and producer. I use Exactly-Once semantic for the Kafka producer, KeyedProcessFunction for stateful processing, MapState to keep my state and RocksDB with incremental checkpointing as my state-backend.
The application works pretty fine when I run it from IntelliJ but when I submit it to my local Flink cluster I receive the AsynchronousException exception and the Flink application keeps retrying after every 0-20 seconds. Has anyone encountered this issue before? Am I missing anything from the configuration perspective?
Here is my code:
class KeyedProcFuncWordCount extends KeyedProcessFunction[String, String, (String, Int)]
{
private var state: MapState[String, Int] = _
override def open(parameters: Configuration): Unit =
{
state = getRuntimeContext
.getMapState(new MapStateDescriptor[String, Int]("wordCountState", createTypeInformation[String],
createTypeInformation[Int]))
}
override def processElement(value: String,
ctx: KeyedProcessFunction[String, String, (String, Int)]#Context,
out: Collector[(String, Int)]): Unit =
{
val currentSum =
if (state.contains(value)) state.get(value)
else 0
val newSum = currentSum + 1
state.put(value, newSum)
out.collect((value, newSum))
}
}
object KafkaProcFuncWordCount
{
val bootstrapServers = "localhost:9092"
val inTopic = "test"
val outTopic = "test-out"
def main(args: Array[String]): Unit =
{
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(30000)
env.setStateBackend(new RocksDBStateBackend("file:///tmp/data/db.rdb", true))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val consumerProps = new Properties
consumerProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers)
consumerProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "KafkaProcFuncWordCount")
consumerProps.setProperty(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed")
val kafkaConsumer = new FlinkKafkaConsumer011[String](inTopic, new SimpleStringSchema, consumerProps)
val producerProps = new Properties
producerProps.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers)
producerProps.setProperty(ProducerConfig.RETRIES_CONFIG, "2147483647")
producerProps.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
producerProps.setProperty(ProducerConfig.ACKS_CONFIG, "all")
producerProps.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
val kafkaProducer = new FlinkKafkaProducer011[String](
outTopic,
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema),
producerProps,
Optional.of(new FlinkFixedPartitioner[String]),
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE,
5
)
val text = env.addSource(kafkaConsumer)
val runningCounts = text
.keyBy(_.toString)
.process(new KeyedProcFuncWordCount)
.map(_.toString())
runningCounts
.addSink(kafkaProducer)
env.execute("KafkaProcFuncWordCount")
}
}
Here is part from the flink taskexecutor log which keeps repeating:
2019-07-05 14:05:47,548 INFO org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer - Flushing new partitions
2019-07-05 14:05:47,552 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011 - Starting FlinkKafkaProducer (1/1) to produce into default topic test-out
2019-07-05 14:05:47,775 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally KeyedProcess -> Map -> Sink: Unnamed (1/1) (f61d24c993f400394eaa028981a26bfe).
2019-07-05 14:05:47,776 INFO org.apache.flink.runtime.taskmanager.Task - KeyedProcess -> Map -> Sink: Unnamed (1/1) (f61d24c993f400394eaa028981a26bfe) switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 6 for operator KeyedProcess -> Map -> Sink: Unnamed (1/1).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 6 for operator KeyedProcess -> Map -> Sink: Unnamed (1/1).
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.flink.api.common.typeutils.SimpleTypeSerializerSnapshot.<init>(Ljava/util/function/Supplier;)V
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
... 5 more
Caused by: java.lang.NoSuchMethodError: org.apache.flink.api.common.typeutils.SimpleTypeSerializerSnapshot.<init>(Ljava/util/function/Supplier;)V
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011$TransactionStateSerializer$TransactionStateSerializerSnapshot.<init>(FlinkKafkaProducer011.java:1244)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011$TransactionStateSerializer.snapshotConfiguration(FlinkKafkaProducer011.java:1235)
at org.apache.flink.api.common.typeutils.CompositeTypeSerializerConfigSnapshot.<init>(CompositeTypeSerializerConfigSnapshot.java:53)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializerConfigSnapshot.<init>(TwoPhaseCommitSinkFunction.java:847)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializer.snapshotConfiguration(TwoPhaseCommitSinkFunction.java:792)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$StateSerializer.snapshotConfiguration(TwoPhaseCommitSinkFunction.java:615)
at org.apache.flink.runtime.state.RegisteredOperatorStateBackendMetaInfo.computeSnapshot(RegisteredOperatorStateBackendMetaInfo.java:170)
at org.apache.flink.runtime.state.RegisteredOperatorStateBackendMetaInfo.snapshot(RegisteredOperatorStateBackendMetaInfo.java:103)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:711)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)
at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
... 7 more
Thank you very much in advance for your help.
Can you double-check that you are not packaging Flink core dependencies (flink-java, flink-streaming-java, flink-runtime, ...) in your jar? Also double-check that your running the same version of Flink in your cluster as the dependency of the Kafka Connector (flink-kafka-connector). The flink-kakfa-connector (like all connector needs to be part of your fatjar).
Hope this helps.
Cheers,
Konstantin

GC overhead limit exceeded in Sparkjob with join

I am writing a spark job to get the latest student records filtered by the student date. But when I try this with hundred thousand of records it is working fine. But when I run it with large number of records my sparkjob returns below error.
I guess this error happens since I load all the data from the table and put int in a RDD. Because my table contains about 4.2 millions of records. If that is so, is there any better way to efficiently load those data and continue my operation successfully ?
Please anybody help me to resolve this
WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.10.10.10): java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2157)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1964)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3316)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:463)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3040)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2681)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:408)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/03/09 10:54:09 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 2, 10.10.10.10, partition 0, PROCESS_LOCAL, 5288 bytes)
17/03/09 10:54:09 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 2 on executor id: 1 hostname: 10.10.10.10.
17/03/09 10:54:09 WARN TransportChannelHandler: Exception in connection from /10.10.10.10:48464
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel. java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNio ByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoo p.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:38 2)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEvent Executor.java:111)
at java.lang.Thread.run(Thread.java:745)
17/03/09 10:54:09 ERROR TaskSchedulerImpl: Lost executor 1 on 10.10.10.10: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/03/09 10:54:09 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170309105209-0032/1 is now EXITED (Command exited with code 52)
Code
object StudentDataPerformanceEnhancerImpl extends studentDataPerformanceEnhancer {
val LOG = LoggerFactory.getLogger(this.getClass.getName)
val USER_PRIMARY_KEY = "user_id";
val COURSE_PRIMARY_KEY = "course_id";
override def extractData(sparkContext: SparkContext, sparkSession: SparkSession, jobConfiguration: JobConfiguration): Unit = {
val context = sparkSession.read.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", jobConfiguration.jdbcURL)
.option("dbtable", "student_student")
.option("user", test_user)
.option("password", test_password)
.load()
context.cache()
val mainRDD = context.rdd.map(k => ((k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time"))),
(k.getLong(k.fieldIndex(USER_PRIMARY_KEY)),
k.getLong(k.fieldIndex(COURSE_PRIMARY_KEY)),
k.getTimestamp(k.fieldIndex("student_date_time")),
k.getString(k.fieldIndex("student_student_index")),
k.getLong(k.fieldIndex("student_category_pk")),
k.getString(k.fieldIndex("effeciency")),
k.getString(k.fieldIndex("net_score")),
k.getString(k.fieldIndex("avg_score")),
k.getString(k.fieldIndex("test_score"))))).persist(StorageLevel.DISK_ONLY)
LOG.info("Data extractions started....!")
try {
val studentCompositeRDD = context.rdd.map(r => ((r.getLong(r.fieldIndex(USER_PRIMARY_KEY)),
r.getLong(r.fieldIndex(COURSE_PRIMARY_KEY))),
r.getTimestamp(r.fieldIndex("student_date_time")))).reduceByKey((t1, t2) => if (t1.after(t2)) t1 else t2)
.map(t => ((t._1._1, t._1._2, t._2), t._2)).persist(StorageLevel.DISK_ONLY)
val filteredRDD = mainRDD.join(studentCompositeRDD).map(k => k._2._1)
DataWriter.persistLatestData(filteredRDD)
} catch {
case e: Exception => LOG.error("Error in spark job: " + e.getMessage)
}
}
}
My DataWriter class related to database persistence is below
object DataWriter {
def persistLatestStudentRiskData(rDD: RDD[(Long, Long, Timestamp, String, Long, String, String, String, String)]): Unit = {
var jdbcConnection: java.sql.Connection = null
try {
jdbcConnection = DatabaseUtil.getConnection
if (jdbcConnection != null) {
val statement = "{call insert_latest_student_risk (?,?,?,?,?,?,?,?,?)}"
val callableStatement = jdbcConnection.prepareCall(statement)
rDD.collect().foreach(x => sendLatestStudentRiskData(callableStatement, x))
}
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
case e: RuntimeException => LOG.error("Error in the latest student persistence: " + e.getMessage)
case e: Exception => LOG.error("Error in the latest student persistence: " + e.getMessage)
} finally {
if (jdbcConnection != null) {
try {
jdbcConnection.close()
} catch {
case e: SQLException => LOG.error("Error in jdbc connection close : " + e.getMessage)
case e: Exception => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}
}
def sendLatestStudentRiskData(callableStatement: java.sql.CallableStatement,
latestStudentData: (Long, Long, Timestamp, String, Long,
String, String, String, String)): Unit = {
try {
callableStatement.setLong(1, latestStudentData._1)
callableStatement.setLong(2, latestStudentData._2)
callableStatement.setTimestamp(3, latestStudentData._3)
callableStatement.setString(4, latestStudentData._4)
callableStatement.setLong(5, latestStudentData._5)
callableStatement.setString(6, latestStudentData._6)
callableStatement.setString(7, latestStudentData._7)
callableStatement.setString(8, latestStudentData._8)
callableStatement.setString(9, latestStudentData._9)
callableStatement.executeUpdate
} catch {
case e: SQLException => LOG.error("Error in executing insert_latest_student_risk stored procedure : " + e.getMessage)
}
}
}
The problem isn't that you are putting data into the RDD, is that you are taking the data out of the RDD and onto the driver memory. Specifically, the problem is the collect call you are using to persist the data. You should remove it. collect brings the entire RDD into memory on the driver and you are no longer using spark and your cluster to process data so you quickly run out of memory unless your data size is very small. collect should rarely be used by spark processes, it's mostly useful for development and debugging with small amounts of data. It has some uses in production applications for some supporting operations but not as a main data flow.
Spark is able to write to jdbc directly if you use spark-sql, leverage that and remove calls to collect.

mutable.Buffer does not work with Scalding JobTest for Type Safe API

I have almost finished my Scalding project which uses the Type Safe API instead of the Fields API. The last issue that remains for me in overall project set up is the integration tests of the entire Scalding job itself (I have finished unit tests for Type Safe External Operations pattern yay!). This means to run the complete job and test the output of the various sinks of my job.
However, something very peculiar is happening. In my
typedSink { scala.collection.mutable.Buffer[] => Unit }
It seems that my program does not see the buffer or do anything with the buffer so the integration test always passes even when it should not. Below is both the job itself and the test to help illuminate what is going on:
object MyJob {
val inputArgPath = "input"
val validOutputArgPath = "validOutput"
val invalidOutputArgPath = "invalidOutput"
}
class MyJob(args: Args) extends Job(args) {
import OperationWrappers._
implicit lazy val uId: Some[UniqueID] = Some(UniqueID.getIDFor(flowDef))
val inputPath: String = args(MyJob.inputArgPath)
val validOutputPath: String = args(MyJob.validOutputArgPath)
val invalidOutputPath: String = args(MyJob.invalidOutputArgPath)
val eventInput: TypedPipe[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => TypedPipe.from(WritableSequenceFile[LongWritable, Text](inputPath))
case _ => TypedPipe.from(TypedTsv[(LongWritable, Text)](inputPath))
}
def returnOutputPipe(outputString: String): FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = {
val eventOutput: FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => WritableSequenceFile[LongWritable, Text](outputString)
case _ => TypedTsv[(LongWritable, Text)](outputString)
}
eventOutput
}
val validatedEvents: TypedPipe[(LongWritable, Either[Text, Event])] = eventInput.convertJsonToEither.forceToDisk
validatedEvents.removeInvalidTuples.removeEitherWrapper.write(returnOutputPipe(invalidOutputPath))
validatedEvents.keepValidTuples.removeEitherWrapper.write(returnOutputPipe(validOutputPath))
override protected def handleStats(statsData: CascadingStats) = {
//This is code to handle counters.
}
}
Below is the integration test:
class MyJobTest extends FlatSpec with Matchers {
private val LOG = LoggerFactory.getLogger(classOf[MyJobTest])
val validEvents: List[(LongWritable, Text)] = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/validEvents.txt")).getLines().toList.map(s => {
val eventText = new Text
val typedFields = s.split(Constants.TAB)
eventText.set(typedFields(1))
(new LongWritable(typedFields(0).toLong), eventText)
})
"Integrate-Test: My Job" should "run test" in {
LOG.info("Before Job Test starts.")
JobTest(classOf[MyJob].getName)
.arg(MyJob.inputArgPath, "input")
.arg(MyJob.invalidOutputArgPath, "invalidOutput")
.arg(MyJob.validOutputArgPath, "validOutput")
.source(TypedTsv[(LongWritable, Text)]("input"), validEvents)
.typedSink[(LongWritable, Text)](TypedTsv[(LongWritable, Text)]("invalidOutput")) {
(buffer: mutable.Buffer[(LongWritable, Text)]) => {
LOG.info("This is inside the buffer1.")
buffer.size should equal(1000000)
}
}
.typedSink[(LongWritable, Text)](TypedTsv[(LongWritable, Text)]("validOutput")) {
(buffer: mutable.Buffer[(LongWritable, Text)]) => {
LOG.info("This is inside the buffer2.")
buffer.size should equal(1000000000)
}
}
.run
.finish
}
}
And finally, the output:
[INFO] --- maven-surefire-plugin:2.7:test (default-test) # MyJob ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (test) # MyJob ---
Discovery starting.
16/01/28 10:06:42 INFO jobs.MyJobTest: Before Job Test starts.
16/01/28 10:06:42 INFO property.AppProps: using app.id: A98C9B84C79348F8A7784D8247410C13
16/01/28 10:06:42 INFO util.Version: Concurrent, Inc - Cascading 2.6.1
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] starting
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] source: MemoryTap["NullScheme"]["0.2996348736498404"]
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] sink: MemoryTap["NullScheme"]["0.8393418014297485"]
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] sink: MemoryTap["NullScheme"]["0.20643450953780684"]
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] parallel execution is enabled: true
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] starting jobs: 1
16/01/28 10:06:42 INFO flow.Flow: [com.myCompany.myProject.c...] allocating threads: 1
16/01/28 10:06:42 INFO flow.FlowStep: [com.myCompany.myProject.c...] starting step: local
16/01/28 10:06:42 INFO util.Version: HV000001: Hibernate Validator 5.0.3.Final
Dumping custom counters:
rawEvent 6
validEvent 6
16/01/28 10:06:42 INFO jobs.MyJob: RawEvents: 6
16/01/28 10:06:42 INFO jobs.MyJob: ValidEvents: 6
16/01/28 10:06:42 INFO jobs.MyJob: InvalidEvents: 0
16/01/28 10:06:42 INFO jobs.MyJob: Job has valid counters and is exiting successfully.
As you can see, the Logger logs the "Before Job Test Starts" but nothing happens inside of the typedSink parts. This is frustrating because my code looks like all of the other code I see for this but it does not work. It should fail the test but everything runs successfully. Additionally, the Logger inside of the typedSink never outputs. Lastly, if you look at the output, you see that it handled counters correctly so it is running the job to completion. I have spent many hours trying new things but nothing seems to work. Hopefully the community will be able to help me. Thanks!
So, while I don't have a great answer to this post, I have what worked for me. Basically my problem was that I was using ScalaTest to run my Scalding jobs from this link: Using the ScalaTest Maven plugin. This worked fine for my unit tests on operations but this caused weirdness when using ScalaTest with JobTest. After talked to the Scalding devs and finally acknowledging my team's own success with JUnitRunner, I decided to go with that. I changed my POM to support JUnitRunner and added #RunWith(classOf[JUnitRunner]) annotations to my tests. Everything worked and behaved like I wanted them too.

Scala/Akka/ReactiveMongo: process does not terminate after system.shutdown()

I just started learning Scala and Akka and now I am trying to develop an application that uses ReactiveMongo framework to connect to a MongoDb server.
The problem is that when I call system.shutdown() in the end of my App object, the process does not terminate and just hangs forever.
I am now testing the case when there is no connection available, so my MongoDB server is not running. I have the following actor class for querying the database:
class MongoDb(val db: String, val nodes: Seq[String], val authentications: Seq[Authenticate] = Seq.empty, val nbChannelsPerNode: Int = 10) extends Actor with ActorLogging {
def this(config: Config) = this(config.getString("db"), config.getStringList("nodes").asScala.toSeq,
config.getList("authenticate").asScala.toSeq.map(c => {
val l = c.unwrapped().asInstanceOf[java.util.HashMap[String, String]]; Authenticate(l.get("db"), l.get("user"), l.get("password"))
}),
config.getInt("nbChannelsPerNode"))
implicit val ec = context.system.dispatcher
val driver = new MongoDriver(context.system)
val connection = driver.connection(nodes, authentications, nbChannelsPerNode)
connection.monitor.ask(WaitForPrimary)(Timeout(30.seconds)).onFailure {
case reason =>
log.error("Waiting for MongoDB primary connection timed out: {}", reason)
log.error("MongoDb actor kills itself as there is no connection available")
self ! PoisonPill
}
val dbConnection = connection(db)
val tasksCollection = dbConnection("tasks")
val taskTargetsCollection = dbConnection("taskTargets")
import Protocol._
override def receive: Receive = {
case GetPendingTask =>
sender ! NoPendingTask
}
}
My app class looks like this:
object HelloAkkaScala extends App with LazyLogging {
import scala.concurrent.duration._
// Create the 'helloakka' actor system
val system = ActorSystem("helloakka")
implicit val ec = system.dispatcher
//val config = ConfigFactory.load(ConfigFactory.load.getString("my.helloakka.app.environment"))
val config = ConfigFactory.load
logger.info("Creating MongoDb actor")
val db = system.actorOf(Props(new MongoDb(config.getConfig("my.helloakka.db.MongoDb"))))
system.scheduler.scheduleOnce(Duration.create(60, TimeUnit.SECONDS), new Runnable() { def run() = {
logger.info("Shutting down the system")
system.shutdown()
logger.info("System has been shut down!")
}})
}
And the log output in my terminal looks like this:
[DEBUG] [08/07/2014 00:32:06.358] [run-main-0] [EventStream(akka://helloakka)] logger log1-Logging$DefaultLogger started
[DEBUG] [08/07/2014 00:32:06.358] [run-main-0] [EventStream(akka://helloakka)] Default Loggers started
00:32:06.443 INFO [run-main-0] [HelloAkkaScala$] - Creating MongoDb actor
00:32:06.518 DEBUG [helloakka-akka.actor.default-dispatcher-3] [reactivemongo.core.actors.MonitorActor] - Actor[akka://helloakka/temp/$a] is waiting for a primary... not available, warning as soon a primary is available.
00:32:06.595 DEBUG [helloakka-akka.actor.default-dispatcher-2] [reactivemongo.core.actors.MongoDBSystem] - Channel #-774976050 unavailable (ChannelClosed(-774976050)).
00:32:06.599 DEBUG [helloakka-akka.actor.default-dispatcher-2] [reactivemongo.core.actors.MongoDBSystem] - The entire node set is still unreachable, is there a network problem?
00:32:06.599 DEBUG [helloakka-akka.actor.default-dispatcher-2] [reactivemongo.core.actors.MongoDBSystem] - -774976050 is disconnected
00:32:08.573 DEBUG [helloakka-akka.actor.default-dispatcher-3] [reactivemongo.core.actors.MongoDBSystem] - ConnectAll Job running... Status: Node[localhost: Unknown (0/10 available connections), latency=0], auth=Set()
00:32:08.574 DEBUG [helloakka-akka.actor.default-dispatcher-3] [reactivemongo.core.actors.MongoDBSystem] - Channel #-73322193 unavailable (ChannelClosed(-73322193)).
00:32:08.575 DEBUG [helloakka-akka.actor.default-dispatcher-3] [reactivemongo.core.actors.MongoDBSystem] - The entire node set is still unreachable, is there a network problem?
00:32:08.575 DEBUG [helloakka-akka.actor.default-dispatcher-3] [reactivemongo.core.actors.MongoDBSystem] - -73322193 is disconnected
... (the last 3 messages repeated many times as per documentation the MongoDriver tries to re-connect with 2 seconds interval)
[ERROR] [08/07/2014 00:32:36.474] [helloakka-akka.actor.default-dispatcher-3] [akka://helloakka/user/$a] Waiting for MongoDB primary connection timed out: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://helloakka/user/$c#1684233695]] after [30000 ms]
[ERROR] [08/07/2014 00:32:36.475] [helloakka-akka.actor.default-dispatcher-3] [akka://helloakka/user/$a] MongoDb actor kills itself as there is no connection available
... (the same 3 messages repeated again)
00:32:46.461 INFO [helloakka-akka.actor.default-dispatcher-4] [HelloAkkaScala$] - Shutting down the system
00:32:46.461 INFO [helloakka-akka.actor.default-dispatcher-4] [HelloAkkaScala$] - Awaiting system termination...
00:32:46.465 WARN [helloakka-akka.actor.default-dispatcher-2] [reactivemongo.core.actors.MongoDBSystem] - MongoDBSystem Actor[akka://helloakka/user/$b#537715233] stopped.
00:32:46.465 DEBUG [helloakka-akka.actor.default-dispatcher-5] [reactivemongo.core.actors.MonitorActor] - Monitor Actor[akka://helloakka/user/$c#1684233695] stopped.
[DEBUG] [08/07/2014 00:32:46.468] [helloakka-akka.actor.default-dispatcher-2] [EventStream] shutting down: StandardOutLogger started
00:32:46.483 INFO [helloakka-akka.actor.default-dispatcher-4] [HelloAkkaScala$] - System has been terminated!
And after that the process hands forever and does never terminate. What am I doing wrong?
You aren't doing anything incorrect. This is a known issue.
https://github.com/ReactiveMongo/ReactiveMongo/issues/148