Spark streaming store method only work in Duration window but not in foreachRDD workflow in customized receiver - scala

I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?

I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?

Related

Does flink streaming job maintain its keyed value state between job runs?

Our usecase is we want to use flink streaming for a de-duplicator job, which reads it's data from source(kafka topic) and writes unique records into hdfs file sink.
Kafka topic could have duplicate data, which can be identified by using composite key
(adserver_id, unix_timestamp of the record)
so I decided to use flink keyed state stream to achieve de-duplication.
val messageStream: DataStream[String] = env.addSource(flinkKafkaConsumer)
messageStream
.map{
record =>
val key = record.adserver_id.get + record.event_timestamp.get
(key,record)
}
.keyBy(_._1)
.flatMap(new DedupDCNRecord())
.map(_.toString)
.addSink(sink)
// execute the stream
env.execute(applicationName)
}
Here is the code for de-duplication using value state from flink.
class DedupDCNRecord extends RichFlatMapFunction[(String, DCNRecord), DCNRecord] {
private var operatorState: ValueState[String] = null
override def open(configuration: Configuration) = {
operatorState = getRuntimeContext.getState(
DedupDCNRecord.descriptor
)
}
#throws[Exception]
override def flatMap(value: (String,DCNRecord), out: Collector[DCNRecord]): Unit = {
if (operatorState.value == null) { // we haven't seen the element yet
out.collect(value._2)
// set operator state to true so that we don't emit elements with this key again
operatorState.update(value._1)
}
}
}
While this approach works fine as long as streaming job is running and maintaining list of unique keys through valueState and performing de-duplication.
But as soon as I cancel the job, flink looses it's state(unique keys seen in previous run of the job) for valueState(only keeps unique keys for the current run) and let the records pass, which were already processed in previous run of the job.
Is there a way, we can enforce flink to mainatain it's valueState(unique_keys) seen so far ?
Appreciate your help.
This requires you capture a snapshot of the state before shutting down the job, and then restart from that snapshot:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch, using the savepoint as the starting point.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground. The section on Observing Failure & Recovery is also relevant here.

handling infinite loop in apache-kafka consumer

iam new to kafka i want to implement a kafka message pipeline system for our play-scala project,i created a topic in which records are inserted, and i wrote a consumer code like this,
val recordBatch = new ListBuffer[CountryModel]
consumer.subscribe(Arrays.asList("session-queue"))
while (true) {
val records = consumer.poll(1000)
val iter = records.iterator()
while (iter.hasNext()) {
val record: ConsumerRecord[String, String] = iter.next()
recordBatch += Json.parse(record.value()).as[CountryModel]
}
processData(recordBatch)
// Thread.sleep(5 * 1000)
}
but after certain time the service is stopping, the processor is going to 100% after certain time machine is stopping, how can i handle this infinite loop.
in production environment i can not rely on this loop right i tried to sleep the thread for some time but it is not a elegant solution

Why memory sink writes nothing in append mode?

I used Spark's structured streaming to stream messages from Kafka. The data was then aggregated, and wrote to a memory sink with append mode. However, when I tried to query the memory, it returned nothing. Below are the code:
result = model
.withColumn("timeStamp", col("startTimeStamp").cast("timestamp"))
.withWatermark("timeStamp", "5 minutes")
.groupBy(window(col("timeStamp"), "5 minutes").alias("window"))
.agg(
count("*").alias("total")
);
// writing to memory
StreamingQuery query = result.writeStream()
.outputMode(OutputMode.Append())
.queryName("datatable")
.format("memory")
.start();
// query data in memory
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
sparkSession.sql("SELECT * FROM datatable").show();
}
}, 10000, 10000);
The result is always:
|window|total|
+------+-----+
+------+-----+
If I used outputMode = complete, then I could get the aggregated data. But that's not my choice as the requirement is to use append mode.
Is there any problem with the code?
Thanks,
In append mode,
The output of a windowed aggregation is delayed the late threshold specified in withWatermark()
In your case, the delay is 5 minutes, I know nothing about your input data, but I guess you probably need to wait for 5 minutes.
I suggest you read (again?) the docs for Structured Streaming:

Apache Spark: how to cancel job in code and kill running tasks?

I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()

Why scala futures do not work faster even with more threads in threads pool?

I have a following algorithm with scala:
Do initial call to db to initialize cursor
Get 1000 entities from db (Returns Future)
For every entity process one additional request to database and get modified entity (returns future)
Transform original entity
Put transformed entity to Future call back from #3
Wait for all Futures
In scala it some thing like:
val client = ...
val size = 1000
val init:Future = client.firstSearch(size) //request over network
val initResult = Await(init, 30.seconds)
var cursorId:String = initResult.getCursorId
while (!cursorId.isEmpty) {
val futures:Seq[Future] = client.grabWithSize(cursorId).map{response=>
response.getAllResults.map(result=>
val grabbedOne:Future[Entity] = client.grabOneEntity(result.id) //request over network
val resultMap:Map[String,Any] = buildMap(result)
val transformed:Map[String,Any] = transform(resultMap) //no future here
grabbedOne.map{grabbedOne=>
buildMap(grabbedOne) == transformed
}
}
Futures.sequence(futures).map(_=> response.getNewCursorId)
}
}
def buildMap(...):Map[String,Any] //sync call
I noticed that if I increase size say two times, every iteration in while started working slowly ~1.5. But I do not see that my PC processor loaded more. It loaded near zero, but time increases in ~1.5. Why? I have setuped:
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1024))
I think, that not all Futures executed in parallel. But why? And ho to fix?
I see that in your code, the Futures don't block each other. It's more likely the database that is the bottleneck.
Is it possible to do a SQL join for O(1) rather than O(n) in terms of database calls? (If you're using Slick, have a look under the queries section about joins.)
If the load is low, it's probably that the connection pool is maxed out, you'd need to increase it for the database and the network.