Debugging Futures and Iteratees in Play 2.2 - scala

I'm having some strange behavior in my Play 2.2 app and I'm not sure how to go about debugging it. My code was running fine until I started using iteratees.
My actor creates an enumerator like below and sends it back to the caller:
val emailFeed = Concurrent.unicast[Message] (
onStart = {
pushee => {
log.debug("Pushing 1")
pushee.push(messages.apply(0))
log.debug("Pushed 1")
}
},
onComplete = {
log.debug("Done with pushee")
},
onError = {
(msg, in) => log.error(msg)
}
the caller then consumes it by :
f.map { reply =>
val emailFeed = reply.asInstanceOf[Enumerator[Message]]
val iter = Iteratee.fold[Message,String] ("") {
(result, msg) => {
Logger.debug("appending subj")
result ++ msg.getSubject()
Logger.debug(" subj appended")
result
}
}
val result: Future[String] = emailFeed |>>> iter
Await.result(result, 10 minutes)
The problem is Await.result times out. I have stepped through the iteratee code and see it being called. There is only 1 chunk to process so it is quick. I also see the enumerator and iteratee debug stmts in the console. I just don't know why it doesn't complete:
[debug] application - Pushing 1
[debug] application - Pushed 1
[debug] application - appending subj
[debug] application - subj appended

Apparently, there isn't much feedback in Play if you forget to end your enumerator properly. You must call:
pushee.end
or
pushee.eofAndEnd
to trigger the enumerator to close and the Future to complete.

Related

Apache Spark: how to cancel job in code and kill running tasks?

I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()

continuously fetch database results with scalaz.stream

I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance
Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.

Consuming a service using WS in Play

I was hoping someone can briefly go over the various ways of consuming a service (this one just returns a string, normally it would be JSON but I just want to understand the concepts here).
My service:
def ping = Action {
Ok("pong")
}
Now in my Play (2.3.x) application, I want to call my client and display the response.
When working with Futures, I want to display the value.
I am a bit confused, what are all the ways I could call this method i.e. there are some ways I see that use Success/Failure,
val futureResponse: Future[String] = WS.url(url + "/ping").get().map { response =>
response.body
}
var resp = ""
futureResponse.onComplete {
case Success(str) => {
Logger.trace(s"future success $str")
resp = str
}
case Failure(ex) => {
Logger.trace(s"future failed")
resp = ex.toString
}
}
Ok(resp)
I can see the trace in STDOUT for success/failure, but my controller action just returns "" to my browser.
I understand that this is because it returns a FUTURE and my action finishes before the future returns.
How can I force it to wait?
What options do I have with error handling?
If you really want to block until feature is completed look at the Future.ready() and Future.result() methods. But you shouldn't.
The point about Future is that you can tell it how to use the result once it arrived, and then go on, no blocks required.
Future can be the result of an Action, in this case framework takes care of it:
def index = Action.async {
WS.url(url + "/ping").get()
.map(response => Ok("Got result: " + response.body))
}
Look at the documentation, it describes the topic very well.
As for the error-handling, you can use Future.recover() method. You should tell it what to return in case of error and it gives you new Future that you should return from your action.
def index = Action.async {
WS.url(url + "/ping").get()
.map(response => Ok("Got result: " + response.body))
.recover{ case e: Exception => InternalServerError(e.getMessage) }
}
So the basic way you consume service is to get result Future, transform it in the way you want by using monadic methods(the methods that return new transformed Future, like map, recover, etc..) and return it as a result of an Action.
You may want to look at Play 2.2 -Scala - How to chain Futures in Controller Action and Dealing with failed futures questions.

spray and actor non deterministic tests

Helo,
at the beginning i wold like to apologize for my english :)
akka=2.3.6
spray=1.3.2
scalatest=2.2.1
I encountered strange behavior of teting routes, which asks actors in handleWith directive,
I've route with handleWith directive
pathPrefix("firstPath") {
pathEnd {
get(complete("Hello from this api")) ~
post(handleWith { (data: Data) =>{ println("receiving data")
(dataCalculator ? data).collect {
case Success(_) =>
Right(Created -> "")
case throwable: MyInternalValidatationException =>
Left(BadRequest -> s"""{"${throwable.subject}" : "${throwable.cause}"}""")
}
}})
}
}
and simple actor wchich always responds when receive object Data and has own receive block wrapped in LoggingReceive, so I should see logs when message is receiving by actor
and i test it using (I think simple code)
class SampleStarngeTest extends WordSpec with ThisAppTestBase with OneInstancePerTest
with routeTestingSugar {
val url = "/firstPath/"
implicit val routeTestTimeout = RouteTestTimeout(5 seconds)
def postTest(data: String) = Post(url).withJson(data) ~> routes
"posting" should {
"pass" when {
"data is valid and comes from the identified user" in {
postTest(correctData.copy(createdAt = System.currentTimeMillis()).asJson) ~> check {
print(entity)
status shouldBe Created
}
}
"report is valid and comes from the anonymous" in {
postTest(correctData.copy(createdAt = System.currentTimeMillis(), adid = "anonymous").asJson) ~> check {
status shouldBe Created
}
}
}
}
}
and behavior:
When I run either all tests in package (using Intellij Idea 14 Ultimate) or sbt test I encounter the same results
one execution -> all tests pass
and next one -> not all pass, this which not pass I can see:
1. fail becouse Request was neither completed nor rejected within X seconds ( X up tp 60)
2. system console output from route from line post(handleWith { (data: Data) =>{ println("receiving data"), so code in handleWith was executed
3. ask timeout exception from route code, but not always (among failed tests)
4. no logs from actor LoggingReceive, so actor hasn't chance to respond
5. when I rerun teststhe results are even different from the previous
Is there problem with threading? or test modules, thread blocking inside libraries? or sth else? I've no idea why it isn't work :(

Why scheduleOnce runs everytime on app shutdown?

I have Play 2 app.
And code something like that.
So, I run app in start mode and run mode.
val calendar = Calendar.getInstance()
calendar.set(Calendar.HOUR_OF_DAY, 19)
calendar.set(Calendar.MINUTE, 0)
calendar.set(Calendar.SECOND, 0)
calendar.set(Calendar.MILLISECOND, 0)
val now = new Date()
val timeDifference = calendar.getTime.getTime - now.getTime
if (timeDifference >= 0) {
val initialDelay = Duration.create(timeDifference, duration.MILLISECONDS)
Akka.system.scheduler.scheduleOnce(initialDelay, new Runnable {
override def run() {
LOGGER.error(System.lineSeparator() +
"=================================================" + System.lineSeparator() +
"| Forming and send report |" + System.lineSeparator() +
"=================================================" + System.lineSeparator()
)
}
})
}
Why scheduleOnce runs on shutdown?
The behavior you are seeing here actually effects all tasks scheduled in the default scheduler (LightArrayResolverScheduler) whether they were scheduled with schedule or scheduleOnce. If you look at the close method on LightArrayResolverScheduler, you will see this logic:
override def close(): Unit = Await.result(stop(), getShutdownTimeout) foreach {
task =>
try task.run() catch {
case e: InterruptedException => throw e
case _: SchedulerException => // ignore terminated actors
case NonFatal(e) => log.error(e, "exception while executing timer task")
}
}
So basically, it looks like during a shutdown of the timer that it will collect all outstanding tasks and execute them serially. I suppose this would not be a big deal if your scheduling was just sending messages to actors (as opposed to using a Runnable) as those actors would hopefully be terminated by this point and result in nothing happening (dead letters).
If you want to avoid this behavior, you could store the result of the call to scheduleOnce and explicitly cancel it first before you initiate your shutdown process.