Celery single task persistent data - celery

Lets say a single task is enough for a machine to stay very busy for a few minutes.
I want to get the result of the task, then depending on the result, have the worker perform the same task again.
The question I cannot find an answer to is this: Can I keep data in memory on the worker machine in order to use it on the next task?

Yes you can. The documentation (http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation) is a bit vague and I'm not sure if this is the best way, but you can do something like this:
This is what you'd like to do, but doesn't work:
# This doesn't work
a = 0
#celery.task
def mytask(x):
a += x
return a
This is how to make it work:
from celery import Task, registry
#celery.task
class MyTask(Task):
def __init__(self):
self.a = 0
def run(self, x):
self.a += x
return self.a
mytask = registry.tasks[MyTask.name]
According to the docs:
A task is not instantiated for every request, but is registered in the task registry as a global instance. This means that the __init__ constructor will only be called once per process, and that the task class is semantically closer to an Actor. ... If you have a task, ... And you route every request to the same process, then it will keep state between requests."
I've had success with this but I don't know if there are better methods.

Related

How to make a single and batched RPC call in Apache Beam

In my pipeline I have to make a single RPC call as well as a Batched RPC
call to fetch data for enrichment. I could not find any reference on how
to make these call within your pipeline. I am still covering my grounds in
Apache Beam and would appreciate if anyone has done this and could share a
sample code or details on how to do this.
Thanks.
Setting up the answer
I will help by writing example DoFns that do the things you want. I will write them in Python, but the Java code would be similar.
Let's suppose we have two functions that internally perform an RPC:
def perform_rpc(client, element):
.... # Some Code to run one RPC for the element using the client
def perform_batched_rpc(client, elements):
.... # Some code that runs a single RPC for a batch of elements using the client
Let's also suppose that you have a function create_client() that returns a client for your external system. We assume that creating this client is somewhat expensive, and it is not possible to maintain many clients in a single worker (due to e.g. memory constraints)
Performing a single RPC per element
It is usually fine to perform a blocking RPC for each element, but this may lead to low CPU usage
class IndividualBlockingRpc(DoFn):
def setup(self):
# Create the client only once per Fn instance
self.client = create_client()
def process(self, element):
perform_rpc(self.client, element)
If you would like to be more sophisticated, you could also try to run asynchronous RPCs by buffering. Consider that in this case, your client would need to be thread safe:
class AsyncRpcs(DoFn):
def __init__(self):
self.buffer = None
self.client = None
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > MAX_BUFFER_SIZE:
self._flush()
def finish(self):
self._flush()
def _flush(self):
if not self.client:
self.client = create_client()
if not self.executor:
self.executor = ThreadPoolExecutor() # Use a configured executor
futures = [self.executor.submit(perform_rpc, client, elm)
for elm in self.elements]
for f in futures:
f.result() # Finalize all the futures
self.buffer = []
Performing a single RPC for a batch of elements
For most runners, a Batch pipeline has large bundles. This means that it makes sense to simply buffer elements as they come into process, and flush them every now and then, like so:
class BatchAndRpc(DoFn):
def __init__(self):
self.buffer = None
self.client = None
def process(self, element):
self.buffer.append(element)
if len(self.buffer) > MAX_BUFFER_SIZE:
self._flush()
def finish(self):
self._flush()
def _flush(self):
if not self.client:
self.client = create_client()
perform_batched_rpc(client, self.buffer)
self.buffer = []
For streaming pipelines, or for pipelines where your bundles are not large enough for this strategy to work well, you may need to try other tricks, but this strategy should be enough for most scenarios.
If these strategies don't work, please let me know, and I'll detail others.

How to throttle the execution of future?

I have basically list of unique ids, and for every id, i make a call to function which returns future.
Problem is number of futures in a single call is variable.
list.map(id -> futureCall)
There will be too much parallelism which can affect my system. I want to configure number of futures execution in parallel.
I want testable design so i can't do this
After searching alot, i found this
I didn't get it how to use it. I tried but it didn't work.
After that i have just imported it in my class where i am making call.
I have used same snippet and set default maxConcurrent to 4.
I replaced import global execution context with ThrottledExecutionContext
You have to wrap your ExecutionContext with ThrottledExecutionContext.
Here is a little sample:
object TestApp extends App {
implicit val ec = ThrottledExecutionContext(maxConcurrents = 10)(scala.concurrent.ExecutionContext.global)
def futureCall(id:Int) = Future {
println(s"executing $id")
Thread.sleep(500)
id
}
val list = 1 to 1000
val results = list.map(futureCall)
Await.result(Future.sequence(results), 100.seconds)
}
Alternatively you can also try a FixedThreadPool:
implicit val ec = ExecutionContext.fromExecutor(java.util.concurrent.Executors.newFixedThreadPool(10))
I am not sure what you are trying to do here. Default global ExecutionContext uses as many threads as you have CPU cores. So, that would be your parallelism. If that's still "too many" for you, you can control that number with a system property: "scala.concurrent.context.maxThreads", and set that to a lower number.
That will be the maximum number of futures that are executed in parallel at any given time. You should not need to throttle anything explicitly.
Alternatively, you can create your own executor, and give it a BlockingQueue with a limited capacity. That would block on the producer side (when a work item is being submitted), like your implementation does, but I would very strongly advice you from doing that as it is very dangerous and prone to deadlocks, and also much less efficient, that the default ForkJoinPool implementation.

What can cause Akka's Scheduler to execute scheduled tasks before the scheduled time?

I'm experiencing a strange behaviour when using Akka's scheduler. My code looks roughly like this:
val s = ActorSystem("scheduler")
import scala.concurrent.ExecutionContext.Implicits.global
def doSomething(): Future[Unit] = {
val now = new GregorianCalendar(TimeZone.getTimeZone("UTC"))
println(s"${now.get(Calendar.MINUTE)}:${now.get(Calendar.SECOND)}:${now.get(Calendar.MILLISECOND)}" )
// Do many things that include an http request using "dispatch" and manipulation of the response and saving it in a file.
}
val futures: Seq[Future[Unit]] = for (i <- 1 to 500) yield {
println(s"$i : ${i*600}")
// AlphaVantage recommends 100 API calls per minute
akka.pattern.after(i * 600 milliseconds, s.scheduler) { doSomething() }
}
Future.sequence(futures).onComplete(_ => s.terminate())
When I execute my code, doSomething is initially called repeatedly with 600 milliseconds between successive calls, as expected. However, after a while, all remaining scheduled calls are suddenly executed simultaneously.
I suspect that something inside my doSomething might be interfering with the scheduling, but I don't know what. My doSomething just does an http request using dispatch and manipulates the result, and does not interact directly with akka or the scheduler in any way. So, my question is:
What can cause the Scheduler's schedule to fail and suddenly trigger the immediate execution of all remaining scheduled tasks?
(PS: I tried to simplify my doSomething to post a minimal non-working example here, but my simplifications resulted in working examples.)
Ok. I figured it out. As soon as one of the futures fail, the line
Future.sequence(futures).onComplete(_ => s.terminate())
will terminate the actor system, and all remaining scheduled tasks will be called.

How to compose two parallel Tasks to cancel one task if another one fails?

I would like to implement my asynchronous processing with
scalaz.concurrent.Task. I need a function (Task[A], Task[B]) => Task[(A, B)] to return a new task that works as follows:
run Task[A] and Task[B] in parallel and wait for the results;
if one of the tasks fails then cancel the second one and wait until it terminates;
return the results of both tasks.
How would you implement such a function ?
As I mention above, if you don't care about actually stopping the non-failed computation, you can use Nondeterminism. For example:
import scalaz._, scalaz.Scalaz._, scalaz.concurrent._
def pairFailSlow[A, B](a: Task[A], b: Task[B]): Task[(A, B)] = a.tuple(b)
def pairFailFast[A, B](a: Task[A], b: Task[B]): Task[(A, B)] =
Nondeterminism[Task].both(a, b)
val divByZero: Task[Int] = Task(1 / 0)
val waitALongTime: Task[String] = Task {
Thread.sleep(10000)
println("foo")
"foo"
}
And then:
pairFailSlow(divByZero, waitALongTime).run // fails immediately
pairFailSlow(waitALongTime, divByZero).run // hangs while sleeping
pairFailFast(divByZero, waitALongTime).run // fails immediately
pairFailFast(waitALongTime, divByZero).run // fails immediately
In every case except the first the side effect in waitALongTime will happen. If you wanted to attempt to stop that computation, you'd need to use something like Task's runAsyncInterruptibly.
There is a weird conception among java developers that you should not cancel parallel tasks. They comminate Thread.stop() and mark it deprecated. Without Thread.stop() you could not really cancel future. All you could do is to send some signal to future, or modify some shared variable and make code inside future to check it periodically. So, all libraries that provides futures could suggest the only way to cancel future: do it cooperatively.
I'm facing the same problem now and is in the middle of writing my own library for futures that could be cancelled. There are some difficulties but they may be solved. You just could not call Thread.stop() in any arbitrary position. The thread may perform updating shared variables. Lock would be recalled normally, but update may be stopped half-way, e.g. updating only half of double value and so on. So I'm introducing some lock. If the thread is in guarded state, then it should be now killed by Thread.stop() but with sending specific message. The guarded state is considered always very fast to be waited for. All other time, in the middle of computation, thread may be safely stopped and replaced with new one.
So, the answer is that: you should not desire to cancel futures, otherwise you are heretic and no one in java community would lend you a willing hand. You should define your own executional context that could kill threads and you should write your own futures library to run upon this context

Concurrent Akka Agents in Scala

I'm working on a scala project right now, and I've decided to use Akka's agent library over the actor model, because it allows a more functional approach to concurrency.However, I'm having a problem running many different agents at a time. It seems like I'm capping at only having three or four agents running at once.
import akka.actor._
import akka.agent._
import scala.concurrent.ExecutionContext.Implicits.global
object AgentTester extends App {
// Create the system for the actors that power the agents
implicit val system = ActorSystem("ActorSystem")
// Create an agent for each int between 1 and 10
val agents = Vector.tabulate[Agent[Int]](10)(x=>Agent[Int](1+x))
// Define a function for each agent to execute
def printRecur(a: Agent[Int])(x: Int): Int = {
// Print out the stored number and sleep.
println(x)
Thread.sleep(250)
// Recur the agent
a sendOff printRecur(a) _
// Keep the agent's value the same
x
}
// Start each agent
for(a <- agents) {
Thread.sleep(10)
a sendOff printRecur(a) _
}
}
The above code creates an agent holding each integer between 1 and 10. The loop at the bottom sends the printRecur function to every agent. The output of the program should show the numbers 1 through 10 being printed out every quarter of a second (although not in any order). However, for some reason my output only shows the numbers 1 through 4 being outputted.
Is there a more canonical way to use agents in Akka that will work? I come from a clojure background and have used this pattern successfully there before, so I naively used the same pattern in Scala.
My guess is that you are running on a 4 core box and that is part of the reason why you only ever see the numbers 1-4. The big thing at play here is that you are using the default execution context which I'm guessing on your system uses a thread pool with only 4 threads on it (one for each core). With the way you've coded this in this sort of recursive manner, my guess is that the first 4 agents never relinquish the threads and they are the only ones that will ever print anything.
You can easily fix this by removing this line:
import scala.concurrent.ExecutionContext.Implicits.global
And adding this line after you create the ActorSystem
import system.dispatcher
This will use the default dispatcher for the actor system which is a fork join dispatcher which does not seem to have the same issue as the default execution context you imported in your sample.
You could also consider using send as opposed to sendOff as that will use the execution context that was available when you constructed the agent. I would think one would use sendOff when they had a case where they explicitly wanted to use another execution context.