I am trying to understand how state management in Spark Streaming works in general. If I run this example program twice will the second run see state from the first run?
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
Is there a way how to achieve this? I am thinking about redeploying an application an I would like not to loose the current state.
tl;dr It depends on what you need the other instance to see. Checkpointing is usually a solution.
ssc.checkpoint(".") (at the line 50 in StatefulNetworkWordCount) enables checkpointing that (quoting the official documentation):
Spark Streaming needs to checkpoint enough information to a fault-tolerant storage system such that it can recover from failures.
A failure can be considered a form of redeployment. It is described in the official documentation under Upgrading Application Code that lists two cases:
Two instances run in parallel
One is gracefully brought down, and the other reads state from checkpoint directory.
Related
I am using Kafka queue 0.8.2 and have implemented standard poll and push calls. Now wanna go to pollByIndex methods that require implementation of Simple Consumer.
Does somebody knows some custom library which already deals with methods like this since implementing Simple Consumer can be a lot of work :)
Upgrading to 0.9 to use ConsumerAPI not option yet for me.
Ok. So creating SimpleConsumer is not really easy. So I went for solution of implementing another command that creates different group each time I queue for results which is stored in zookeeper and then assign offset I need on topic. That way default streaming queue will not be destroyed, it create small load on zookeeper but nothing significant.
Also after every querying I take care to remove group from zookeeper to keep it clean.
1 day of coding more less.
With new versions 0.9 and newer we will have ConsumerAPI which has its own calls also for this situation. Until then more work from dev side needed.
In a typical web application, there are some things that I would prefer to run as delayed jobs/tasks. They tend to have some or all of the following properties:
Takes a long time (anywhere from multiple seconds to multiple minutes to multiple hours).
Occupy some resource heavily (CPU, network, disk, external API limits, etc.)
Result not immediately necessary. Can complete HTTP response without it. OK (and possibly even preferable) to delay until later.
Can be (and possibly preferable to) run on (a) different machine(s) than web server(s). The machine(s) are potentially dedicated job/task runners.
Should be run in response to other event(s), or started periodically.
What would be the preferred way(s) to set up, enqueue, schedule, and run delayed jobs/tasks in a Scala + Play Framework 2.x app?
For more details...
The pattern I have used in the past, and which I would like to replicate if applicable, is:
In handler of web request, or in cron-like call, enqueue job(s)
In job runner(s), repeatedly dequeue and run one job at a time
Possibly handle recording job results
This seems to be a relatively simple yet still relatively flexible pattern.
Examples I have encountered in the past include:
Updating derived data in DB
Analytics/tracking API calls for a web request
Delete expired sessions or other stale/outdated DB records
Periodic batch ETLs
In other languages/frameworks, I would typically use a job/task framework. Examples include:
Resque in a Ruby + Rails app
Celery in a Python + Django app
I have found the following existing materials, but unfortunately, I don't think they fit my use case directly.
Play 1.x asynchronous jobs API (+ various SO questions referencing it). Appears to have been removed in 2.x line. No reference to what replaced it.
Play 2.x Akka integration. Seems very general-purpose. I'd imagine it's possible to use Akka for the above, but I'd prefer not to write a jobs/tasks framework if one already exists. Also, no info on how to separate the job runner machine(s) from your web server(s).
This SO answer. Seems potentially promising for the "short to medium duration IO bound" case, e.g. analytics calls, but not necessarily for the "CPU bound" case (probably shouldn't tie up CPU on web server, prefer to ship off to different node), the "lots of network" case, or the "multiple hour" case (probably shouldn't leave that in the background on the web server, even if it isn't eating up too many resources).
This SO question, and related questions. Similar to above, it seems to me that this covers only the cases where it would be appropriate to run on the same web server.
Some further clarification on use-cases (as per commenters' request). There are two main use-cases that I have experienced with something like resque or celery that I am trying to replicate here:
Some event on the site (Most often, an incoming web request causes task to be enqueued.)
Task should run periodically. (Most often, this is implemented as: periodically, enqueue task to be run as above.)
In the case of resque or celery, the tasks enqueued by both use-cases enter queues the same way and are treated the same way by the runner/worker process. Barring other Scala or Play-specific considerations, that would be my initial guess for how to approach this.
Some further clarification on why I do not believe the Akka scheduler fits my use case out-of-the-box (as per commenters' request):
While it is no doubt possible to construct a fitting solution using some combination of the Akka scheduler (for periodic jobs), akka-remote and akka-cluster (for communicating between the job caller and the job runner), that approach requires a certain amount of glue code which is almost a delayed job framework in and of itself. If it exists, I would prefer to use an existing out-of-the-box solution rather than reinvent the wheel.
In these slides: http://www.slideshare.net/jboner/introducing-akka I've read that Akka supports hot deployment. The way I understand this term is I'll be able to make code changes without restarting my application and losing it's current state.
That's exactly what I may need for my scala/akka application. But how do I actually do a hot deployment? What tools and techniques should I use?
It isn't clear what state you want to maintain? The mailboxes of the actors? Configuration of remoting? All of that is non-trivial to reason about in normal circumstances not too mention hot swapping.
If you are thinking of something along the lines of OSGI hot deployment then no in general you cannot. You have a few options though.
You can change an actors behavior at runtime using a variety of methods the easiest would be become/unbecome. This is sometimes what is meant by hotswap.
A generic approach might be to deploy your new code to new nodes join a cluster and then kill off previous nodes.
I am using celery to process and deploy some data and I would like to be able to send this data in batches.
I found this:
http://docs.celeryproject.org/en/latest/reference/celery.contrib.batches.html
But I am having problems with it, such as:
It does not play nicely with eventlets, putting exceptions in the log
files stating that the timer is null after the queue is empty.
It seems to leave additional hanging threads after calling celery
multi stop
It does not appear to adhere to the standard logging of a typical
Task.
It does not appear to retry the task when raise mytask.retry() is
called.
I am wondering if others have experienced these problems, and is there a solution?
I am fine with implementing batch processing on my own, but I do not know a good strategy to make sure that all items are deployed (i.e. even those at the end of the thread).
Also, if the batch fails, I would want to retry the entire batch. I am not sure of any elegant way to do that.
Basically, I am looking for any viable solution for doing real batch processing with celery.
I am using Celery v3.0.21
Thanks!
I have looked at the documentation on both, but am not sure what's the best choice for a given application. I have looked closer at celery, so the example will be given in those terms.
My use case is similar to this question, with each worker loading a large file remotely (one file per machine), however I also need workers to contain persistent objects. So, if a worker completes a task and returns a result, then is called again, I need to use a previously created variable for the new task.
Repeating the object creation at each task call is far too wasteful. I haven't seen a celery example to lead me to believe this is possible, I was hoping to use the worker_init signal to accomplish this.
Finally, I need a central hub to keep track of what all the workers are doing. This seems to imply a client-server architecture rather than the one provided by Celery, is this correct? If so, would IPython Parallel be a good choice given the requirements?
I'm currently evaluating Celery vs IPython parallel as well. Regarding a central hub to keep track of what the workers are doing, have you checked out the Celery Flower project here? It provides a webpage that allows you to view the status of all tasks in the queue.