Batch processing and functional programming - scala

As a Java developer, I'm used to use Spring Batch for batch processing, generally using a streaming library to export large XML files with StAX for exemple.
I'm now developping a Scala application, and wonder if there's any framework, tool or guideline to achieve batch processing.
My Scala application uses the Cake Pattern and I'm not sure how I could integrate this with SpringBatch. Also, I'd like to follow the guidelines described in Functional programming in Scala and try to keep functional purity, using stuff like the IO monad...
I know this is kind of an open question, but I never read anything about this...
Has anyone already achieved functional batch processing here? How was it working? Am I supposed to have a main that creates a batch processing operation in an IO monad and run it? Is there any tool or guideline to help, monitor or handle restartability, like we use Spring Batch in Java.
Do you use Spring Batch in Scala?
How do you handle the integration part, for exemple waiting for a JMS/AMQP message to start the treatment that produces an XML?
Any feedback on the subjet is welcome

You don't mention what kind of app you are developing with Scala, so I'm going to wild guess here and suppose you are doing a server side one. Going further with wild guessing let's say you are using Akka... because you are using it, aren't you? :)
In that case, I guess what you are looking for is Akka Quartz Scheduler, the official Quartz Extension and utilities for cron-style scheduling in Akka. I haven't tried it myself, but from your requirements it seems that Akka + this module would be a good fit. Take into account that Akka already provides hooks to handle restartability of failed actors, and I don't think that it would be that difficult to add monitoring of batch processes leveraging the lifecycle callbacks built into actors.
Regarding interaction with JMS/AMQP messaging, you could use the Akka Camel module, that provides support for sending and receiving messages through a lot of protocols, including JMS. Using this module you could have a consumer actor receiving messages from some JMS endpoint, and fire whatever process you want from there, probably forwarding or sending a new message to the actor responsible for that process. If the process is fired either by a cron style timer or an incoming message you can reuse the same actor to accomplish the task.

Related

build workflow engine with Akka

In our Scala/Play application we use activiti. (also experimenting with camunda) users can create workflows (shown in this picture http://camunda.com/ ). All calls to these external workflow engines are wrapped in Scala Future (activiti and camunda APIs are all Java blocking APIs).
is there any library to implement workflows totally using Akka/Actors avoiding heavy toolkits like activiti/camunda? Or ideas how to best use Akka with activiti/camunda ?
You could try and use the Akka FSM dsl to do the same bypassing activity and also blocking apis. see http://doc.akka.io/docs/akka/snapshot/scala/fsm.html
Note that camunda has very powerful asynchronous continuation features which allow you to delegate any long-running processing to background threads. This allows very flexible configuration of "how much work" is done synchronously in the client (possibly HTTP) thread. This can give you a good balance between performance and fault tolerance.
I know of the existence of the Catify BPMN Engine, built using Akka (Java). I do not have any experience with it, nor do I know for sure whether API calls are asynchronous, but I would expect so. Since it is written in Akka it should combine well with Play!.

Difference in message-passing model of Akka and Vert.x

Am a Scala programmer and understand Akka from a developer point of view. I have not looked into Akka library's code. Have read about the two types of actors in the Akka model - thread-based and event-based - but not having run Akka at large scale I dont have experience of configuring Akka for production. And am completely new to Vert.x. So, from the choices perspective to build a reactive application stack I want to know -
Is the message-passing model of Akka and Vert.x very different? How?
Are the data-structures behind Akka's actors and Vert.x's verticles to buffer messages very different?
In a superficial view they're really similar, although I personally consider more similar Vert.x ideas to some MQ system than to Akka... the Vert.x topology is more flat: A verticle share a message with other verticle and receive a response... instead Akka is more like a tree, where you've several actors, but you can supervise actors using other actor,..for simple projects maybe they're not so big deal, but for big projects you could appreciate a more "hierarchic system"...
Vert.x on the other hand, offer a better Interoperability between very popular languages*. For me that is a big point, where you would need to mix actors with a MQ system and dealing with more complexity, Vert.x makes it simple and elegant..so the answer, which is better?...depend, if your system will be constructed only over Scala, then Akka could be the best way...if you need communication with JavaScript, Ruby, Python, Java, etc... and don't need a complex hierarchy, then Vert.x is the way to go..
*(using JSON, which could be an advantage or disadvantage compared to)
Also you must consider that Vert.x is a full solution, TCP, HTTP server, routing, even WebSocket!!! That is pretty amazing because they offer a full stack and the API is very clean. If you choose Akka you would need use a framework like Play, Xitrum Ospray. Personally I don't like any of them.
Also remember that Vert.x is a not opinionated platform, you can use Akka or Kafka with it, for example, without almost any overhead. The way how every part of the system is decouple inside a verticle makes it so simple.
Vert.x is a big project with an amazing perspective but really new, if you need a solution now maybe it would not be the better option, fortunately you can learn both and use both in the same project.
After doing a bit of google search I have figured that at detailed comparison of Akka vs Vert.x has not yet been done ( atleast I cound't find it ).
Computation model:
Vert.x is based on Event Driven model.
Akka is based on Actor Model of concurrency,
Reactive Streams:
Vert.x has Reactive Streams builtin
Akka supports Reactive Streams via Akka Streaming. Akka has stream operators ( via Scala DSL ) that is very concise and clean.
HTTP Support
Vert.x has builtin support of creating network services ( HTTP, TCP etc )
Akka has Akka HTTP for that
Scala support
Vert.x is written in Java
Akka is written in Scala and its fun to work on
Remote services
Vert.x supports services, so we need to explicitly create services
Akka has Actors which can be deployed anywhere on the network, with support for clustering, replication, load-balancing, supervision etc.
References:
https://groups.google.com/forum/#!topic/vertx/ppSKmBoOAoQ
https://blog.openshift.com/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift/
https://en.wikipedia.org/wiki/Vert.x
http://akka.io/

Finagle and Akka, why not use them together?

I have not used Finagle nor Akka in practice, but I have been reading a lot of about them.
Finagle being a RPC system and Akka a toolkit for highly concurrent applications, why all the people compare them as two possible solutions which cannot be used together? All searches I've done propose to use one or the other, no one proposes to use them together.
Finagle, for example, has a very interesting way of defining endpoints via thrift and its IDL. With this IDL we could define a custom endpoint and through scooge or whatever code generation tool, it would be possible to have a service with no effort. Also a client to connect to this service is created with a lot of common client issues automatically resolved (reconnection, timeout, retries, load-balancing, connection-pooling, ...).
Akka instead, solves a lot of concurrency headaches and it scales extremely well without all the complexities of hand controlled threading.
As a summary, why not use them together?:
Finagle + Thrift (with its IDL): It facilitates service design and development as well as deployment (which includes ease of scaling-out).
Akka: It uses all the server power through its Actor system and it scales extremely well if I change server properties (for example if it's deployed on EC2 and I convert my node from m1.small to m1.large).
What do you think?
NOTE: Asume that the issue of mapping Futures and Promises is resolved, as well as a mismatch between FuturePools and ExecutionContexts. The pattern would be to convert Finagle to the scala way of using Futures.
You are right in that service discovery and service implementation are orthogonal concerns, and I can follow your argument about using Finagle for the former and Akka for the latter. You could in principle use the two together without seeking a grand unification of futures, since you only need to send the service’s reply back to the requesting Actor in a message, i.e. you would need to add your own little “pipeTo” pattern on top of Twitter futures.

What actor based web frameworks are available for Scala?

I need to build very concurrent web service which will expose REST based API for JavaScript (front end) and Rails (back end). Web service will be suiting data access API to MongoDB.
I already wrote an initial implementation using NodeJS and would like to try Scala based solution. I'm also considering Erlang, for which every web framework is actor based.
So I'm looking for web framework explicitly build using Actors in order to support massive load of requests I'm very new to Scala and I don't quite understand how Actor might work if almost all frameworks for Scala are based on Java servlets which creates a thread on each request which will just exhaust all resources in my scenario.
If you're really going to have 10k+ long active connections at a time, then any standard Java application server/framework (maybe, except for Netty) will not work for you - all of them are consuming lots of memory (even if any kind of smart NIO is used). You'd better stick to a clustered event-loop based solution (like node.js that you've already tried), mongrel backed with zeroMQ, nginx with the mode for writing into MQ polled by Scala Actors, etc.
Among the Scala/Java frameworks, Lift has a good async support for REST (though it's not directly tied to actors). OTOH, LinkedIn uses Scalatra + stdlib actors for their REST services behind Signal ,and feels just fine.
Another option is Play framework. The latest 1.1 release supports Scala. It also supports akka as a module.
As far as Scalatra itself, they have been working on a new request
abstraction called SSGI (akin to the Servlet/Rack/WSGI/WAI layer),
that they said should ennable them to break from solely running as a
Servlet and also run on top of something built with Netty. See thread here.
http://github.com/scalatra/ssgi
There's some other interesting frameworks at the Scalatra level of simplicity since designed from the ground up to support asynchronous web services (won't tie up a thread per request):
https://github.com/jdegoes/blueeyes - Not a servlet; built on Netty.
("loosely inspired by ... Scalatra")
http://spray.cc/ - Built on Akka actors, Akka Mist. Servlet 3.0 or Jetty continuations
("spray was heavily inspired by BlueEyes and Scalatra.")
And at a lower level:
https://github.com/rschildmeijer/loft - "Continuation based non-blocking, asynchronous, single threaded web
server."
Not production-ready, but rather interesting-looking. Continuations require the compiler plugin.
http://liftweb.net/ Indeed, a request starts off as a servlet, but then lift uses comet support found in many servlet containers to break away from the thread, keeping the request context (which the container then doesn't destroy) which then can be used to output data in actors.
http://akkasource.org also has support for rest, but it will block the thread until the actor finishes with its work

Need opinion regarding design/architecture of a web application

I am working on a web application which needs to get data from some local and some non local resources and then display it. As it could take arbitrary amount of time to get the data from these resources I am thinking of using the actors concept so that each actor is responsible for getting data from the respective resource. The request thread will wait for each actor to finish its task and then use ajax to update only the portion of the web page that is dependent on that data. This way user will start seeing the data as soon as it is received rather than wait for all of them to finish and then get a first look at the data.
I am planning to look into scala/lift framework for this. I have read some articles on the web for scala/lift and want to explore if this is the correct way to approach this problem and also if scala/lift is good platform of choice. I have worked in Java and C# previously. Any opinions, comments, suggestions are welcome.
Thanks,
gary
Take a look a message queue technology like Java's JMS. Message queues allow you to handle long running background tasks asynchronously and reliably. This is the technique sites like Flickr and YouTube use to do media transcoding asynchronously. You can use a Java EE server, or a JMS technology like Apache's ActiveMQ, and then layer your Scala/Lift code on top of it.
Richard Monson-Haefel's book on JMS covers it well.
For more general help with web site scaling and construction, take a look at Todd Hoff's excellent blog, highscalability.com/. There are some good pointers to using message queues to offload long-running tasks this way.
BTW, Twitter uses Scala for something much like what you're considering. Here's an interview with some of their developers; they describe one way they use Scala:
Robey Pointer: A lot of our architecture is based on letting Rails do what it does best, which is the AJAX, the web front ends, the website—what the user sees. Anything we can offload out of the request/response cycle, we do. So we queue those tasks into a messaging system and have back-end daemons handle them.
If the non-local resources originate at some other service or system, Event Driven Architecture might work for you. Instead of pulling from the non-local resources you could set up this web-application as a subscriber to the events published by these services. Upon receiving a message regarding part of its functionality it would cache locally the data it's interested in. This should let you escape the issue of asynchronous update of parts of the page (all data would be accessible locally).
Udi Dahan blogs about this approach a lot and is also an author of a .NET message bus (NServiceBus) that can be used in such scenarios. See for example http://msdn.microsoft.com/en-us/architecture/aa699424.aspx
Actors would be a way to go. You're essentially setting up a light weight version of JMS. And Lift does the comet stuff very well.
In addition to the Scala actors, and the Lift Actors, you also have akka actors. When Scala Swarm becomes production ready you'll be ready for that too.
If the delayed information is distinct from the information that needs to display immediately, with Lift you can use a LazyLoad snippet that has the longer-running computation (calling the web service) as part of its logic. Lift will take care of inserting it into the page when its ready.