Clojure futures in context of Scala's concurrency models - scala

After being exposed to scala's Actors and Clojure's Futures, I feel like both languages have excellent support for multi core data processing.
However, I still have not been able to determine the real engineering differences between the concurrency features and pros/cons of the two models. Are these languages complimentary, or opposed in terms of their treatment of concurrent process abstractions?
Secondarily, regarding big data issues, it's not clear wether the scala community continues to support Hadoop explicitly (whereas the clojure community clearly does ). How do Scala developers interface with the hadoop ecosystem?

Some solutions are well solved by agents/actors and some are not. This distinction is not really about languages more than how specific problems fit within general classes of solutions. This is a (very short) comparason of Actors/agents vs. References to try to clarify the point that the tool must fit the concurrency problem.
Actors excel in distributed situation where no data needs to be concurrently modified. If your problem can be expressed purely by passing messages then actors will do the trick. Actors work poorly where they need to modify several related data structures at the same time. The canonical example of this being moving money between bank accounts.
Clojure's refs are a great solution to the problem of many threads needing to modify the same thing at the same time. They excel at shared memory multi-processor systems like today's PCs and Servers. In addition to the Bank account example, Rich Hickey (the author of clojure) uses the example of a baseball game to explain why this is important. If you wanted to use actors to represent a baseball game then before you moved the ball, all the fans would have to send it a message asking it where it was... and if they wanted to watch a player catching the ball things get even more complex.
Clojure has cascalog which makes writing hadoop jobs look a lot like writing clojure.

Actors provide a way of handling the potential interleaving and synchronization control that inevitably comes when trying to get multiple threads to work together. Each actor has a queue of messages that it processes in order one at a time so as to avoid the need to include explicit locks. In this case a Future provides a way of waiting for a response from an actor.
As far as Hadoop is concerned, Twitter just released a library specifically for Hadoop called Scalding but as long as the library is written for the JVM, it should work with either language.

Related

why is scala actors deprecated in 2.10?

I was just comparing different scala actor implementations and now I'm wondering what could have been the motivation to deprecate the existing scala actor implementation in 2.10 and replace the default actor with the Akka implementation? Neither the migration guide nor the first announcement give any explanation.
According to the comparison the two solutions were different enough that keeping both would have been a benefit. Thus, I'm wondering whether there were any major problems with the existing implementation that caused this decision? In other words, was it a technical or a political decision?
I can't but give you a guess answer:
Akka provides a stable and powerful library to work with Actors, along with lots of features that deals with high concurrency (futures, agents, transactional actors, STM, FSM, non-blocking I/O, ...).
Also it implements actors in a safer way than scala's, in that the client code have only access to generic ActorRef. This makes it impossible to interact with actors other than through message-passing.
[edited: As Roland pointed out, this also enables additional features like fault-tolerance through a supervision hierarchy and location transparency: the ability to deploy the actor locally or remotely with no change needed on the client code.
The overall design more closely resembles the original one in erlang.]
Much of the core features were duplicated in scala and akka actors, so a unification seems a most sensible choice (given that the development team of both libraries is now part of the same company, too: Typesafe).
The main gain is avoiding duplication of the same core functionality, which would only create confusion and compatibility issues.
Given that a choice is due, it only remains to decide which would be the standard implementation.
It's evident to me that Akka has more to offer in this respect, being a full-blown framework with many enterprise-level features already included and more to come in the near future.
I can't think of a specific case where scala.actors is capable of accomplishing what akka can't.
p.s. A similar reasoning was made that led to the unification of the standard future/promise implementation in 2.10
The whole scala language and community have to gain from a simplified interface to base language features, instead of a fragmented scene made of different frameworks, each having it's own syntax and model to learn.
The same can't be said for other, more high-level aspects, like web-frameworks, where the developer gains from a richer panorama of available solutions.

In which way is akka real-time?

At a couple of places there is state that akka is somehow "real-time". E.g.:
http://doc.akka.io/docs/akka/2.0/intro/what-is-akka.html
Unfortunately I was not able to find a deeper explanation in which way akka is "real-time". So this is the question:
In which way is akka real-time?
I assume akka is not really a real-time computing system in the sense of the following definition, isn't it?: https://en.wikipedia.org/wiki/Real-time_computing
No language built on the JVM can be real-time in the sense that it's guaranteed to react within a certain amount of time unless it is using a JVM that supports real-time extensions (and takes advantage of them). It just isn't technically possible--and Akka is no exception.
However, Akka does provide support for running things quickly and with pretty good timing compared to what is possible. And in the docs, the other definitions of real-time (meaning online, while-running, with-good-average-latency, fast-enough-for-you-not-to-notice-the-delay, etc.) may be used on occasion.
Since akka is a message driven system, the use of real-time relates to one of the definition of the wikipedia article you mention in the domain of data transfer, media processing and enterprise systems, the term is used to mean 'without perceivable delay'.
"real time" here equates to "going with the flow": events/messages are efficiently processed/consumed as they are produced (in opposition to "batch processing").
Akka can be a foundation for a soft real-time system, but not for a hard one, because of the limitations of the JVM. If you scroll a bit down in the Wikipedia article, you will find the section "Criteria for real-time computing", and there is a nice explanation about the different "real-timeness" criteria.
systems that are subject to a "real-time constraint"— e.g. operational
deadlines from event to system response.
en.wikipedia.org/wiki/Real-time_computing
The akka guys might be reffering to features like futures that allow you to add a time constraint on expectations from a computation.
Also the clustering model of akka may be used to mean an online system which is real-time(Abstracted so as to look like its running locally).
My take is that the Akka platform can support a form of real-time constraint by delivering responsive applications through the use of (I'm quoting here):
Asynchronous, non-blocking and highly performant event-driven programming model
Fault tolerance through supervisor hierarchies with “let-it-crash” semantics
Definition of time-out policies in the response delivery
As already said, all these features combined provides a platform with a form of response time guarantee, especially compared to mainstream applications and tools available nowadays on the JVM.
It's still arguable to claim that Akka could be strictly defined as a real-time computing system, as per wikipedia's definition.
For such claims to be proven, you would better refer to the Akka team itself.

What would be a good application for an enhanced version of MapReduce that shares information between Mappers?

I am building an enhancement to the Spark framework (http://www.spark-project.org/). Spark is a project out of UC Berkeley that does MapReduce quickly in RAM. Spark is built in Scala.
The enhancement I'm building allows some data to be shared between the mappers while they are computing. This can be useful, for example, if each of the mappers is looking for an optimal solution, and they all want to share the current best solution (to prune out bad solutions early). The solution may be slightly out of date as it propagates, but this should still speed up the solution. In general, this is called the branch-and-bound approach.
We can share monotonically increasing numbers, but also we can share arrays, and dictionaries.
We are also looking at machine learning applications where the mappers describe local natural gradient information, and then a new best current optimal solution is shared among all nodes.
What are some other good real-world applications of this kind of enhancement? What kinds of real, useful applications might benefit from a Map Reduce computation with just a little bit of information-sharing between mappers. What applications use MapReduce or Hadoop right now but are just a little too slow because of the independence restriction of the Map phase?
The benefit can be to either speed up the map phase, or improve the solution.
The enhancement I'm building allows some data to be shared between the mappers while they are computing.
Apache Giraph is based on Google Pregel which is based on BSP and is used for graph processing. In BSP, there is data sharing between the processes in the communication phase.
Giraph depends on Hadoop for implementation. In general there is no communication between the mappers in MapReduce, but in Giraph the mappers communicate with each other during the communication phase of BSP.
You might be also interested in Apache Hama which implements BSP and can be used for more than graph processing.
There might be some reason why mappers don't communicate in the MR. Have you considered these factors in your enhancement?
What are some other good real-world applications of this kind of enhancement?
Graph processing is one thing I can think of, similar to Giraph. Checkout the different use cases for BSP, some might be applicable for this kind of enhancement. I am also very interested what other have to say on this.

Which are the kind of applications/services/components where the Actors model (Scala, Erlang) is best suited for?

Besides the benefits of this model over the shared-memory model, I'm just trying to understand where to apply it for higher levels use-cases.
As to Scala, Actors model fits most of the multi-threaded cases one can think about:
Swing GUI application
Web Applications (see Lift framework)
Application Server in multicore environment:
Batch processing of requests/data
Background tracking tasks
Notifications & Scheduled tasks
Actors model makes design much clearer and greatly simplifies interprocess communication.
OTP Framework : Provides really good framework for network based applications.
Helps in making fault tolerant applications . (process restart using Supervisor's in OTP).
Both Synchronous and Asynchronous modes of communication can be done using gen_server.
Event based callbacks can be used using gen_event.
State machine can be programmed easily using gen_fsm (In case you need to follow some states in your application).
A process crash does not bring the whole application down. Only that particular process crashes.
Functional programming language.
A lot easier to program at binary level.
Garbage collection.
Native compilation option.
Fair amount of good useful modules are available.
Able to make good solid concurrent applications easily.
And lots more.... I really enjoyed working on some applications in erlang , making those in c/c++ would have been very difficult.

Is OpenCL good for agent based simulation?

I'm learning Scala with the aim of writing agent based simulations using actor concurrency. I currently know very little about OpenCL, and before I dive in can anyone tell me if it is likely to be appropriate/compatible with agent based simulations?
If so, then ScalaCL looks very attractive.
You should use OpenCL if you have some heavy weight computations that can be parallelized and you want to use your graphic card to do it (or parts of it).
It has a bit strange model of computation (at least if you know just "general" programming and not how the GPU works or if you have strong background in some areas of math), and quite some limitations what/how you can do.
So I think it's quite unlikely that's what you are looking for.
Actors have very little to do with OpenCL, I think the only commonality of the two that they both address the problem of parallel computation, but from a very different perspective. IMO the actor model is much easier to understand and probably also to use it (but it's just a guess as I didn't really have any business with OpenCL so far).
If you want to implement an agent based system then actors can be quite useful. You could have a look at standard scala actors, or alternative implementations:
Akka, also offering many additional functionality on top of actors + nice docs with some tutorials
actors in scalaz
OpenCL is generally only good for speeding up programs that involve doing the same thing many times with different data. If your agents will all be doing the same thing at the same time, then yes it might be appropriate and compatible.
Otherwise, the two just don't fit well together, and OpenCL will probably make things run slower rather than faster.