What would be a good application for an enhanced version of MapReduce that shares information between Mappers? - scala

I am building an enhancement to the Spark framework (http://www.spark-project.org/). Spark is a project out of UC Berkeley that does MapReduce quickly in RAM. Spark is built in Scala.
The enhancement I'm building allows some data to be shared between the mappers while they are computing. This can be useful, for example, if each of the mappers is looking for an optimal solution, and they all want to share the current best solution (to prune out bad solutions early). The solution may be slightly out of date as it propagates, but this should still speed up the solution. In general, this is called the branch-and-bound approach.
We can share monotonically increasing numbers, but also we can share arrays, and dictionaries.
We are also looking at machine learning applications where the mappers describe local natural gradient information, and then a new best current optimal solution is shared among all nodes.
What are some other good real-world applications of this kind of enhancement? What kinds of real, useful applications might benefit from a Map Reduce computation with just a little bit of information-sharing between mappers. What applications use MapReduce or Hadoop right now but are just a little too slow because of the independence restriction of the Map phase?
The benefit can be to either speed up the map phase, or improve the solution.

The enhancement I'm building allows some data to be shared between the mappers while they are computing.
Apache Giraph is based on Google Pregel which is based on BSP and is used for graph processing. In BSP, there is data sharing between the processes in the communication phase.
Giraph depends on Hadoop for implementation. In general there is no communication between the mappers in MapReduce, but in Giraph the mappers communicate with each other during the communication phase of BSP.
You might be also interested in Apache Hama which implements BSP and can be used for more than graph processing.
There might be some reason why mappers don't communicate in the MR. Have you considered these factors in your enhancement?
What are some other good real-world applications of this kind of enhancement?
Graph processing is one thing I can think of, similar to Giraph. Checkout the different use cases for BSP, some might be applicable for this kind of enhancement. I am also very interested what other have to say on this.

Related

How to profile Akka applications?

I have a small Akka application that passes many messages between its actors and each actor does some calculations on the data it receives. What I want is to profile this application in order to see which parts of the code take up most time and so on.
I tried VisualVM but I cannot really understand what's going on. I added a picture of the profiler output.
My questions are
What for example is this first line and why does it take up so much time? (scala.concurrent.forkjoin.ForkJoinPool.scan())
Can Akka applications because of their asynchronous behaviour be profiled well at all?
Can I see for instance how long one specific actor(-type) works for one specific message(-type) it receives?
Are there other best-practices for profiling Akka applications?
There are packages not profiled by default and it is their time that is accounted in the profile of scala.concurrent.forkjoin.ForkJoinPool.scan(). If all the hidden packages are allowed to be sampled, the true CPU time consumers will be revealed. For example, the following before/after illustrative profiles uncover that threads are put to sleep most of the time by sun.misc.Unsafe.park waiting to be unparked.
Akka applications can be profiled quite well with proper instrumentation and call tracing. Google's prominent Dapper, a Large-Scale Distributed Systems Tracing Infrastructure paper contains detailed explanation of the technique. Twitter created Zipkin based on that. It is open sourced and has an extension for distributed tracing of Akka. Follow its wiki for a good explanation of how to set up a system that allows to
trace call hierarchies inside an actor system;
debug request processing pipelines (you can log to traces, annotate them with custom key-value pairs);
see dependencies between derived requests and their contribution to resulting response time;
find and analyse slowest requests in your system.
There is also a new kid on the block, Kamon. It is a reactive-friendly toolkit for monitoring applications that run on top of the JVM, which is specially enthusiastic to applications built with the Typesafe Reactive Platform. That definitely means yes for Akka and the integration comes in the form of the kamon-akka and kamon-akka-remote modules that bring bytecode instrumentation to gather metrics and perform automatic trace context propagation on your behalf. Explore the documentation starting from Akka Integration Overview to understand what it can and how to achieve that.
Just a couple of days ago TypeSafe announced that TypeSafe console now is free. I don't know what can be better for profiling Scala/Akka applications. Of cause you can try JProfiler for JVM languages, I've used it with Java projects, but it's not free and for Java.
I was thinking about profiling/metrics in code since I also use Akka/Scala a lot for building production applications, but I also eager to hear alternative ways to make sure that application is healthy.
Metrics (like Dropwizard)
Very good tool for collecting metrics in the code, with good documentation and embedded support for Graphite, Ganglia, Logback, etc.
It has verbose tools for collecting in-app statistics like gauges, counter histograms, timings - information to figure out what is the current state of your app, how many actors were created, etc, if they are alive, what the current state is of majority of actors, etc.
Agree, it's a bit different from profiling but helps a lot to find roots of the problem, especially if integrated with some char building tool.
Profilers like (VisualVM, XRebel)
Since I'm a big fun of doing monitoring, it still answers a slightly different question - what are current insights of my application right now?
But there is quite another matter may disturb us - what of my code is slow (or sloppy)?
For that reason, we have VisualVM and another answers to this question - how to profile Akka actors with VisualVM.
Also, I'd suggest trying XRebel profiler that just adds a bit more firepower to process of figuring out what code makes app slower. It's also paid but on my project it saved a lot of time dealing with sloppy code.
New Relic
I'd suggest it for some playground projects since you can get some monitoring/profiling solutions for free, but on more serious projects I'd go for things I highlighted above.
So I hope, that my overview was helpful.

In which way is akka real-time?

At a couple of places there is state that akka is somehow "real-time". E.g.:
http://doc.akka.io/docs/akka/2.0/intro/what-is-akka.html
Unfortunately I was not able to find a deeper explanation in which way akka is "real-time". So this is the question:
In which way is akka real-time?
I assume akka is not really a real-time computing system in the sense of the following definition, isn't it?: https://en.wikipedia.org/wiki/Real-time_computing
No language built on the JVM can be real-time in the sense that it's guaranteed to react within a certain amount of time unless it is using a JVM that supports real-time extensions (and takes advantage of them). It just isn't technically possible--and Akka is no exception.
However, Akka does provide support for running things quickly and with pretty good timing compared to what is possible. And in the docs, the other definitions of real-time (meaning online, while-running, with-good-average-latency, fast-enough-for-you-not-to-notice-the-delay, etc.) may be used on occasion.
Since akka is a message driven system, the use of real-time relates to one of the definition of the wikipedia article you mention in the domain of data transfer, media processing and enterprise systems, the term is used to mean 'without perceivable delay'.
"real time" here equates to "going with the flow": events/messages are efficiently processed/consumed as they are produced (in opposition to "batch processing").
Akka can be a foundation for a soft real-time system, but not for a hard one, because of the limitations of the JVM. If you scroll a bit down in the Wikipedia article, you will find the section "Criteria for real-time computing", and there is a nice explanation about the different "real-timeness" criteria.
systems that are subject to a "real-time constraint"— e.g. operational
deadlines from event to system response.
en.wikipedia.org/wiki/Real-time_computing
The akka guys might be reffering to features like futures that allow you to add a time constraint on expectations from a computation.
Also the clustering model of akka may be used to mean an online system which is real-time(Abstracted so as to look like its running locally).
My take is that the Akka platform can support a form of real-time constraint by delivering responsive applications through the use of (I'm quoting here):
Asynchronous, non-blocking and highly performant event-driven programming model
Fault tolerance through supervisor hierarchies with “let-it-crash” semantics
Definition of time-out policies in the response delivery
As already said, all these features combined provides a platform with a form of response time guarantee, especially compared to mainstream applications and tools available nowadays on the JVM.
It's still arguable to claim that Akka could be strictly defined as a real-time computing system, as per wikipedia's definition.
For such claims to be proven, you would better refer to the Akka team itself.

What does a web-based framework scalable?

thanks you very much in advance.
First of all, I conceive scalability as the ability to design a system that doest not change when the demand of its services, whatever they are, increases considerably. May you need more hardware (vertically or horizontally0? Fine, add it at your leisure because the system is prepared and has been designed to cope with it.
My question is simple to ask but presumably very complex to answer. I would like to know what you I look at in a framework to make sure it will scale accordingly, both in number of hits and number of sessions running simultaneously.
This question is not about technology nor a particular framework at all, it is more a theoretical question.
I know that depend very much on having a good database design and a proper hardware behind with replication, etc... Let's assume that this all exists, however yet my framework must meet some criteria, what?
Provide a memcache?
Ability to run across multiple machines (at the web server level) and use many replicated databases? But what is in the software that makes that possible?
etc...
Please, let's not relate the answers with any particular programming language or technology behind.
Thanks again,
D.
I think scalability depends most of all on the use case: do you expect huge amounts of data, then you should focus on the database, if it's about traffic, focus on the server, is it about adding new features, focus on your data-model and the framework you are using...
Comparing a microposts-service like Twitter to a university website or a webservice like GoogleDocs you will find quite different requirements.
First of all the common notion of scalability is the ability of a software to improve in throughput or capacity if more hardware resources are added (CPUs, memory, bandwidth etc).
Software that does not improve in increased resources is not scalable.
Getting out of the definitions, I think your question is related to evaluation of frameworks you are planning to introduce to your implementation that may affect your software's ability to scale.
IMHO the most important factor to evaluate when introducing a framework is to see if there is hidden serialization in it (that serialization in effects transfers to/affects your software)
So if you introduce a framework that introduces serialization in your application that can affect your ability to scale.
How to evaluate?
Careful source code inspection (if open source)
Are there any performance guarantees offered by those that build the
framework?
Do measurements yourself to see how introducing this framework
affects your performance and replace if not satisfied

Clojure futures in context of Scala's concurrency models

After being exposed to scala's Actors and Clojure's Futures, I feel like both languages have excellent support for multi core data processing.
However, I still have not been able to determine the real engineering differences between the concurrency features and pros/cons of the two models. Are these languages complimentary, or opposed in terms of their treatment of concurrent process abstractions?
Secondarily, regarding big data issues, it's not clear wether the scala community continues to support Hadoop explicitly (whereas the clojure community clearly does ). How do Scala developers interface with the hadoop ecosystem?
Some solutions are well solved by agents/actors and some are not. This distinction is not really about languages more than how specific problems fit within general classes of solutions. This is a (very short) comparason of Actors/agents vs. References to try to clarify the point that the tool must fit the concurrency problem.
Actors excel in distributed situation where no data needs to be concurrently modified. If your problem can be expressed purely by passing messages then actors will do the trick. Actors work poorly where they need to modify several related data structures at the same time. The canonical example of this being moving money between bank accounts.
Clojure's refs are a great solution to the problem of many threads needing to modify the same thing at the same time. They excel at shared memory multi-processor systems like today's PCs and Servers. In addition to the Bank account example, Rich Hickey (the author of clojure) uses the example of a baseball game to explain why this is important. If you wanted to use actors to represent a baseball game then before you moved the ball, all the fans would have to send it a message asking it where it was... and if they wanted to watch a player catching the ball things get even more complex.
Clojure has cascalog which makes writing hadoop jobs look a lot like writing clojure.
Actors provide a way of handling the potential interleaving and synchronization control that inevitably comes when trying to get multiple threads to work together. Each actor has a queue of messages that it processes in order one at a time so as to avoid the need to include explicit locks. In this case a Future provides a way of waiting for a response from an actor.
As far as Hadoop is concerned, Twitter just released a library specifically for Hadoop called Scalding but as long as the library is written for the JVM, it should work with either language.

Looking for a mature, scalable GraphDB with .NET or C++ binding

My basic requirements from a GraphDB:
Mature (production-ready)
Native .NET or C++ language binding
Horizontal scalability: both
Automated data redundancy and sharding
Distributed graph algorithms / query execution
Currently I disqualified the following:
InfiniteGraph: no C++ / .NET language binding
HyperGraphDB: no C++ / .NET language binding
Microsoft Trinity: Not mature
Neo4j: not distributed
I'm not sure about the scalability of the following:
Sparsity DEX
Franz Inc. AllegroGraph
Sones GraphDB
I found the available information about horizontal scalability capabilities quite general. I guess there are good reasons for this.
Any information would be appreciated.
Unfortunately your basic requirements already extend todays general understanding of graphs - even in the academia. No listed pure graph database will be able to satisfy all your needs. Distributed graph algorithms which are aware of large distributed but interconnected graphs are still a big research issue. So for your application it might be best to find a well matching graph database, graph processing stack or RDF-Store and implement the missing parts on your own.
When your application is mostly Online Transactional Graph Processing (OLTP) (read/write heavy) with a focus on the vertices and you can resign on the distributed algorithms for a moment then use one of these:
Neo4j
OrientDB
DEX
HyperGraphDB
InfiniteGraph
InfoGrid
Microsoft Horton
When it is more Online Analytical Processing (OLAP) (mostly read) still with a focus on the vertices and distribution really matters then :
Apache Hama (early stage project)
Microsoft Trinity (research project)
Golden Orb (good, but Java only)
Signal/Collect (http://www.ifi.uzh.ch/ddis/research/sc , but a research project)
Or is its focus more on the edges, logical reasoning/pattern matching and you need or better can live with a distribution on an edge level like in the Semantic Web then use one of these RDF-/Triple-/Quadstores:
AllegroGraph (okay, they are a graphdb/rdf store hybrid ;)
Jena
Sesame
Stardog
Virtuoso
...and many more RDF stores
Good starting points might be DEX or Neo4j: If you're looking for a good and really fast graphdb kernel for C++ DEX might be best, but you would have to implement a lot of networking and distribution stuff on your own. Neo4j has a lot of distribution and fault tolerance, but at the moment more on a vertex sharding level and it's kernel is Java. For ideas and inspiration on implementing distributed graph algorithms perhaps take a look at Golden Orb and Signal/Collect.
An alternative approach might be starting with AllegroGraph or Stardog. Especially AllegroGraph might be a bit tricky in the beginning until you get adopted to their way of thinking. Stardog is still young and Java, but fast and already quite mature.