Distributing Scala over a cluster? - scala

So I've recently started learning Scala and have been using graphs as sort of my project-to-improve-my-Scala, and it's going well - I've since managed to easily parallelize some graph algorithms (that benefit from data parallelization) courtesy of Scala 2.9's amazing support for parallel collections.
However, I want to take this one step further and have it parallelized not just on a single machine but across several. Does Scala offer any clean way to do this like it does with parallel collections, or will I have to wait until I get to the chapter in my book on Actors/learn more about Akka?
Thanks!
-kstruct

There was an attempt of creating distributed collections (currently project is frozen).
Alternatives would be Akka (which recently got really cool addition: Akka Cluster), that you've already mentioned, or full-fledged cluster engines, that is not parallel collections in any sense and more like distributing cluster over the scala but could be used in your task in some way - such as Scoobi for Hadoop, Storm or even Spark (specifically, Bagel for graph processing).
There is also Swarm that was build on top of delimited continuations.
Last but not least is Menthor - authors claiming that it is especially fits graph processing and makes use of Actors.
Since you're aiming to work with graphs you may also consider to look at Cassovary that was recently opensourced by twitter.
Signal-collect is a framework for parallel dataprocessing backed with Akka.

You can use Akka ( http://akka.io ) - it has always been the most advanced and powerful actor and concurrency framework for Scala, and the fresh-baked version 2.0 allows for nice transparent actor remoting, hierarchies and supervision. The canonical way to do parallel computations is to create as many actors as there are parallel parts in your algorithm, optionally spreading them over several machines, send them data to process and then gather the results (see here).

Related

Is it possible in Spark to run concurrent jobs on the same SparkSession?

I am an amateur Spark user and Scala. Although I did numerous searches, I could not find my answer.
Is it possible to assign different tasks to different executors at the same time on a single driver program?
for example, Suppose we have 10 nodes. I want to write a code to classify a dataset using Naive Bayes algorithm with five workers and at the same time, I want to assign the other five workers a task to classify the dataset with the decision tree algorithm. Afterward, I will combine the the answers.
HamidReza,
What you want to achieve is running two actions in parallel from your driver. It's definately possible but it only makes sense if your actions are not using the whole cluster (for a better resource management in fact).
You can use concurrency for this. There are many ways of implementing a concurrent program, starting with Futures (I can't really recommend this approach, but seems to be the most popular choice in Scala), to more advanced types like Tasks (you can take a look to popular functional libraries like Monix, Cats or Zio).

What considerations should be taken when deciding wether or not to use Apache Spark?

In the past for job that required a heavy processing load I would use Scala and parallel collections.
I'm currently experimenting with Spark and find it interesting but a steep learning curve. I find the development slower as have to use a reduced Scala API.
What do I need to determine before deciding wether or not to use Spark ?
The current Spark job im trying to implement is processing approx 5GB if data. This data is not huge but I'm running a Cartesian product of this data and this is generating data in excess of 50GB. But maybe using Scala parallel collecitons will be just as fast, I know the dev time to implement the job will be faster from my point of view.
So what considerations should I take into account before deciding to use Spark ?
The main advantages Spark has over traditional high-performance computing frameworks (e.g. MPI) are fault-tolerance, easy integration into the Hadoop stack, and a remarkably active mailing list http://mail-archives.apache.org/mod_mbox/spark-user/ . Getting distributed fault-tolerant in-memory computations to work efficiently isn't easy and it's definitely not something I'd want to implement myself. There's a review of other approaches to the problem in the original paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf .
However, when my work is I/O bound, I still tend to rely primarily on pig scripts as pig is more mature and I think the scripts are easier to write. Spark has been great when pig scripts won't cut it (e.g. iterative algorithms, graphs, lots of joins).
Now, if you've only got 50g of data, you probably don't care about distributed fault-tolerant computations (if all your stuff is on a single node then there's no framework in the world that can save you from a node failure :) ) so parallel collections will work just fine.

what are the options for hadoop on scala

We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
Additionally, there are a few other benefits:
Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
It has several active contributors both inside and outside Twitter that are committed to making it great.
It is interoperable with your existing Cascading jobs.
In addition to the Typed API, it has a a Fields API which may be more familiar to users of R and data-frame frameworks.
It provides a robust Matrix Library.
I've had success with Scoobi. It's straightforward to use, strongly typed, hides most of the Hadoop mess (by doing thing like automatically serializing your objects for you), and totally Scala. One of the things I like about its API is that the designers wanted the Scoobi collections to feel just like the standard Scala collections, so you actually use them much the same way, except that operations run on Hadoop instead of locally. This actually makes it pretty easy to switch between Scoobi collections and Scala collections while you're developing and testing.
I've also used Scrunch, which is built on top of the Java-based Crunch. I haven't used it in a while, but it's now part of Apache.
Twitter is investing a lot of effort into Scalding, including a nice Matrix library that could be used for various machine learning tasks. I need to give Scoobi a try, too.
For completeness, if you're not wedded to MapReduce, have a look at the Spark project. It performs far better in many scenarios, including in their port of Hive to Spark, appropriately called Shark. As a frequent Hive user, I'm excited about that one.
The first two I would likely investigate are Scalding (which builds on top of Cascading) and Scoobi. I have not used either, though, but Scalding, in particular, looks like it provides a really nice API.
Another option is Stratosphere, It offers a Scala API that converts the Scala types to Stratosphere's internal data types.
The API is quite similar to Scalding but Stratosphere natively supports advanced data flows (so you don't have to chain MapReduce Jobs). You will have much better performance with Stratosphere than with Scalding.
Stratosphere does not run on Hadoop MapReduce but on Hadoop YARN, so you can use your existing YARN cluster.
This is the word count example in Stratosphere (with the Scala API):
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

How are Scala 2.9 parallel collections working behind the scenes?

Scala 2.9 introduced parallel collections. They are a really great tool for certain tasks. However, how do they work internally and am I able to influence the behavior/configuration?
What method do they use to figure out the optimal number of threads? If I am not satisfied with the result are there any configuration parameters to adjust?
I'm not only interested how many threads are actually created, I am also interested in the way how the actual work is distributed amongst them. How the results are collected and how much magic is going on behind the scenes. Does Scala somehow test if a collection is large enough to benefit from parallel processing?
Briefly, there are two orthogonal aspects to how your operations are parallelized:
The extent to which your collection is split into chunks (i.e. the size of the chunks) for a parallelizable operation (such as map or filter)
The number of threads to use for the underlying fork-join pool (on which the parallel tasks are executed)
For #2, this is managed by the pool itself, which discovers the "ideal" level of parallelism at runtime (see java.lang.Runtime.getRuntime.availableProcessors)
For #1, this is a separate problem and the scala parallel collections API does this via the concept of work-stealing (adaptive scheduling). That is, when a particular piece of work is done, a worker will attempt to steal work from other work-queues. If none is available, this is an indication that all of the processors are very busy and hence a bigger chunk of work should be taken.
Aleksandar Prokopec, who implemented the library gave a talk at this year's ScalaDays which will be online shortly. He also gave a great talk at ScalaDays2010 where he describes in detail how the operations are split and re-joined (there are a number of issues that are not immediately obvious and some lovely bits of cleverness in there too!).
A more comprehensive answer is available in the PDF describing the parallel collections API.

Parallelization/Cluster Options For Code Execution

I'm coming from a java background and have a CPU bound problem that I'm trying to parallelize to improve performance. I have broken up my code to perform in a modular way so that it can be distributed and run in a parallel way (hopefully).
#Transactional(readOnly = false, propagation = Propagation.REQUIRES_NEW)
public void runMyJob(List<String> some params){
doComplexEnoughStuffAndWriteToMysqlDB();
}
Now, I have been thinking of the following options for parallelizing this problem and I'd like people's thoughts/experience in this area.
Options I am currently thinking of:
1) Use Java EE (eg JBoss) clustering and MessageDrivenBeans. The MDBs are on the slave nodes in the cluster. Each MDB can pick up an event which kicks off a job as above. AFAIK Java EE MDBs are multithreaded by the app server so this should hopefully also be able to take advantage of multicores. Thus it should be vertically and horizontally scalable.
2) I could look at using something like Hadoop and Map Reduce. Concerns I would have here is that my job processing logic is actually quite high level so I'm not sure how translatable that is to Map Reduce. Also, I'm a total newbie to MR.
3) I could look at something like Scala which I believe makes concurrency programming much simpler. However, while this is vertically scalable, it's not a cluster/horizontally scalable solution.
Anyway, hope all that makes sense and thank you very much for any help provided.
the solution you are looking for is Akka. Clustering is a feature under development, and will normally be included in Akka 2.1
Excellent Scala and Java Api, extremely complete
Purely message-oriented pattern, with no shared state
Fault resistant and scalable
Extremely easy to distribute jobs
Please get rid of J2EE if you are still on time. You are very welcome to join the Akka mailing list to ask your questions.
You should have a look at spark.
It is a cluster computing framework written in Scala aiming at being a viable alternative to Hadoop.
It has a number of nice feats:
In-Memory Computations: You can control the degree of caching
Hadoop Input/Output interoperability: Spark can read/write data from all the Hadoop input sources such as HDFS, EC2, etc.
The concept of "Resilient Distributed Datasets" (RDD) which allows you to directly execute most of MR style workloads in parallel on a cluster as you would do locally
Primary API = Scala, optional python and Java APIs
It makes use of Akka :)
If I understand your question correctly, Spark would combine your options 2) and 3).