I know parallel collections will become available.
What form will these take, and what else are we likely to see?
For the full list, see: Beyond 2.8 - A Roadmap
The main thing seems to be parallel collections. They are drop-in replacement for the scala collections, but the methods are executed in parallel.
From the scala days presentation by Aleksandar Prokopec:
Scala parallel collections that will
be introduced in 2.8 reimplement
standard collection operations while
keeping compatibility with existing
Scala collection framework. They also
introduce new operations
characteristic for parallel
algorithms, and a few contracts the
programmer should be aware of.
For a good video explanation of parallel collections, see Scala Parallel Collections - Aleksandar Prokopec
Have a look at this: Changes between Scala 2.8 and Scala 2.9
http://www.infoq.com/interviews/martin-odersky-scala-future
It's in Jan 2011 fairly recent. Might help you ^_^.
Official site: Scala 2.9.0 RC1 (from 2011-03-25, scala-lang.org)
Related
is there any way to run Tinkerpop Gremlin 3.1 traversals in OrientDB?
I've noticed that currently the DBMS supports the previous version (2.x) of the Tinkerpop traversal language which, for example, only allows to directly filter edges by label, but not vertices :( .
I was quite satisfied with gremlin-scala and orientDB-gremlin but I found that not all my queries where efficiently executed (some indexes were ignored).
Is there any other way?
Thanks in advance :)
Orientdb-gremlin is indeed the only available driver, and while it works pretty well for base cases there's some work left for index usage. If you report your cases in a github issue we can have a look. Best is obviously if you submit a PR :)
We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
Additionally, there are a few other benefits:
Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
It has several active contributors both inside and outside Twitter that are committed to making it great.
It is interoperable with your existing Cascading jobs.
In addition to the Typed API, it has a a Fields API which may be more familiar to users of R and data-frame frameworks.
It provides a robust Matrix Library.
I've had success with Scoobi. It's straightforward to use, strongly typed, hides most of the Hadoop mess (by doing thing like automatically serializing your objects for you), and totally Scala. One of the things I like about its API is that the designers wanted the Scoobi collections to feel just like the standard Scala collections, so you actually use them much the same way, except that operations run on Hadoop instead of locally. This actually makes it pretty easy to switch between Scoobi collections and Scala collections while you're developing and testing.
I've also used Scrunch, which is built on top of the Java-based Crunch. I haven't used it in a while, but it's now part of Apache.
Twitter is investing a lot of effort into Scalding, including a nice Matrix library that could be used for various machine learning tasks. I need to give Scoobi a try, too.
For completeness, if you're not wedded to MapReduce, have a look at the Spark project. It performs far better in many scenarios, including in their port of Hive to Spark, appropriately called Shark. As a frequent Hive user, I'm excited about that one.
The first two I would likely investigate are Scalding (which builds on top of Cascading) and Scoobi. I have not used either, though, but Scalding, in particular, looks like it provides a really nice API.
Another option is Stratosphere, It offers a Scala API that converts the Scala types to Stratosphere's internal data types.
The API is quite similar to Scalding but Stratosphere natively supports advanced data flows (so you don't have to chain MapReduce Jobs). You will have much better performance with Stratosphere than with Scalding.
Stratosphere does not run on Hadoop MapReduce but on Hadoop YARN, so you can use your existing YARN cluster.
This is the word count example in Stratosphere (with the Scala API):
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
So I've recently started learning Scala and have been using graphs as sort of my project-to-improve-my-Scala, and it's going well - I've since managed to easily parallelize some graph algorithms (that benefit from data parallelization) courtesy of Scala 2.9's amazing support for parallel collections.
However, I want to take this one step further and have it parallelized not just on a single machine but across several. Does Scala offer any clean way to do this like it does with parallel collections, or will I have to wait until I get to the chapter in my book on Actors/learn more about Akka?
Thanks!
-kstruct
There was an attempt of creating distributed collections (currently project is frozen).
Alternatives would be Akka (which recently got really cool addition: Akka Cluster), that you've already mentioned, or full-fledged cluster engines, that is not parallel collections in any sense and more like distributing cluster over the scala but could be used in your task in some way - such as Scoobi for Hadoop, Storm or even Spark (specifically, Bagel for graph processing).
There is also Swarm that was build on top of delimited continuations.
Last but not least is Menthor - authors claiming that it is especially fits graph processing and makes use of Actors.
Since you're aiming to work with graphs you may also consider to look at Cassovary that was recently opensourced by twitter.
Signal-collect is a framework for parallel dataprocessing backed with Akka.
You can use Akka ( http://akka.io ) - it has always been the most advanced and powerful actor and concurrency framework for Scala, and the fresh-baked version 2.0 allows for nice transparent actor remoting, hierarchies and supervision. The canonical way to do parallel computations is to create as many actors as there are parallel parts in your algorithm, optionally spreading them over several machines, send them data to process and then gather the results (see here).
Since 2.9 , we can create a parallel collection by a single method par. It is easy and simple, but how to control the concurrency for the parallel collection ?
On 2.9.1, the following approach worked for me:
collection.parallel.ForkJoinTasks
.defaultForkJoinPool.setParallelism(<number of needed threads>)
See this question.
I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.