MapReduce implementation in Scala - scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.

To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.

http://hadoop.apache.org/ is language agnostic.

Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.

You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.

A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.

For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.

to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).

I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.

Related

Can Cascading rewritten/replaced Apache Spark & Scala? Is it more optimal?

I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks

replacing sql database using akka-persistence - why is it not happening?

There are ways to replace SQL databases in Haskell, Clojure:
http://www.datomic.com/ (Clojure)
https://github.com/dmbarbour/haskell-vcache
https://hackage.haskell.org/package/acid-state
However I cannot find a library for doing so in Scala , using akka-persistence.
I wonder why ?
I heard that https://www.querki.net/ is doing something similar (https://github.com/jducoeur/Querki), but it not a copyleft library (unlike acid-state for Haskell).
I am wonder if I am looking at this from the wrong angle, I am wondering why other languages have these solutions and Scala does not seem to have it, maybe there is a fundamental reason for that ? Am I missing something ?
The libraries you mention do quite different things:
akka-persistence Store the state of an actor. If you have an actor that uses internal state. This is quite specialized.
acid-state serializes Haskell data to disk.
Datomic is a system for overriding temporal data in a way that does not destroy the original data.
Object stores works well with dynamic languages like Clojure and Python, since they work dynamic data that can be serialize to disk.
I found it much nicer to work with MongoDB in Python than in Scala.
When the NoSQL movement started there were initial excitement, but after using these systems some people realized that you are giving up good properties that databases have.
Datomic is an interesting project with new ideas. There is a Scala clone of it. Not sure how stable it is:
https://github.com/dwhjames/datomisca

what are the options for hadoop on scala

We are starting a big-data based analytic project and we are considering to adopt scala (typesafe stack). I would like to know the various scala API's/projects which are available to do hadoop , map reduce programs.
Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
Additionally, there are a few other benefits:
Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
It has several active contributors both inside and outside Twitter that are committed to making it great.
It is interoperable with your existing Cascading jobs.
In addition to the Typed API, it has a a Fields API which may be more familiar to users of R and data-frame frameworks.
It provides a robust Matrix Library.
I've had success with Scoobi. It's straightforward to use, strongly typed, hides most of the Hadoop mess (by doing thing like automatically serializing your objects for you), and totally Scala. One of the things I like about its API is that the designers wanted the Scoobi collections to feel just like the standard Scala collections, so you actually use them much the same way, except that operations run on Hadoop instead of locally. This actually makes it pretty easy to switch between Scoobi collections and Scala collections while you're developing and testing.
I've also used Scrunch, which is built on top of the Java-based Crunch. I haven't used it in a while, but it's now part of Apache.
Twitter is investing a lot of effort into Scalding, including a nice Matrix library that could be used for various machine learning tasks. I need to give Scoobi a try, too.
For completeness, if you're not wedded to MapReduce, have a look at the Spark project. It performs far better in many scenarios, including in their port of Hive to Spark, appropriately called Shark. As a frequent Hive user, I'm excited about that one.
The first two I would likely investigate are Scalding (which builds on top of Cascading) and Scoobi. I have not used either, though, but Scalding, in particular, looks like it provides a really nice API.
Another option is Stratosphere, It offers a Scala API that converts the Scala types to Stratosphere's internal data types.
The API is quite similar to Scalding but Stratosphere natively supports advanced data flows (so you don't have to chain MapReduce Jobs). You will have much better performance with Stratosphere than with Scalding.
Stratosphere does not run on Hadoop MapReduce but on Hadoop YARN, so you can use your existing YARN cluster.
This is the word count example in Stratosphere (with the Scala API):
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

Scala equivalent to pyTables?

I'm looking for a little assistance in Scala similar to that provided by pyTables. PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
Any suggestions?
I had a quick look at pyTables, and I don't think there's anything remotely like it in Scalaland (or indeed Javaland), but we have a few of the ingredients necessary to make it a possibility if you want to invest the time:
scala.Dynamic to do idiomatic selection on data-driven structures
A bunch of graph databases to provide the underlying navigational persistence substrate (I've had acceptable results from OrientDB, which has a better license than most)
PyTables is a python implementation of HDF5 with some added niceties to let you work on it in a pythonic way, and get good indexing support. I'm not sure if there's a package implemented in a similar way in Scala, but you can use the same HFD5 based hierarchical data storageusing the HDF5 implementation in Java: HDF Java

Parallelization/Cluster Options For Code Execution

I'm coming from a java background and have a CPU bound problem that I'm trying to parallelize to improve performance. I have broken up my code to perform in a modular way so that it can be distributed and run in a parallel way (hopefully).
#Transactional(readOnly = false, propagation = Propagation.REQUIRES_NEW)
public void runMyJob(List<String> some params){
doComplexEnoughStuffAndWriteToMysqlDB();
}
Now, I have been thinking of the following options for parallelizing this problem and I'd like people's thoughts/experience in this area.
Options I am currently thinking of:
1) Use Java EE (eg JBoss) clustering and MessageDrivenBeans. The MDBs are on the slave nodes in the cluster. Each MDB can pick up an event which kicks off a job as above. AFAIK Java EE MDBs are multithreaded by the app server so this should hopefully also be able to take advantage of multicores. Thus it should be vertically and horizontally scalable.
2) I could look at using something like Hadoop and Map Reduce. Concerns I would have here is that my job processing logic is actually quite high level so I'm not sure how translatable that is to Map Reduce. Also, I'm a total newbie to MR.
3) I could look at something like Scala which I believe makes concurrency programming much simpler. However, while this is vertically scalable, it's not a cluster/horizontally scalable solution.
Anyway, hope all that makes sense and thank you very much for any help provided.
the solution you are looking for is Akka. Clustering is a feature under development, and will normally be included in Akka 2.1
Excellent Scala and Java Api, extremely complete
Purely message-oriented pattern, with no shared state
Fault resistant and scalable
Extremely easy to distribute jobs
Please get rid of J2EE if you are still on time. You are very welcome to join the Akka mailing list to ask your questions.
You should have a look at spark.
It is a cluster computing framework written in Scala aiming at being a viable alternative to Hadoop.
It has a number of nice feats:
In-Memory Computations: You can control the degree of caching
Hadoop Input/Output interoperability: Spark can read/write data from all the Hadoop input sources such as HDFS, EC2, etc.
The concept of "Resilient Distributed Datasets" (RDD) which allows you to directly execute most of MR style workloads in parallel on a cluster as you would do locally
Primary API = Scala, optional python and Java APIs
It makes use of Akka :)
If I understand your question correctly, Spark would combine your options 2) and 3).