Parallelization/Cluster Options For Code Execution - scala

I'm coming from a java background and have a CPU bound problem that I'm trying to parallelize to improve performance. I have broken up my code to perform in a modular way so that it can be distributed and run in a parallel way (hopefully).
#Transactional(readOnly = false, propagation = Propagation.REQUIRES_NEW)
public void runMyJob(List<String> some params){
doComplexEnoughStuffAndWriteToMysqlDB();
}
Now, I have been thinking of the following options for parallelizing this problem and I'd like people's thoughts/experience in this area.
Options I am currently thinking of:
1) Use Java EE (eg JBoss) clustering and MessageDrivenBeans. The MDBs are on the slave nodes in the cluster. Each MDB can pick up an event which kicks off a job as above. AFAIK Java EE MDBs are multithreaded by the app server so this should hopefully also be able to take advantage of multicores. Thus it should be vertically and horizontally scalable.
2) I could look at using something like Hadoop and Map Reduce. Concerns I would have here is that my job processing logic is actually quite high level so I'm not sure how translatable that is to Map Reduce. Also, I'm a total newbie to MR.
3) I could look at something like Scala which I believe makes concurrency programming much simpler. However, while this is vertically scalable, it's not a cluster/horizontally scalable solution.
Anyway, hope all that makes sense and thank you very much for any help provided.

the solution you are looking for is Akka. Clustering is a feature under development, and will normally be included in Akka 2.1
Excellent Scala and Java Api, extremely complete
Purely message-oriented pattern, with no shared state
Fault resistant and scalable
Extremely easy to distribute jobs
Please get rid of J2EE if you are still on time. You are very welcome to join the Akka mailing list to ask your questions.

You should have a look at spark.
It is a cluster computing framework written in Scala aiming at being a viable alternative to Hadoop.
It has a number of nice feats:
In-Memory Computations: You can control the degree of caching
Hadoop Input/Output interoperability: Spark can read/write data from all the Hadoop input sources such as HDFS, EC2, etc.
The concept of "Resilient Distributed Datasets" (RDD) which allows you to directly execute most of MR style workloads in parallel on a cluster as you would do locally
Primary API = Scala, optional python and Java APIs
It makes use of Akka :)
If I understand your question correctly, Spark would combine your options 2) and 3).

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Spark (with/without HBase over Hadoop)
MongoDB or Postgres
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.
I think you are on the right direction to search for software stack/architecture which can:
handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.
I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"
In any case, I would refrain from having the database do processing.
I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.
*Alongside Hadoop, Hive, Spark...
In my opinion, it depends more on your requirements and the data volume you will have than the number of users -which is also a requirement-. Hadoop (aka Hive/Impala, HBase, MapReduce, Spark, etc.) works fine with big amounts -GB/TB per day- of data and scales horizontally very well.
In the Big Data environments I have worked with I have always used Hadoop HDFS to store raw data and leverage the distributed file system to analyse the data with Apache Spark. The results were stored in a database system like MongoDB to obtain low latency queries or fast aggregates with many concurrent users. Then we used Impala to get on demmand analytics. The main question when using so many technologies is to scale well the infraestructure and the resources given to each one. For example, Spark and Impala consume a lot of memory (they are in memory engines) so it's a bad idea to put a MongoDB instance on the same machine.
I would also suggest you a graph database since you are building a social network architecture; but I don't have any experience with this...
Are you looking to stay purely open-sourced? If you are going to go enterprise at some point, a lot of the enterprise distributions of Hadoop include Spark analytics bundled in.
I have a bias, but, there is also the Datastax Enterprise product, which bundles Cassandra, Hadoop and Spark, Apache SOLR, and other components together. It is in use at many of the major internet entities, specifically for the applications you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will be hosting this as well.
If you are staying in the cloud, you will not have to choose, you will be able to (depending on your cloud environment, but, with AWS for example) use Spark for continuous-batch process, Hadoop MapReduce for long-timeline analytics (analyzing data accumulated over a long time), etc., because the storage will be decoupled from the collection and processing. Put data in S3, and then process it later with whatever engine you need to.
If you will be hosting the hardware, building a Hadoop cluster will give you the ability to mix hardware (heterogenous hardware supported by the framework), will give you a robust and flexible storage platform and a mix of tools for analysis, including HBase and Hive, and has ports for most of the other things you've mentioned, such as Spark on Hadoop (not a port, actually the original design of Spark.) It is probably the most versatile platform, and can be deployed/expanded cheaply, since the hardware does not need to be the same for every node.
If you are self-hosting, going with other cluster options will force hardware requirements on you that may be difficult to scale with later.
We use Spark +Hbase + Apache Phoenix + Kafka +ElasticSearch and scaling has been easy so far.
*Phoenix is a JDBC driver for Hbase, it allows to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (via JDBC river), it really simplifies integration.

What considerations should be taken when deciding wether or not to use Apache Spark?

In the past for job that required a heavy processing load I would use Scala and parallel collections.
I'm currently experimenting with Spark and find it interesting but a steep learning curve. I find the development slower as have to use a reduced Scala API.
What do I need to determine before deciding wether or not to use Spark ?
The current Spark job im trying to implement is processing approx 5GB if data. This data is not huge but I'm running a Cartesian product of this data and this is generating data in excess of 50GB. But maybe using Scala parallel collecitons will be just as fast, I know the dev time to implement the job will be faster from my point of view.
So what considerations should I take into account before deciding to use Spark ?
The main advantages Spark has over traditional high-performance computing frameworks (e.g. MPI) are fault-tolerance, easy integration into the Hadoop stack, and a remarkably active mailing list http://mail-archives.apache.org/mod_mbox/spark-user/ . Getting distributed fault-tolerant in-memory computations to work efficiently isn't easy and it's definitely not something I'd want to implement myself. There's a review of other approaches to the problem in the original paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf .
However, when my work is I/O bound, I still tend to rely primarily on pig scripts as pig is more mature and I think the scripts are easier to write. Spark has been great when pig scripts won't cut it (e.g. iterative algorithms, graphs, lots of joins).
Now, if you've only got 50g of data, you probably don't care about distributed fault-tolerant computations (if all your stuff is on a single node then there's no framework in the world that can save you from a node failure :) ) so parallel collections will work just fine.

Distributing Scala over a cluster?

So I've recently started learning Scala and have been using graphs as sort of my project-to-improve-my-Scala, and it's going well - I've since managed to easily parallelize some graph algorithms (that benefit from data parallelization) courtesy of Scala 2.9's amazing support for parallel collections.
However, I want to take this one step further and have it parallelized not just on a single machine but across several. Does Scala offer any clean way to do this like it does with parallel collections, or will I have to wait until I get to the chapter in my book on Actors/learn more about Akka?
Thanks!
-kstruct
There was an attempt of creating distributed collections (currently project is frozen).
Alternatives would be Akka (which recently got really cool addition: Akka Cluster), that you've already mentioned, or full-fledged cluster engines, that is not parallel collections in any sense and more like distributing cluster over the scala but could be used in your task in some way - such as Scoobi for Hadoop, Storm or even Spark (specifically, Bagel for graph processing).
There is also Swarm that was build on top of delimited continuations.
Last but not least is Menthor - authors claiming that it is especially fits graph processing and makes use of Actors.
Since you're aiming to work with graphs you may also consider to look at Cassovary that was recently opensourced by twitter.
Signal-collect is a framework for parallel dataprocessing backed with Akka.
You can use Akka ( http://akka.io ) - it has always been the most advanced and powerful actor and concurrency framework for Scala, and the fresh-baked version 2.0 allows for nice transparent actor remoting, hierarchies and supervision. The canonical way to do parallel computations is to create as many actors as there are parallel parts in your algorithm, optionally spreading them over several machines, send them data to process and then gather the results (see here).

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.