Running SBT (Scala) on several (cluster) machines at the same time - scala

So I've been playing with Akka Actors for a while now, and have written some code that can distribute computation across several machines in a cluster. Before I run the "main" code, I need to have an ActorSystem waiting on each machine I will be deploying over, and I usually do this via a Python script that SSH's into all the machines and starts the process by doing something like cd /into/the/proper/folder/ and then sbt 'run-main ActorSystemCode'.
I run this Python script on one of the machines (call it "Machine X"), so I will see the output of SSH'ing into all the other machines in my Machine X SSH session. Whenever I do run the script, it seems all the machines are re-compiling the entire code before actually running it, making me sit there for a few minutes before anything useful is done.
My question is this:
Why do they need to re-compile at all? The same JVM is available on all machines, so shouldn't it just run immediately?
How do I get around this problem of making each machine compile "it's own copy"?

sbt is a build tool and not an application runner. Use sbt-assembly to build an all in one jar and put the jar on each machine and run it with scala or java command.

It's usual for cluster to have a single partition mounted on every node (via NFS or samba). You just need to copy the artifact on that partition and they will be directly accessible in each node. If it's not the case, you should ask your sysadmin to install it.
Then you will need to launch the application. Again, most clusters come
with MPI. The tools mpirun (or mpiexec) are not restricted to real MPI applications and will launch any script you want on several nodes.

Related

Application kill of spark on yarn via Zeppelin

Is there a recommended way to application kill spark on yarn from inside Zeppelin (using scala)? In the spark shell I use
:q
and it cleanly exits the shell, kills the application on yarn, and unreserves the cores I was using.
I've found that I can use
sys.exit
which does kill the application on yarn successfully, but it also throws an error and requires that I restart the interpreter if I want to start a new session. If I'm actively running another notebook with a separate instance of the same interpreter then sys.exit isn't ideal because I can't restart the interpreter until I've finished the work in the second notebook.
You probably want to go to the YARN UI and kill the application there. It should be running on port 8088 of your primary name node. However, this will require a restart of the service, as well.
Ideally you let YARN deal with this, though. Just because Zeppelin will start Spark with a specified number of executors and cores doesn't mean these are "reserved" in the way you think. These cores are still available for other containers. YARN manages these resources very well. Unless you have a limited cluster and/or are doing something that requires every last drop of resource management from YARN then you should be fine to leave the Spark application that Zeppelin is using alone.
You could try restarting the Zeppelin Spark interpreter (which can be done from within the interpreter settings page). This should kill the Zeppelin app, but will only restart the interpreter (and hence the Zeppelin app), when you try executing a paragraph again.

Run Spark/Cloudera application in remote machine with Eclipse

I have some problems to understand the logical architecture in which I develop with Scala/Spark-shell and Hadoop environment.
For better describe the logical architecture, I drew a small schema:
As the figure shows, I have Eclipse installated on my personal PC, and I would like to run scala script from my PC to Hadoop in remote mode.
Now I have the VPN connection, and I can process my scala program with PUtty from the shell. In practice, every time that I have to launch a Scala script, I transfer the file .scala from my pc to remote machine with WinSCP, so I lanch the program directly from the remote machine. Every time I have to tranfer the file making me work wasteful.
Now the question: is there a way to launch the script from my personal PC to remote cluster, without pass through to the PUtty?

Achieve SBT Run startup speed while executing through command line

I've been working on a small set of command line programs in Scala. While
developing I used SBT, and tested the program with run within the console. At
this point the programs had a fast startup time (when re-run after initial compilation); nearly instant, even
with additional dependencies.
Now that I'm trying to actually utilize them on my system outside of sbt, the speeds have noticeable lag. I'm looking for ways to
reduce this, since the nature of these utilities requires little to no delay.
The best speeds I've achieved so far has been through utilizing Drip. I include all dependencies in a lib directory by utilizing Pack and then run by executing a shell script like this:
#!/bin/sh
SCRIPT=$(readlink -f "$0")
SCRIPT_PATH=$(dirname "$SCRIPT")
PROG_HOME=`cd "$SCRIPT_PATH/../" && pwd`
CLASSPATH_SUFFIX=""
# Path separator used in EXTRA_CLASSPATH
PSEP=":"
exec drip \
-cp "${PROG_HOME}/lib/*${CLASSPATH_SUFFIX}" \ # Add lib directory to classpath
TagWorkspace "$#" # TagWorkspace is the main class
This is still noticeably slower then invoking run from within SBT.
I'm curious as to why SBT is able to startup the application so much faster, and if there is someway for me to levarage its strategy, or SBT itself, even if that means keeping a long living process around to actually run a command through.
Unless you have forking turned on for your run task, this is likely due to VM startup time. When you run from inside an active SBT session, you have an already initialized VM pointing at your classes - all SBT needs to do is create a new ClassLoader and point it at your build output directory. This bypasses all of the other (not insignificant) stuff that happens when you fire up a new VM.
Have you tried using the client VM to start your utility from the command line? Sadly, this isn't an option with 64-bit Java, since Oracle apparently doesn't want to support it, but if you're using a 32-bit VM, try adding the -client argument to the list that you give the VM from the command line.
If you are using a 64-bit VM, some googling will find you some unofficial forks of OpenJDK that have the client VM re-enabled. It's really just a #define in the JVM build itself - it works fine once it's been compiled in.
The only slowness I have is launching SBT. Running a hello-word Scala app with java (no Drip) version 1.8 on a 7381 bogomips CPU takes only 0.2 seconds.
If you're not in that magnitude, I suspect your application startup requires loading thousands of classes, and creating instances of them.

Deployment of files to Virtual Machines

During our development process the developers do code modifications, compile the code and need to deploy it on a remote machine and test it or debug it remotely.
There are manual steps that are usually needed - stop one or more services, copy the compiled files to specific place in the destination machine and other steps (maybe delete some folder etc.)
I was wondering if there is a tool that as input gets IP of remote machine and predefined steps (stop service, copy local files to remote machine etc) - and just do autmatic deployment for the developer? I'd like to automate this tiring process a bit...
Thanks.
Ant is a common tool for such tasks in Java development. You can use ant to compile your code, use an scp task to copy your binaries to a server and run scripts on that server. The configuration is done by XML and is pretty easy. You should google or search on stackoverflow for some examples.
I use rundeck to control my deployments. I like it's simplicity and the fact that all that's required is SSH access to my servers, enabling me to upload files, and run whatever scripts I require.
It has a simple XML configuration file listing the servers in my network. This makes it really easy to integrate with other CM tools.
For windows deployments you're going to require an SSH implementation installed on each node, or a more complicated deployment tool.

Deploying Perl to a share nothing cluster

Anyone have suggestions for deployment methods for Perl modules to a share nothing cluster?
Our current method is very manual.
Take down half the cluster
Copy Perl modules ( CPAN style modules ) to downed cluster members
ssh to each member and run perl Makefile.pl; make ; make install on each module to be installed
Confirm deployment
In service the newly deployed cluster members, out of service the old cluster members and repeat steps 2 -> 4
This is obviously far from optimal, anyone have or know of good tool chains for deploying Perl modules to a shared nothing cluster?
Take one node offline, install Perl, and then use it to reimage the other nodes.
At least, that's how I imagine you'd want to install software in a shared-nothing cluster. Perl is just the application you happen to be installing.
Assuming all the machines are identical, you should be able to keep one canonical installation, and use rsync or something to keep the others in updated.
I have, in the past, developed a Perl program which used the Expect module (from CPAN) to automate basically the process you described, automatically sshing to each host, copying any necessary files, and performing the installations. Unfortunately, this was developed on-site for a client, so I do not have access to the code to share. If you're familiar with Expect, it shouldn't be too difficult to set up, though.
We currently have a clustered Perl application that does data processing. We also have numerous CPAN modules and modules that we've developed that the software depends on. When you say 'shared nothing', I'm assuming you're referring to things like NFS mounts.
If the machines have identical configurations, then you may be able to build your entire application into a single directory structure (eg: /opt/my-app), tar it up and that could be come the only thing you need to push to the boxes.
As far as deploying it to the boxes, you might be able to use Capistrano. We developed a couple of our own cluster utilities that piggybacked off of ssh - I've released one form of that utility: parallel-jobs. Its README shows an example of executing multiple parallel ssh commands. It's a small step to extend that program to be able to know about your cluster and then be able to execute the same command across the cluster (as opposed to a series of different commands).
If you are using Debian or Ubunto OS you could package your Perl modules - I have open sourced some code to help with this: Perl module builder it's still very rough but does work and can be made to work on your own code as well as CPAN modules, this then makes deployment much easier.
There is also project to get RedHat rpms for all of CPAN, Dave Cross gave a talk Perl in RPM-Land which may be of use.
If you are on some other system which doesn't have packaging then the rsync option (install on one machine and then rsync to the others) should work as well, note you can mount a windows share and rsync to it across unix if needed.
Using a central manager like Puppet makes creating and maintaining machines in a cluster a lot easier to manage, from installing code to managing users and email configuration. There is also a Perl project in the pipeline to do something similar but this has not been made public yet.
Capistrano is a tool that allows you to run commands on a group of servers; it is perfectly suited to making your task considerably easier.
Further down the line of automation, but also complexity, is Puppet that allows you do define a group of servers, give them roles and then push out sets of code to every machine subscribing to a certain role.
I am not sure exactly what a share nothing cluster is, but if it uses some base *nix system like Fedora, Mandriva, or Ubuntu. Many of the perl modules are precompiled for specific architectures. You can easily run these.
If these systems are of the same arch you can do as someone else said and just copy the compiled modules from system to system, just make sure you have all of the dependancies as well on the recipient system.