Application kill of spark on yarn via Zeppelin - scala

Is there a recommended way to application kill spark on yarn from inside Zeppelin (using scala)? In the spark shell I use
:q
and it cleanly exits the shell, kills the application on yarn, and unreserves the cores I was using.
I've found that I can use
sys.exit
which does kill the application on yarn successfully, but it also throws an error and requires that I restart the interpreter if I want to start a new session. If I'm actively running another notebook with a separate instance of the same interpreter then sys.exit isn't ideal because I can't restart the interpreter until I've finished the work in the second notebook.

You probably want to go to the YARN UI and kill the application there. It should be running on port 8088 of your primary name node. However, this will require a restart of the service, as well.
Ideally you let YARN deal with this, though. Just because Zeppelin will start Spark with a specified number of executors and cores doesn't mean these are "reserved" in the way you think. These cores are still available for other containers. YARN manages these resources very well. Unless you have a limited cluster and/or are doing something that requires every last drop of resource management from YARN then you should be fine to leave the Spark application that Zeppelin is using alone.

You could try restarting the Zeppelin Spark interpreter (which can be done from within the interpreter settings page). This should kill the Zeppelin app, but will only restart the interpreter (and hence the Zeppelin app), when you try executing a paragraph again.

Related

Run Python mrjob in a Kubernetes on Hadoop Cluster

I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.
I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.
Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.
The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;
runners:
hadoop:
cmdenv:
PATH: <pod name>:/opt/hadoop
However mrjob is still unable to detect hadoop bin and gives the below error
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'
So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.
mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.
IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.
Beam would be recommended since it is compatible with GCP DataFlow

The Pyspark always use the system‘s python

We know that a system has two Python:
①system's python
/usr/bin/python
②user's python
~/anaconda3/envs/Python3.6/bin/python3
Now I have a cluster with my Desktop(master) and Laptop(slave).
It's OK for different mode of PysparkShell if I set like this:
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
for both two nodes' ~/.bashrc
However,I want to configure it with jupyter notebook.So I set like this in each node's
~/.bashrc
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
then I get the log
My Spark version is:
spark-3.0.0-preview2-bin-hadoop3.2
I have read all the answers in
environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
and
different version 2.7 than that in driver 3.6 in jupyter/all-spark-notebook
But no luck.
I guess slave's python2.7 is from system's python.not from anaconda's python.
How to force spark's slave node to use anaconda's python?
Thanks~!
Jupiyter is looking for ipython, you probably only have ipython installed in your system python.
In order to use jupyter in different python version. You need to use python version manager (pyenv), and python environment manager(virtualenv), together you can choose which version of python you are going to use and which environment you are going to install jupyter, and fully isolated python versions and packages.
Install ipykernel in your chosen python environment and install jupyter.
After you finish above step. You need to make sure that the Spark worker will switch to your chosen python version and environment every time Spark ReourceManager launches a worker executor. In order to swtich python version and environment when the Spark worker executor, you need to make sure that a little script ran right after the Spark Resource Manager ssh into worker:
go to the python environment directory
source 'whatever/bin/activate'
After you have done above steps, you should have chosen python version and jupyter ran by Spark worker executor.

Is there a way to run a scala sheet in Intellij on a remote environment?

I am looking for a way to run some scala code in a spark shell on a cluster. Is there a way to do this? Or even inside a simple scala shell where I can instantiate my own spark context.
I tried to look for some kind of Remote setup for scala worksheet in Intellij but I wasn't able to find anything useful.
So far the only way I can connect to a remote environment is to run the debugger
The best solution I have come across is that on the spark cluster install the jupyter notebook.
Now you can use the browser and work remotely on the cluster. Otherwise good old telnet also works.

Achieve SBT Run startup speed while executing through command line

I've been working on a small set of command line programs in Scala. While
developing I used SBT, and tested the program with run within the console. At
this point the programs had a fast startup time (when re-run after initial compilation); nearly instant, even
with additional dependencies.
Now that I'm trying to actually utilize them on my system outside of sbt, the speeds have noticeable lag. I'm looking for ways to
reduce this, since the nature of these utilities requires little to no delay.
The best speeds I've achieved so far has been through utilizing Drip. I include all dependencies in a lib directory by utilizing Pack and then run by executing a shell script like this:
#!/bin/sh
SCRIPT=$(readlink -f "$0")
SCRIPT_PATH=$(dirname "$SCRIPT")
PROG_HOME=`cd "$SCRIPT_PATH/../" && pwd`
CLASSPATH_SUFFIX=""
# Path separator used in EXTRA_CLASSPATH
PSEP=":"
exec drip \
-cp "${PROG_HOME}/lib/*${CLASSPATH_SUFFIX}" \ # Add lib directory to classpath
TagWorkspace "$#" # TagWorkspace is the main class
This is still noticeably slower then invoking run from within SBT.
I'm curious as to why SBT is able to startup the application so much faster, and if there is someway for me to levarage its strategy, or SBT itself, even if that means keeping a long living process around to actually run a command through.
Unless you have forking turned on for your run task, this is likely due to VM startup time. When you run from inside an active SBT session, you have an already initialized VM pointing at your classes - all SBT needs to do is create a new ClassLoader and point it at your build output directory. This bypasses all of the other (not insignificant) stuff that happens when you fire up a new VM.
Have you tried using the client VM to start your utility from the command line? Sadly, this isn't an option with 64-bit Java, since Oracle apparently doesn't want to support it, but if you're using a 32-bit VM, try adding the -client argument to the list that you give the VM from the command line.
If you are using a 64-bit VM, some googling will find you some unofficial forks of OpenJDK that have the client VM re-enabled. It's really just a #define in the JVM build itself - it works fine once it's been compiled in.
The only slowness I have is launching SBT. Running a hello-word Scala app with java (no Drip) version 1.8 on a 7381 bogomips CPU takes only 0.2 seconds.
If you're not in that magnitude, I suspect your application startup requires loading thousands of classes, and creating instances of them.

Running SBT (Scala) on several (cluster) machines at the same time

So I've been playing with Akka Actors for a while now, and have written some code that can distribute computation across several machines in a cluster. Before I run the "main" code, I need to have an ActorSystem waiting on each machine I will be deploying over, and I usually do this via a Python script that SSH's into all the machines and starts the process by doing something like cd /into/the/proper/folder/ and then sbt 'run-main ActorSystemCode'.
I run this Python script on one of the machines (call it "Machine X"), so I will see the output of SSH'ing into all the other machines in my Machine X SSH session. Whenever I do run the script, it seems all the machines are re-compiling the entire code before actually running it, making me sit there for a few minutes before anything useful is done.
My question is this:
Why do they need to re-compile at all? The same JVM is available on all machines, so shouldn't it just run immediately?
How do I get around this problem of making each machine compile "it's own copy"?
sbt is a build tool and not an application runner. Use sbt-assembly to build an all in one jar and put the jar on each machine and run it with scala or java command.
It's usual for cluster to have a single partition mounted on every node (via NFS or samba). You just need to copy the artifact on that partition and they will be directly accessible in each node. If it's not the case, you should ask your sysadmin to install it.
Then you will need to launch the application. Again, most clusters come
with MPI. The tools mpirun (or mpiexec) are not restricted to real MPI applications and will launch any script you want on several nodes.