Is there a way to run a scala sheet in Intellij on a remote environment? - scala

I am looking for a way to run some scala code in a spark shell on a cluster. Is there a way to do this? Or even inside a simple scala shell where I can instantiate my own spark context.
I tried to look for some kind of Remote setup for scala worksheet in Intellij but I wasn't able to find anything useful.
So far the only way I can connect to a remote environment is to run the debugger

The best solution I have come across is that on the spark cluster install the jupyter notebook.
Now you can use the browser and work remotely on the cluster. Otherwise good old telnet also works.

Related

Jupyter for Scala with spylon-kernel without having to install Spark

Based on web search and as highly recommended, I am trying to run Jupyter on my local for Scala (using spylon-kernel).
I was able to create a notebook but while trying to run/play a Scala code snippet, I see this message initializing scala interpreter and in the console, I see this error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
I am not planning to install Spark. Is there a way I can still use Jupyter for Scala without installing Spark?
I am new to Jupyter and the ecosystem. Pardon me for the amateur question.
Thanks

Can I use Jupyter lab to interact with databricks spark cluster using Scala?

Can I use Jupyter lab to connect to a databricks spark cluster that is hosted remotely?
There are KB articles about databricks connect, which allows a scala or java client-process to control a spark cluster. Here is an example:
https://docs.databricks.com/dev-tools/databricks-connect.html
While that KB article covers a lot of scenarios, it doesn't explain how to use Jupyter notebooks to interact with a databricks cluster using the Scala programming language. I'm familiar with scala programming, but not Python.
Yes, it appears to be possible although it is not well documented. These steps worked for me on windows. I used databricks v.7.1 with scala 2.12.10.
Step 1. Install anaconda : https://repo.anaconda.com/
Step 2. Because python seems to be the language of choice for notebooks,
you will need to manually install and configure a scala kernel
I'm able to get things working with the almond kernel : https://almond.sh/
When you install almond, be careful to pick a version of scala
that corresponds to the DBR runtime you will be connected to in the remote cluster.
Step 3. Now follow the databricks-connect docs to get a scala program to
compile and connect to the remote cluster via the intellij / sbt environment.
The documentation can be found here. https://docs.databricks.com/dev-tools/databricks-connect.html
This is a fully supported and fairly conventional approach that can be used to develop custom modules.
Step 4. Once you have created a working scala process, you will be familiar with sbt. The build.sbt is used for referencing the "databricks-connect" distribution. The distribution will be in a location like so:
unmanagedBase := new java.io.File("C:\\Users\\minime\\AppData\\Local\\Programs\\Python\\Python37\\Lib\\site-packages\\pyspark\\jars")
While it is straightforward for intellij / sbt to compile those dependencies into your program, it will take a bit more work to do the equivalent thing in he almond/jupyter kernel.
Before you go back to your jupyter notebook, run your new scala process and allow it to create a spark session. Then before the process dies, use "process explorer" to find the the related java.exe, then in the lower view/pane show handles, then copy all the handles into notepad (Ctrl+A in process explorer, Ctrl+V in notepad). This gives you the subset of modules from the databricks distribution that are actually being loaded into your process at runtime.
Step 5. Now that you have the relevant modules, you need to configure your almond scala kernel to load them into memory. Create a new jupyter notebook and select the scala kernel and use code like the following to load all your modules:
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever001-1.1.1.jar")))
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever002-1.1.1.jar")))
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever003-1.1.1.jar")))
...
Please note that there are lots and lots of jars in the distribution (maybe 100!?).
You may wish to load other libraries directly from maven (assuming they are compatible with scala 2.12.10 and your databricks-connect distribution)
// Microsoft JDBC
interp.load.ivy("com.microsoft.sqlserver" % "mssql-jdbc" % "8.2.1.jre8")
// Other libraries
interp.load.ivy("joda-time" % "joda-time" % "2.10.5")
interp.load.ivy("org.scalaj" %% "scalaj-http" % "2.3.0")
interp.load.ivy("org.json4s" %% "json4s-native" % "3.5.3")
interp.load.ivy("com.microsoft.azure" % "msal4j" % "1.6.1")
// Other libraries
interp.load.ivy("org.apache.hadoop" % "hadoop-azure" % "3.2.1")
Fair warning... when loading libraries into the almond kernel, it is sometimes important to load them in a specific order. My examples above aren't intended to tell you what order to load them via interp.load.
Step 6. If everything went as planned, you should now be able to create a spark session running in a jupyter notebook using code that is similar to the stuff you were writing in "Step 3" above.
import org.apache.spark.sql._
val p_SparkSession = SparkSession.builder()
.appName("APP_" + java.util.UUID.randomUUID().toString)
.master("local")
.config("spark.cores.max","4")
.getOrCreate()
Your almond kernel is now connected to the remote cluster, via the databricks-connect distribution. Everything works as long as you don't need to serialize any functions or data types out to the remote cluster. In that case you will probably get a variety of serialization errors and null pointer exceptions. Here is an example:
java.lang.NullPointerException
com.databricks.service.SparkServiceClassSync$.checkSynced(SparkServiceClassSync.scala:244)
org.apache.spark.sql.util.SparkServiceObjectOutputStream.writeReplaceClassDescriptor(SparkServiceObjectOutputStream.scala:82)
...
org.apache.spark.sql.util.ProtoSerializer.serializePlan(ProtoSerializer.scala:377)
com.databricks.service.SparkServiceRPCClientStub.$anonfun$executePlan$1(SparkServiceRPCClientStub.scala:193)
This answer will be the first of several. I'm hoping that there are other scala/spark/databricks experts who can help work out the remaining kinks in this configuration, so that any of the functions and data types that are declared in my notebooks can be used by the remote cluster as well!
In my first answer I pointed out that the primary challenge in using scala notebooks (in Jupyter lab with almond) is that we are missing the functionality to serialize any functions or data types, and send them out to the remote cluster that is being hosted by databricks.
I should point out that there are two workarounds that I use regularly when I encounter this limitation.
I revert to using the "spark-shell". It is a standard component of the databricks-connect distribution. I can then load the relevant parts of my scala code using :load and :paste commands. For some happy reason the "spark-shell" is fully capable of serializing functions and data types in order to dynamically send them to the remote cluster. This is something that the almond kernel is not able to do for us within the context of the Jupyter notebooks.
The other workaround is to .collect() the dataframes back to the driver (within the memory of the jupyter notebook kernel.) Once they are collected, I can perform additional transformations on them, even with the help of "original" functions and "original" data types that are only found within my jupyter notebook. In this case I won't get the performance benefits of distributed processing. But while the code is still under development, I'm typically not working with very large datasets so it doesn't make that much of a difference if the driver is running my functions, or if the workers are.
Hope this is clear. I'm hoping that Databricks may eventually see the benefit of allowing scala programmers to develop code remotely, in jupyter lab. I think they need to be the ones to select one of the scala kernels, and do the heavy-lifting to support this scenario. As-of now they probably believe their own notebook experience in their own portal is sufficient for the needs of all scala programmers.
To add on to David's first answer, I did this additional step:
Step 5.5. Programmatically add the databricks jar dependencies to the scala kernel.
Using the directory you get from databricks-connect get-jar-dir I used the following code:
import $ivy.`com.lihaoyi::os-lib:0.2.7`
def importJars{
val myJars = os.list(os.Path("/Users/me/miniconda3/envs/dbx-p40/lib/python3.7/site-packages/pyspark/jars/"))
for (j <- myJars){
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(j.toString)))
}
}
importJars

PySpark and PDB don't seem to mix

I'm building stand alone python programs that will use pyspark (and elasticsearch-hadoop connector). I am also addicted to the Python Debugger (PDB) and want to be able to step through my code.
It appears I can't run pyspark with the PDB like I normally do
./pyspark -m pdb testCode.py
I get an error "pyspark does not support any application options"
Is it possible to run pyspark code from the standard python interpreter? or do i need to give up the pdb?
I also saw online that I need to include py4j-0.9-src.zip in my PYTHONPATH. When I do that, I can use the python interpreter and step through my code, but I get an error "Py4JavaError: Py4JJava...t id=o18)" when it runs any of the pyspark code. That error seemed to indicate that I wasn't really interacting with spark.
How do I approach this?

IPython starting ipclusters

I'm using this amazing IPython notebook. I'm very interested into parallel computing right now and would like to use MPI with IPython (and MPI4py). But I can't start a cluster with
ipcluster start -n 4
on Windows7. I just get back "failed to create process". If I use the notebook and start a cluster in the "Clusters" register it's all working fine. But with cmd (even with admin rights) I just get this message. Same with all attempts of using MPI (MPICH2). All path vars are set. Maybe this problem has no connection to Python at all...
I can't say anything about IPython's parallel features, but if you're having problems with MPI in Windows in general, I would offer these suggestions. I've had quite a few issues in the past in trying to get MPI working in Windows. The most convenient method for me in the past has been to use an OpenMPI Windows binary http://www.open-mpi.org/software/ompi/v1.6/. These are now only available in previous releases. And even then, you might have to try more than one before you find one that works. I don't know why, but the latest didn't work on my machine. The release before that one did, however. After this, you have to call mpicc and mpiexec from the Microsoft Visual Studio Command Prompt or it won't work (without a lot of other stuff).
After you have verified that MPI is working, you can try installing mpi4py separately and see if that works. In my experience, sometimes this has worked fine and sometimes I've had to wrestle with configurations. You might just try your luck with an unofficial, prepackaged binary (for example, http://www.lfd.uci.edu/~gohlke/pythonlibs/).
Hope this helps!

Use Apache Cascading in windows

I am starting to use Cascading library, but I search information and all is about cascading on linux... I have executed fine the Impatient examples in a ubuntu server.
But I want to develop and test my application using eclipse in windows...
Is that posssible?? How I can do it?
Thanks
Glad to hear the "Impatient" examples helped out -
There are two concerns: (1) Windows and (2) Eclipse.
Hadoop runs in Java, and is primarily meant for running apps on clusters. You must be careful on Windows, because the Java support is problematic. I've seen many students attempt to use Cygwin, thinking that would provide a Java layer -- it does not. Running Hadoop atop Cygwin typically is more trouble than it's worth. Obviously the HDInsight work by Microsoft is a great way to run Hadoop on Windows, on Azure. To run Hadoop on your desktop Windows, it's best to use a virtual machine. Then be certain to run in "Standalone Mode", instead of pseudo-distributed mode or attempting to create a cluster on your desktop. Otherwise, it'd be better to run Cascading apps in HDInsight for Hadoop on Azure.
Eclipse is a much simpler answer. Gradle build scripts in for the "Impatient" series show how to use "gradle eclipse" to generate a project to import into your IDE. Even so, you may have to clean up some paths -- Eclipse doesn't handle Gradle imports as cleanly as it should, from what I've seen.
Hope that helps -
To develop and test your Cascading application using eclipse in windows, you need to apply a patch (https://github.com/congainc/patch-hadoop_7682-1.0.x-win). Download the patch jar, then add to your application's CLASSPATH. In your code, you need to set the properties "fs.file.impl"
Properties properties = new Properties();
AppProps.setApplicationJarClass(properties, Main.class);
if (System.getProperty("os.name").toLowerCase().indexOf("win") >= 0) {
properties.put("fs.file.impl",
"com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem");
}
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);