Jupyter for Scala with spylon-kernel without having to install Spark - scala

Based on web search and as highly recommended, I am trying to run Jupyter on my local for Scala (using spylon-kernel).
I was able to create a notebook but while trying to run/play a Scala code snippet, I see this message initializing scala interpreter and in the console, I see this error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
I am not planning to install Spark. Is there a way I can still use Jupyter for Scala without installing Spark?
I am new to Jupyter and the ecosystem. Pardon me for the amateur question.
Thanks

Related

<console>:25: error: object databricks is not a member of package com

I actually work on zeppelin with spark and scala. I want to import the library which contain : import com.databricks.spark.xml.
I tried but I have still the same mistake in zeppelin mistake : <console>:25: error: object databricks is not a member of package com.
What I've done actually ? I create a note in Zeppelin with this code : %dep
z.load("com.databricks:spark-xml_2.11:jar:0.5.0"). Even with that, the interpreter don't work. It's like it don't succeed to load the library.
Have you an idea why it don't work ?
Thanks for your help and have a nice day !
Your problem is very common and not intuitive to solve. I resolved an issue similar to this (I wanted to load the postgres jdbc connector in AWS EMR and I was using a linux terminal). Your issue can be resolved by checking if you can:
load the jar file manually to the environment that is hosting Zeppelin.
add the path of the jar file to your CLASSPATH environment variable. I don't know where you're hosting your files that manage your CLASSPATH env, but in EMR, my file, viewed from the Zeppelin root directory, was here: /usr/lib/zeppelin/conf/zeppelin-env.sh
download the zeppelin interpreter with
$ sudo ./bin/install-interpreter.sh --name "" --artifact
add the interpreter in Zeppelin wby going to the Zeppelin Interpreter GUI and add in the interpreter group.
Reboot Zeppelin with:
$ sudo stop zeppelin
$ sudo start zeppelin
It's very likely that your configurations may vary slightly, but I hope this helps provide some structure and relevance.

How to add customized jar in Jupyter Notebook in Scala

I need to use a third party jar (mysql) in my Scala script, if I use spark shell, I can specify the jar in the starting command like below:
spark2-shell --driver-class-path mysql-connector-java-5.1.15.jar --jars /opt/cloudera/parcels/SPARK2/lib/spark2/jars/mysql-connector-java-5.1.15.jar
However, how can I do this in Jupyter notebook? I remember there is a magic way to do it in pyspark, I am using Scala, and I can't change the environment setting of the kernel I am using.
I have the solution now, and it is very simple indeed as below:
Use a toree based Scala kernel (which is what I am using)
Use AddJar: in the notebook and run it, the jar will be downloaded and voila!
That's it.
AddJar http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.15/mysql-connector-java-5.1.15.jar

Is there a way to run a scala sheet in Intellij on a remote environment?

I am looking for a way to run some scala code in a spark shell on a cluster. Is there a way to do this? Or even inside a simple scala shell where I can instantiate my own spark context.
I tried to look for some kind of Remote setup for scala worksheet in Intellij but I wasn't able to find anything useful.
So far the only way I can connect to a remote environment is to run the debugger
The best solution I have come across is that on the spark cluster install the jupyter notebook.
Now you can use the browser and work remotely on the cluster. Otherwise good old telnet also works.

Setting Specific Python in Zeppelin Interpreter

What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`

Scala is a must for Spark?

I am new to Spark. In its docs, it says It is available in either Scala or Python.
And some blogs says spark depends on scala (for instance, http://cn.soulmachine.me/blog/20130614/). Therefore, I am wondering : Is scala a must for Spark? (Do I have to install scala first due to the dependency?)
The API of Scala has the following language bindings:
Scala
Java
Python
Scala as a natural fit, since it supports strongly functional programming, which is obvious benefical in the area of Big Data. Most tutorials and coding snippets, which you find on the net, are written in Scala.
Concerning the runtimne depenendencies please have a look at
the project download page
"Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark 1.2.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x)."
Java is a must for Spark + many other transitive dependencies (scala compiler is just a library for JVM). PySpark just connects remotely (by socket) to the JVM using Py4J (Python-Java interoperation). Py4J is included in PySpark.
PySpark requires Python 2.6 or higher. PySpark applications are
executed using a standard CPython interpreter in order to support
Python modules that use C extensions. We have not tested PySpark with
Python 3 or with alternative Python interpreters, such as PyPy or
Jython.
All of PySpark’s library dependencies, including Py4J, are bundled
with PySpark and automatically imported.
Standalone PySpark applications should be run using the bin/pyspark
script, which automatically configures the Java and Python environment
using the settings in conf/spark-env.sh or .cmd. The script
automatically adds the bin/pyspark package to the PYTHONPATH.
https://spark.apache.org/docs/0.9.1/python-programming-guide.html - this instruction shows how to build and run all this with Scala/Java Build Tool (SBT), which will download all dependencies (including scala) automatically from remote repository. Yo can also use Maven.
If you don't want Java on your machine - you can start it on any other and configure PySpark for using it (by SparkConf().setMaster).
So, you need Java for master node with Spark itself (and all java-dependencies like scala), and Python 2.6 for py-client