How to integrate Jupyter notebook scala kernel with apache spark? - scala

I have installed Scala kernel based on this doc: https://github.com/jupyter-scala/jupyter-scala
Kernel is there:
$ jupyter kernelspec list
Available kernels:
python3 /usr/local/homebrew/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/resources
scala /Users/bobyfarell/Library/Jupyter/kernels/scala
When I try to use Spark in the notebook I get this:
val sparkHome = "/opt/spark-2.3.0-bin-hadoop2.7"
val scalaVersion = scala.util.Properties.versionNumberString
import org.apache.spark.ml.Pipeline
Compilation Failed
Main.scala:57: object apache is not a member of package org
; import org.apache.spark.ml.Pipeline
^
I tried:
Setting SPARK_HOME and CLASSPATH to the location of $SPARK_HOME/jars
Setting -cp option pointing to $SPARK_HOME/jars in kernel.json
Setting classpath.add call before imports
None of these helped. Please note I don't want to use Toree, I want to use standalone spark and Scala kernel with Jupyter. A similar issue is reported here too: https://github.com/jupyter-scala/jupyter-scala/issues/63

It doesn't look like you are following the jupyter-scala directions for using Spark. You have to load spark into the kernel using the special imports.

Related

How to install sbt package for Jupyter notebook with Scala kernel?

I am using Jupyter notebook with Scala kernel? How can I install sbt package and load it for the notebook?
It seems this https://www.scala-sbt.org/1.x/docs/Scripts.html is related.
late answer, but likely you would like to use Almond as a Scala kernel. Simply import sbt packages with ivy, e.g.:
import $ivy.`org.scalaz::scalaz-core:7.2.27`, scalaz._, Scalaz._

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

How to add jar files in pyspark anaconda?

from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g")
sc=SparkContext.getOrCreate(conf)
dfv = sc.textFile("./part-001*.gz")
I have install pyspark thru anaconda and I can import pyspark in anaconda python. But I don't know how to add jar files in conf.
I tried
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.packages','file:///XXX,jar')
but it doesn't work.
Any proper way to add jar file here ?
The docs say:
spark.jars.packages: Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
Instead, you should simply use spark.jars:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
So:
conf=SparkConf().setAppName("2048roject").setMaster("local[*]")\
.set("spark.driver.maxResultSize", "80g").set("spark.executor.memory", "5g").set("spark.driver.memory", "60g").set('spark.jars.files','file:///XXX.jar')

How do I make packages available to the Scala REPL?

I'm trying to get familiar with Scala. I am using macOS.
I've installed scala using brew install scala which is hassle-free and once complete I can launch the scala REPL simply by issuing scala and I'm at the scala> prompt.
I now want to import some packages, so I try:
import org.apache.spark.sql.Column
and unsurprisingly it fails with
error: object apache is not a member of package org
This makes sense, how would it know where to get that package from? Thing is, I don't know what I need to do to make that package available. Is there anything I can do from the command-line that would allow me to import org.apache.spark.sql.Column?
I have googled around a little but not found anything that explains in a jargon-free way. Complete Scala noob here so jargon-free responses would be appreciated.
Here are two ways to start a REPL with dependencies that I'm aware of:
Use SBT to manage dependencies, use console to start a REPL with those dependencies
Use Ammonite REPL
You could create a separate directory with a build.sbt where you set
scalaVersion := "2.11.12"
and then copy the
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
snippets from MavenCentral. Then you can run the REPL with sbt console. Note that this will create a project and target subdirectories, so it "leaves traces", you can't use it like the standalone scala-repl. You could also omit the build.sbt, and add the library-dependencies by typing them into the SBT-shell itself.
Alternatively you can just use Ammonite REPL that has been created exactly for that purpose.
You can use classpath to make the lib available i.e. download the jar locally and use the command as follows (here I'm using Apache IO lib to move files from scala prompt )
C0:Desktop pvangala$ scala -cp /Users/pvangala/Downloads/commons-io-2.6/commons-io-2.6.jar
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161).
Type in expressions for evaluation. Or try :help.
scala> import java.io.File
import java.io.File
scala> val src = new File("/Users/pvangala/Downloads/commons-io-2.6-bin.tar")
src: java.io.File = /Users/pvangala/Downloads/commons-io-2.6-bin.tar
scala> val dst = new File("/Users/pvangala/Desktop")
dst: java.io.File = /Users/pvangala/Desktop
scala> org.apache.commons.io.FileUtils.moveFileToDirectory(src, dst, true)
If you want to use spark stuff I'd recommend you use the spark-shell that comes with the spark-installation. I don't know macOS so I can't help you much with the install of Spark there.
For normal Scala I recommend ammonite http://ammonite.io/#Ammonite-REPL that has included syntax to allow to pull packages/dependencies.
If you want to use spark, you should use the spark-shell instead the scala REPL. It has almost the same behaviour but includes all the spark dependencies by default.
You should download spark binaries from here
Then if you are using Linux, you should create the variable SPARK_HOME pointing to the downloaded folder and include its bin folder in PATH.
then you can start it in any console with the command spark-shell
In Windows i'm not sure, but i think that you should have a spark-shell.cmd file or something similar which you can use to start the spark-shell,
I did the following in Windows:
for /f "tokens=*" %%a in ('java -jar coursier fetch -p "com.lihaoyi::requests:0.2.0" "com.lihaoyi::upickle:0.7.5"') do set SCP=%%a
scala -nc -classpath %SCP% %1 %2 %3
Instead of the two libraries listed here you can use an unlimited number of other libraries you need. They must be available in maven central, though. The %1 could be a scala script (".sc" extension). But you could leave it empty and the REPL will start with the libraries on the classpath.

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?
I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.