EMR Notebook Scala kernel import graphframes library - scala

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?

That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

Related

run pyspark locally with pycharm

I wrote the following very simple python script with my Pycharm IDE
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql.types import LongType, FloatType,IntegerType,StringType,DoubleType
from pyspark.sql.functions import udf
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import abs
from pyspark.sql import HiveContext
spark = SparkSession.builder.config("requiredconfig").appName("SparkSran").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
when I click on run on my IDE or run the following command
spark-submit --py-files /home/user/PycharmProjects/helloSparkWorld/test.py
I get
/usr/lib/spark/bin/spark-class: line 71: /usr/local/java/jdk10.0.1/bin/java: No such file or directory
my JAVA_HOME and SPARK_HOME are set as follows
echo $SPARK_HOME gives /usr/lib/spark
and
echo $JAVA_HOME gives
/usr/local/java/jdk10.0.1
You can just do a pip install pyspark in your environment that you are using with your pycharm installation to run python programs. You can run your pyspark .py files as python filename.py itself, if you're running it locally.
Basically just provide your pip or python interpreter with the pyspark pip package and you'll be able to run it via pycharm using the same interpreter.

error not found value spark import spark.implicits._ import spark.sql

I am using hadoop 2.7.2 , hbase 1.4.9, spark 2.2.0, scala 2.11.8 and java 1.8 on a hadoop cluster which is composed of one master and two slave.
when I run spark-shell after starting the cluster , it works fine.
I am trying to connect to hbase using scala by following this tutorial : [https://www.youtube.com/watch?v=gGwB0kCcdu0][1] .
But when I try like he does to run the spark-shell by adding those jars like argument I have this error:
spark-shell --jars
"hbase-annotations-1.4.9.jar,hbase-common-1.4.9.jar,hbase-protocol-1.4.9.jar,htrace-core-3.1.0-incubating.jar,zookeeper-3.4.6.jar,hbase-client-1.4.9.jar,hbase-hadoop2-compat-1.4.9.jar,metrics-json-3.1.2.jar,hbase-server-1.4.9.jar"
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
and after that even I log out and run spark-shell another time I have the same issue.
Can any one tell me please what is the cause and how to fix it .
In your import statement spark should be an object of type SparkSession. That object should have been created previously for you. Or you need to create it yourself (read spark docs). I didn't watch your tutorial video.
The point is it doesn't have to be called spark. It could be for instance called sparkSession and then you can do import sparkSession.implicits._

How to import play framwork jar in spark-shell in windows?

I am using windows machine and installed spark and scala for my learning. for spark-sql in need to process json input data.
scala> sc
res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext#7431f4b8
scala> import play.api.libs.json._
<console>:23: error: not found: value play
import play.api.libs.json._
^
scala>
How can i import play api in my spark-shell commnad.
If you want to use other libraries while you are using spark-shell, you need to run spark-shell command with --jars and/or --packages. For example, to use play in your spark shell, run the following command;
spark-shell --packages "com.typesafe.play":"play_2.11":"2.6.19"
For more information, you can use spark-shell -h. I hope it helps!

How to integrate Jupyter notebook scala kernel with apache spark?

I have installed Scala kernel based on this doc: https://github.com/jupyter-scala/jupyter-scala
Kernel is there:
$ jupyter kernelspec list
Available kernels:
python3 /usr/local/homebrew/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/resources
scala /Users/bobyfarell/Library/Jupyter/kernels/scala
When I try to use Spark in the notebook I get this:
val sparkHome = "/opt/spark-2.3.0-bin-hadoop2.7"
val scalaVersion = scala.util.Properties.versionNumberString
import org.apache.spark.ml.Pipeline
Compilation Failed
Main.scala:57: object apache is not a member of package org
; import org.apache.spark.ml.Pipeline
^
I tried:
Setting SPARK_HOME and CLASSPATH to the location of $SPARK_HOME/jars
Setting -cp option pointing to $SPARK_HOME/jars in kernel.json
Setting classpath.add call before imports
None of these helped. Please note I don't want to use Toree, I want to use standalone spark and Scala kernel with Jupyter. A similar issue is reported here too: https://github.com/jupyter-scala/jupyter-scala/issues/63
It doesn't look like you are following the jupyter-scala directions for using Spark. You have to load spark into the kernel using the special imports.

--files in SPARK_SUBMIT_OPTIONS not working in zeppelin

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh.
When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.
What's the correct way to pass an .egg file zeppelin spark intrepreter?
Spark version is 1.6.2 and zeppelin version is 0.6.0
The zepplein-env.sh file contains the follwing:
export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"
You also need to adjust the PYTHONPATH on the executor nodes:
export SPARK_SUBMIT_OPTIONS="... --conf 'spark.executorEnv.PYTHONPATH=fly_libs-1.1-py2.7.egg:pyspark.zip:py4j-0.10.3-src.zip' ..."
It does not seem to be possible to append to an existing python path, therefore make sure you list all the required dependencies.