run pyspark locally with pycharm - pyspark

I wrote the following very simple python script with my Pycharm IDE
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql.types import LongType, FloatType,IntegerType,StringType,DoubleType
from pyspark.sql.functions import udf
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import abs
from pyspark.sql import HiveContext
spark = SparkSession.builder.config("requiredconfig").appName("SparkSran").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
when I click on run on my IDE or run the following command
spark-submit --py-files /home/user/PycharmProjects/helloSparkWorld/test.py
I get
/usr/lib/spark/bin/spark-class: line 71: /usr/local/java/jdk10.0.1/bin/java: No such file or directory
my JAVA_HOME and SPARK_HOME are set as follows
echo $SPARK_HOME gives /usr/lib/spark
and
echo $JAVA_HOME gives
/usr/local/java/jdk10.0.1

You can just do a pip install pyspark in your environment that you are using with your pycharm installation to run python programs. You can run your pyspark .py files as python filename.py itself, if you're running it locally.
Basically just provide your pip or python interpreter with the pyspark pip package and you'll be able to run it via pycharm using the same interpreter.

Related

object apache is not a member of package org - jupyter notebook

I am trying to launch jupyter notebook with scala. For that I used almond but I am running into problem when trying to import:
import org.apache.spark._
the error is:
object apache is not a member of package org

How to import Delta Lake module in Zeppelin notebook and pyspark?

I am trying to use Delta Lake in a Zeppelin notebook with pyspark and seems it cannot import the module successfully. e.g.
%pyspark
from delta.tables import *
It fails with the following error:
ModuleNotFoundError: No module named 'delta'
However, there is no problem to save/read the data frame using delta format. And the module can be loaded successfully if using scala spark %spark
Is there any way to use Delta Lake in Zeppelin and pyspark?
Finally managed to load it on zeppelin pyspark. Have to explicitly include the jar file
%pyspark
sc.addPyFile("**LOCATION_OF_DELTA_LAKE_JAR_FILE**")
from delta.tables import *

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

How to import play framwork jar in spark-shell in windows?

I am using windows machine and installed spark and scala for my learning. for spark-sql in need to process json input data.
scala> sc
res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext#7431f4b8
scala> import play.api.libs.json._
<console>:23: error: not found: value play
import play.api.libs.json._
^
scala>
How can i import play api in my spark-shell commnad.
If you want to use other libraries while you are using spark-shell, you need to run spark-shell command with --jars and/or --packages. For example, to use play in your spark shell, run the following command;
spark-shell --packages "com.typesafe.play":"play_2.11":"2.6.19"
For more information, you can use spark-shell -h. I hope it helps!

How to integrate Jupyter notebook scala kernel with apache spark?

I have installed Scala kernel based on this doc: https://github.com/jupyter-scala/jupyter-scala
Kernel is there:
$ jupyter kernelspec list
Available kernels:
python3 /usr/local/homebrew/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/resources
scala /Users/bobyfarell/Library/Jupyter/kernels/scala
When I try to use Spark in the notebook I get this:
val sparkHome = "/opt/spark-2.3.0-bin-hadoop2.7"
val scalaVersion = scala.util.Properties.versionNumberString
import org.apache.spark.ml.Pipeline
Compilation Failed
Main.scala:57: object apache is not a member of package org
; import org.apache.spark.ml.Pipeline
^
I tried:
Setting SPARK_HOME and CLASSPATH to the location of $SPARK_HOME/jars
Setting -cp option pointing to $SPARK_HOME/jars in kernel.json
Setting classpath.add call before imports
None of these helped. Please note I don't want to use Toree, I want to use standalone spark and Scala kernel with Jupyter. A similar issue is reported here too: https://github.com/jupyter-scala/jupyter-scala/issues/63
It doesn't look like you are following the jupyter-scala directions for using Spark. You have to load spark into the kernel using the special imports.