pyspark pip installation hive-site.xml - pyspark

I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)

Related

How to install and use pyspark on mac

I'm taking a machine learning course and am trying to install pyspark to complete some of the class assignments. I downloaded pyspark from this link, unzipped it and put it in my home directory, and added the following lines to my .bash_profile.
export SPARK_PATH=~/spark-3.3.0-bin-hadoop2.6
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
However, when I try to run the command:
pyspark
to start a session, I get the error:
-bash: pyspark: command not found
Can someone tell me what I need to do to get pyspark working on my local machine? Thank you.
You are probably missing the PATH entry. Here are the environment variable changes I did to get pyspark working on my Mac:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.6.jdk/Contents/Home/
export SPARK_HOME=/opt/spark-3.3.0-bin-hadoop3
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'
Also ensure that, you've Java SE 8+ and Python 3.5+ installed.
Start the server from /opt/spark-3.3.0-bin-hadoop3/sbin/start-master.sh.
Then run pyspark and copy+paste URL displayed on screen in web browser.

Jupyter for Scala with spylon-kernel without having to install Spark

Based on web search and as highly recommended, I am trying to run Jupyter on my local for Scala (using spylon-kernel).
I was able to create a notebook but while trying to run/play a Scala code snippet, I see this message initializing scala interpreter and in the console, I see this error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
I am not planning to install Spark. Is there a way I can still use Jupyter for Scala without installing Spark?
I am new to Jupyter and the ecosystem. Pardon me for the amateur question.
Thanks

ToreeInstall ERROR | Unknown interpreter PySpark. toree can not install PySpark

When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.

How to add customized jar in Jupyter Notebook in Scala

I need to use a third party jar (mysql) in my Scala script, if I use spark shell, I can specify the jar in the starting command like below:
spark2-shell --driver-class-path mysql-connector-java-5.1.15.jar --jars /opt/cloudera/parcels/SPARK2/lib/spark2/jars/mysql-connector-java-5.1.15.jar
However, how can I do this in Jupyter notebook? I remember there is a magic way to do it in pyspark, I am using Scala, and I can't change the environment setting of the kernel I am using.
I have the solution now, and it is very simple indeed as below:
Use a toree based Scala kernel (which is what I am using)
Use AddJar: in the notebook and run it, the jar will be downloaded and voila!
That's it.
AddJar http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.15/mysql-connector-java-5.1.15.jar

adding packages to pyspark using jupyter notebook

I am able to run jupyter with pyspark successfully using https://cloud.google.com/dataproc/tutorials/jupyter-notebook
My question is - if I had to add packages to pyspark (like spark-csv or graphframes) and use them through the notebook, what is the best practice to follow ?
I can add the package in a new pyspark job using --packages option, but how do i connect that new pyspark context to the notebook ?
To get the notebook working, you'll really want the notebook setup to pick up the right packages itself. Since the initialization action you linked works to ensure Jupyter will be using the cluster's configured Spark directories and thus pick up all the necessary YARN/filesystem/lib configurations, the best way to do this is to add the property at cluster-creation time instead of job-submission time:
gcloud dataproc clusters create \
--properties spark:spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0
Per this StackOverflow error, setting the spark-defaults.conf property spark.jars.packages is the more portable equivalent of specifying the --packages option, since --packages is just syntactic sugar in spark-shell/spark-submit/pyspark wrappers which sets the spark.jars.packages configuration entry anyways.