I am able to run jupyter with pyspark successfully using https://cloud.google.com/dataproc/tutorials/jupyter-notebook
My question is - if I had to add packages to pyspark (like spark-csv or graphframes) and use them through the notebook, what is the best practice to follow ?
I can add the package in a new pyspark job using --packages option, but how do i connect that new pyspark context to the notebook ?
To get the notebook working, you'll really want the notebook setup to pick up the right packages itself. Since the initialization action you linked works to ensure Jupyter will be using the cluster's configured Spark directories and thus pick up all the necessary YARN/filesystem/lib configurations, the best way to do this is to add the property at cluster-creation time instead of job-submission time:
gcloud dataproc clusters create \
--properties spark:spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0
Per this StackOverflow error, setting the spark-defaults.conf property spark.jars.packages is the more portable equivalent of specifying the --packages option, since --packages is just syntactic sugar in spark-shell/spark-submit/pyspark wrappers which sets the spark.jars.packages configuration entry anyways.
Related
I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)
I need to use a third party jar (mysql) in my Scala script, if I use spark shell, I can specify the jar in the starting command like below:
spark2-shell --driver-class-path mysql-connector-java-5.1.15.jar --jars /opt/cloudera/parcels/SPARK2/lib/spark2/jars/mysql-connector-java-5.1.15.jar
However, how can I do this in Jupyter notebook? I remember there is a magic way to do it in pyspark, I am using Scala, and I can't change the environment setting of the kernel I am using.
I have the solution now, and it is very simple indeed as below:
Use a toree based Scala kernel (which is what I am using)
Use AddJar: in the notebook and run it, the jar will be downloaded and voila!
That's it.
AddJar http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.15/mysql-connector-java-5.1.15.jar
I want to share udfs I created in Scala with other cluster which our data scientist use with pyspark and jupyter in EMR clusters.
Is this possible? How?
this answer indeed helps
create an uber jar, put in s3, on bootstrap action copt it from s3 to spark local jar folder and it should work
What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`
I am trying add extra libraries to scala used through spark-shell through the Elsatic MapReduce inatance. But I am unsure how to go by this, is there a build tool that is used when spark-shell runs?
All i need to do is install a scala library and have it run through the spark-shell version of scala, Im not sure how to go about this since Im not sure how the EMR instance installs scala and spark.
I think that this answer will evolve with the information you give. As for now, considering that you have AWS EMR cluster deployed on which you wish to use the spark-shell. There is many options :
Option 1 : You can copy your libraries to the cluster with the scp command and add them into your spark-shell with the --jars options. e.g :
from your local machine :
scp -i awskey.pem /path/to/jar/lib.jar hadoop#emr-cluster-address:/path/to/destination
on your EMR cluster :
spark-shell --master yarn --jars lib.jar
Spark uses the following URL scheme to allow different strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Option 2 : You can have a copy of your libraries from S3 and add them with --jars option.
Option 3 : You can use the --packages options to load it from remote repository. You can include any other dependencies by supplying a comma-delimited list of maven coordinates. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.
For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.