I am trying add extra libraries to scala used through spark-shell through the Elsatic MapReduce inatance. But I am unsure how to go by this, is there a build tool that is used when spark-shell runs?
All i need to do is install a scala library and have it run through the spark-shell version of scala, Im not sure how to go about this since Im not sure how the EMR instance installs scala and spark.
I think that this answer will evolve with the information you give. As for now, considering that you have AWS EMR cluster deployed on which you wish to use the spark-shell. There is many options :
Option 1 : You can copy your libraries to the cluster with the scp command and add them into your spark-shell with the --jars options. e.g :
from your local machine :
scp -i awskey.pem /path/to/jar/lib.jar hadoop#emr-cluster-address:/path/to/destination
on your EMR cluster :
spark-shell --master yarn --jars lib.jar
Spark uses the following URL scheme to allow different strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Option 2 : You can have a copy of your libraries from S3 and add them with --jars option.
Option 3 : You can use the --packages options to load it from remote repository. You can include any other dependencies by supplying a comma-delimited list of maven coordinates. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.
For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.
Related
I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.
I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.
Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.
The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;
runners:
hadoop:
cmdenv:
PATH: <pod name>:/opt/hadoop
However mrjob is still unable to detect hadoop bin and gives the below error
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'
So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.
mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.
IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.
Beam would be recommended since it is compatible with GCP DataFlow
I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)
I actually work on zeppelin with spark and scala. I want to import the library which contain : import com.databricks.spark.xml.
I tried but I have still the same mistake in zeppelin mistake : <console>:25: error: object databricks is not a member of package com.
What I've done actually ? I create a note in Zeppelin with this code : %dep
z.load("com.databricks:spark-xml_2.11:jar:0.5.0"). Even with that, the interpreter don't work. It's like it don't succeed to load the library.
Have you an idea why it don't work ?
Thanks for your help and have a nice day !
Your problem is very common and not intuitive to solve. I resolved an issue similar to this (I wanted to load the postgres jdbc connector in AWS EMR and I was using a linux terminal). Your issue can be resolved by checking if you can:
load the jar file manually to the environment that is hosting Zeppelin.
add the path of the jar file to your CLASSPATH environment variable. I don't know where you're hosting your files that manage your CLASSPATH env, but in EMR, my file, viewed from the Zeppelin root directory, was here: /usr/lib/zeppelin/conf/zeppelin-env.sh
download the zeppelin interpreter with
$ sudo ./bin/install-interpreter.sh --name "" --artifact
add the interpreter in Zeppelin wby going to the Zeppelin Interpreter GUI and add in the interpreter group.
Reboot Zeppelin with:
$ sudo stop zeppelin
$ sudo start zeppelin
It's very likely that your configurations may vary slightly, but I hope this helps provide some structure and relevance.
I need to use a third party jar (mysql) in my Scala script, if I use spark shell, I can specify the jar in the starting command like below:
spark2-shell --driver-class-path mysql-connector-java-5.1.15.jar --jars /opt/cloudera/parcels/SPARK2/lib/spark2/jars/mysql-connector-java-5.1.15.jar
However, how can I do this in Jupyter notebook? I remember there is a magic way to do it in pyspark, I am using Scala, and I can't change the environment setting of the kernel I am using.
I have the solution now, and it is very simple indeed as below:
Use a toree based Scala kernel (which is what I am using)
Use AddJar: in the notebook and run it, the jar will be downloaded and voila!
That's it.
AddJar http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.15/mysql-connector-java-5.1.15.jar
I am able to run jupyter with pyspark successfully using https://cloud.google.com/dataproc/tutorials/jupyter-notebook
My question is - if I had to add packages to pyspark (like spark-csv or graphframes) and use them through the notebook, what is the best practice to follow ?
I can add the package in a new pyspark job using --packages option, but how do i connect that new pyspark context to the notebook ?
To get the notebook working, you'll really want the notebook setup to pick up the right packages itself. Since the initialization action you linked works to ensure Jupyter will be using the cluster's configured Spark directories and thus pick up all the necessary YARN/filesystem/lib configurations, the best way to do this is to add the property at cluster-creation time instead of job-submission time:
gcloud dataproc clusters create \
--properties spark:spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0
Per this StackOverflow error, setting the spark-defaults.conf property spark.jars.packages is the more portable equivalent of specifying the --packages option, since --packages is just syntactic sugar in spark-shell/spark-submit/pyspark wrappers which sets the spark.jars.packages configuration entry anyways.