Submit a PySpark job to a cluster with the '--py-files' argument - google-cloud-dataproc

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value.
This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip.
e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?

This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.
My test.py file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.

Related

Read local/linux files in Spark Scala code executing in Yarn Cluster Mode

How to access and read local file data in Spark executing in Yarn Cluster Mode.
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
Spark code to read csv:
val test_data = spark.read.option("inferSchema", "true").option("header", "true).csv("/home/test_dir/test_file.csv")
val test_file_data = spark.read.option("inferSchema", "true").option("header", "true).csv("file:///home/test_dir/test_file.csv")
The above sample spark-submit is failing with local file not-found error (/home/test_dir/test_file.csv)
Spark by defaults check for file in hdfs:// but my file is in local and should not be copied into hfds and should read only from local file system.
Any suggestions to resolve this error?
Using file:// prefix will pull files from the YARN nodemanager filesystem, not the system from where you submitted the code.
To access your --files use csv("#test_file.csv")
should not be copied into hdfs
Using --files will copy the files into a temporary location that's mounted by the YARN executor and you can see them from the YARN UI
Below solution worked for me:
local/linux file: /home/test_dir/test_file.csv
spark-submit --class "" --master yarn --deploy_mode cluster --files /home/test_dir/test_file.csv test.jar
To access file passed in spark-submit:
import scala.io.Source
val lines = Source.fromPath("test_file.csv").getLines.toString
Instead of specifying complete path, specify only file name that we want to read. As spark already takes copy of file across nodes, we can access data of file with only file name.

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.
in python its not working however in SCALA it works:
You need to do following steps:
step1:
If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example:
sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export --
username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2:
Also, you need to download the jar file from below link:
http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3:
Suppose, customers table is imported using sqoop as sequence file.
Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar
step4: Now run below commands inside the spark-shell
scala> import org.apache.hadoop.io.LongWritable
scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")
scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)
You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4
$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

Is there a spark-defaults.conf when installed with pip install pyspark

I installed pyspark with pip.
I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file.
Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says
"Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-memory command line option or in your default properties file."
But I'm afraid there is no such file when installing pyspark with pip.
I'm I right? How do I solve this?
Thanks!
I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,
mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf
The spark-defaults.conf file should be located in:
$SPARK_HOME/conf
If no file is present, create one (a template should be available in the same directory).
How to find the default configuration folder
Check contents of the folder in Python:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
# '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
When no spark-defaults.conf file is available, built-in values are used
To my surprise, no spark-defaults.conf but just a template file was present!
Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
spark.sparkContext.getConf().getAll()
# [('spark.driver.port', '55128'),
# ('spark.app.name', 'myApp'),
# ('spark.rdd.compress', 'True'),
# ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
# ('spark.serializer.objectStreamReset', '100'),
# ('spark.master', 'local[*]'),
# ('spark.submit.pyFiles', ''),
# ('spark.app.startTime', '1645484409629'),
# ('spark.executor.id', 'driver'),
# ('spark.submit.deployMode', 'client'),
# ('spark.app.id', 'local-1645484410352'),
# ('spark.ui.showConsoleProgress', 'true'),
# ('spark.driver.host', 'xxx.xxx.xxx.xxx')]
Note that not all properties are listed but:
only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.
For instance, consider the default parallelism is in my case:
spark._sc.defaultParallelism
8
This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.
If passed the property spark.default.parallelism when launching the app
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
then the property is shown in the Web UI and in the list
spark.sparkContext.getConf().getAll()
Precedence of configuration settings
Spark will consider given properties in this order (spark-defaults.conf comes last):
SparkConf
flags passed to spark-submit
spark-defaults.conf
From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Note
Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.
Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark
The spark-defaults.config file is needed when we have to change any of the default configs for spark.
As #niuer suggested, it should be present in the $SPARK_HOME/conf/ directory. But that might not be the case with you. By default, a template config file will be present there. You can just add a new spark-defaults.conf file in $SPARK_HOME/conf/.
Check your spark path. There are configuration files under:
$SPARK_HOME/conf/, e.g.
spark-defaults.conf.

AWS EMR import pyfile from S3

I'm struggling to understand how to import files as libraries with pyspark.
Let's say that I have the following
HappyBirthday.py
def run():
print('Happy Birthday!')
sparky.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import HappyBirthday
sc = SparkContext(appName="kmeans")
HappyBirthday.run()
sc.stop()
And both of them are stored in the same folder in S3.
How I make sure that, when I use
spark-submit --deploy-mode cluster s3://<PATH TO FILE>/sparky.py
, HappyBirthday.py is also imported?
If you are trying to run sparky.py and use a function inside HappyBirthday.py, you can try something like this.
spark-submit \
--deploy-mode cluster --master yarn \
--py-files s3://<PATH TO FILE>/HappyBirthday.py \
s3://<PATH TO FILE>/sparky.py
Just remember that s3 does not have the concept of "folders", so you just need to provide the exact path of the files or the group of files.
In case you have a whole bunch of dependencies in your project, you can bundle them all up into a single .zip file with the necessary init.py files and you can import any of the functions inside the libraries.
For example - I have sqlparse library as a dependency, with a bunch of python file s inside it. I have a package zip file, like below.
unzip -l packages.zip
Archive: packages.zip
0 05-05-2019 12:44 sqlparse/
2249 05-05-2019 12:44 sqlparse/__init__.py
5916 05-05-2019 12:44 sqlparse/cli.py
...
110 05-05-2019 12:44 sqlparse-0.3.0.dist-info/WHEEL
--------- -------
125034 38 files
This is uploaded to S3 and then used in the job.
spark-submit --deploy-mode cluster --master yarn --py-files s3://my0-test-bucket/artifacts/packages.zip s3://my-test-script/script/script.py
My file can contain imports like below.
import pyspark
import sqlparse # Importing the library
from pprint import pprint
What you want to use here is the --py-files argument for spark-submit. From the submitting applications page in the Spark documentation:
For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.
For your example, this would be:
spark-submit --deploy-mode cluster --py-files s3://<PATH TO FILE>/sparky.py

Passing parameters into dataproc pyspark job

How do you pass parameters into the python script being called in a dataproc pyspark job submit? Here is a cmd I've been mucking with:
gcloud dataproc jobs submit pyspark --cluster my-dataproc \
file:///usr/test-pyspark.py \
--properties=^:^p1="7day":p2="2017-10-01"
This is the output returned:
Job [vvvvvvv-vvvv-vvvv-vvvv-0vvvvvv] submitted. Waiting for job output...
Warning: Ignoring non-spark config property: p2=2017-10-01
Warning: Ignoring non-spark config property: p1=7day
Found script=/usr/test-pyspark.py
Traceback (most recent call last):
File "/usr/test-pyspark.py", line 52, in <module>
print(sys.argv[1])
IndexError: list index out of range`
Clearly doesn't recognize the 2 params I'm trying to pass in. I also tried:
me#my-dataproc-m:~$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc test-pyspark.py 7day 2017-11-01
But that returned with:
ERROR: (gcloud.dataproc.jobs.submit.pyspark) unrecognized arguments:
7day
2017-11-01
The pattern I use to pass params with the hive jobs doesn't work for pyspark.
Any help appreciated!
Thanks,
Melissa
The second form is close, use '--' to separate arguments to your job from arguments to gcloud:
$ gcloud dataproc jobs submit pyspark --cluster=my-dataproc \
test-pyspark.py -- 7day 2017-11-01
This worked for me for passing multiple arguments -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> --region europe-west1 --properties spark.master=yarn --properties spark.submit.deployMode=client --properties spark.sql.adaptive.enabled=true --properties spark.executor.memoryOverhead=8192 --properties spark.driver.memoryOverhead=4096 .py -- --arg1=value1 --arg2=value2
or simply saying -
gcloud dataproc jobs submit pyspark --cluster <cluster_name> <name-of-the-script>.py -- --arg1=value1 --arg2=value2
All the Best !