I am running a PySpark application using Dataproc Serverless for Spark, and my config file looks like this:
spark = (
pyspark.sql.SparkSession.builder.appName("app_name")
.config("spark.logConf", "true")
.config("spark.sql.broadcastTimeout", broadcast_timeout)
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0")
.config("spark.ui.showConsoleProgress", progress_bar)
.getOrCreate()
)
But the appName used is not reflected in the Dataproc batch job console:
In Dataproc -> Batches -> Clicking on Job Id -> Details tab -> Properties: spark:spark.app.name gives me a random ID.
Dataproc UI reflect properties set during batch submission, and do not reflect all properties that are set in Spark application code. spark.app.name property value that you see is a default value for this property, that you override in your Spark app.
If you can, you need to set this property when submitting batch job:
gcloud dataproc batches submit \
. . . \
--properties=spark.app.name="<MY_CUSTOM_APP_NAME>"
Related
There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.
I know this question has a duplicate
, but my use case is a little specific. I want to run my Spark job (compiled to a .jar) on an EMR (via Spark submit) and give 2 options like this:
spark-submit --master yarn --deploy-mode cluster <rest of command>
To achieve this, I wrote the code like this:
val sc = new SparkContext(new SparkConf())
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
However this gives the error during building the jar:
org.apache.spark.SparkException: A master URL must be set in your configuration
So what's a workaround? How do I set these 2 variables in code so that the master and deploy mode options are taken up while submitting; yet I should be able to use the variables sc and spark in my code (e.g:- val x = spark.read())
You could simply access command-line arguments simply as below and pass as many values as you want.
val spark = SparkSession.builder().appName("Test App")
.master(args(0))
.getOrCreate()
spark-submit --master yarn --deploy-mode cluster master-url
If you need something more fancy command-line parser then you can take a look here https://github.com/scopt/scopt
My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.
1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.
I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.
The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)
Here is my file that I submit as a PySpark job in Dataproc, thru the UI
# Load file data fro Google Cloud Storage to Dataproc cluster, creating an RDD
# Because Spark transforms are 'lazy', we do a 'count()' action to make sure
# we successfully loaded the main data file
allFlt = sc.textFile("gs://mybucket/mydatafile")
allFlt.count()
# Remove header from file so we can work w data ony
header = allFlt.take(1)[0]
dataOnly = allFlt.filter(lambda line: line != header)
It starts and then errors out with
allFlt = sc.textFile("gs://thomtect/flightinfo")
NameError: name 'sc' is not defined
Why is this? Shouldn't a spark context have alraedy been established by Dataproc? What do I need to add to my code so that it is accepted as Spark commands
https://cloud.google.com/dataproc/submit-job has an example python spark job submission.
The short answer is to add the following to the top of your script:
#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
And to expand a bit on why this is required: when Dataproc runs python scripts, it uses spark-submit (http://spark.apache.org/docs/latest/submitting-applications.html) instead of running the pyspark shell.