Google Cloud Dataproc - job file erroring on sc.textFile() command - google-cloud-dataproc

Here is my file that I submit as a PySpark job in Dataproc, thru the UI
# Load file data fro Google Cloud Storage to Dataproc cluster, creating an RDD
# Because Spark transforms are 'lazy', we do a 'count()' action to make sure
# we successfully loaded the main data file
allFlt = sc.textFile("gs://mybucket/mydatafile")
allFlt.count()
# Remove header from file so we can work w data ony
header = allFlt.take(1)[0]
dataOnly = allFlt.filter(lambda line: line != header)
It starts and then errors out with
allFlt = sc.textFile("gs://thomtect/flightinfo")
NameError: name 'sc' is not defined
Why is this? Shouldn't a spark context have alraedy been established by Dataproc? What do I need to add to my code so that it is accepted as Spark commands

https://cloud.google.com/dataproc/submit-job has an example python spark job submission.
The short answer is to add the following to the top of your script:
#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
And to expand a bit on why this is required: when Dataproc runs python scripts, it uses spark-submit (http://spark.apache.org/docs/latest/submitting-applications.html) instead of running the pyspark shell.

Related

Finding the location of my spark job output file

I am testing pyspark jobs in an EMR cluster on AWS. The goal is to use a Lambda function to fire the spark job, but for now I am manually running the spark job. So, I SSH to the master node and then run the spark job as below:
spark-submit /home/hadoop/testspark.py mybucket
mybucket - parameter passed to the spark job.
The line that saves the RDD is
rddFiltered.repartition(1).saveAsTextFile("/home/hadoop/output.txt")
The spark job seems to run but it puts the output file in some location - Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt.
Where is this exactly located and how can I view the contents? Forgive my ignorance on HDFS and Hadoop.
Eventually, I want to rename output.txt to something meaningful and then transfer to S3, just haven't gotten there yet.
If I re-run the spark job it says "Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt already exists". How do I prevent this or at least overwrite the file?
Thanks
Based on the EMR documentation:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
if you do not specify prefix, spark will write data to HDFS by default. You can check EMR HDFS with this command:
hadoop fs -ls /home/hadoop/
You can also transfer from HDFS to S3 with S3DistCp:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Unfortunately you cannot overwrite the existing file using saveAsTextFile:
https://spark-project.atlassian.net/browse/SPARK-1100
As I can see you re-partitioned the file into one partition, so you can write it into the local file-system as well:
rddFiltered.repartition(1).collect().saveAsTextFile("file:///home/hadoop/output.txt")
Note, if you are using distributed cluster you have to collect() back to driver first!

Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
Is it possible to use this configuration with AWS Glue?
Option 1 :
Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe.
sc._jsc.hadoopConfiguration().set("mykey","myvalue")
I think you neeed to add the correspodning class also like this
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
example snippet :
sc = SparkContext()
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
glueContext = GlueContext(sc)
spark = glueContext.spark_session
To prove that that configuration exists ....
Debug in python :
sc._conf.getAll() // print this
Debug in scala :
sc.getConf.getAll.foreach(println)
Option 2:
Other side you try using job parameters of the glue :
https://docs.aws.amazon.com/glue/latest/dg/add-job.html
which has key value properties like mentioned in docs
'--myKey' : 'value-for-myKey'
you can follow below screen shot for editing job and specifying the parameters with --conf
Option 3:
If you are using, aws cli you can try below...
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
Fun is they mentioned in the docs dont set message like below. but i dont know why it was exposed.
To sum up : I personally prefer option1 since you have
programmatic control.
Go to glue job console and edit your job as follows :
Glue> Jobs > Edit your Job> Script libraries and job parameters
(optional) > Job parameters
Set the following:
key: --conf value:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.
import os
os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/ --groupBy='.*(additional).*' --targetSize=64 --outputCodec=none")
Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)
also, I think we can use subprocess.
If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp will hang because the pyspark application is blocking.
spark.stop() # spark context
os.system("/usr/bin/s3-dist-cp ...")

How to retrieve Dataproc's jobId within a PySpark job

I run several batch jobs and I would like to reference the jobId from dataproc to the saved output files.
That would allow to have all logs for arguments and output associated with the results. Downside remains: As executors in YARN past away, no logs for the single executor can be obtained anymore.
The context of Google dataproc is passed into Spark jobs by using tags. Therefore all suitable information are present in the SparkConfig and can be accessed:
pyspark.SparkConf().get("spark.yarn.application.tags", "unknown")
pyspark.SparkConf().get("spark.yarn.tags", "unknown")
Output looks the following:
dataproc_job_3f4025a0-bce1-a254-9ddc-518a4d8b2f3d
That information can then be assigned to our export folder and output is saved with Dataproc reference:
df.select("*").write. \
format('com.databricks.spark.csv').options(header='true') \
.save(export_folder)

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.