Where are the Spark logs on EMR? - scala

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR.
Where can I access these?
I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set to cluster and --master set to yarn. It runs the job fine.
However I do not see my println statements in the Amazon EMR UI where it lists "stderr, stdoutetc. Furthermore if my job errors I don't see why it had an error. All I see is this in thestderr`:
15/05/27 20:24:44 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1432754139536_0002
appId: 2
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-10-185-87-217.ec2.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1432758272973
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://10.150.67.62:9046/proxy/application_1432754139536_0002/A
appUser: hadoop
`

With the deploy mode of cluster on yarn the Spark driver and hence the user code executed will be within the Application Master container. It sounds like you had EMR debugging enabled on the cluster so logs should have also pushed to S3. In the S3 location look at task-attempts/<applicationid>/<firstcontainer>/*.

If you SSH into the master node of your cluster then you should be able to find the stdout, stderr, syslog and controller logs under:
/mnt/var/log/hadoop/steps/<stepname>

I also spent a lot of time figuring this out. Found logs in the following location:
EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.

The event logs, the ones required for the spark-history-server can be found at :
hdfs:///var/log/spark/apps

If you submit your job with emr-bootstrap you can specify the log directory as an s3 bucket with --log-uri

Related

gcloud CLI application logs to bucket

There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.
Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory
...
gcloud dataproc jobs submit spark \
--project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
--jars=gs://**.jar\
--properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
--gcp.conf > gs://***-$date.log 2>&1
By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.
But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.

Web UI(http://localhost:8088) is not showing Spark Applications

I have pseudo distributed hadoop 2.2.0 Environment setup in my laptop.I can run mapreduce applications(including Pig and Hive jobs) and the status of the applications can be seen from the Web UI http://localhost:8088
I have downloaded the Spark library and just used the file system-HDFS for the spark applications.when I launch a spark application,it is getting launched and the execution also gets completed successfully as expected.
But the Web UI http://localhost:8088 is not listing the Spark application completed/launched.
Please suggest if there is any other additional configuration is required for seeing Spark applications in the Web UI.
(Note: http://localhost:50070this Web UI shows the files correctly,when tried writing files to HDFS via Spark applications)
You might have figured it out but for others who are starting with Spark.You can see all the spark jobs on
http://localhost:4040
after your spark context is initiated(port can be different eg 4041). Based on Standalone installation you can see the master and slave status on
http://localhost:8080
(for slave port is usually 8081 onward). you need to Spark-submit jobs to yarn-cluster or client to see the same on hadoop webservices.