Where does Spark store its profiling results? - scala

One can view profiling statistics of a Spark program through a browser on port 4040. However, the cluster I am running it on, doesn't have a browser and also I'm not the admin. Is this information also logged in some file, so that I can make my own tool to visualize the statistics by reading such file?
Note that I run Spark over YARN. But sometimes I also run using local standalone mode. So any answer related to that is also appreciated.

Related

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

How can I execute a S3-dist-cp command within a spark-submit application

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

Can I catch events such as on Executor start in Apache Spark?

What I want to do, is for the executor to start a program, such as a profiling tool, when it starts (that is, before it start executing any task). In this way, it would be possible to monitor things like CPU usage of an executor. Does Spark provide such hooks/callbacks? I have used SparkListener, but that is used by the driver side. Do we have a similar thing for Executors?
This should work for your requirement.
http://spark.apache.org/developer-tools.html#profiling
Setup yourkit to work with both drivers and slaves (executors). It doesn't start profiling unless you tell it. Connect to master or slave, start profiling and then run your tests.
Happy profiling!!

Web UI(http://localhost:8088) is not showing Spark Applications

I have pseudo distributed hadoop 2.2.0 Environment setup in my laptop.I can run mapreduce applications(including Pig and Hive jobs) and the status of the applications can be seen from the Web UI http://localhost:8088
I have downloaded the Spark library and just used the file system-HDFS for the spark applications.when I launch a spark application,it is getting launched and the execution also gets completed successfully as expected.
But the Web UI http://localhost:8088 is not listing the Spark application completed/launched.
Please suggest if there is any other additional configuration is required for seeing Spark applications in the Web UI.
(Note: http://localhost:50070this Web UI shows the files correctly,when tried writing files to HDFS via Spark applications)
You might have figured it out but for others who are starting with Spark.You can see all the spark jobs on
http://localhost:4040
after your spark context is initiated(port can be different eg 4041). Based on Standalone installation you can see the master and slave status on
http://localhost:8080
(for slave port is usually 8081 onward). you need to Spark-submit jobs to yarn-cluster or client to see the same on hadoop webservices.

Unable to run YCSB successfully for ElasticSearch

I am new to both YCSB and ElasticSearch. I was able to run YCSB easily for Cassandra. However, I have not been able to do the same with ES (or perhaps I have but I am not sure).
Following the steps documented in YCSB/elasticsearch, I was able to start the test and I even got the results. I am not sure on which ES instance is it running on? For Cassandra, I had to start the Cassandra Server myself and then run the tests (providing the hosts details along with the ycsb command). ES, on the other hand, does not require us to do anything of that sort. So how does YCSB run these tests. I didn't even have my local ES instance up but the tests gave results.
Any insights would really help?
Thanks!