Scala code execution on master of spark cluster? - scala

The spark application uses some API calls which do not use spark-session. I believe when the piece of code doesn't use spark it is getting executed on the master node!
Why do I want to know this?
I am getting a java heap space error while I am trying to POST some files using API calls and I believe if I upgrade the master and increase driver mem it can be solved.
I want to understand how this type of application is executed on the Spark cluster?
Is my understanding right or am I missing something?

It depends - closures/functions passed to the built-in function transform or any code in udfs you create, code in forEachBatch (and maybe a few other places) will run on the workers. Other code runs on driver

Related

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

How can I execute a S3-dist-cp command within a spark-submit application

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

Is there a way to submit a Spark application to a cluster without creating a jar?

I am building a service which takes in a string of Spark code to execute on a cluster. Is there any way for me to set the Spark context to the cluster and execute without building a jar and submitting it?
Indeed, you can use the spark shell, or look at something like the IBM Spark Kernel, Zeppelin, etc. to have a long running Spark Context you can submit code to and have it run. As you are almost certainly already aware of, be very careful with accepting strings and executing them on the cluster (e.g. only from a trusted source).

How does work the spark application listens to port?

Can I run the program, listens at the port in the cluster?
I want to write an application that accepts http requests and performs some calculation using spark
Yes, you can run any code you want on driver node. You can use, for example, spray.io http server and connect to spark actor system:
import org.apache.spark.SparkEnv
implicit val system = SparkEnv.get.actorSystem
But there is no way to execute arbitrary code on workers. Workers run only code blocks inside RDD's map-reduce functions.
It is hard to understand your English, but, if I understood you correctly you are looking for something like Spark-JobServer

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();