How to execute an application uploaded to worker nodes with --files option? - scala

I am uploading a file to my worker nodes using spark-submit, and I would like to access this file. This file is a binary, which I would like to execute. I already know how to execute the file through scala, but I keep getting a "File not found" exception and I can't find a way to access it.
I use the following command to submit my job.
spark-submit --class Main --master yarn --deploy-mode cluster --files las2las myjar.jar
when the job is being executed I noticed that it was uploaded to the staging directory for the current running application, when I tried to run the following, it didn't work.
val command = "hdfs://url/user/username/.sparkStaging/" + sparkContext.applicationId + "/las2las" !!
This is the exception that gets thrown:
17/10/22 18:15:57 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "hdfs://url/user/username/.sparkStaging/application_1486393309284_26788/las2las": error=2, No such file or directory
So, my question is, how can I access the las2las file?

Use SparkFiles:
val path = org.apache.spark.SparkFiles.get("las2las")

How can I access the las2las file?
When you go to the YARN UI at http://localhost:8088/cluster and click on the application ID for the Spark application, you'll get redirected to the page with the container logs. Click Logs. In stderr you should find lines that looks similar to the following:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> file:/Users/jacek/.sparkStaging/application_1508700955259_0002
SPARK_USER -> jacek
SPARK_YARN_MODE -> true
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.worker.ui.port=44444' \
'-Dspark.driver.port=55365' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler#192.168.1.6:55365 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1508700955259_0002 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_libs__618005180363157241.zip" } size: 218111116 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_conf__.zip" } size: 105328 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
===============================================================================
I executed my Spark application as follows:
YARN_CONF_DIR=/tmp \
./bin/spark-shell --master yarn --deploy-mode client --files hello.sh
so the lines of interest is:
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
You should find a similar line with the path to the shell script (mine is /Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh).
This file is a binary, which I would like to execute.
With the line, you can try to execute it.
import scala.sys.process._
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh" !!
warning: there was one feature warning; re-run with -feature for details
java.io.IOException: Cannot run program "/Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:113)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:129)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 50 elided
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 54 more
It won't work by default since the file is not marked as executable.
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rw-r--r-- 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
(I don't know if you can inform Spark or YARN to make a file executable).
Let's make the file executable.
scala> s"chmod +x /Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
res2: String = ""
It is indeed an executable shell script.
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rwxr-xr-x 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
Let's execute it then.
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
+ echo 'Hello world'
res3: String =
"Hello world
"
It worked fine given the following hello.sh:
#!/bin/sh -x
echo "Hello world"

Related

How to pass differnt filenames to spark using scala

I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format using spark-submit command.
The files are on my linux box.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception.
Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv".
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
I don't have a cluster on my hand right now, so I cannot test it. However:
You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

pyspark container- spark-submitting a pyspark script throws file not found error

Solution-
Add following env variables to the container
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
Trying to create a spark container and spark-submit a pyspark script.
I am able to create the container but running the pyspark script throws the following error:
Exception in thread "main" java.io.IOException: Cannot run program
"python": error=2, No such file or directory
Questions :
Any idea why this error is occurring ?
Do i need to install python separately or does it comes bundled with spark download ?
Do i need to install Pyspark separately or does it comes bundled with spark download ?
What is preferable regarding python installation? download and put it under /opt/python or use apt-get ?
pyspark script:
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
output of spark-submit:
newuser#c1f28230da16:~$ spark-submit count.py
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting
this to the maintainers of org.apache.spark.unsafe.Platform WARNING:
Use --illegal-access=warn to enable warnings of further illegal
reflective access operations WARNING: All illegal access operations
will be denied in a future release 21/02/01 19:58:35 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Exception in
thread "main" java.io.IOException: Cannot run program "python":
error=2, No such file or directory at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564) at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: java.io.IOException: error=2, No such file or directory at
java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at
java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319) at
java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 15 more log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
output of printenv:
newuser#c1f28230da16:~$ printenv
HOME=/home/newuser
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip
TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin
_=/usr/bin/printenv
myspark dockerfile:
JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG
SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz
#MAINTAINER demo#gmail.com
#LABEL maintainer="demo#foo.com"
############################################
### Install openjava
############################################
# Base image stage 1 FROM ubuntu as jdk
ARG JAVA_HOME ARG JDK_PACKAGE
WORKDIR /opt/
## download open java
# ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE
/
# ADD $JDK_PACKAGE / COPY $JDK_PACKAGE .
RUN mkdir -p $JAVA_HOME/ && \
tar -zxf $JDK_PACKAGE --strip-components 1 -C $JAVA_HOME && \
rm -f $JDK_PACKAGE
############################################
### Install spark search
############################################
# Base image stage 2 From ubuntu as spark
#ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
## download spark COPY $SPARK_PACKAGE .
RUN mkdir -p $SPARK_HOME/ && \
tar -zxf $SPARK_PACKAGE --strip-components 1 -C $SPARK_HOME && \
rm -f $SPARK_PACKAGE
# Mount elasticsearch.yml config
### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml
############################################
### final
############################################
From ubuntu as finalbuild
ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
# get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME $SPARK_HOME
# Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME
# setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH
$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip
# Expose ports
# EXPOSE 9200
# EXPOSE 9300
# Define mountable directories.
#VOLUME ["/data"]
## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash && \
echo 'newuser:newpassword' | chpasswd && \
chown -R newuser $SPARK_HOME $JAVA_HOME && \
chown -R newuser:newuser /home/newuser && \
chmod 755 /home/newuser
#chown -R newuser:newuser /home/newuser
#chown -R newuser /home/newuser && \
# Install Python RUN apt-get update && \
apt-get install -yq curl && \
apt-get install -yq vim && \
apt-get install -yq python3.9
## Install PySpark and Numpy
#RUN \
# pip install --upgrade pip && \
# pip install numpy && \
# pip install pyspark
#
USER newuser
WORKDIR /home/newuser
# RUN chown -R newuser /home/newuser
Added following env variables to the container and it works
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

Google Cloud Endpoint Error when creating service config

I am trying to configure Google Cloud Endpoints using Cloud Functions. For the same I am following instructions from: https://cloud.google.com/endpoints/docs/openapi/get-started-cloud-functions
I have followed the steps given and have come to the point of building the service config into a new ESPv2 Beta docker image. When I give the command:
chmod +x gcloud_build_image
./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p ESP_PROJECT_ID
after replacing the hostname and configid and projectid I get the following error
> -c service-host-name-xxx -p project-id
Using base image: gcr.io/endpoints-release/endpoints-runtime-serverless:2
++ mktemp -d /tmp/docker.XXXX
+ cd /tmp/docker.5l3t
+ gcloud endpoints configs describe service-host-name-xxx.run.app --project=project-id --service=service-host-name-xxx.app --format=json
ERROR: (gcloud.endpoints.configs.describe) NOT_FOUND: Service configuration 'services/service-host-name-xxx.run.app/configs/service-host-name-xxx' not found.
+ error_exit 'Failed to download service config'
+ echo './gcloud_build_image: line 46: Failed to download service config (exit 1)'
./gcloud_build_image: line 46: Failed to download service config (exit 1)
+ exit 1
Any idea what am I doing wrong? Thanks
My bad. I repeated the steps and got it working. So I guess there must have been some mistake I did while trying it out. The document works as it states.
I had the same error. When running the script twice it works. This means you have to already have a service endpoint configured, which does not exist yet when the script tries to fetch the endpoint information with:
gcloud endpoints configs describe service-host-name-xxx.run.app
What I would do (in cloudbuild) is to supply some sort of an "empty" container first. I used the following example on top of my cloudbuild.yaml:
gcloud run services list \
--platform managed \
--project ${PROJECT_ID} \
--region europe-west1 \
--filter=${PROJECT_ID}-esp-svc \
--format yaml | grep . ||
gcloud run deploy ${PROJECT_ID}-esp-svc \
--image="gcr.io/endpoints-release/endpoints-runtime-serverless:2" \
--allow-unauthenticated \
--platform managed \
--project=${PROJECT_ID} \
--region=europe-west1 \
--timeout=120

How to start master and slave on EMR

I'm new to EMR and now I can't run my Spark application on EMR.
My question is how I can $start-master.sh and $start-slave.sh on EMR.
I put this 2 commands into a bash file and upload to S3 for bootstrap.
aws emr create-cluster --release-label $release_label \
--instance-groups InstanceGroupType="MASTER",InstanceCount=$instance_count,InstanceType=$instance_type,BidPrice=0.2,Name="MASTER" \
InstanceGroupType="CORE",InstanceType=$instance_type,InstanceCount=$instance_count,BidPrice=0.2,Name="CORE" \
--auto-terminate \
--use-default-roles \
--name knx-attribution-spark-$product-$environment-$build_number \
--log-uri s3://knx-logs/emr/knx-attribution-spark-$product-$environment \
--ec2-attributes KeyName=$keypair,SubnetId=$subnet,EmrManagedMasterSecurityGroup=$sg1,EmrManagedSlaveSecurityGroup=$sg1,AdditionalMasterSecurityGroups=$sg2,AdditionalSlaveSecurityGroups=$sg2 \
--tags Name="knx-emr-attribution-spark-$product-$environment" Environment=$environment \
--applications Name=Spark Name=Hadoop\
--bootstrap-actions Path="s3://${BOOTSTRAP_FILE}" \
--steps Type=Spark,Name=Stage,ActionOnFailure=CONTINUE,Args=[--deploy-mode,client,--packages,org.mongodb.spark:mongo-spark-connector_2.11:2.3.0,--driver-memory,8g,--executor-memory,4g,--num-executors,4,--py-files,s3://${FILE_ZIP},--master,spark://127.0.0.1:7077,s3://${BUCKET}]
the bootstrap file is:
./spark/sbin/start-master.sh
./spark/sbin/start-slave.sh spark://127.0.0.1:7077
and it always throw this error:
/emr/instance-controller/lib/bootstrap-actions/1/install_lib.sh: line 4: start-master.sh: command not found
Make sure the the .sh files are executable before you try.
If not try this command to make it as an executable:
chmod +x start-master.sh
And then try running the scripts.
You cannot run Spark on Amazon EMR in standalone mode.
On EMR, Spark is run on YARN rather than in Standalone mode. Unless there’s something else wrong that I’m not seeing, the only thing you should need to change is to remove “--master,spark://127.0.0.1:7077” from the arguments of your spark-submit step. The correct args to use would be “--master,yarn”, but that is the default on EMR, so you don’t need to specify that.

Specify hbase-site.xml to spark-submit

I have a spark job (written in Scala) that retrieves data from an HBase table found on another server. In order to do this I first create the HBaseContext like this:
val hBaseContext:HBaseContext = new HBaseContext(sparkContext, HBaseConfiguration.create())
When I run the spark job I use spark-submit and specify the arguments needed. Something like this:
spark-submit --master=local[*] --executor-memory 4g --executor-cores 2 --num-executors 2 --jars $(for x in `ls -1 ~/spark_libs/*.jar`; do readlink -f $x; done | paste -s | sed -e 's/\t/,/g') --class com.sparksJob.MyMainClass myJarFile.jar "$#"
The thing is that this connects to zookeeper on localhost, however I want it to connect to the zookeeper on another server (the one where HBase is).
Hardcoding this information works:
val configuration: Configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "10.190.144.8")
configuration.set("hbase.zookeeper.property.clientPort", "2181")
val hBaseContext:HBaseContext = new HBaseContext(sparkContext, HBaseConfiguration.create(configuration))
However but I want it configurable.
How can I specify spark-submit the path to an hbase-site.xml file to use?
You can pass hbase-site.xml as parameter of the --files option. Your example would become:
spark-submit --master yarn-cluster --files /etc/hbase/conf/hbase-site.xml --executor-memory 4g --executor-cores 2 --num-executors 2 --jars $(for x in `ls -1 ~/spark_libs/*.jar`; do readlink -f $x; done | paste -s | sed -e 's/\t/,/g') --class com.sparksJob.MyMainClass myJarFile.jar "$#"
Notice the master set to yarn-cluster. Any other option would make the hbase-site.xml to be ignored.