How to start master and slave on EMR - pyspark

I'm new to EMR and now I can't run my Spark application on EMR.
My question is how I can $start-master.sh and $start-slave.sh on EMR.
I put this 2 commands into a bash file and upload to S3 for bootstrap.
aws emr create-cluster --release-label $release_label \
--instance-groups InstanceGroupType="MASTER",InstanceCount=$instance_count,InstanceType=$instance_type,BidPrice=0.2,Name="MASTER" \
InstanceGroupType="CORE",InstanceType=$instance_type,InstanceCount=$instance_count,BidPrice=0.2,Name="CORE" \
--auto-terminate \
--use-default-roles \
--name knx-attribution-spark-$product-$environment-$build_number \
--log-uri s3://knx-logs/emr/knx-attribution-spark-$product-$environment \
--ec2-attributes KeyName=$keypair,SubnetId=$subnet,EmrManagedMasterSecurityGroup=$sg1,EmrManagedSlaveSecurityGroup=$sg1,AdditionalMasterSecurityGroups=$sg2,AdditionalSlaveSecurityGroups=$sg2 \
--tags Name="knx-emr-attribution-spark-$product-$environment" Environment=$environment \
--applications Name=Spark Name=Hadoop\
--bootstrap-actions Path="s3://${BOOTSTRAP_FILE}" \
--steps Type=Spark,Name=Stage,ActionOnFailure=CONTINUE,Args=[--deploy-mode,client,--packages,org.mongodb.spark:mongo-spark-connector_2.11:2.3.0,--driver-memory,8g,--executor-memory,4g,--num-executors,4,--py-files,s3://${FILE_ZIP},--master,spark://127.0.0.1:7077,s3://${BUCKET}]
the bootstrap file is:
./spark/sbin/start-master.sh
./spark/sbin/start-slave.sh spark://127.0.0.1:7077
and it always throw this error:
/emr/instance-controller/lib/bootstrap-actions/1/install_lib.sh: line 4: start-master.sh: command not found

Make sure the the .sh files are executable before you try.
If not try this command to make it as an executable:
chmod +x start-master.sh
And then try running the scripts.

You cannot run Spark on Amazon EMR in standalone mode.

On EMR, Spark is run on YARN rather than in Standalone mode. Unless there’s something else wrong that I’m not seeing, the only thing you should need to change is to remove “--master,spark://127.0.0.1:7077” from the arguments of your spark-submit step. The correct args to use would be “--master,yarn”, but that is the default on EMR, so you don’t need to specify that.

Related

How to pass differnt filenames to spark using scala

I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format using spark-submit command.
The files are on my linux box.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception.
Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv".
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
I don't have a cluster on my hand right now, so I cannot test it. However:
You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

Getting error in configuring s3 Sink conector

I have cloned landoop fast-data-dev docker repo from this GitHub repo.
and built the image using command docker build --tag=landoop .
After building the image, I ran it using:
docker run --rm -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 -p 9581-9585:9581-9585 -p 9092:9092 -e ADV_HOST=10.10.X.X -e DEBUG=1 -e AWS_ACCESS_KEY_ID=XXX -e AWS_SECRET_ACCESS_KEY=XXX landoop
Once the UI was up, I tried to create a s3 sink connection but it failed saying:
Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
Also I don't see the libnss3.so file in the location. However if I run the docker container directly using the command below, I can see the file in the location and there is no error when creating the s3 sink connector.
docker run --rm --net=host landoop/fast-data-dev
Has anyone faced this error?
Answering my own question so that others can benefit,if it's not appropriate please leave a comment and I will make it a comment. I figured out that the libnss3 library was missing from debian image and had to install while building the image. For this I edited the setp-and-run.sh and added the libnss3, the script looks like :
FROM debian as compile-lkd
RUN apt-get update \
&& apt-get install -y \
unzip \
wget \
libnss3 \

How to get output of gcloud composer command?

I'm executing gcloud composer commands:
gcloud composer environments run airflow-composer \
--location europe-west1 --user-output-enabled=true \
backfill -- -s 20171201 -e 20171208 dags.my_dag_name \
kubeconfig entry generated for europe-west1-airflow-compos-007-gke.
It's a regular airflow backfill. The command above is printing the results at the end of the whole backfill range, is there any way to get the output in a streaming manner ? Each time a DAG gets backfilled it will be printed in the standard output, like in a regular airflow-cli.

Run PySpark job from .egg instead of .py

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn \
--driver-memory 20g \
--deploy-mode client \
--conf parquet.compression=SNAPPY \
--jars spark-avro_2.11-3.2.0.jar \
--py-files dummyproject-1_spark-py2.7.egg \
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark \
file:///dummyproject-1_spark-py2.7.egg#__main__.py \
--cluster=my-cluster-001 \
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark \
--cluster=my-cluster-001 \
--py-files=file:///dummyproject-1_spark-py2.7.egg \
file:///__main__.py \

How to pass -D parameter or environment variable to Spark job?

I want to change Typesafe config of a Spark job in dev/prod environment. It seems to me that the easiest way to accomplish this is to pass -Dconfig.resource=ENVNAME to the job. Then Typesafe config library will do the job for me.
Is there way to pass that option directly to the job? Or maybe there is better way to change job config at runtime?
EDIT:
Nothing happens when I add --conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev" option to spark-submit command.
I got Error: Unrecognized option '-Dconfig.resource=dev'. when I pass -Dconfig.resource=dev to spark-submit command.
Change spark-submit command line adding three options:
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Here is my spark program run with addition java option
/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log
as you can see
the custom config file
--files /home/spark/jobs/fact_stats_ad.conf
the executor java options
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf
the driver java options.
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'
Hope it can helps.
I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it:
"
The right way to pass the parameter is through the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties
“
You can read my blog post about overall configurations of spark.
I'm am running on Yarn as well.
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
if you write in this way, the later --conf will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment tab.
so the correct way is to put the options under same line like this:
--conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d'
if you do this, you can find all your settings will be shown under sparkUI.
I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like
Array(".../spark-submit", ..., "--conf", confValues, ...)
where confValues is:
for yarn-cluster mode:
"spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
for local[*] mode:
"run.mode=development"
It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.
spark-submit --driver-java-options "-Denv=DEV -Dmode=local" --class co.xxx.datapipeline.jobs.EventlogAggregator target/datapipeline-jobs-1.0-SNAPSHOT.jar
The above command works for me:
-Denv=DEV => to read DEV env properties file, and
-Dmode=local => to create SparkContext in local - .setMaster("local[*]")
Use the method like in below command, may be helpful for you -
spark-submit --master local[2] --conf
'spark.driver.extraJavaOptions=Dlog4j.configuration=file:/tmp/log4j.properties'
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/log4j.properties'
--class com.test.spark.application.TestSparkJob target/application-0.0.1-SNAPSHOT-jar-with-dependencies.jar prod
I have tried and it worked for me, I would suggest also go through heading below spark post which is really helpful -
https://spark.apache.org/docs/latest/running-on-yarn.html
I originally had this config file:
my-app {
environment: dev
other: xxx
}
This is how I'm loading my config in my spark scala code:
val config = ConfigFactory.parseFile(File<"my-app.conf">)
.withFallback(ConfigFactory.load())
.resolve
.getConfig("my-app")
With this setup, despite what the Typesafe Config documentation and all the other answers say, the system property override didn't work for me when I launched my spark job like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Dmy-app.environment=prod' \
--files my-app.conf \
my-app.jar
To get it to work I had to change my config file to:
my-app {
environment: dev
environment: ${?env.override}
other: xxx
}
and then launch it like so:
spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Denv.override=prod' \
--files my-app.conf \
my-app.jar