Running Applications doesn t appear spark web Ui but runs - scala

i need your help, i created 2 apps (one which using spray framework and the other one receive messages from kafka and send it to cassandra).
Both run all the time and should never stop.
I m in standalone on the server and my conf is :
- In spark_env.sh :
SPARK_MASTER_IP=MYIP
SPARK_EXECUTOR_CORES=2
SPARK_MASTER_PORT=7077
SPARK_EXECUTOR_MEMORY=4g
#SPARK_WORKER_PORT=65000
MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
SPARK_LOCAL_IP=MYIP
SPARK_MASTER_WEBUI_PORT=8080
- In spark_env.sh :
spark.master spark://MYIPMASTER:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark-1.6.1-bin-hadoop2.6/spark-events
spark.history.fs.logDirectory /opt/spark-1.6.1-bin-hadoop2.6/logs
spark.io.compression.codec lzf
spark.cassandra.connection.host MYIPMASTER
spark.cassandra.auth.username LOGIN
spark.cassandra.auth.password PASSWORD
I can access on both pages :
MYIP:8080/ and MYIP:4040/
But on http://MYIP:8080/, i see only my workers , i can t see my application which running.
When i submit i use this :
/opt/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class MYCLASS --verbose --conf spark.eventLog.enable=true --conf spark.master.ui.port=8080 --master local[2] /opt/spark-1.6.1-bin-hadoop2.6/jars/MYJAR.jar
Why ?
Could you help me?
Thanks a lot :)

In your spark-submit command you are using the --master as local[2] which is submitting the application in local mode. If you wants to run it on the standalone cluster that you are running then you should pass spark master URL in master option i.e. --master spark://MYIPMASTER:7077

In terms of the master, spark-submit will respect the setting by following orders,
The master URL in your application code, which is the
SparkSession.builder().master("...")
The --master parameter for the spark-submit command
The default configuration in your spark-defaults.conf

Mode: Standalone cluster
1> bin/spark-submit --class com.deepak.spark.App ../spark-0.0.2-SNAPSHOT.jar --master spark://172.29.44.63:7077, was not working because master was specified after the jar
2> bin/spark-submit --class com.deepak.spark.App --master spark://172.29.44.63:7077 ../spark-0.0.2-SNAPSHOT.jar, this worked

Related

Spark submit truncates arguments in yarn cluster mode

I am running spark application on yarn cluster in cluster deploy mode using following command
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=cluster --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
This command is not passing the entire argument to HelloWorld class.
Main method argument passed : {"sample":{
Main method argument expected: {"sample":{}}
The same command is running properly with client deploy mode
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=client --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
Upon inspecting the launch_container.sh script in yarn worker node it was found that the command also had truncated string within it (--arg '{\"sample\":{')
Spark Version: 2.3
Hadoop Version: 2.7.3
Yarn consider {{ and }} as parameter expansion character hence any occurrence is considered as an environment variable and replaced with the corresponding value. Since there is no environment variable.
This causes an issue in cluster deploy mode as driver runs in yarn cluster.
Reference: YarnApplicationConstants

Pyspark submit master yarn cluster deploy - logs location

I submitted pyspark job with spark-submit command on a haddoop cluster. The command is as follows
spark-submit --master yarn --deploy-mode cluster --driver-memory 1g --num-executors 2 --executor-memory 1g --executor-cores 2 --py-files module_stm_extracts.py,module_table_compare.py datacheck,py
The job completed, but I never got the application id in the console.
How do I find the application log, so that I can review
You can find it at the YARN Resource Manager WebUI, by default it's acessible through the 8088 port of the master node: http://<master_node_ip>:8088
Or you can list the applications through command line too:
yarn application -list -appStates ALL
And with the applicationId get the log with the following command:
yarn logs --applicationId <application_id>

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

Master must start with yarn,spark

I am getting this error when is want to run SparkPi example.
beyhan#beyhan:~/spark-1.2.0-bin-hadoop2.4$ /home/beyhan/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --master ego-client --class org.apache.spark.examples.SparkPi /home/beyhan/spark-1.2.0-bin-hadoop2.4/lib/spark-examples-1.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Also i already start my master via another terminal
>./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home/beyhan/spark-1.2.0-bin-hadoop2.4/sbin/../logs/spark-beyhan-org.apache.spark.deploy.master.Master-1-beyhan.out
Any suggestion ?
Thanks.
Download and extract Spark:
$ cd ~/Downloads
$ wget -c http://archive.apache.org/dist/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz
$ cd /tmp
$ tar zxf ~/Downloads/spark-1.2.0-bin-hadoop2.4.tgz
$ cd spark-1.2.0-bin-hadoop2.4/
Start master:
$ sbin/start-master.sh
Find master's URL from logs in the file that above command printed. Lets assume that master is: spark://ego-server:7077
In this case, you can also find your master url by visiting this URL: http://localhost:8080/
Start one slave, and connect it to master:
$ sbin/start-slave.sh --master spark://ego-server:7077
Another way to ensure that master up and running start a shell bound to that master:
$ bin/spark-submit --master "spark://ego-server:7077"
If you get a spark shell, then everything seems fine.
Now execute your job:
$ find . -name "spark-example*jar"
./lib/spark-examples-1.2.0-hadoop2.4.0.jar
$ bin/spark-submit --master "spark://ego-server:7077" --class org.apache.spark.examples.SparkPi ./lib/spark-examples-1.2.0-hadoop2.4.0.jar
The error you're getting
Error: Master must start with yarn, spark, mesos, or local
Means that --master ego-client is not recognized by spark.
Use
--master local
for a local execution of spark or
--master spark://your-spark-master-ip:7077