why can i run spark app in eclipse directly without spark-submit - eclipse

1.My spark(standalone) cluster: spmaster,spslave1,spslave2
2.For my simple spark app which selects some records from mysql.
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://spmaster:7077")
.setAppName("SparkApp")
.set("spark.driver.extraClassPath","/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.24.jar") //the driver jar was uploaded to all nodes
.set("spark.executor.extraClassPath","/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.24.jar");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
String url = "jdbc:mysql://192.168.31.43:3306/mytest";
Map<String, String> options = new HashMap<String, String>();
options.put("url", url);
options.put("dbtable", "mytable");
options.put("user", "root");
options.put("password", "password");
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options)
.load();
jdbcDF.registerTempTable("c_picrecord");
DataFrame sql = sqlContext.sql("select * from mytable limit 10");
sql.show(); // **show the result on eclipse console**
sc.close();
}
3.My question : when i right click->run as 'Java Application', it works successfully, and i can find the job on webUI<spark://spmaster:7077>.I don't undersatand how it works , and what is the different between with using spark-submit.sh.

When we use spark-submit.sh for submitting application, then spark-submit already created Spark Context (aka Driver) by default.
But when we use Java API (JavaSparkContext) to connect master, then Java application will become Driver. And by using this Driver all application/job will submitted to master.

The spark-submit.sh script is just a wrapper around a ${JAVA_HOME}/bin/java execution command. It sets up the environment details and then runs something like:
${JAVA_HOME}/bin/java -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$#"
When you click on run as 'Java Application' you're also triggering a java execution command, but without all the environment settings done by spark-submit.sh and with the differences mentioned by #Sheel.

Related

Spark program taking hadoop configurations from an unspecified location

I have few test cases such as reading/writing a file on HDFS that I want to automate using Scala and run using maven. I have taken the Hadoop configuration files of test environment and put it in the resources directory of my maven project. The project is also running fine on the desired cluster from any cluster that I am using to run the project from.
One thing that I am not getting is how is Spark taking Hadoop configurations from resources directory even when I have not specified it anywhere in the project. Below is a code snippet from project.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val hdfsCoreSitePath = new Path("/etc/hadoop/conf/core-site.xml","core-site.xml")
val hdfsHDFSSitePath = new Path("/etc/hadoop/conf/hdfs-site.xml","hdfs-site.xml")
val hdfsYarnSitePath = new Path("/etc/hadoop/conf/yarn-site.xml","yarn-site.xml")
val hdfsMapredSitePath = new Path("/etc/hadoop/conf/mapred-site.xml","mapred-site.xml")
hadoopConfiguration.addResource(hdfsCoreSitePath)
hadoopConfiguration.addResource(hdfsHDFSSitePath)
hadoopConfiguration.addResource(hdfsYarnSitePath)
hadoopConfiguration.addResource(hdfsMapredSitePath)
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(hadoopConfiguration)
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
println("-----------------Logged-in via keytab---------------------")
FileSystem.get(hadoopConfiguration)
val sc=new SparkContext(conf)
return sc
}
#Test
def testCase(): Unit = {
var hadoopConfiguration: Configuration = new Configuration()
val sc=getSparkContext(hadoopConfiguration)
//rest of the code
//...
//...
}
Here, I have used hadoopconfiguration object but I am not specifying this anywhere to sparkContext as this will run the tests on the cluster which I am using for running the project and not on some remote test environment.
If this is not a correct way? Can anyone please explain how I should carry out my motive of running spark test-cases on test environment from some remote cluster?

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

Can I call the SparkContext constructor twice?

I need to do something like the following.
val conf = new SparkConf().setAppName("MyApp")
val master = new SparkContext(conf).master
if (master == "local[*]") // running locally
{
conf.set(...)
conf.set(...)
}
else // running on a cluster
{
conf.set(...)
conf.set(...)
}
val sc = new SparkContext(conf)
I first check whether I am running in local mode or cluster mode, and set the conf properties accordingly. But just to know about the master, I first have to create a SparkContext object. And after setting the conf properties, I obviously create another SparkContext object. Is this fine? Or Spark would just ignore my second constructor? If that is the case, in what other way I can find about the master (whether local or in cluster mode that is) before creating the SparkContext object?
Starting multiple contexts at the same time will give an error.
You can get around this by stopping the first context before creating the second.
master.stop()
val sc = new SparkContext(conf)
It's silly to do this though, you can get the master from the spark conf without needing to start a spark context.
conf.get("spark.master")

Programatically setting (remote) master address for launching Spark

Note that the following local setting does work:
sc = new SparkContext("local[8]", testName)
But setting the remote master programmatically does not work:
sc = new SparkContext(master, testName)
or (same end result)
val sconf = new SparkConf()
.setAppName(testName)
.setMaster(master)
sc = new SparkContext(sconf)
In both of the latter cases the result is:
[16:25:33,427][INFO ][AppClient$ClientActor] Connecting to master akka.tcp://sparkMaster#mellyrn:7077/user/Master...
[16:25:33,439][WARN ][ReliableDeliverySupervisor] Association with remote system [akka.tcp://sparkMaster#mellyrn:7077]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The following command line approach for setting the spark master consistently works (verified on multiple projects):
$SPARK_HOME/bin/spark-submit --master spark://mellyrn.local:7077
--class $1 $curdir/sparkclass.jar )
Clearly there is some additional configuration happening related to the command line spark-submit. Anyone want to posit what that might be?
In UNIX shell script below:
SP_MAST_URL=$CASSANDRA_HOME/dse client-tool spark master-address
echo $SP_MAST_URL
This will print the master from your Spark cluster environment. You may try this command utility provided by Spark and pass it on to SPARK SUBMIT command.
Note: CASSANDRA_HOME is the path where Apache cassandra installation is done. It could be any UNIX FILE path depending upon each environment.

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream with Spark on local mode

I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)