Spark program taking hadoop configurations from an unspecified location - scala

I have few test cases such as reading/writing a file on HDFS that I want to automate using Scala and run using maven. I have taken the Hadoop configuration files of test environment and put it in the resources directory of my maven project. The project is also running fine on the desired cluster from any cluster that I am using to run the project from.
One thing that I am not getting is how is Spark taking Hadoop configurations from resources directory even when I have not specified it anywhere in the project. Below is a code snippet from project.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val hdfsCoreSitePath = new Path("/etc/hadoop/conf/core-site.xml","core-site.xml")
val hdfsHDFSSitePath = new Path("/etc/hadoop/conf/hdfs-site.xml","hdfs-site.xml")
val hdfsYarnSitePath = new Path("/etc/hadoop/conf/yarn-site.xml","yarn-site.xml")
val hdfsMapredSitePath = new Path("/etc/hadoop/conf/mapred-site.xml","mapred-site.xml")
hadoopConfiguration.addResource(hdfsCoreSitePath)
hadoopConfiguration.addResource(hdfsHDFSSitePath)
hadoopConfiguration.addResource(hdfsYarnSitePath)
hadoopConfiguration.addResource(hdfsMapredSitePath)
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(hadoopConfiguration)
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
println("-----------------Logged-in via keytab---------------------")
FileSystem.get(hadoopConfiguration)
val sc=new SparkContext(conf)
return sc
}
#Test
def testCase(): Unit = {
var hadoopConfiguration: Configuration = new Configuration()
val sc=getSparkContext(hadoopConfiguration)
//rest of the code
//...
//...
}
Here, I have used hadoopconfiguration object but I am not specifying this anywhere to sparkContext as this will run the tests on the cluster which I am using for running the project and not on some remote test environment.
If this is not a correct way? Can anyone please explain how I should carry out my motive of running spark test-cases on test environment from some remote cluster?

Related

Can I call the SparkContext constructor twice?

I need to do something like the following.
val conf = new SparkConf().setAppName("MyApp")
val master = new SparkContext(conf).master
if (master == "local[*]") // running locally
{
conf.set(...)
conf.set(...)
}
else // running on a cluster
{
conf.set(...)
conf.set(...)
}
val sc = new SparkContext(conf)
I first check whether I am running in local mode or cluster mode, and set the conf properties accordingly. But just to know about the master, I first have to create a SparkContext object. And after setting the conf properties, I obviously create another SparkContext object. Is this fine? Or Spark would just ignore my second constructor? If that is the case, in what other way I can find about the master (whether local or in cluster mode that is) before creating the SparkContext object?
Starting multiple contexts at the same time will give an error.
You can get around this by stopping the first context before creating the second.
master.stop()
val sc = new SparkContext(conf)
It's silly to do this though, you can get the master from the spark conf without needing to start a spark context.
conf.get("spark.master")

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

Running Sparkling-Water with external H2O backend

I was following the steps for running Sparkling water with external backend from here. I am using spark 1.4.1, sparkling-water-1.4.16, I've build the extended h2o jar and exported the H2O_ORIGINAL_JAR and H2O_EXTENDED_JAR system variables. I start the h2o backend with
java -jar $H2O_EXTENDED_JAR -md5skip -name test
But when I start sparkling water via
./bin/sparkling-shell
and in it try to get the H2OConf with
import org.apache.spark.h2o._
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test”)
val hc = H2OContext.getOrCreate(sc, conf)
it fails on the second line with
<console>:24: error: trait H2OConf is abstract; cannot be instantiated
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
^
I've tried adding the newly build extended h2o jar with --jars parameter either to sparkling water or standalone spark with no progress. Does any one have any hints?
This is unsupported for versions of Spark earlier than 2.0.
Download the latest version of sparkling jar and add it to while starting the spark-shell:
./bin/sparkling-shell --master yarn-client --jars "<path to the jar located>"
Then run the code by setting the extended h2o driver:
import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("//home//xyz//sparkling-water-2.2.5/bin//h2odriver-sw2.2.5-hdp2.6-extended.jar").setNumOfExternalH2ONodes(2).setMapperXmx("6G")
val hc = H2OContext.getOrCreate(spark, conf)

why can i run spark app in eclipse directly without spark-submit

1.My spark(standalone) cluster: spmaster,spslave1,spslave2
2.For my simple spark app which selects some records from mysql.
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://spmaster:7077")
.setAppName("SparkApp")
.set("spark.driver.extraClassPath","/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.24.jar") //the driver jar was uploaded to all nodes
.set("spark.executor.extraClassPath","/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.24.jar");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
String url = "jdbc:mysql://192.168.31.43:3306/mytest";
Map<String, String> options = new HashMap<String, String>();
options.put("url", url);
options.put("dbtable", "mytable");
options.put("user", "root");
options.put("password", "password");
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options)
.load();
jdbcDF.registerTempTable("c_picrecord");
DataFrame sql = sqlContext.sql("select * from mytable limit 10");
sql.show(); // **show the result on eclipse console**
sc.close();
}
3.My question : when i right click->run as 'Java Application', it works successfully, and i can find the job on webUI<spark://spmaster:7077>.I don't undersatand how it works , and what is the different between with using spark-submit.sh.
When we use spark-submit.sh for submitting application, then spark-submit already created Spark Context (aka Driver) by default.
But when we use Java API (JavaSparkContext) to connect master, then Java application will become Driver. And by using this Driver all application/job will submitted to master.
The spark-submit.sh script is just a wrapper around a ${JAVA_HOME}/bin/java execution command. It sets up the environment details and then runs something like:
${JAVA_HOME}/bin/java -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$#"
When you click on run as 'Java Application' you're also triggering a java execution command, but without all the environment settings done by spark-submit.sh and with the differences mentioned by #Sheel.

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream with Spark on local mode

I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)