Running Sparkling-Water with external H2O backend - scala

I was following the steps for running Sparkling water with external backend from here. I am using spark 1.4.1, sparkling-water-1.4.16, I've build the extended h2o jar and exported the H2O_ORIGINAL_JAR and H2O_EXTENDED_JAR system variables. I start the h2o backend with
java -jar $H2O_EXTENDED_JAR -md5skip -name test
But when I start sparkling water via
./bin/sparkling-shell
and in it try to get the H2OConf with
import org.apache.spark.h2o._
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test”)
val hc = H2OContext.getOrCreate(sc, conf)
it fails on the second line with
<console>:24: error: trait H2OConf is abstract; cannot be instantiated
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
^
I've tried adding the newly build extended h2o jar with --jars parameter either to sparkling water or standalone spark with no progress. Does any one have any hints?

This is unsupported for versions of Spark earlier than 2.0.

Download the latest version of sparkling jar and add it to while starting the spark-shell:
./bin/sparkling-shell --master yarn-client --jars "<path to the jar located>"
Then run the code by setting the extended h2o driver:
import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("//home//xyz//sparkling-water-2.2.5/bin//h2odriver-sw2.2.5-hdp2.6-extended.jar").setNumOfExternalH2ONodes(2).setMapperXmx("6G")
val hc = H2OContext.getOrCreate(spark, conf)

Related

Spark Local Session with custom maven library

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

Spark program taking hadoop configurations from an unspecified location

I have few test cases such as reading/writing a file on HDFS that I want to automate using Scala and run using maven. I have taken the Hadoop configuration files of test environment and put it in the resources directory of my maven project. The project is also running fine on the desired cluster from any cluster that I am using to run the project from.
One thing that I am not getting is how is Spark taking Hadoop configurations from resources directory even when I have not specified it anywhere in the project. Below is a code snippet from project.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val hdfsCoreSitePath = new Path("/etc/hadoop/conf/core-site.xml","core-site.xml")
val hdfsHDFSSitePath = new Path("/etc/hadoop/conf/hdfs-site.xml","hdfs-site.xml")
val hdfsYarnSitePath = new Path("/etc/hadoop/conf/yarn-site.xml","yarn-site.xml")
val hdfsMapredSitePath = new Path("/etc/hadoop/conf/mapred-site.xml","mapred-site.xml")
hadoopConfiguration.addResource(hdfsCoreSitePath)
hadoopConfiguration.addResource(hdfsHDFSSitePath)
hadoopConfiguration.addResource(hdfsYarnSitePath)
hadoopConfiguration.addResource(hdfsMapredSitePath)
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(hadoopConfiguration)
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
println("-----------------Logged-in via keytab---------------------")
FileSystem.get(hadoopConfiguration)
val sc=new SparkContext(conf)
return sc
}
#Test
def testCase(): Unit = {
var hadoopConfiguration: Configuration = new Configuration()
val sc=getSparkContext(hadoopConfiguration)
//rest of the code
//...
//...
}
Here, I have used hadoopconfiguration object but I am not specifying this anywhere to sparkContext as this will run the tests on the cluster which I am using for running the project and not on some remote test environment.
If this is not a correct way? Can anyone please explain how I should carry out my motive of running spark test-cases on test environment from some remote cluster?

How to use mesos master url in a self-contained Scala Spark program

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream with Spark on local mode

I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)