java.lang.InterruptedException when creating SparkSession in Scala - scala

If I clone this gist: https://gist.github.com/jamiekt/cea2dab3ea8de91489b31045b302e011
and then issue sbt run it fails on the line
val spark = SparkSession.builder()
.config(new SparkConf().setMaster("local[*]"))
.enableHiveSupport()
.getOrCreate()
with error:
Java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
No clue why this might be happening. Anyone got a suggestion?
Scala version is 2.11.12 (see in build.sbt in the gist)
Spark version is 2.3.0 (again, see in build.sbt)
Java Version
$ java -version
java version "1.8.0_161"

The error is because you have not stopped the sparkSession instance created and the instance is removed from memory without being closed as soon as sbt run completes i.e. after the successful completion of your code.
So all you require is
spark.stop()
at the end of the scope where the instance is created as
object Application extends App{
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
//import spark.implicits._
//val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
//val groupBy = Seq("number","word")
//val asAt = LocalDate.now()
//val joinedDf = Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt)).joinDataFramesOnColumns(groupBy)
//joinedDf.show
spark.stop()
}
Just before the
Java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
You must have following message too
ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
which gives clue to the cause of the error.

Related

Spark Local Session with custom maven library

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

How to load a local file with a local spark session

I'm running a local spark session on my mac via my intellij sbt console and I get a
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/myuser/Documents/data/dataset.csv; error.
my current code looks like this:
val data = spark.read.csv("file:///Users/myuser/Documents/data/dataset.csv")
I've also tried:
val data = spark.read.csv("/Users/myuser/Documents/data/dataset.csv")
my spark session looks like this
import org.apache.spark.sql.SparkSession
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("avro_test")
.getOrCreate()
}
}
I know this is the same issue as the one found here: How to load local file in sc.textFile, instead of HDFS
but none of the answers here (and others i've looked at) are helping me or else i'm not fully understanding them. any suggestions?

How to use SparkSession and StreamingContext together?

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.
You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

Error instantiating 'org.apache.spark.sql.hive.HiveSessionState': on Linux server

I have a Scala Spark application that I'm trying to run on a Linux server using a shell script. I am getting the error:
Exception in thread "main" java.lang.IllegalArgumentException: Error
while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
However, I don't understand what is wrong. I am doing this to instantiate Spark:
val sparkConf = new SparkConf().setAppName("HDFStoES").setMaster("local")
val spark: SparkSession = SparkSession.builder.enableHiveSupport().config(sparkConf).getOrCreate()
Am I doing this correctly, if so what could be the error?
sparkSession = SparkSession.builder().appName("Test App").master("local[*])
.config("hive.metastore.warehouse.dir", hiveWareHouseDir)
.config("spark.sql.warehouse.dir", hiveWareHouseDir).enableHiveSupport().getOrCreate();
Use above, you need to specify the "hive.metastore.warehouse.dir" directory to enable hive support in spark session.