Spark Local Session with custom maven library - scala

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

Related

NoSuchMethodError in google dataproc cluster for excel files

While consuming Excel file in dataproc cluster, getting errorjava.lang.NoSuchMethodError.
Note: schema is getting printed but not the actual data.
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling
o74.showString. : java.lang.NoSuchMethodError:
scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at
com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:74)
Code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from google.cloud import storage
from google.cloud import bigquery
import pyspark
client = storage.Client()
bucket_name = "test_bucket"
path=f"gs://{bucket_name}/test_file.xlsx"
def make_spark_session(app_name, jars=[]):
configuration = (SparkConf()
.set("spark.jars", ','.join(jars)))
spark = SparkSession.builder.appName(app_name) \
.config(conf=configuration).getOrCreate()
return spark
app_name = 'test_app'
jars = ['gs://bucket/spark-excel_2.11_uber-0.12.0.jar']
spark = make_spark_session(app_name,jars)
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader","true") \
.load(path)
df.show(1)
This appears to be Scala version mismatch between your job jars and the cluster. Both Dataproc 1.5 and 2.0 come with Scala 2.12. The gs://bucket/spark-excel_2.11_uber-0.12.0.jar in your code seems to be Scala 2.11 based, you might want to use spark-excel_2.12_... instead. In addition to that, make sure your Spark application is also built with Scala 2.12.

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

How to load a local file with a local spark session

I'm running a local spark session on my mac via my intellij sbt console and I get a
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/myuser/Documents/data/dataset.csv; error.
my current code looks like this:
val data = spark.read.csv("file:///Users/myuser/Documents/data/dataset.csv")
I've also tried:
val data = spark.read.csv("/Users/myuser/Documents/data/dataset.csv")
my spark session looks like this
import org.apache.spark.sql.SparkSession
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("avro_test")
.getOrCreate()
}
}
I know this is the same issue as the one found here: How to load local file in sc.textFile, instead of HDFS
but none of the answers here (and others i've looked at) are helping me or else i'm not fully understanding them. any suggestions?

How spark-shell or Zepellin notebook set HiveContext to SparkSession?

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.
When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector