How to load a local file with a local spark session - scala

I'm running a local spark session on my mac via my intellij sbt console and I get a
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/myuser/Documents/data/dataset.csv; error.
my current code looks like this:
val data = spark.read.csv("file:///Users/myuser/Documents/data/dataset.csv")
I've also tried:
val data = spark.read.csv("/Users/myuser/Documents/data/dataset.csv")
my spark session looks like this
import org.apache.spark.sql.SparkSession
trait SparkSessionWrapper {
lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("avro_test")
.getOrCreate()
}
}
I know this is the same issue as the one found here: How to load local file in sc.textFile, instead of HDFS
but none of the answers here (and others i've looked at) are helping me or else i'm not fully understanding them. any suggestions?

Related

Hdinsight Spark Session issue with Parquet

Using HDinsight to run spark and a scala script.
I'm using the example scripts provided by the Azure plugin in intellij.
It provides me with the following code:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
Fair enough. And I can do things like:
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
and I can save files:
rdd1.saveAsTextFile("wasb:///HVACout2")
However, I am looking to load in a parquet file. The code I have found (elsewhere) for parquet files coming in is:
val df = spark.read.parquet("resources/Parquet/MyFile.parquet/")
Line above gives an error on this in HDinsight (when I submit the jar via intellij).
Why don't you use?:
val spark = SparkSession.builder
.master("local[*]") // adjust accordingly
.config("spark.sql.warehouse.dir", "E:/Exp/") //change accordingly
.appName("MySparkSession") //change accordingly
.getOrCreate()
When I put in spark session and get rid of spark context, HD insight breaks.
What am I doing wrong?
How using HdInsight do I go about creating either a spark session or context, that allows me to read in text files, parquet and all the rest? How do I get the best of both worlds
My understanding is SparkSession, is the better and more recent way. And what we should be using. So how do I get it running in HDInsight?
Thanks in advance
Turns out if I add
val spark = SparkSession.builder().appName("Spark SQL basic").getOrCreate()
After the spark context line and before the parquet, read part, it works.

Spark Local Session with custom maven library

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

Spark on HDInsights - No FileSystem for scheme: adl

I am writing an application that processes files from ADLS. When attempting to read the files from the cluster by running the code within spark-shell it has no problem accessing the files. However, when I attempt to sbt run the project on the cluster it gives me:
[error] java.io.IOException: No FileSystem for scheme: adl
implicit val spark = SparkSession.builder().master("local[*]").appName("AppMain").getOrCreate()
import spark.implicits._
val listOfFiles = spark.sparkContext.binaryFiles("adl://adlAddressHere/FolderHere/")
val fileList = listOfFiles.collect()
This is spark 2.2 on HDI 3.6
In your build.sbt add:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure-datalake" % "2.8.0" % Provided
I use Spark 2.3.1 instead of 2.2. That version works well with hadoop-azure-datalake 2.8.0.
Then, configure your spark context:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
hadoopConf.set("fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl")
hadoopConf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
hadoopConf.set("dfs.adls.oauth2.client.id", clientId)
hadoopConf.set("dfs.adls.oauth2.credential", clientSecret)
hadoopConf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
TL;DR;
If you are using RDD through spark context you can tell Hadoop Configuration where to find the implementation of your org.apache.hadoop.fs.adl.AdlFileSystem.
The key come in the format fs.<fs-prefix>.impl, and the value is a full class name that implements the class org.apache.hadoop.fs.FileSystem.
In your case, you need fs.adl.impl which is implemented by org.apache.hadoop.fs.adl.AdlFileSystem.
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
import spark.implicits._
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
I usually work with Spark SQL, so I need to configure spark session too:
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", clientId)
spark.conf.set("dfs.adls.oauth2.credential", clientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", s"https://login.microsoftonline.com/$tenantId/oauth2/token")
Well, I found if I package the jar and spark-submit it that it works fine so that will work for the mean time. I'm still surprised it would not work in local[*] mode though.

How to use SparkSession and StreamingContext together?

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.
You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector