Scala Cannot resolve symbol AnalysisException - scala

Scala newbie here. I am trying to catch some exceptions while reading files from S3 using spark and want my code to do nothing if Path does not exist exception occurs. For that I have a code like the following:
import org.apache.spark.sql.{Row, SQLContext, SparkSession, AnalysisException}
var partnerData: org.apache.spark.sql.DataFrame = null
if (fileType == "csv"){
try{
partnerData = spark.read
.format("csv")
.schema(inputSchema)
.load(inputPath)
} catch (AnalysisException e) {
}
} else {
partnerData = spark.read
.format("parquet")
.load(inputPath)
}
Although I have imported the AnalysisException, I keep seeing this message:
Cannot resolve symbol AnalysisException
Anything that I'm missing here? I am planning to detect the exception type first and then decide what to do next based on message text. I'd be grateful if you could point me to the right direction in case this doesn't make any sense as I've read multiple threads and haven't been able to find a better solution. This is much easier to do in python though.

Your catch syntax is wrong, it should be
try {
???
} catch {
case e: AnalysisException => // Handler for AnalysisException
case _ => // Default handler
}

Related

Issue with try-finally in Scala

I have following scala code:
val file = new FileReader("myfile.txt")
try {
// do operations on file
} finally {
file.close() // close the file
}
How do I handle FileNotFoundException thrown when I read the file? If I put that line inside try block, I am not able to access the file variable inside finally.
For scala 2.13:
you can just use Using to acquire some resource and release it automatically without error handling if it's an AutoClosable:
import java.io.FileReader
import scala.util.Using
val newStyle: Try[String] = Using(new FileReader("myfile.txt")) {
reader: FileReader =>
// do something with reader
"something"
}
newStyle
// will be
// Failure(java.io.FileNotFoundException: myfile.txt (No such file or directory))
// if file is not found or Success with some value it will not fall
scala 2.12:
You can wrap your reader creation by scala.util.Try and if it will fall on creation you will get Failure with FileNotFoundException inside.
import java.io.FileReader
import scala.util.Try
val oldStyle: Try[String] = Try{
val file = new FileReader("myfile.txt")
try {
// do operations on file
"something"
} finally {
file.close() // close the file
}
}
oldStyle
// will be
// Failure(java.io.FileNotFoundException: myfile.txt (No such file or directory))
// or Success with your result of file reading inside
I recommend not to use try ... catch blocks in scala code. It's not type safety for some cases and can lead to non-obvious results but for release some resource in old scala versions there is the only way to do it - using try-finally.

How to start a spark session only if needed

I am fairly new to spark. I have a case where i dont need the executors and other infra until a condition is met.I have the following code
def main(args: Array[String]) {
try {
val request = args(0).toString
// Get the spark session
val spark = getSparkSession()
log.info("Running etl Job")
// Pipeline builder
val pipeline = new PipelineBuilder().build(request)
pipeline.execute(spark)
spark.stop()
} catch {
case e: Exception => {
throw new RuntimeException("Failed to successfully run", e)
}
}
}
The above code creates a spark session and executes an ETL pipeline.
However i have a requirement that i only need to start the pipeline if based on a condition. In the below code, i want to only start the sparksession if a condition is true.
def main(args: Array[String]) {
try {
val request = args(0).toString
if(condition) {
val spark = getSparkSession()
log.info("Running etl Job")
// Pipeline builder
val pipeline = new PipelineBuilder().build(request)
pipeline.execute(spark)
spark.stop()
} else {
// DO nothing
}
} catch {
case e: Exception => {
throw new RuntimeException("Failed to successfully run", e)
}
}
}
Does this ensure that no sparksession is initiated and no executors are spun up if the condition is false ? If not, is there any other way to solve this ?
You can make use of lazy evaluation in scala.
In your getSparkSession() function define
lazy val spark: SparkSession = ....
As per wikipedia, "Lazy Evaluation is an evaluation strategy which delays the evaluation of an expression until its value is needed" .
Few benefits of lazy evaluation are,
Lazy evaluation can help to resolve circular dependencies.
It can provide performance enhancement by not doing calculations until needed — and they may not be done at all if the calculation is not used.
It can increase the response time of applications by postponing the heavy operations until required.
Please refer https://dzone.com/articles/scala-lazy-evaluation to know more.

NULL Pointer Exception, while creating DF inside foreach()

I have to read certain files from S3, so I created a CSV containing path of those files on S3. I am reading created CSV file using below code:
val listofFilesRDD = sparkSession.read.textFile("s3://"+ file)
This is working fine.
Then I am trying to read each of those paths and create dataframe like:
listofFilesRDD.foreach(iter => {
val pathDF = sparkSession.read
.schema(testSchema)
.option("headers", true)
.csv("s3://"+iter)
pathDF.printSchema()
})
but, the above code gives NullPointerException.
So, How can I fix the above code?
You can solve the above problem as below you simple create Array of s3 file paths and iterate over that array and create DF inside that as below
val listofFilesRDD = sparkSession.read.textFile("s3://"+ file)
val listOfPaths = listofFilesRDD.collect()
listOfPaths.foreach(iter => {
val pathDF = sparkSession.read
.schema(testSchema)
.option("headers", true)
.csv("s3://"+iter)
pathDF.printSchema()
})
You cannot access a RDD inside a RDD ! Thats the sole rule ! You have to do something else to make your logic work !
You can find more about it here : NullPointerException in Scala Spark, appears to be caused be collection type?
If anyone encounter DataFrame problem , can solve this problem.
def parameterjsonParser(queryDF:DataFrame,spark:SparkSession): Unit ={
queryDF.show()
val otherDF=queryDF.collect()
otherDF.foreach { row =>
row.toSeq.foreach { col =>
println(col)
mainJsonParser(col.toString,spark)
}
}
Thank you #Sandeep Purohit

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran

How to download and save a file from the internet using Scala?

Basically I have a url/link to a text file online and I am trying to download it locally. For some reason, the text file that gets created/downloaded is blank. Open to any suggestions. Thanks!
def downloadFile(token: String, fileToDownload: String) {
val url = new URL("http://randomwebsite.com/docs?t=" + token + "&p=tsr%2F" + fileToDownload)
val connection = url.openConnection().asInstanceOf[HttpURLConnection]
connection.setRequestMethod("GET")
val in: InputStream = connection.getInputStream
val fileToDownloadAs = new java.io.File("src/test/resources/testingUpload1.txt")
val out: OutputStream = new BufferedOutputStream(new FileOutputStream(fileToDownloadAs))
val byteArray = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
out.write(byteArray)
}
I know this is an old question, but I just came across a really nice way of doing this :
import sys.process._
import java.net.URL
import java.io.File
def fileDownloader(url: String, filename: String) = {
new URL(url) #> new File(filename) !!
}
Hope this helps. Source.
You can now simply use fileDownloader function to download the files.
fileDownloader("http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words", "stop-words-en.txt")
Here is a naive implementation by scala.io.Source.fromURL and java.io.FileWriter
def downloadFile(token: String, fileToDownload: String) {
try {
val src = scala.io.Source.fromURL("http://randomwebsite.com/docs?t=" + token + "&p=tsr%2F" + fileToDownload)
val out = new java.io.FileWriter("src/test/resources/testingUpload1.txt")
out.write(src.mkString)
out.close
} catch {
case e: java.io.IOException => "error occured"
}
}
Your code works for me... There are other possibilities that make empty file.
Here is a safer alternative to new URL(url) #> new File(filename) !!:
val url = new URL(urlOfFileToDownload)
val connection = url.openConnection().asInstanceOf[HttpURLConnection]
connection.setConnectTimeout(5000)
connection.setReadTimeout(5000)
connection.connect()
if (connection.getResponseCode >= 400)
println("error")
else
url #> new File(fileName) !!
Two things:
When downloading from an URL object, if an error (404 for instance) is returned, then the URL object will throw a FileNotFoundException. And since this exception is generated from another thread (as URL happens to run on a separate thread), a simple Try or try/catch won't be able to catch the exception. Thus the preliminary check for the response code: if (connection.getResponseCode >= 400).
As a consequence of checking the response code, the connection might sometimes get stuck opened indefinitely for improper pages (as explained here). This can be avoided by setting a timeout on the connection: connection.setReadTimeout(5000).
Flush the buffer and then close your output stream.