Catching spark exceptions in PySpark - pyspark

I have a Databricks notebook that reads csv files as a first step in an ETL pipeline.
Sometimes the csv-files does not have the required schema and this causes the notebook to crash. I need to handle these errors when they occur instead of letting the entire pipeline crash.
Below is my code where I attempt to handle these errors. When a faulty csv-file is read I expect the output to be "Exception caught".
try:
newData = (spark.read
.format("csv")
.option("delimiter", "|")
.option("mode", "FAILFAST")
.option("inferSchema", "false")
.option("enforceSchema", "true")
.option("header", "True")
.schema(schema)
.load(bronzePath + "/"+ fileName + "*")
)
newData.display()
except FileReadException as e:
# Do stuff, handle exception
print("Exception caught")
But the Exception is never caught and I get the the full Exception as output.
FileReadException: Error while reading file dbfs:/mnt/datalake/bronze/<myPath>/<myFile.csv>.
Caused by: SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
Caused by: BadRecordException: org.apache.spark.sql.catalyst.csv.MalformedCSVException: Malformed CSV record
Caused by: MalformedCSVException: Malformed CSV record
Googling the issue helped me understand that it is not possible to catch scala exceptions in pyspark.
Is there some other way that I can catch this exception? What other alternatives do I have?
I need to in some way handle any faulty csv-files that are received. Using PERMISSIVE or DROPMALFORMED mode is not an option for me since I need to react and treat the faulty files.

Related

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:
df = spark.read.parquet("path/to/file.parquet")
But when I try to do any operation like .show() or .repartition(n) I run into the following error:
java.lang.IllegalArgumentException: Illegal Capacity: -102
any ideas on how I can fix this?
It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633
Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.

How to read .DAT file using SCALA

I am trying to read DAT file using below syntax but getting below error:
spark.read.format("dat").option("header", "true").option("delimiter","!^")
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: dat. Please find packages at http://spark.apache.org/third-party-projects.html
You can try this:
spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.read.textFile("filename")
.map(line => line.split("YOUR DOUBLE DELIMITER").mkString("\t")))
The answer is from here:
How to use double pipe as delimiter in CSV?

ERROR: org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV

I try this basic command to read a CSV in scala:
val df = spark.read
.option("header", "true")
.option("sep","|")
.option("inferSchema", "true")
.csv("path/to/_34File.csv")
And I get:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
What could be the solution?
The solution is to rename de file from "_34File.csv" to "34File.csv". It's a peculiar case and that worked for me.

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899