Apache Zeppelin Can't Write Deltatable to Spark - scala

I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin:
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
Which yields this output (truncated to omit repeat output):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 192.168.64.3, executor 2): java.io.FileNotFoundException: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
...
I'm unable to figure out why this is happening at all as I'm too unfamiliar with Spark. Any tips? Thanks for your help.

Related

Why Apache Spark does some checks and raises those exceptions during the job runtime, but has never thrown them during Unit test?

There was a bug in my Scala code, formatting the date of the timestamp, being then concatenated as the String to some, non-timestamp column of the Spark Streaming:
concat(date_format(col("timestamp"),"yyyy-MM-DD'T'HH:mm:ss.SSS'Z'")
So, during the tests, everything was ok and tests, sending the messages to the Kafka, were passed, and I was able to see those messages in the Kafka Tool:
Not 292th of October there because of DD instead of dd in the formatter.
But then in the executor it was some extra check that wasn't passed and job was crashed:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 12, kafkadatageneratorjob-driver, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to format it to '2021-10-292T14:27:12.577Z' in the new formatter. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
How to enable the same strict check on the Unit tests to make them also failing on those checks without explicit check of the value, but just forcing timeParserPolicy also to be executed in tests also.

ParquetDecodingException using pyspark

I saved a parquet file, then loaded it and tried to join it with another dataframe. I used the regular read/write parquet files.
Then I got the following error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 925 in stage 37.0 failed 1 times, most recent failure: Lost
task 925.0 in stage 37.0 (TID 6376, localhost, executor driver):
org.apache.parquet.io.ParquetDecodingException: Can not read value at
113652 in block 0 in file
file:/my_parquet_path/part-00031-b9b3442d-8459-4591-956c-9ef2299095cd-c000.snappy.parquet
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
page Page [bytes.size=1048635, valueCount=30917, uncompressedSize=1048635]
in col [my_field_name, list, element] optional binary element (UTF8)
java.io.IOException: FAILED_TO_UNCOMPRESS(5)
Any idea why does it happen? How the filter the problematic rows?

Spark read failed due to duplicate column

I am trying to read parquet file from S3 in databricks, using scala.
below is the simple read code
val df = spark.read.parquet(s"/mnt/$MountName/tstamp=2020_03_25")
display(df)
MountName is the dbfs where data is mounted from S3.
But I am getting error which is due to duplicate key in file.
SparkException: Job aborted due to stage failure: Task 0 in stage 813.0 failed 4 times, most recent failure: Lost task 0.3 in stage 813.0 (TID 79285, 10.179.245.218, executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/Alibaba_data/tstamp=2020_03_25/ts-1585154320710.parquet.gz.
Caused by: java.lang.RuntimeException: Found duplicate field(s) "subtype": [subtype, subType] in case-insensitive mode
Now i need to overcome it. May be making the read case sensitive or by dropping the column while read, or by any other means if suggested.
Suggestion please.
Try with case sensitivity enabled.
spark.sql.caseSensitive should be set to true.

SparkException: Job aborted due to stage failure: Task 0 in stage 3.0"

Does anyone know why I get this error when I try to load multiple dataframes using different threads in Scala.
SparkException: Job aborted due to stage failure: Task 0 in stage 3.0
Here is my code:
val dataFrame = connector.createDataFrame(query)
dataFrame.printSchema()
println(dataFrame.count())
val dataFrame2 = connector.createDataFrame(query2)
dataFrame2.printSchema()
println(dataFrame2.count())
The problem is when I try to perform any operation, and it is not for lack of memory because each Dataframe only has 100 rows.

apache zeppelin fails on reading csv using pyspark

I'm using Zeppelin-Sandbox 0.5.6 with Spark 1.6.1 on Amazon EMR.
I am reading csv file located on s3.
The problem is that sometimes I'm getting error reading the file. I need to restart the interpreter several times until it works. nothing in my code changes. I can't restore it, and can't tell when it's happening.
My code goes as following:
defining dependencies:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.4.0")
using spark-csv:
%pyspark
import pyspark.sql.functions as func
df = sqlc.read.format("com.databricks.spark.csv").option("header", "true").load("s3://some_location/some_csv.csv")
error msg:
Py4JJavaError: An error occurred while calling o61.load. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-172-22-2-187.ec2.internal):
java.io.InvalidClassException: com.databricks.spark.csv.CsvRelation;
local class incompatible: stream classdesc serialVersionUID =
2004612352657595167, local class serialVersionUID =
6879416841002809418
...
Caused by: java.io.InvalidClassException:
com.databricks.spark.csv.CsvRelation; local class incompatible
Once I'm reading the csv into the dataframe, the rest of the code works fine.
Any advice?
Thanks!
You need to execute spark adding the spark-csv package to it like this
$ pyspark --packages com.databricks:spark-csv_2.10:1.2.0
Now the spark-csv will be in your classpath