How to resolve error while reading parquet files - pyspark

While reading parquet files in pyspark I'm getting following error.
How can I resolve it.
Error :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 131.0 failed 4 times, most recent failure: Lost task
0.3 in stage 131.0 (TID 16100, 10.107.97.154, executor 56): org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=; isDirectory=false; length=32203696; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Related

Getting Exception thrown in awaitResult in Azure databricks notebook

I am getting below error while I tried to write an imported table from a azure container path to delta in databricks notebook,
Job aborted.
Caused by: Exception thrown in awaitResult:
Caused by: Job aborted due to stage failure.
at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:607)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:359)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$7(TransactionalWriteEdge.scala:352)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:189)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:336)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:148)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:428)
at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.awaitShuffleMapStage$1(DeltaOptimizedWriterExec.scala:189)
at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.getShuffleStats(DeltaOptimizedWriterExec.scala:194)
at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.computeBins(DeltaOptimizedWriterExec.scala:136)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 855 in stage 2.0 failed 4 times, most recent failure: Lost task 855.3 in stage 2.0 (TID 1527) (10.94.102.5 executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2979)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2926)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2920)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2920)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1340)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1340)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1340)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3129)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3117)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Below is the code,
%scala
spark.read.parquet(s"<Azure container path>")
.write.format("delta").mode("overwrite")
.option("delta.autoOptimize", "true")
.option("delta.autoOptimize.optimizeWrite", "true")
.option("delta.targetFileSize", "1024mb")
.option("delta.dataSkippingNumIndexedCols", "-1")
.option("path", s"<target_path>")
.partitionBy("week_id")
.saveAsTable(s"${table}")
I have tried by increasing driver and executor memory but still it had thrown the same error. Could someone please help on this issue?

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary Error while writing to Parquet File

When I am trying to write the data to Parquet file I am facing below mentioned error. I read post about if two Parquet files have different datatypes then we will see this error. But I tried individually casting all the columns in the dataframe also I am trying to write to a new directory that doesn't have any files.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 787 in stage 76.0 failed 4 times, most recent failure: Lost task 787.3 in stage 76.0 (TID 77007) (100.100.191.241 executor 145): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:400)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:87)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)

resolve the issue of possible the underlying files have been updated in data bricks

val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.

Write parquet to S3 using Spark Scala: java.lang.NullPointerException

When I tried to write parquet to s3 with Scala using the code below:
ab.write.mode("overwrite").parquet(path)
And the error shows:
org.apache.spark.SparkException: Job aborted. (Job aborted.) ......
rg.apache.spark.SparkException: Job aborted due to stage failure: Task
1 in stage 112.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 112.0 (TID 6273, localhost): java.lang.NullPointerException at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:412)
at
org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at
org.apache.hadoop.fs.s3a.S3AOutputStream.(S3AOutputStream.java:87)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)

Error reading rdd locally using pyspark

I have a rdd stored locally and I am using jupyter notebook with pyspark. I tried to load the rdd but it crashed, stating file not found, although the files are there. Do you know the reason?
a=sc.textFile('file:///myfile/test5.rdd')
Error message
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 303, abc.com, executor 3):
java.io.FileNotFoundException: File file:/myfile/test5.rdd/part-00000 does not exist