I have a rdd stored locally and I am using jupyter notebook with pyspark. I tried to load the rdd but it crashed, stating file not found, although the files are there. Do you know the reason?
a=sc.textFile('file:///myfile/test5.rdd')
Error message
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 303, abc.com, executor 3):
java.io.FileNotFoundException: File file:/myfile/test5.rdd/part-00000 does not exist
Related
I keep getting org.apache.spark.SparkException: Job aborted when I try to save my flattened json file in azure blob as csv. Some answers that I have found recomends to increase the executor memory. Which I have done here:
I get this error when I try to save the config:
What do I need to do to solve this issue?
EDIT
Adding part of the stacktrace that is causing org.apache.spark.SparkException: Job aborted. I have also tried with and without coalesce when saving my flattend dataframe:
ERROR FileFormatWriter: Aborting job 0d8c01f9-9ff3-4297-b677-401355dca6c4.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 4 times, most recent failure: Lost task 0.3 in stage 79.0 (TID 236) (10.139.64.7 executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
Experiencing similar error when executing the spark.executor.memory 4g command on my cluster with similar worker node.
The cause of the error is mainly the limit of executor memory in specific cluster node is 3 Gb and you are passing the value as 4 Gb as error message suggests.
Resolution:
Give spark.executor.memory less than 3Gb.
Select the bigger worker type Standard_F8, Standard_F16 etc.
When I am trying to write the data to Parquet file I am facing below mentioned error. I read post about if two Parquet files have different datatypes then we will see this error. But I tried individually casting all the columns in the dataframe also I am trying to write to a new directory that doesn't have any files.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 787 in stage 76.0 failed 4 times, most recent failure: Lost task 787.3 in stage 76.0 (TID 77007) (100.100.191.241 executor 145): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:400)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:87)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
While reading parquet files in pyspark I'm getting following error.
How can I resolve it.
Error :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 131.0 failed 4 times, most recent failure: Lost task
0.3 in stage 131.0 (TID 16100, 10.107.97.154, executor 56): org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=; isDirectory=false; length=32203696; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.
When I tried to write parquet to s3 with Scala using the code below:
ab.write.mode("overwrite").parquet(path)
And the error shows:
org.apache.spark.SparkException: Job aborted. (Job aborted.) ......
rg.apache.spark.SparkException: Job aborted due to stage failure: Task
1 in stage 112.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 112.0 (TID 6273, localhost): java.lang.NullPointerException at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:412)
at
org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at
org.apache.hadoop.fs.s3a.S3AOutputStream.(S3AOutputStream.java:87)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)