Error reading rdd locally using pyspark - pyspark

I have a rdd stored locally and I am using jupyter notebook with pyspark. I tried to load the rdd but it crashed, stating file not found, although the files are there. Do you know the reason?
a=sc.textFile('file:///myfile/test5.rdd')
Error message
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 303, abc.com, executor 3):
java.io.FileNotFoundException: File file:/myfile/test5.rdd/part-00000 does not exist

Related

Azure Databricks: Error, Specified heap memory (4096MB) is above the maximum executor memory (3157MB) allowed for node type Standard_F4

I keep getting org.apache.spark.SparkException: Job aborted when I try to save my flattened json file in azure blob as csv. Some answers that I have found recomends to increase the executor memory. Which I have done here:
I get this error when I try to save the config:
What do I need to do to solve this issue?
EDIT
Adding part of the stacktrace that is causing org.apache.spark.SparkException: Job aborted. I have also tried with and without coalesce when saving my flattend dataframe:
ERROR FileFormatWriter: Aborting job 0d8c01f9-9ff3-4297-b677-401355dca6c4.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 4 times, most recent failure: Lost task 0.3 in stage 79.0 (TID 236) (10.139.64.7 executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
Experiencing similar error when executing the spark.executor.memory 4g command on my cluster with similar worker node.
The cause of the error is mainly the limit of executor memory in specific cluster node is 3 Gb and you are passing the value as 4 Gb as error message suggests.
Resolution:
Give spark.executor.memory less than 3Gb.
Select the bigger worker type Standard_F8, Standard_F16 etc.

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary Error while writing to Parquet File

When I am trying to write the data to Parquet file I am facing below mentioned error. I read post about if two Parquet files have different datatypes then we will see this error. But I tried individually casting all the columns in the dataframe also I am trying to write to a new directory that doesn't have any files.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 787 in stage 76.0 failed 4 times, most recent failure: Lost task 787.3 in stage 76.0 (TID 77007) (100.100.191.241 executor 145): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:400)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:87)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)

How to resolve error while reading parquet files

While reading parquet files in pyspark I'm getting following error.
How can I resolve it.
Error :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 131.0 failed 4 times, most recent failure: Lost task
0.3 in stage 131.0 (TID 16100, 10.107.97.154, executor 56): org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=; isDirectory=false; length=32203696; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

resolve the issue of possible the underlying files have been updated in data bricks

val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.

Write parquet to S3 using Spark Scala: java.lang.NullPointerException

When I tried to write parquet to s3 with Scala using the code below:
ab.write.mode("overwrite").parquet(path)
And the error shows:
org.apache.spark.SparkException: Job aborted. (Job aborted.) ......
rg.apache.spark.SparkException: Job aborted due to stage failure: Task
1 in stage 112.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 112.0 (TID 6273, localhost): java.lang.NullPointerException at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:412)
at
org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at
org.apache.hadoop.fs.s3a.S3AOutputStream.(S3AOutputStream.java:87)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)