'java.lang.OutOfMemoryError: Java heap space' error in spark application while trying to read the avro file and performing Actions - scala

The avro size is around 44MB.
Below is the yarn logs error :
20/03/30 06:55:04 INFO spark.ExecutorAllocationManager: Existing executor 18 has been removed (new total is 0)
20/03/30 06:55:04 INFO cluster.YarnClusterScheduler: Cancelling stage 5
20/03/30 06:55:04 INFO scheduler.DAGScheduler: ResultStage 5 (head at IrdsFIInstrumentEnricher.scala:15) failed in 213.391 s due to Job aborted due to stage f ailure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 134, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFa ilure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: container_1585337469684_0037_02_000029 on host: fratlhadooap pd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
Driver stacktrace:
20/03/30 06:55:04 INFO scheduler.DAGScheduler: Job 3 failed: head at IrdsFIInstrumentEnricher.scala:15, took 213.427308 s
20/03/30 06:55:04 ERROR CCOIrdsEnrichmentService: Unexpected error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 13 4, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: c ontainer_1585337469684_0037_02_000029 on host: fratlhadooappd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
Driver stacktrace:
→ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
.
.
.
.
.
.
.
.
20/03/30 06:48:19 INFO storage.DiskBlockManager: Shutdown hook called
20/03/30 06:48:19 INFO util.ShutdownHookManager: Shutdown hook called
LogType:stdout
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:124
Log Contents:
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill %p"
Executing /bin/sh -c "kill 62191"...
LogType:container-localizer-syslog
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:0
Log Contents:
Below is the code I am using :
fiDF = spark.read
.format("com.databricks.spark.avro")
.load("C:\\Users\\kativikb\\Downloads\\Temp\\cco-irds\\rds_db_global_rds_fi-instrument_20200328000000_v1_block3_snapshot-inc.avro").limit(1)
val tempDF = fiDF.select("payload.identifier.id")
tempDF.show(10) // ******* Error at t his line ******

This was because the avro schema was too large, and I was using the spark version 2.1.0, which perhaps has bug for larger schemas.
this has been fixed in 2.4.0.
I solved this error by changing the schema and using my custom schema, taking only the required fields in the schema.

Related

Azure Databricks: Error, Specified heap memory (4096MB) is above the maximum executor memory (3157MB) allowed for node type Standard_F4

I keep getting org.apache.spark.SparkException: Job aborted when I try to save my flattened json file in azure blob as csv. Some answers that I have found recomends to increase the executor memory. Which I have done here:
I get this error when I try to save the config:
What do I need to do to solve this issue?
EDIT
Adding part of the stacktrace that is causing org.apache.spark.SparkException: Job aborted. I have also tried with and without coalesce when saving my flattend dataframe:
ERROR FileFormatWriter: Aborting job 0d8c01f9-9ff3-4297-b677-401355dca6c4.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 4 times, most recent failure: Lost task 0.3 in stage 79.0 (TID 236) (10.139.64.7 executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
Experiencing similar error when executing the spark.executor.memory 4g command on my cluster with similar worker node.
The cause of the error is mainly the limit of executor memory in specific cluster node is 3 Gb and you are passing the value as 4 Gb as error message suggests.
Resolution:
Give spark.executor.memory less than 3Gb.
Select the bigger worker type Standard_F8, Standard_F16 etc.

Spark Error - Exit status: 143. Diagnostics: Container killed on request

I am getting the error below:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 653 in stage 7.0 failed 4 times, most recent failure: Lost task 653.3 in stage 7.0 (TID 27294, ip-10-0-57-16.ec2.internal, executor 34): ExecutorLostFailure (executor 34 exited caused by one of the running tasks) Reason: Container marked as failed: container_1602898457220_0001_01_000370 on host: ip-10-0-57-16.ec2.internal. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
My data set is 80GB
Operation i did is create some square, interaction features, so maybe double the number of columns.
I am using 20 m4.16xlarge (64CPU, 256GB, https://aws.amazon.com/ec2/instance-types/) instance
spark.yarn.executor.memoryOverhead = '16384'
can i do something to fix this? and why the I got OOM error even my dataset is way smaller than my instance number.
i increase the following two parameter and avoid the error:
spark.default.parallelism = '128'
spark.executor.cores = '16'

How to resolve error while reading parquet files

While reading parquet files in pyspark I'm getting following error.
How can I resolve it.
Error :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 131.0 failed 4 times, most recent failure: Lost task
0.3 in stage 131.0 (TID 16100, 10.107.97.154, executor 56): org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=; isDirectory=false; length=32203696; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

resolve the issue of possible the underlying files have been updated in data bricks

val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.

Error reading rdd locally using pyspark

I have a rdd stored locally and I am using jupyter notebook with pyspark. I tried to load the rdd but it crashed, stating file not found, although the files are there. Do you know the reason?
a=sc.textFile('file:///myfile/test5.rdd')
Error message
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 303, abc.com, executor 3):
java.io.FileNotFoundException: File file:/myfile/test5.rdd/part-00000 does not exist