Getting "File ready exists" error in AWS Glue when writing a dynamic frame to Redshift - amazon-redshift

I have a system with over 20 ETL jobs. I'm adding a new job which is similar in concept to other jobs, so I copied and modified the code. At the end, I'm writing a dynamic frame to a Redshift table. The frame has about 130,000 records. Here is my error
2020-07-14 00:06:11,563 ERROR [Thread-9] datasources.FileFormatWriter (Logging.scala:logError(91)) - Aborting job 1e89bbc9-ffe6-41a9-9bbc-f015cf6da2a5.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 114 in stage 55.0 failed 4 times, most recent failure: Lost task 114.3 in stage 55.0 (TID 2631, ip-10-204-19-115.us-west-2.compute.internal, executor 4): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://c2l-dwh-temp/dwh_dev_file_download_fact/3c9d8ef3-c831-4642-9e54-2d2c6f24e155/part-00114-5aacdd19-3f62-4ee8-bf53-ab17d117dd0f-c000.csv
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:810)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:212)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:177)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The temp directory was empty at the start of the Glue job. Now it has a bunch of folders, some with lots of CSV files. I've never seen this error with my other similar jobs.
Any pointers?
Thanks

Found the problem. Further down in the traceback was a reference to my code, where I had a index error. The PySpark thread kept retrying, and that is why it got the "file already exists" error. The retry would not work because I had a coding error in my user-defined function.

Related

Delta Lake: File Not Found Exception

I am using Delta Lake to perform merge operation, for which I am trying to convert my Parquet files to delta format which are partitioned over time:
val source = spark.read.parquet("s3a://data-lake/source/")
source
.write
.option("maxRecordsPerFile",20000)
.mode("overwrite")
.partitionBy("time")
//.option("fs.s3a.committer.name", "partitioned") (I even tried using s3a committers)
.format("delta")
.save("s3a://data-lake/target/")
The data is over around 250G and my spark configs are:
spark.cores.max 420
spark.default.parallelism 10000
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.driver.extraJavaOptions -Xms20g
spark.driver.memory 28g
spark.executor.cores 2
spark.executor.memory 28G
In the logs it shows File Not Found Error and eventually kills executors after running for some time:
20/05/16 11:51:02 WARN TaskSetManager: Lost task 208.0 in stage 2.0 (TID 3294, 172.16.145.25, executor 14): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://data-lake/target/time=20190101/part-00208-2c9b5ddd-f2c1-4b8c-9d77-eacb0055ff82.c000.snappy.parquet
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
... 53 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried saving it as parquet format, it does work as expected. But it takes significantly more time to process.
If it's already in parquet, you can copy the folder in s3 to the target place without spark, and initialize it as a delta table with the following line:
import io.delta.DeltaTable
DeltaTable.forPath("s3a://data-lake/target/"))
More info here.

Out of memory exception or worked node lost during the spark scala job

I am executing a spark-scala job using spark-shell and the problem I am facing is, at the end of the final stage and final mapper like in stage 5 it allocates 50 and completed 49 very quickly and at the 50th it takes 5 minutes and says that out of memory and fails. I am using SPARK_MAJOR_VERSION=2
I am using the below command
spark-shell --master yarn --conf spark.driver.memory=30G --conf spark.executor.memory=40G --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=false --conf spark.sql.broadcastTimeout=36000 --conf spark.shuffle.compress=true --conf spark.executor.heartbeatInterval=3600s --conf spark.executor.instance=160
In the above configuration I have tried the dynamic allocation to true and started the driver and executor memory from 1GB. I have the overall ram of 6.78TB and 1300 VCores(This is my entire hadoop hardware).
The table I am reading is 40GB and I am joining 6 tables to that 40GB table, so, overall might be 60GB. so spark is initializing 4 stages for this and in the final stage at the end it is failing. I am using the spark sql to execute SQL.
Below are the errors:
19/04/26 14:29:02 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 125967 ms exceeds timeout 120000 ms
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 2 on worker03.some.com: Executor heartbeat timed out after 125967 ms
19/04/26 14:29:02 WARN TaskSetManager: Lost task 5.0 in stage 2.0 (TID 119, worker03.some.com, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 125967 ms
19/04/26 14:29:02 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 126225 ms exceeds timeout 120000 ms
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 1 on ncednhpwrka0008.devhadoop.charter.com: Executor heartbeat timed out after 126225 ms
19/04/26 14:29:02 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e1223_1556277056929_0976_01_000003 on host: worker03.some.com. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e1223_1556277056929_0976_01_000003
Exit code: 52
Shell output: main : command provided 1
main : run as user is svc-bd-xdladmrw-dev
main : requested yarn user is svc-bd-xdladmrw-dev
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/00/yarn/local/nmPrivate/application_1556277056929_0976/container_e1223_1556277056929_0976_01_000003/container_e1223_1556277056929_0976_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 52. Last 4096 bytes of stderr :
0 in stage 2.0 (TID 119)
19/04/26 14:27:37 INFO HadoopRDD: Input split: hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2:0+6623042
19/04/26 14:27:37 INFO OrcRawRecordMerger: min key = null, max key = null
19/04/26 14:27:37 INFO ReaderImpl: Reading ORC rows from hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2 with {include: [true, true, true], offset: 0, length: 9223372036854775807}
19/04/26 14:29:00 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 119)
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 119,5,main]
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 INFO DiskBlockManager: Shutdown hook called
19/04/26 14:29:00 INFO ShutdownHookManager: Shutdown hook called
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 2 on worker03.some.com: Container marked as failed: container_e1223_1556277056929_0976_01_000003 on host: worker03.some.com. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e1223_1556277056929_0976_01_000003
Exit code: 52
Shell output: main : command provided 1
main : run as user is svc-bd-xdladmrw-dev
main : requested yarn user is svc-bd-xdladmrw-dev
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/00/yarn/local/nmPrivate/application_1556277056929_0976/container_e1223_1556277056929_0976_01_000003/container_e1223_1556277056929_0976_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 52. Last 4096 bytes of stderr :
0 in stage 2.0 (TID 119)
19/04/26 14:27:37 INFO HadoopRDD: Input split: hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2:0+6623042
19/04/26 14:27:37 INFO OrcRawRecordMerger: min key = null, max key = null
19/04/26 14:27:37 INFO ReaderImpl: Reading ORC rows from hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2 with {include: [true, true, true], offset: 0, length: 9223372036854775807}
19/04/26 14:29:00 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 119)
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 119,5,main]
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 INFO DiskBlockManager: Shutdown hook called
19/04/26 14:29:00 INFO ShutdownHookManager: Shutdown hook called
Can anyone let me know if I am doing anything wrong here, like the memory allocation or something?, please suggest any alternatives to complete this job without getting the our of memory exception or worker node lost error. Any help or info is greatly appreciated.
Thanks!
at the end of the final stage and final mapper like in stage 5 it allocates 50 and completed 49 very quickly and at the 50th it takes 5 minutes and says that out of memory and fails.
The table I am reading is 40GB and I am joining 6 tables to that 40GB table
It sounds like a skewed data to me, most of the keys used for joining are in one partition. So instead spreading the work among multiple executors, Spark uses just one and overloads it. It affects both memory consumption and performance.
There are a few ways to deal with it:
Skewed dataset join in Spark?
How to repartition a dataframe in Spark scala on a skewed column?

Can't Consume Messages When Using Kafka v.0.10.0.x

I'm using kafka v.0.10.2 on my cluster.
I can produce message fine using v.0.8.x and v0.10.2
BUT when consuming messages using client v0.10.0.x I'm having errors below;
WARN [ConsumerFetcherThread-console-consumer-myconsumer-0-1002], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#16090d7a. Possible cause: java.nio.BufferUnderflowException (kafka.consumer.ConsumerFetcherThread)
ok,now my kafka.clien is v.0.8.x
but ihave a new problem
6 15:07:13 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoop11, executor 10): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: kafka.common.KafkaException: String exceeds the maximum size of 32767.
at kafka.api.ApiUtils$.shortStringLength(ApiUtils.scala:73)
at kafka.api.TopicData$.headerSize(FetchResponse.scala:107)
at kafka.api.TopicData.<init>(FetchResponse.scala:113)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:103)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:196)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:212)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
what show i do
String exceeds the maximum size of 32767.
Tell me about my final plan
I upgraded the version to 0.10.2

Failed to get broadcast_22_piece0 of broadcast_22

when I run Scala application on Spark cluster in yarn mode(spark version 2.2.0),the application is using the pregel model, each vertex in the data graph sends message. the Exception information as follows:
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 29 in stage 25.0 failed 4 times,
most recent failure: Lost task 29.3 in stage 25.0 (TID 1632, 192.168.1.5, executor 1): java.io.IOException: org.apache.spark.SparkException:
Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 12 more
I have searched the exception online, and one of the suggestions is adding statement .set("spark.cleaner.ttl","2000")
but it still does not work well.
Can you help me? thanks a lot.
some sinnpets that may cause the above exception as follows:
val spark = SparkSession.builder.master("spark://node01.:7077").appName("ioce").getOrCreate()
.......
and in the program, joining dataframe is used(which I looked through online warns that may be also relevant to the exception).

Spark exception after re-submitting stopped application

I'm running a Spark job (from a Spark notebook) using dynamic allocation with the options
"spark.master": "yarn-client",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.executorIdleTimeout": "30s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "1h",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.executor.cores": 2
(Note: I'm not sure yet whether the issue is caused by dynamicAllocation or not)
I'm using Spark version 1.6.1.
If I cancel a running job/app (either by pressing the cancel-button on the cell in the notebook, or by shuting down the notebook server and thus the app) and restart the same app shortly (some minutes) after, I often get the following excpetion:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, i89810.sbb.ch): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 11 more
Using the Yarn ResourceManager, I verified that the old job is not running anymore before re-submitting the job. Still I suppose that the problem arises because the killed job is not yet fully cleaned up and interferes with the newly launched job?
Somebody has encountered the same issue and knows how to solve this?
You need to setup external shuffle service when dynamic allocation is enabled. Otherwise shuffle files are deleted when executors are removed. Which is why Failed to get broadcast_3_piece0 of broadcast_3 exception is thrown.
For more information on this, see official spark documentation Dynamic Resource Allocation