java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary Error while writing to Parquet File - scala

When I am trying to write the data to Parquet file I am facing below mentioned error. I read post about if two Parquet files have different datatypes then we will see this error. But I tried individually casting all the columns in the dataframe also I am trying to write to a new directory that doesn't have any files.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 787 in stage 76.0 failed 4 times, most recent failure: Lost task 787.3 in stage 76.0 (TID 77007) (100.100.191.241 executor 145): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:400)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:87)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)

Related

How to resolve error while reading parquet files

While reading parquet files in pyspark I'm getting following error.
How can I resolve it.
Error :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 131.0 failed 4 times, most recent failure: Lost task
0.3 in stage 131.0 (TID 16100, 10.107.97.154, executor 56): org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=; isDirectory=false; length=32203696; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

resolve the issue of possible the underlying files have been updated in data bricks

val mnt_point_write="/mnt/pnt"
ord_JsonDF.write.mode("overwrite").format("json").option("header",true).json(mnt_point_write+"/Processed_file")
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 114.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 114.0 (TID 13876, 10.139.64.40, executor 27):
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt-02/AF_tab/Processed/YYYY=2019/MM=07/DD=21/part-00000-tid-677839983764655717-ahfuhaufhehfhurfawefkjfaffadfe-2685-1.c000.csv.
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Coming across the above issue. Can anyone help in resolving the issue please.

Failed to get broadcast_22_piece0 of broadcast_22

when I run Scala application on Spark cluster in yarn mode(spark version 2.2.0),the application is using the pregel model, each vertex in the data graph sends message. the Exception information as follows:
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 29 in stage 25.0 failed 4 times,
most recent failure: Lost task 29.3 in stage 25.0 (TID 1632, 192.168.1.5, executor 1): java.io.IOException: org.apache.spark.SparkException:
Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_22_piece0 of broadcast_22
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:222)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 12 more
I have searched the exception online, and one of the suggestions is adding statement .set("spark.cleaner.ttl","2000")
but it still does not work well.
Can you help me? thanks a lot.
some sinnpets that may cause the above exception as follows:
val spark = SparkSession.builder.master("spark://node01.:7077").appName("ioce").getOrCreate()
.......
and in the program, joining dataframe is used(which I looked through online warns that may be also relevant to the exception).

Write parquet to S3 using Spark Scala: java.lang.NullPointerException

When I tried to write parquet to s3 with Scala using the code below:
ab.write.mode("overwrite").parquet(path)
And the error shows:
org.apache.spark.SparkException: Job aborted. (Job aborted.) ......
rg.apache.spark.SparkException: Job aborted due to stage failure: Task
1 in stage 112.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 112.0 (TID 6273, localhost): java.lang.NullPointerException at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:412)
at
org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at
org.apache.hadoop.fs.s3a.S3AOutputStream.(S3AOutputStream.java:87)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)

Error while trying to save the data to Hive tables from dataframe

We have the following issue when we try to insert data into Hive table.
Job aborted due to stage failure: Task 5 in stage 65.0 failed 4
times, most recent failure: Lost task 5.3 in stage 65.0 (TID 987,
tnblf585.test.sprint.com): java.lang.ArrayIndexOutOfBoundsException:
45 at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.genericGet(rows.scala:254)
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.isNullAt(rows.scala:248)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:107)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:104)
at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
I have figured it out that one of the column name in dataframe and hive table are not same , after the column name correction it has loaded correctly