java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark - pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:
df = spark.read.parquet("path/to/file.parquet")
But when I try to do any operation like .show() or .repartition(n) I run into the following error:
java.lang.IllegalArgumentException: Illegal Capacity: -102
any ideas on how I can fix this?

It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633
Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.

Related

ERROR AzureNativeFileSystemStore: DirectoryIsNotEmpty

I am trying to execute this code in Azure HdInsigth. I have a cluster Spark that is connected with Data Lake Storage.
spark.conf.set(
"fs.azure.sas.data.spmdevsharedstorage.blob.core.windows.net",
"xxxxxxxxxxx key xxxxxxxxxxx"
)
val shared_data = "wasbs://data#spmdevsharedstorage.blob.core.windows.net/"
//Read Csv
val dfCsv = spark.read.option("inferSchema", "true").option("header", true).csv(shared_data + "/test/4G-pixel.csv")
val dfCsv_final_withcolumn = dfCsv.select($"latitude",$"longitude")
val dfCsv_final = dfCsv_final_withcolumn.withColumn("new_latitude",col("latitude")*100)
//write
dfCsv_final.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(shared_data + "/test/4G-pixel_edit.csv")
The code reads the csv file well. So, when write the new file csv I see the following error:
20/04/03 14:58:12 ERROR AzureNativeFileSystemStore: Encountered Storage Exception for delete on Blob: https://spmdevsharedstorage.blob.core.windows.net/data/test/4G-pixel_edit.csv/_temporary/0, Exception Details: This operation is not permitted on a non-empty directory. Error Code: DirectoryIsNotEmpty
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2627)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2637)
The new file csv is written to the Data Lake but the code stops. I need you to not see this error.
How can I fix it?
I faced a similar issue.
I resolved it by using the below configuration.. set this to true.
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
or
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped","true")

IOException while overwriting parquet

I have parquet file let's say file name abc/A.parquet and few records are filtered out based on certain condition and create DF and I am trying overwrite file with resulted filtered DF with saveMode overwrite option, but throwing below exception:
command used to overwrite
filterDF.coalesce(1).write.mode("overwrite").parquet("file:/home/psub2/cls_parquet2/file:/home/psub7/abc/A.parquet")
failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/home/psub7/abc/A.parquet does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Pls help, Thanks in advance
Conceptually you can't read and write dataframe from same file. IOException thrown when you try to read df from file A and try write same df into the same file A. You can overwrite A parquet file only if you didn't read dataframe from file A.
For example you can read dataframe from file A and overwrite file B.

pyspark : AnalysisException when reading csv file

I am new to pyspark . I am migrating my project to pyspark . I am trying to read csv file from S3 and create df out of it. file name is assigned to variable cfg_file and I am using key variable for reading from S3. I am able to do same using pandas but get AnalysisException when I read using spark . I am using boto lib for S3 connection
df = spark.read.csv(StringIO.StringIO(Key(bucket,cfg_file).get_contents_as_string()), sep=',')
AnalysisException: u'Path does not exist: file:

Scala Spark - Overwrite parquet File on HDFS

I was trying to append the data frame to existing parquet file found option to have the saveMode to append. But when I was trying to append it throws the error it was not the directory.
data.coalesce(1).write.mode(SaveMode.Append).parquet("/user/root/AppendTest");
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user/root/AppendTest":root:root:-rw-r--r-- (Ancestor /user/root/AppendTest is not a directory).
P.S: While was creating the new file it was generated to the folder and then I have renamed to the desired file.
I have checked How to overwrite the output directory in spark but that doesn't solve my problem here. I have tried the ways mentioned in the above questions(issue mentioned is also different).

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !