Reading .parquet file from spark throwing exceptions when empty - pyspark

While reading .parquet file as below:
src = spark.read.parquet(filePath)
When file is empty is throwing below error:
filename.parquet is not a Parquet file (too small length: 0)
How to check if the file is empty before reading?

Related

java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:
df = spark.read.parquet("path/to/file.parquet")
But when I try to do any operation like .show() or .repartition(n) I run into the following error:
java.lang.IllegalArgumentException: Illegal Capacity: -102
any ideas on how I can fix this?
It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633
Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.

Scala Spark - Overwrite parquet File on HDFS

I was trying to append the data frame to existing parquet file found option to have the saveMode to append. But when I was trying to append it throws the error it was not the directory.
data.coalesce(1).write.mode(SaveMode.Append).parquet("/user/root/AppendTest");
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user/root/AppendTest":root:root:-rw-r--r-- (Ancestor /user/root/AppendTest is not a directory).
P.S: While was creating the new file it was generated to the folder and then I have renamed to the desired file.
I have checked How to overwrite the output directory in spark but that doesn't solve my problem here. I have tried the ways mentioned in the above questions(issue mentioned is also different).

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

Problems with non-"UTF-8" file collection using flume - Spooldir type

My flume spool directory contains non-"UTF-8" files.
So I get a Java.nio.charset.MalformedInputException error when I try to collect it.
Changing the encoding option of a .conf file will also cause an error.
And I have to use spooldir type.
How can I collecrt non-"UTF-8" files.
enter image description here
The encoding of our log files were Latin5 (which is Turkish)
Fixed it by adding the below line into the conf file:
AGENTNAME.sources.SOURCENAME.inputCharset = ISO-8859-9

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !