Error saving a parquet file with partitionBy - scala

I'm trying to save a parquet file with the append mode and running into an issue trying to do this from a windows system and then a linux system. Consider the following code.
val df = Seq(
(1, "test name1"),
(2, "test name2"),
(3, "test name3")).toDF("id", "name")
df.write.mode("append").partitionBy("name").parquet("D:\\path\\data.parquet")
when I run this code on a Windows system, I get the parquet file with three partitions as expected.
Further, when I run this on a Linux system, it still works fine except the space character is not encoded with %20.
Now, if I first create the parquet file (data.parquet) from windows and then try to append to the same file from linux, it creates three new partitions and also outputs an error saying
java.io.FileNotFoundException: /path/data.parquet/_SUCCESS (Permission denied)
If I manually encode the space character before I append to the file from linux, I get an %2520 where it encoded the % character.
df.withColumn("newName", regexp_replace(col("name")," ", "%20"))
.drop("name")
.withColumnRenamed("newName", "name")
.write.mode("append").partitionBy("name").parquet("/path/data.parquet")
Any idea how to handle this and make it work so that both windows and linux can append to the same file? What I'm trying to do is to encode the space character to %20 in the partition names when I save the file from linux.

Related

SciSpark: Reading from local folder instead of HDFS folder

I use an enterprise cluster that has both local and HDFS filesystems. Files I need to process are in netcdf format and hence I use SciSpark to load. On a workstation that has no HDFS, the code reads from local folder. However, when HDFS folders are present, it attempts read from HDFS only. As the size of files present in the folder is huge (running cumulatively into hundreds of GBs to TBs), I am having to copy them to HDFS, which is inefficient and inconvenient. The (Scala) code I use for loading the files is shown below:
val ncFilesRDD = sc.netcdfDFSFiles(ncDirectoryPath, List("x1", "x2", "x3"))
val ncFileCRDArrayRDD = ncFilesRDD.map(x => (x.variables.get("x1").get.data.toArray,
x.variables.get("x2").get.data.toArray,
x.variables.get("x3").get.data.toArray
))
I would very much appreciate any help in modifying the code that will enable me to use local directory instead of HDFS.
The Source Code Comment Doc for netcdfDFSFiles says
that since the files are read from HDFS
Not sure if you can use netcdfDFSFiles to read from local.
But there is another function netcdfFileList which says : The URI could be an OpenDapURL or a filesystem path.
Since it can take filesystem path , you can use it like
val ncFilesRDD = sc.netcdfFileList("file://your/path", List("x1", "x2", "x3"))
The file:// will look for local dir only.

pyspark - capture malformed JSON file name after load fails with FAILFAST option

To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()

Spark - how to skip or ignore empty gzip files when reading

I have a couple of hundred folders with some thousands of gzipped text files each in s3 and I'm trying to read them into a dataframe with spark.read.csv().
Among the files, there are some with zero length, resulting in the error:
java.io.EOFException: Unexpected end of input stream
Code:
df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)
I've tried setting the mode to DROPMALFORMED and reading with sc.textFile() but no luck.
What's the best way to handle empty or broken gzip files?
Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:
--conf spark.sql.files.ignoreCorruptFiles=true

Spark Streaming - textFileStream: Tmp file error

I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.

Spark coalesce looses file when program is aborted

in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex