Scala Spark - overwrite parquet file failed to delete file or dir - scala

I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted.
The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data.
I use Project SDK 1.8, Scala version 2.11.8 and Spark version 2.0.2.
After running that line of code the second time:
newDF.repartition(1).write.mode(SaveMode.Overwrite).parquet(
OutputFilePath + "/day=" + DateOfData)
this error occurs:
WARN FileUtil:
Failed to delete file or dir [C:\Users\...\day=2018-07-15\._SUCCESS.crc]:
it still exists.
Exception in thread "main" java.io.IOException:
Unable to clear output directory file:/C:/Users/.../day=2018-07-15
prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
After the third time:
WARN FileUtil: Failed to delete file or dir
[C:\Users\day=2018-07-20\part-r-00000-8d1a2bde-c39a-47b2-81bb-decdef8ea2f9.snappy.parquet]: it still exists.
Exception in thread "main" java.io.IOException: Unable to clear output directory file:/C:/Users/day=2018-07-20 prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
As you see it's another file than the second time running the code.
And so on.. After deleting the files manually all parquet files can be created.
Does somebody know that issue and how to fix it?
Edit: It's always a crc-file that can't be deleted.

Thanks for your answers. :)
The solution is not to write in the Users directory. There seems to be a permission problem. So I created a new folder in the C: directory and it works perfect.

this problem occurs when you open the destination directory in windows. You just need to close the directory.

Perhaps another Windows process has a lock on the file so it can't be deleted.

Related

PostgreSQL 14.5 pg_read_binary_file could not open file for reading: Invalid argument

Yesterday I installed PostgreSQL 14.5 on a Windows 10 laptop.
I then ran an old script to load images into a table.
The script uses the pg_read_binary_file function.
Some of the images are .jpg files and some are .png files.
Of the 34 files, only 5 were successfully processed (1 .jpg and 4 .png). The other 29 failed with the following error:
[Exception, Error code 0, SQLState XX000] ERROR: could not open file "file absolute path" for reading: Invalid argument
For instance, the following statement executes without errors
select pg_read_binary_file('C:\Users\Jorge\OneDrive\Documents\000\020-logos\adalid.png') as adalid_png;
... and the following statement fails
select pg_read_binary_file('C:\Users\Jorge\OneDrive\Documents\000\020-logos\oper.png') as oper_png;
... with the following error message
[Exception, Error code 0, SQLState XX000] ERROR: could not open file "C:/Users/Jorge/OneDrive/Documents/000/020-logos/oper.png" for reading: Invalid argument
So far, I have not been able to identify any difference in the files that could be the cause of the error. Also, I'm pretty sure the script works on earlier releases of version 14. Unfortunately I have not been able to find a website to download any of those earlier releases to test it again.
Has anyone else found this problem, and its solution?
I think the issue is somehow caused by OneDrive. This laptop is new. When I logged in with my Microsoft account, the OneDrive directory was automatically created and updated. Apparently this operation only updates the directory entries, leaving the contents of the files in the cloud until they are opened. When I zipped the directory that contains all my images, a message from OneDrive appeared saying that in that moment it will restore some files. After that, all the commands in my scripts work.
My theory is that pg_read_binary_file gets the file entry from the directory, so it doesn't give the "No such file or directory" message; but then fails reading the contents, giving the "Invalid argument" message instead.
The unanswered question would be: why does 7-Zip make OneDrive restore the files but pg_read_binary_file does not?
UPDATE
After more testing, and reading Save disk space with OneDrive Files On-Demand for Windows, now I am sure that pg_read_binary_file could fail and send the message "Invalid argument" when the OneDrive file is not a locally available file. In Windows File Explorer such file has a blue cloud icon next to it.

How does pyspark read from an an entire directory under the hood?

How does pyspark read from a directory under the hood?
Asking because have situation where I am attempting to read from a dir and then write to that same location later in overwrite mode. Doing this is throwing java.io.FileNotFoundException errors (even when manually placing files in the dir) and am assuming that this read and overwriting in the cause, so am curious as to the underlying process that spark uses to read from a dir and what could be happening when I try then to overwrite that same dir.

input file or path doesn't exist when executing scala script

I just started learning Spark/Scala, here is a confusing issue I came across on my very first practice:
I created a test file: input.txt in /etc/spark/bin
I created a RDD
I started to do a word count but received the error saying Input path does not exist
Here is a screenshot:
Why the input.txt not picked up by Scala? if it is permission related, the file was created by root but I am also running Spark/Scala under root.
Thank you very much.
Spark reads file from hdfs. Copy it with:
hadoop fs -put /etc/spark/bin/input.txt /hdfs/path
If it's a local installation file will be moved to hadoop folder.

Spark Streaming - textFileStream: Tmp file error

I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.

Spark coalesce looses file when program is aborted

in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex