I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.
Related
How does pyspark read from a directory under the hood?
Asking because have situation where I am attempting to read from a dir and then write to that same location later in overwrite mode. Doing this is throwing java.io.FileNotFoundException errors (even when manually placing files in the dir) and am assuming that this read and overwriting in the cause, so am curious as to the underlying process that spark uses to read from a dir and what could be happening when I try then to overwrite that same dir.
I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted.
The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data.
I use Project SDK 1.8, Scala version 2.11.8 and Spark version 2.0.2.
After running that line of code the second time:
newDF.repartition(1).write.mode(SaveMode.Overwrite).parquet(
OutputFilePath + "/day=" + DateOfData)
this error occurs:
WARN FileUtil:
Failed to delete file or dir [C:\Users\...\day=2018-07-15\._SUCCESS.crc]:
it still exists.
Exception in thread "main" java.io.IOException:
Unable to clear output directory file:/C:/Users/.../day=2018-07-15
prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
After the third time:
WARN FileUtil: Failed to delete file or dir
[C:\Users\day=2018-07-20\part-r-00000-8d1a2bde-c39a-47b2-81bb-decdef8ea2f9.snappy.parquet]: it still exists.
Exception in thread "main" java.io.IOException: Unable to clear output directory file:/C:/Users/day=2018-07-20 prior to writing to it
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:91)
As you see it's another file than the second time running the code.
And so on.. After deleting the files manually all parquet files can be created.
Does somebody know that issue and how to fix it?
Edit: It's always a crc-file that can't be deleted.
Thanks for your answers. :)
The solution is not to write in the Users directory. There seems to be a permission problem. So I created a new folder in the C: directory and it works perfect.
this problem occurs when you open the destination directory in windows. You just need to close the directory.
Perhaps another Windows process has a lock on the file so it can't be deleted.
To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()
I just started learning Spark/Scala, here is a confusing issue I came across on my very first practice:
I created a test file: input.txt in /etc/spark/bin
I created a RDD
I started to do a word count but received the error saying Input path does not exist
Here is a screenshot:
Why the input.txt not picked up by Scala? if it is permission related, the file was created by root but I am also running Spark/Scala under root.
Thank you very much.
Spark reads file from hdfs. Copy it with:
hadoop fs -put /etc/spark/bin/input.txt /hdfs/path
If it's a local installation file will be moved to hadoop folder.
in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex