Overwriting the parquet file throws exception in spark - scala

I am trying to read the parquet file from hdfs location, do some transformations and overwrite the file in the same location. I had to overwrite the file in the same location because I had to run the same code multiple times.
Here is the code I have written
val df = spark.read.option("header", "true").option("inferSchema", "true").parquet("hdfs://master:8020/persist/local/")
//after applying some transformations lets say the final dataframe is transDF which I want to overwrite at the same location.
transDF.write.mode("overwrite").parquet("hdfs://master:8020/persist/local/")
Now the problem is before reading the parquet file from the given location, spark for some reason I believe it deletes the file at the given location because of overwrite mode. So when executing the code I get the following error.
File does not exist: hdfs://master:8020/persist/local/part-00000-e73c4dfd-d008-4007-8274-d445bdea3fc8-c000.snappy.parquet
Any suggestions on how to solve this problem? Thanks.

The simple answer is that you cannot overwrite what you are reading. The reason behind this is that overwrite would need to delete everything, however, since spark is working in parallel, some portions might still be reading at the time. Furthermore, even if everything was read, spark needs the original file to recalculate tasks which are failed.
Since you need the input for multiple iterations, I would simply make the name of the input and the output into arguments for the function that does one iteration and delete the previous iteration only once the writing is successful.

This is what I have tried and it worked. My requirement was almost same. It was upsert option.
by the way, spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic') property was set. Even then also the Transform job was failing
Took a backup of S3 folder (final curated layer) before every batch operation
using the dataframe operations, first delete the S3 parquet file location before overwrite
then Append to the particular location
Previously the entire job was running for 1.5Hrs and failing frequently. Now it's taking 10-15mins for the entire operations

Related

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.
example:
my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet
the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet
I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet
and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/
df
.write
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("XXXXXXXX")
.mode(SaveMode.Append)
.format(fileFormat)
.save(path)
When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.
A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.
Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

Behavior of the overwrite in spark

I am regularly uploading data on a parquet file which I use for my data analysis using and I want to ensure that the data in my parquet file are not duplicated. The command I use to do this is:
df.write.parquet('my_directory/', mode='overwrite')
Does this ensure that all my non-duplicated data will not be deleted accidentally at some point.
Cheers
The Overwrite as the name implies it rewrites the whole data into the path that you specify.
Rewrite in the sense, the data that is available in the df will be written to the path by removing the old files available if any in the path specified. So you can consider this as a DELETE and LOAD scenario, where you read all the records from the datasource lets say Oracle and then do your transformations and delete the parquet and write the new content in the dataframe.
The Dataframe.write supports a list of modes to write the content to the target.
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data already
exists.
If your intention is to add new data to the parquet then you have to do with append but this brings in a new challenge of duplicates if you are dealing with changing data.
Does this ensure that all my non-duplicated data will not be deleted
accidentally at some point.
No. mode='overwrite' only ensures that if data already exists in the target directory, then the existing data would be deleted and new data would be written (analogous to truncate and load in RDBMS tables).
If you want to ensure there is no record level duplicates, the easiest thing to do is this:
df1 = df.dropDuplicates()
df1.write.parquet('my_directory/', mode='overwrite')

Spark Structured Streaming Processing Previous Files

I am implementing the file source in Spark Structures Streaming and want to process the same file name again if the file has been modified. Basically an update to the file. Currently right now Spark will not process the same file name again once processed. Seems limited compared to Spark Streaming with Dstream. Is there a way to do this? Spark Structured Streaming doesn't document this anywhere it only process new file with different names.
I believe this is somewhat of an anti pattern, but you may be able to dig through the checkpoint data and remove the entry for that original file.
Try looking for the original file name in the /checkpoint/sources// files delete the file or entry. That might cause the stream to pick up the file name again. I haven't tried this myself.
If this is a one time manual update, I would just change the file name to something new and drop it in the source directory. This approach won't be maintainable or automated.

Append/concatenate two files using spark/scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!