Append/concatenate two files using spark/scala - scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help

You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

Related

How to save and merge RDD data in scala?

I want to merge existing data in hdfs with new comings from RDD. (Not by filename instead by real data inside them)
I found out there is no way to control output files' names in rdd.saveAsTextFile API, so I can not save both just by naming them with different names.
I try to merge them by Hadoop's FileUtil.copyMerge function, but I'm using Hadoop 3, which means this API is not supported ever more.

Read part files from hdfs into data frame using pySpark

I have multiple files stored in a hdfs location as follows
/user/project/202005/part-01798
/user/project/202005/part-01799
There are 2000 such part files. Each file is of the format
{'Name':'abc','Age':28,'Marks':[20,25,30]}
{'Name':...}
and so on . I have 2 questions
1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark
As these files are in one directory, and these are named as part-xxxxx files, so you can safely assume these are multiple part files of the same dataset. If these are partitions, they should be saved like this /user/project/date=202005/*
You can specify the dir "/user/project/202005" as input for spark like below assuming these are csv files
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)

Overwriting the parquet file throws exception in spark

I am trying to read the parquet file from hdfs location, do some transformations and overwrite the file in the same location. I had to overwrite the file in the same location because I had to run the same code multiple times.
Here is the code I have written
val df = spark.read.option("header", "true").option("inferSchema", "true").parquet("hdfs://master:8020/persist/local/")
//after applying some transformations lets say the final dataframe is transDF which I want to overwrite at the same location.
transDF.write.mode("overwrite").parquet("hdfs://master:8020/persist/local/")
Now the problem is before reading the parquet file from the given location, spark for some reason I believe it deletes the file at the given location because of overwrite mode. So when executing the code I get the following error.
File does not exist: hdfs://master:8020/persist/local/part-00000-e73c4dfd-d008-4007-8274-d445bdea3fc8-c000.snappy.parquet
Any suggestions on how to solve this problem? Thanks.
The simple answer is that you cannot overwrite what you are reading. The reason behind this is that overwrite would need to delete everything, however, since spark is working in parallel, some portions might still be reading at the time. Furthermore, even if everything was read, spark needs the original file to recalculate tasks which are failed.
Since you need the input for multiple iterations, I would simply make the name of the input and the output into arguments for the function that does one iteration and delete the previous iteration only once the writing is successful.
This is what I have tried and it worked. My requirement was almost same. It was upsert option.
by the way, spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic') property was set. Even then also the Transform job was failing
Took a backup of S3 folder (final curated layer) before every batch operation
using the dataframe operations, first delete the S3 parquet file location before overwrite
then Append to the particular location
Previously the entire job was running for 1.5Hrs and failing frequently. Now it's taking 10-15mins for the entire operations

Spark Structured Streaming Processing Previous Files

I am implementing the file source in Spark Structures Streaming and want to process the same file name again if the file has been modified. Basically an update to the file. Currently right now Spark will not process the same file name again once processed. Seems limited compared to Spark Streaming with Dstream. Is there a way to do this? Spark Structured Streaming doesn't document this anywhere it only process new file with different names.
I believe this is somewhat of an anti pattern, but you may be able to dig through the checkpoint data and remove the entry for that original file.
Try looking for the original file name in the /checkpoint/sources// files delete the file or entry. That might cause the stream to pick up the file name again. I haven't tried this myself.
If this is a one time manual update, I would just change the file name to something new and drop it in the source directory. This approach won't be maintainable or automated.

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!