Spark saveAsTextFile to Azure Blob creates a blob instead of a text file - scala

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?

Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.

After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!

Related

ADF Data Flow creates empty file

I have setup a very basic data transformation using a "Data flow". I'm taking in a CSV file and modifying one of the columns and writing to a new CSV file in an "output" directory. I noticed that after the pipeline runs not only does it create the output folder but it also creates and empty file with the same name as the output folder.
Did I setup something wrong or is this empty file normal?
Sink Settings
Sink
Settings
Mapping
Optimize
Storage
Thank you,
Scott

Stitch Part Files to one with custom name

Data Fusion Pipeline gives us one or more part files at output if sync in GCS Bucket. My question is how we can combine those part files to one and also gave them a meaningful name ?
The Data Fusion transformations run in Dataproc clusters executing either Spark or MapReduce jobs. Your final output is split in many files because the jobs partition your data based on the HDFS partitions (this is the default behavior for Spark/Hadoop).
When writing a Spark script you are able to manipulate this default behavior and produce different outputs. However, Data Fusion was built to abstract the code layer and provide you the experience of using a fully managed data integrator. Using split files should not be a problem but if you really need to merge them I suggest that you use the following approach:
On the top of your Pipeline Studio click on Hub -> Plugins, search for Dynamic Spark Plugin, click on Deploy and then in Finish (you can ignore the JAR file)
Back to your pipeline, select Spark in the sink section.
Replace your GCS plugin with the Spark plugin
In your Spark plugin, set Compile at Deployment Time as false and replace the code with some Spark code that does what you want. The code below for example is hardcoded but works:
def sink(df: DataFrame) : Unit = {
new_df = df.coalesce(1)
new_df.write.format("csv").save("gs://your/path/")
}
This function receives the data from your pipeline as a Dataframe. The coalesce function reduces the number of partitions to 1 and the last line writes it to GCS.
Deploy your pipeline and it will be ready to run

How to Write a Spark Dataframe (in DataBricks) to Blob Storage (in Azure)?

I am working in DataBricks, where I have a DataFrame.
type(df)
Out: pyspark.sql.dataframe.DataFrame
The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.
I found this post. So I tried that code:
# Configure blob storage account access key globally
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
sas_key)
output_container_path = "wasbs://%s#%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(datafiles
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array<string>,Enumerations:array<string>,EnvironmentMeasurements:string,Files:array<struct<Value:string,checksum:string,checksumType:string,name:string,size:string>>,GlobalProcesses:string,Printouts:array<string>,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string> data type.;
Therefore my question (and this should be easy is my assumption):
How can I write my spark dataframe from DataBricks to an Azure Blob Storage?
My Azure folder structure is like this:
Account = MainStorage
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in.
Container 2 is called "Output" # here I want to store my Spark Dataframe.
Many thanks in advance!
EDIT
I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)
Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
Please note newly created file would have the some default file name with csv extension hence you might need to rename it with a consistent name.
// output_container_path= wasbs://ContainerName#StorageAccountName.blob.core.windows.net/DirectoryName
val mount_root = "/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

Overwriting the parquet file throws exception in spark

I am trying to read the parquet file from hdfs location, do some transformations and overwrite the file in the same location. I had to overwrite the file in the same location because I had to run the same code multiple times.
Here is the code I have written
val df = spark.read.option("header", "true").option("inferSchema", "true").parquet("hdfs://master:8020/persist/local/")
//after applying some transformations lets say the final dataframe is transDF which I want to overwrite at the same location.
transDF.write.mode("overwrite").parquet("hdfs://master:8020/persist/local/")
Now the problem is before reading the parquet file from the given location, spark for some reason I believe it deletes the file at the given location because of overwrite mode. So when executing the code I get the following error.
File does not exist: hdfs://master:8020/persist/local/part-00000-e73c4dfd-d008-4007-8274-d445bdea3fc8-c000.snappy.parquet
Any suggestions on how to solve this problem? Thanks.
The simple answer is that you cannot overwrite what you are reading. The reason behind this is that overwrite would need to delete everything, however, since spark is working in parallel, some portions might still be reading at the time. Furthermore, even if everything was read, spark needs the original file to recalculate tasks which are failed.
Since you need the input for multiple iterations, I would simply make the name of the input and the output into arguments for the function that does one iteration and delete the previous iteration only once the writing is successful.
This is what I have tried and it worked. My requirement was almost same. It was upsert option.
by the way, spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic') property was set. Even then also the Transform job was failing
Took a backup of S3 folder (final curated layer) before every batch operation
using the dataframe operations, first delete the S3 parquet file location before overwrite
then Append to the particular location
Previously the entire job was running for 1.5Hrs and failing frequently. Now it's taking 10-15mins for the entire operations

Append/concatenate two files using spark/scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)