Change file name in Azure Databricks [duplicate] - pyspark

This question already has answers here:
Change output filename prefix for DataFrame.write()
(3 answers)
Closed 12 months ago.
I've write the data in the pyspark dataframe using the dataframe writer APIs.
How can I change the name of the csv file generated?

It seems you are trying to get a single CSV file out of a Spark Dataframe, using the spark.write.csv() method. This will create a distributed file by default.
I would recommend the following instead if you want a single file with a specific name.
df.toPandas().to_csv('/dbfs/path_of_your_file/filename.csv')
using Pandas to_csv arguments that fit your need.

Related

Is there a way to validate each row of a spark dataframe?

My activity setup :
I have a text file containing multiple json entries .
I want to access each json entry and verify its key value pair .
Is there a way to do this using Pyspark ?
I tried to load the txt file by reading it into a spark session and validating its schema using the dataframe.schema() function. But I recently learnt that dataframe.schema() does data sampling and doesnt validate all the records in the dataframe.
You're probably better off using a framework like Deequ (https://github.com/awslabs/python-deequ) to test your dataset.

How to save and merge RDD data in scala?

I want to merge existing data in hdfs with new comings from RDD. (Not by filename instead by real data inside them)
I found out there is no way to control output files' names in rdd.saveAsTextFile API, so I can not save both just by naming them with different names.
I try to merge them by Hadoop's FileUtil.copyMerge function, but I'm using Hadoop 3, which means this API is not supported ever more.

How to Write a Spark Dataframe (in DataBricks) to Blob Storage (in Azure)?

I am working in DataBricks, where I have a DataFrame.
type(df)
Out: pyspark.sql.dataframe.DataFrame
The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.
I found this post. So I tried that code:
# Configure blob storage account access key globally
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
sas_key)
output_container_path = "wasbs://%s#%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(datafiles
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array<string>,Enumerations:array<string>,EnvironmentMeasurements:string,Files:array<struct<Value:string,checksum:string,checksumType:string,name:string,size:string>>,GlobalProcesses:string,Printouts:array<string>,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string> data type.;
Therefore my question (and this should be easy is my assumption):
How can I write my spark dataframe from DataBricks to an Azure Blob Storage?
My Azure folder structure is like this:
Account = MainStorage
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in.
Container 2 is called "Output" # here I want to store my Spark Dataframe.
Many thanks in advance!
EDIT
I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)
Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
Please note newly created file would have the some default file name with csv extension hence you might need to rename it with a consistent name.
// output_container_path= wasbs://ContainerName#StorageAccountName.blob.core.windows.net/DirectoryName
val mount_root = "/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!

Executing Mongo DB raw queries present in file in my java code [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MongoDB (Java) - How to run query saved in javascript file?
I have written java code to search/insert values into a collections in java.But what i want is that i have a .js file containing raw MongoDB queries
ex:db.collectionname.find(fieldname: fieldvalue)
and now i want to read the .js file line by line and execute the raw mongoDB queries.
Please help me with any idea or about functions which will execute raw Mongo DB statements in java like statement.executequery("select * from tablename"); in sql
You could use a JSON intrepretter, like this one implemented in ANTLR, and output nested BasicDBObject's. The only remaining piece is parsing the db.collectionname.find(...), which should be relatively straightforward. Instead of ANTLR, you can also find one of probably many Java JSON parsers and instantiate the BasicDBObject's yourself.