I have an Azure Data Lake gen1 and an Azure Data Lake gen2 (Blob Storage w/hierarchical) and I am trying to create a Databricks notebook (Scala) that reads 2 files and writes a new file back into the Data Lake. In both Gen1 and Gen2 I am experiencing the same issue where the file name of the output csv I have specified is getting saved as a directory and inside that directory it's writing 4 files "committed, started, _SUCCESS, and part-00000-tid-
For the life of me, I can't figure out why it's doing it and not actually saving the csv to the location.
Here's an example of the code I've written. If I do a .show() on the df_join dataframe then it outputs the correct looking results. But the .write is not working correctly.
val df_names = spark.read.option("header", "true").csv("/mnt/datalake/raw/names.csv")
val df_addresses = spark.read.option("header", "true").csv("/mnt/datalake/raw/addresses.csv")
val df_join = df_names.join(df_addresses, df_names.col("pk") === df_addresses.col("namepk"))
df_join.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("/mnt/datalake/reports/testoutput.csv")
If I understand for your needs correctly, you just want to write the Spark DataFrame data to a single csv file named testoutput.csv into Azure Data Lake, not a directory named testoutput.csv with some partition files.
So you can not directly realize it via use these Spark functions like DataFrameWriter.save, because actually the dataframe writer writes data to HDFS based on Azure Data Lake. The HDFS persists data as a directory named yours and some partition files. Please see some documents about HDFS like The Hadoop FileSystem API Definition to know it.
Then, per my experience, you can try to use Azure Data Lake SDK for Jave within your Scala program to directly write data from DataFrame to Azure Data Lake as a single file. And you can refer to some samples https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java.
The reason why it's creating a directory with multiple files, is because each partition is saved and written to the data lake individually. To save a single output file you need to re partition your dataframe
Let's use the dataframe API
confKey = "fs.azure.account.key.srcAcctName.blob.core.windows.net"
secretKey = "==" #your secret key
spark.conf.set(confKey,secretKey)
blobUrl = 'wasbs://MyContainerName#srcAcctName.blob.core.windows.net'
Coalesce your dataframe
df_join.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("blobUrl" + "/reports/")
Change the file name
files = dbutils.fs.ls(blobUrl + '/reports/')
output_file = [x for x in files if x.name.startswith("part-")]
dbutils.fs.mv(output_file[0].path, "%s/reports/testoutput.csv" % (blobUrl))
Try this :
df_join.to_csv('/dbfs/mnt/....../df.csv', sep=',', header=True, index=False)
Related
I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')
I am trying to execute this code in Azure HdInsigth. I have a cluster Spark that is connected with Data Lake Storage.
spark.conf.set(
"fs.azure.sas.data.spmdevsharedstorage.blob.core.windows.net",
"xxxxxxxxxxx key xxxxxxxxxxx"
)
val shared_data = "wasbs://data#spmdevsharedstorage.blob.core.windows.net/"
//Read Csv
val dfCsv = spark.read.option("inferSchema", "true").option("header", true).csv(shared_data + "/test/4G-pixel.csv")
val dfCsv_final_withcolumn = dfCsv.select($"latitude",$"longitude")
val dfCsv_final = dfCsv_final_withcolumn.withColumn("new_latitude",col("latitude")*100)
//write
dfCsv_final.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(shared_data + "/test/4G-pixel_edit.csv")
The code reads the csv file well. So, when write the new file csv I see the following error:
20/04/03 14:58:12 ERROR AzureNativeFileSystemStore: Encountered Storage Exception for delete on Blob: https://spmdevsharedstorage.blob.core.windows.net/data/test/4G-pixel_edit.csv/_temporary/0, Exception Details: This operation is not permitted on a non-empty directory. Error Code: DirectoryIsNotEmpty
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2627)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2637)
The new file csv is written to the Data Lake but the code stops. I need you to not see this error.
How can I fix it?
I faced a similar issue.
I resolved it by using the below configuration.. set this to true.
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
or
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped","true")
I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions
The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.
I am new to pyspark . I am migrating my project to pyspark . I am trying to read csv file from S3 and create df out of it. file name is assigned to variable cfg_file and I am using key variable for reading from S3. I am able to do same using pandas but get AnalysisException when I read using spark . I am using boto lib for S3 connection
df = spark.read.csv(StringIO.StringIO(Key(bucket,cfg_file).get_contents_as_string()), sep=',')
AnalysisException: u'Path does not exist: file:
I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInference","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible