I am trying to execute this code in Azure HdInsigth. I have a cluster Spark that is connected with Data Lake Storage.
spark.conf.set(
"fs.azure.sas.data.spmdevsharedstorage.blob.core.windows.net",
"xxxxxxxxxxx key xxxxxxxxxxx"
)
val shared_data = "wasbs://data#spmdevsharedstorage.blob.core.windows.net/"
//Read Csv
val dfCsv = spark.read.option("inferSchema", "true").option("header", true).csv(shared_data + "/test/4G-pixel.csv")
val dfCsv_final_withcolumn = dfCsv.select($"latitude",$"longitude")
val dfCsv_final = dfCsv_final_withcolumn.withColumn("new_latitude",col("latitude")*100)
//write
dfCsv_final.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(shared_data + "/test/4G-pixel_edit.csv")
The code reads the csv file well. So, when write the new file csv I see the following error:
20/04/03 14:58:12 ERROR AzureNativeFileSystemStore: Encountered Storage Exception for delete on Blob: https://spmdevsharedstorage.blob.core.windows.net/data/test/4G-pixel_edit.csv/_temporary/0, Exception Details: This operation is not permitted on a non-empty directory. Error Code: DirectoryIsNotEmpty
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2627)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2637)
The new file csv is written to the Data Lake but the code stops. I need you to not see this error.
How can I fix it?
I faced a similar issue.
I resolved it by using the below configuration.. set this to true.
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
or
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped","true")
Related
I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')
I have a script that writes Hive table content into a CSV file in HDFS.
The target folder name is given in a JSON paramater file. When I launch the script I notice that the folder that I already created is deleted automatically and then an error is thrown saying that the target file does not exist. This my script:
sigma.cache // sigma is the df that contains the hive table. Tested OK
sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
//Parametre_vigiliste.cible is the variable inide the JSON file that contains the target folder name
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.txt"));
sigma.unpersist()
ERROR THROWN:
exception caught: java.lang.UnsupportedOperationException: CSV data
source does not support null data type.
Can this code delete a folder for a certain reason? Thank you.
So as Prateek suggested I tried sigma.printSchema and I discovered some null columns. I rectified that and it worked.
I'm reading metrics data from json files from S3. What is the right way to handle the case when a path to the file doesn't exist? Currently I'm getting an AnalysisException: Path does not exist when there is no file with a given $metricsData name.
I think one way is to throw an exception but how should I correctly check if a path to the file exists?
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
I wouldn't use java.nio.file, it doesn't have a proper binding to S3 and/or HDFS. If you want your code to be applicable for all filesystems (local, in Docker (CI/CD), S3, HDFS, etc.) try using Apache Hadoop utils:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
val path = new Path("base/path/to/data")
val fs = path.getFileSystem(new Configuration())
// applicable for local and remote FS
if (fs.exists(path)) {
sparkSession.read(...)
}
You can use java.nio.file :
import java.nio.file.{Paths, Files}
if(Files.exists(Paths.get(s"$dataPath/$metricsData.json")))
val metricsDataDF: DataFrame = spark.read.option("multiline", "true")
.json(s"$dataPath/$metricsData.json")
How to check if path or file exist in Scala
I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)
I am reading sas file from azure blob . Converting it to csv and trying to upload csv to azure blob . However for small files in MBs I am able to do the same successfully with the following spark scala code .
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.github.saurfang.sas.spark._
val sqlContext = new SQLContext(sc)
val df=sqlContext.sasFile("wasbs://container#storageaccount/input.sas7bdat")
df.write.format("csv").save("wasbs://container#storageaccount/output.csv");
But for large files in GB it gives me Analysis exception wasbs://container#storageaccount/output.csv file already exists exception. I have tried overwrite also . But no luck . Any help would be appriciated
Actually, you could not overwrite an existing file on HDFS normally, even for small files in MBs.
Please try to use the code below to overwrite, please check your spark version because there are some differences to use the methed for different spark version.
df.write.format("csv").mode("overwrite").save("wasbs://container#storageaccount/output.csv");
I don't know the code above using overwrite mode whether you had tried as you said.
So there is another way to do it that first delete the existing files befer do the overwrite operation.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("<hdfs://<namenodehost>/ or wasb[s]://<containername>#<accountname>.blob.core.windows.net/<path> >"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
And there is a spark topic discussed similar issue, please see http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html.