Spark: java.io.FileNotFoundException: File does not exist in copyMerge - scala

I am trying to merge all spark output part files in a directory and create a single file in Scala.
Here is my code:
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
And then at last step, I am writing data frame output like below.
dfMainOutputFinalWithoutNull.repartition(10).write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("header", "true")
.option("codec", "gzip")
.mode("overwrite")
.save(outputfile)
merge(mergeFindGlob, mergedFileName )
dfMainOutputFinalWithoutNull.unpersist()
When I run this I get below exception
java.io.FileNotFoundException: File does not exist: hdfs:/user/zeppelin/FinancialLineItem/temp_FinancialLineItem
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
This is how I get my output
Instead of the folder, I want to merge all files inside a folder and create a single file.

There is a copyMerge API in Hadoop 2 :
https://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/fs/FileUtil.html#line.382
Unfortunately this is going to be deprecated and removed in Hadoop 3.0.
Here's re-implementation of copyMerge (in PySpark though) I had to write as we couldn't find a better solution:
https://github.com/Tagar/stuff/blob/master/copyMerge.py
Hope it helps somebody else too.

Related

Make “Partitioned” Staging Committer work on Databricks

Currently, I have a working Spark ETL application that reads data from an S3 bucket, applies transformation, and writes them back to another S3 bucket in a parquet format. But depending on the input data volume, it takes up to a few hours to complete the job.
The processing takes ~20-25 minutes and the rest of the time job waits until S3 files are copied inside the S3.
I want to use the modern Partitioned Staging Committer (https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_.E2.80.9CPartitioned.E2.80.9D_Staging_Committer) to speed up the job and avoid wasting cluster time during the S3 routine. I've implemented a simple POC job:
import org.apache.spark.sql.{SaveMode, SparkSession}
import scala.io.Source
// scalastyle:off
object PartitionedStagingCommitterPoc {
private val spark = SparkSession.builder()
.appName("PartitionedStagingCommitterPoc")
.config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
.config("spark.sql.parquet.mergeSchema", "false")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
.config("spark.sql.hive.metastorePartitionPruning", "true")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
.config("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.committer.name", "partitioned")
.config("spark.hadoop.fs.s3a.committer.magic.enabled", "false")
.config("spark.hadoop.fs.s3a.committer.staging.unique-filenames", "true")
.config("spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads", "true")
.config("spark.hadoop.fs.s3a.committer.staging.conflict-mode", "replace")
.getOrCreate()
def main(args: Array[String]): Unit = {
import spark.implicits._
val resourceStream = Thread.currentThread().getContextClassLoader.getResourceAsStream("input.csv")
val linesDataset = Source.fromInputStream(resourceStream).getLines().toSeq.toDS()
val dataFrame = spark.read
.option("header", value = true)
.option("inferSchema", value = true)
.csv(linesDataset)
dataFrame.write
.mode(SaveMode.Append)
.partitionBy("date", "country_code")
.parquet("s3a://my-sandbox/data")
}
}
My POC application works fine when I run it on my local workstation, but it fails with an exception when I try to run the same code on the Databricks using an uber jar in a "spark-submit" job:
Caused by: org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-sandbox/data': Filesystem not supported by this committer
at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.createOutputCommitter(AbstractS3ACommitterFactory.java:51)
at org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter.<init>(BindingPathOutputCommitter.java:87)
at org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter.<init>(BindingParquetOutputCommitter.scala:44)
... 76 more
Unfortunately, the exception is not descriptive, and I cannot understand what's wrong. Since the same jar works without issues when I submit it to my local Spark, I assume, that some implicit pre-configuration causes the problem in the Databricks runtime. Any help would be much appreciated.

How to get number of files read from S3 path in Spark

So, I'm using the most generic S3 read code in Spark, it reads multiple files in my specified directory into a single dataframe:
val df=spark.read.option("sep", "\t")
.option("inferSchema", "true")
.option("encoding","UTF-8")
.schema(sch)
.csv("s3://my-bucket/my-directory/")
What would be the best way (if any) to get the number of files that were read from this path?
You can try to count distinct input_file_name() :
val nbFiles = df.select(input_file_name()).distinct.count
Or using Hadoop FileSystem:
import org.apache.hadoop.fs.Path
val s3Path = new Path("s3://my-bucket/my-directory/")
val contentSummary = s3Path.getFileSystem(sc.hadoopConfiguration).getContentSummary(s3Path)
val nbFiles = contentSummary.getFileCount()

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

I have following simple Scala class , which i will later modify to fit some machine learning models.
I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.
The csv file looks like this and its include a Date column as one of the variables.
+-------------------+-------------+-------+---------+-----+
| Date| x1 | y | x2 | x3 |
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29| 57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41| 53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49| 61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24|
Data set link : https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOTPMriGgE
I created a jar file out of this using intelliJ IDEA. And it was successfully done.
object jar1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.appName("SparkByExample")
.getOrCreate()
val data = sc.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(args(0))
data.write.format("text").save(args(1))
}
}
After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .
But it was failed and i got following error message :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;
I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.
Can anyone help me to figure this out ?
Updated :
I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;
Thank you
As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:
df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")
If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON.
This is working code example based on your csv file for csv.
val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._
val df = spark.read
.format("csv")
.option("delimiter", ",")
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.load("datat.csv")
df.printSchema()
df.show()
df.write
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.option("escape", "\\")
.save("another")
There is no need custom encoder/decoder.

Merge Spark output CSV files with a single header

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.
I have a Scala script that takes raw data from S3, processes it and writes it to HDFS or even S3 with Spark-CSV. I think I can use multiple files as input if I want to use AWS Machine Learning tool for training a prediction model. But if I want to use something else, I presume it is best if I receive a single CSV output file.
Currently, as I do not want to use repartition(1) nor coalesce(1) for performance purposes, I have used hadoop fs -getmerge for manual testing, but as it just merges the contents of the job output files, I am running into a small problem. I need a single row of headers in the data file for training the prediction model.
If I use .option("header","true") for the spark-csv, then it writes the headers to every output file and after merging I have as many lines of headers in the data as there were output files. But if the header option is false, then it does not add any headers.
Now I found an option to merge the files inside the Scala script with Hadoop API FileUtil.copyMerge. I tried this in spark-shell with the code below.
import org.apache.hadoop.fs.FileUtil
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
val configuration = new Configuration();
val fs = FileSystem.get(configuration);
FileUtil.copyMerge(fs, new Path("smallheaders"), fs, new Path("/home/hadoop/smallheaders2"), false, configuration, "")
But this solution still just concatenates the files on top of each other and does not handle headers. How can I get an output file with only one row of headers?
I even tried adding df.columns.mkString(",") as the last argument for copyMerge, but this added the headers still multiple times, not once.
you can walk around like this.
1.Create a new DataFrame(headerDF) containing header names.
2.Union it with the DataFrame(dataDF) containing the data.
3.Output the union-ed DataFrame to disk with option("header", "false").
4.merge partition files(part-0000**0.csv) using hadoop FileUtil
In this ways, all partitions have no header except for a single partition's content has a row of header names from the headerDF. When all partitions are merged together, there is a single header in the top of the file. Sample code are the following
//dataFrame is the data to save on disk
//cast types of all columns to String
val dataDF = dataFrame.select(dataFrame.columns.map(c => dataFrame.col(c).cast("string")): _*)
//create a new data frame containing only header names
import scala.collection.JavaConverters._
val headerDF = sparkSession.createDataFrame(List(Row.fromSeq(dataDF.columns.toSeq)).asJava, dataDF.schema)
//merge header names with data
headerDF.union(dataDF).write.mode(SaveMode.Overwrite).option("header", "false").csv(outputFolder)
//use hadoop FileUtil to merge all partition csv files into a single file
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
FileUtil.copyMerge(fs, new Path(outputFolder), fs, new Path("/folder/target.csv"), true, spark.sparkContext.hadoopConfiguration, null)
Output the header using dataframe.schema
( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
create a file with the header on dsefs
append all the partition files (headerless) to the file in #2 using hadoop Filesystem API
We had a similar issue, following the below approach to get single output file-
Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
Read the files from the previous step and write back to different location on hdfs with coalesce(1).
dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1).
And the second step provides single output file with one header line.
To merge files in a folder into one file:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):
dataFrame
.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out)
Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).
def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
val tmpDir = "tmpDir"
df.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", header.toString)
.option("delimiter", sep)
.save(tmpDir)
val dir = new File(tmpDir)
val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
(new File(tmpCsvFile)).renameTo(new File(fileName))
dir.listFiles.foreach( f => f.delete )
dir.delete
}
Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed).
Example:
val headerSchema = List(
StructField("example1", StringType, true),
StructField("example2", StringType, true),
StructField("example3", StringType, true)
)
val header_DF =sqlCtx.read
.option("delimiter", ",")
.option("header", "false")
.option("mode","DROPMALFORMED")
.option("inferSchema","false")
.schema(StructType(headerSchema))
.format("com.databricks.spark.csv")
.load("folder containg the files")
In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.
// Convert JavaRDD to CSV and save as text file
outputDataframe.write()
.format("com.databricks.spark.csv")
// Header => true, will enable to have header in each file
.option("header", "true")
Please follow the link with Integration test on how to write a single header
http://bytepadding.com/big-data/spark/write-a-csv-text-file-from-spark/

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):
df
.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
or coalesce:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
data frame before saving:
All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.
If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
val newData = << create your dataframe >>
val outputfile = "/user/feeds/project/outputs/subject"
var filename = "myinsights"
var outputFileName = outputfile + "/temp_" + filename
var mergedFileName = outputfile + "/merged_" + filename
var mergeFindGlob = outputFileName
newData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.mode("overwrite")
.save(outputFileName)
merge(mergeFindGlob, mergedFileName )
newData.unpersist()
Can't remember where I learned this trick, but it might work for you.
I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?
If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:
val fileprefix= "/mnt/aws/path/file-prefix"
dataset
.coalesce(1)
.write
//.mode("overwrite") // I usually don't use this, but you may want to.
.option("header", "true")
.option("delimiter","\t")
.csv(fileprefix+".tmp")
val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
.filter(file=>file.name.endsWith(".csv"))(0).path
dbutils.fs.cp(partition_path,fileprefix+".tab")
dbutils.fs.rm(fileprefix+".tmp",recurse=true)
If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.
This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.
The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.
spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()
df.coalesce(1).write.csv(filepath,header=True)
will create folder in given filepath with one part-0001-...-c000.csv file
use
cat filepath/part-0001-...-c000.csv > filename_you_want.csv
to have a user friendly filename
This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.
More context on accepted answer
The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:
val df = Seq("one", "two", "three").toDF("num")
df
.repartition(1)
.write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
Here's what's outputted:
Documents/
tmp/
mydata.csv/
_SUCCESS
part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
N.B. mydata.csv is a folder in the accepted answer - it's not a file!
How to output a single file with a specific name
We can use spark-daria to write out a single mydata.csv file.
import com.github.mrpowers.spark.daria.sql.DariaWriters
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = sys.env("HOME") + "/Documents/better/staging",
filename = sys.env("HOME") + "/Documents/better/mydata.csv"
)
This'll output the file as follows:
Documents/
better/
mydata.csv
S3 paths
You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = "s3a://bucket/data/src",
filename = "s3a://bucket/data/dest/my_cool_file.csv"
)
See here for more info.
Avoiding copyMerge
copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.
Source code
Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.
PySpark implementation
It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.
from pathlib import Path
home = str(Path.home())
data = [
("jellyfish", "JALYF"),
("li", "L"),
("luisa", "LAS"),
(None, None)
]
df = spark.createDataFrame(data, ["word", "expected"])
df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
Limitations
The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.
I'm using this in Python to get a single file:
df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)
A solution that works for S3 modified from Minkymorgan.
Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.
/**
* Merges multiple partitions of spark text file output into single file.
* #param srcPath source directory of partitioned files
* #param dstPath output path of individual path
* #param deleteSource whether or not to delete source directory after merging
* #param spark sparkSession
*/
def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit = {
import org.apache.hadoop.fs.FileUtil
import java.net.URI
val config = spark.sparkContext.hadoopConfiguration
val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
FileUtil.copyMerge(
fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
)
}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
import org.apache.spark.sql.functions._
I solved using below approach (hdfs rename file name):-
Step 1:- (Crate Data Frame and write to HDFS)
df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
Step 2:- (Create Hadoop Config)
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
Step3 :- (Get path in hdfs folder path)
val pathFiles = new Path("/hdfsfolder/blah/")
Step4:- (Get spark file names from hdfs folder)
val fileNames = hdfs.listFiles(pathFiles, false)
println(fileNames)
setp5:- (create scala mutable list to save all the file names and add it to the list)
var fileNamesList = scala.collection.mutable.MutableList[String]()
while (fileNames.hasNext) {
fileNamesList += fileNames.next().getPath.getName
}
println(fileNamesList)
Step 6:- (filter _SUCESS file order from file names scala list)
// get files name which are not _SUCCESS
val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)
val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
hdfs.rename(partFileSourcePath , desiredCsvTargetPath)
spark.sql("select * from df").coalesce(1).write.option("mode","append").option("header","true").csv("/your/hdfs/path/")
spark.sql("select * from df") --> this is dataframe
coalesce(1) or repartition(1) --> this will make your output file to 1 part file only
write --> writing data
option("mode","append") --> appending data to existing directory
option("header","true") --> enabling header
csv("<hdfs dir>") --> write as CSV file & its output location in HDFS
repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)
you can use rdd.coalesce(1, true).saveAsTextFile(path)
it will store data as singile file in path/part-00000
Here is a helper function with which you can get a single result-file without the part-0000 and without a subdirectory on S3 and AWS EMR:
def renameSinglePartToParentFolder(directoryUrl: String): Unit = {
import sys.process._
val lsResult = s"aws s3 ls ${directoryUrl}/" !!
val partFilename = lsResult.split("\n").map(_.split(" ").last).filter(_.contains("part-0000")).last
s"aws s3 rm ${directoryUrl}/_SUCCESS" !
s"aws s3 mv ${directoryUrl}/${partFilename} ${directoryUrl}" !
}
val targetPath = "s3://my-bucket/my-folder/my-file.csv"
df.coalesce(1).write.csv(targetPath)
renameSinglePartToParentFolder(targetPath)
Write to a single part-0000... file.
Use AWS CLI to list all files and rename the single file accordingly.
by using Listbuffer we can save data into single file:
import java.io.FileWriter
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
val text = spark.read.textFile("filepath")
var data = ListBuffer[String]()
for(line:String <- text.collect()){
data += line
}
val writer = new FileWriter("filepath")
data.foreach(line => writer.write(line.toString+"\n"))
writer.close()
def export_csv(
fileName: String,
filePath: String
) = {
val filePathDestTemp = filePath + ".dir/"
val merstageout_df = spark.sql(merstageout)
merstageout_df
.coalesce(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(filePathDestTemp)
val listFiles = dbutils.fs.ls(filePathDestTemp)
for(subFiles <- listFiles){
val subFiles_name: String = subFiles.name
if (subFiles_name.slice(subFiles_name.length() - 4,subFiles_name.length()) == ".csv") {
dbutils.fs.cp (filePathDestTemp + subFiles_name, filePath + fileName+ ".csv")
dbutils.fs.rm(filePathDestTemp, recurse=true)
}}}
There is one more way to use Java
import java.io._
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)
{
val p = new java.io.PrintWriter(f);
try { op(p) }
finally { p.close() }
}
printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}