Spark writing compressed CSV with custom path to S3 - scala

I'm trying to simply write a CSV to S3 using Spark written in Scala:
I notice in my output bucket the following file:
...PROCESSED/montfh-04.csv/part-00000-723a3d72-56f6-4e62-b627-9a181a820f6a-c000.csv.snappy
when it should only be montfh-04.csv
Code:
val processedMetadataDf = spark.read.csv("s3://" + metadataPath + "/PROCESSED/" + "month-04" + ".csv")
val processCount = processedMetadataDf.count()
if (processCount == 0) {
// Initial frame is 0B -> Overwrite with path
val newDat = Seq("dummy-row-data")
val unknown_df = newDat.toDF()
unknown_df.write.mode("overwrite").option("header","false").csv("s3://" + metadataPath + "/PROCESSED/" + "montfh-04" + ".csv")
}
Here I notice two strange things:
It puts it in a directory
It adds that weird part sub-path to the file with snappy compression
All I am trying to do is simply write a flat CSV file with that name to the specified path. What are my options?

This is how spark works. The location you provide for saving DataSet/DataFrame is the directory location where spark can write all their partitions.
No. of part files will be equal to no of partition which in your case is only 1.
Now, if you want the filename to be montfh-04.csv only then you can rename it.
Note: renaming in S3 is costly operation ( copy and delete). As you are writing with spark it will be 3 times of the I/O as 2 times will be the output Commit operation and 1 time rename. Better write it in HDFS and upload it from there with the required key name.

Related

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

scala loop through multiple files in the path

I am new to spark and scala. I have below requirement. I need to process all the files under a path which have sub directories. I guess, I need to write a for-loop logic to process across all the files.
Below is the example of my case:
src/proj_fldr/dataset1/20170624/file1.txt
src/proj_fldr/dataset1/20170624/file2.txt
src/proj_fldr/dataset1/20170624/file3.txt
src/proj_fldr/dataset1/20170625/file1.txt
src/proj_fldr/dataset1/20170625/file2.txt
src/proj_fldr/dataset1/20170625/file3.txt
src/proj_fldr/dataset1/20170626/file1.txt
src/proj_fldr/dataset1/20170626/file2.txt
src/proj_fldr/dataset1/20170626/file3.txt
src/proj_fldr/dataset2/20170624/file1.txt
src/proj_fldr/dataset2/20170624/file2.txt
src/proj_fldr/dataset2/20170624/file3.txt
src/proj_fldr/dataset2/20170625/file1.txt
src/proj_fldr/dataset2/20170625/file2.txt
src/proj_fldr/dataset2/20170625/file3.txt
src/proj_fldr/dataset2/20170626/file1.txt
src/proj_fldr/dataset2/20170626/file2.txt
src/proj_fldr/dataset2/20170626/file3.txt
I need the code to iterate the files like
In src
loop (proj_fldr
loop(dataset
loop(datefolder
loop(file1 then, file2....))))
Since you have a regular file structure you can use the wildcard * when reading the files. You can do the following to read all the files into a single RDD:
val spark = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.wholeTextFiles("src/*/*/*/*.txt")
The result will be a RDD[(String, String)] with the path and the content in a tuple for each processed file.
To explicitly set if you want to use local or HDFS files you can append "hdfs://" or "file://" to the beginning of the path.

How to write Spark Streaming output to HDFS without overwriting

After some processing I have a DStream[String , ArrayList[String]] , so when I am writing it to hdfs using saveAsTextFile and after every batch it overwrites the data , so how to write new result by appending to previous results
output.foreachRDD(r => {
r.saveAsTextFile(path)
})
Edit :: If anyone could help me in converting the output to avro format and then writing to HDFS with appending
saveAsTextFile does not support append. If called with a fixed filename, it will overwrite it every time.
We could do saveAsTextFile(path+timestamp) to save to a new file every time. That's the basic functionality of DStream.saveAsTextFiles(path)
An easily accessible format that supports append is Parquet. We first transform our data RDD to a DataFrame or Dataset and then we can benefit from the write support offered on top of that abstraction.
case class DataStructure(field1,..., fieldn)
... streaming setup, dstream declaration, ...
val structuredOutput = outputDStream.map(record => mapFunctionRecordToDataStructure)
structuredOutput.foreachRDD(rdd =>
import sparkSession.implicits._
val df = rdd.toDF()
df.write.format("parquet").mode("append").save(s"$workDir/$targetFile")
})
Note that appending to Parquet files gets more expensive over time, so rotating the target file from time to time is still a requirement.
If you want to append the same file and store on file system, store it as a parquet file. You can do it by
kafkaData.foreachRDD( rdd => {
if(rdd.count()>0)
{
val df=rdd.toDF()
df.write(SaveMode.Append).save("/path")
}
Storing the streaming output to HDFS will always create a new files even in case when you use append with parquet which leads to a small files problems on Namenode. I may recommend to write your output to sequence files where you can keep appending to the same file.
Here I solve the issue without dataframe
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
messages.foreachRDD{ rdd =>
rdd.repartition(1)
val eachRdd = rdd.map(record => record.value)
if(!eachRdd.isEmpty) {
eachRdd.saveAsTextFile(hdfs_storage + DateTimeFormatter.ofPattern("yyyyMMddHHmmss").format(LocalDateTime.now) + "/")
}
}

Specifying the filename when saving a DataFrame as a CSV [duplicate]

This question already has an answer here:
Spark dataframe save in single file on hdfs location [duplicate]
(1 answer)
Closed 5 years ago.
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.
The function is defined as
def csv(path: String): Unit
path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
Code :
df.coalesce(1).write.csv("sample_path")
Current Output :
sample_path
|
+-- part-r-00000.csv
Desired Output :
sample_path
|
+-- my_file.csv
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question
In Scala it will look like:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
or just:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

Does reading multiple files & collect bring them to driver in spark

Code snippet :
val inp = sc.textFile("C:\\mk\\logdir\\foldera\\foldera1\\log.txt").collect.mkString(" ")
I know above code reads the entire file & combine them in one string & executes it driver node(single execution. not parallel one).
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
code block{ }
sc.stop
Q1 )Here I am reading multiple files (which are present in above folder structure). I believe in this case each file will be created as partition & will be sent to separate node & executed parallely. Am I correct in my understanding? Can someone confirm this? Or is there anyway i can confirm it systematically?
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
val cont = inp.collect.mkString(" ")
code block{ }
sc.stop
Q2) How the spark handles this case. though I am doing collect, I assume that it will not collect all content from all files but just the one file . Am I right? Can someone help me understand this?
Thank you very much in Advance for your time & help.
Q1 )Here I am reading multiple files (which are present in above
folder structure). I believe in this case each file will be created as
partition & will be sent to separate node & executed parallely. Am I
correct in my understanding? Can someone confirm this? Or is there
anyway i can confirm it systematically?
ANSWER :
SparkContext’s TextFile method, i.e., sc.textFile creates a RDD with each line as an element. If there are 10 files in data i.e yourtextfilesfolder folder, 10 partitions will be created. You can verify the number of partitions by:
yourtextfilesfolder.partitions.length
However, Partitioning is determined by data locality. This may result in too few partitions by default. AFAIK there is no guarantee that one partition will be created please see the code of 'SparkContext.textFile'.
& 'minPartitions' - suggested minimum number of partitions for the resulting RDD
For better understanding see below method.
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
you can mention minPartitions as shown above from SparkContext.scala
Q2) How the spark handles this case. though I am doing collect, I assume
that it will not collect all content from all files but just the one
file . Am I right? Can someone help me understand this?
ANSWER : Your rdd constructed with multiple text files. so collect will collect from all partitions to driver from different files, not one file at a time.
you can verify : using rdd.collect
However, If you want read multiple text files you can also use wholeTextFiles
please see the #note in below method Small files are preferred, large file is also allowable, but may cause bad performance.
See spark-core-sc-textfile-vs-sc-wholetextfiles
Doc :
RDD> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local file
system (available on all nodes), or any Hadoop-supported file system
URI.
/**
* Read a directory of text files from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI. Each file is read as a single record and returned in a
* key-value pair, where the key is the path of each file, the value is the content of each file.
*
* <p> For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
*
* <p> then `rdd` contains
* {{{
* (a-hdfs-path/part-00000, its content)
* (a-hdfs-path/part-00001, its content)
* ...
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* #note Small files are preferred, large file is also allowable, but may cause bad performance.
* #note On some filesystems, `.../path/*` can be a more efficient way to read all files
* in a directory rather than `.../path/` or `.../path`
* #note Partitioning is determined by data locality. This may result in too few partitions
* by default.
*
* #param path Directory to the input data files, the path can be comma separated paths as the
* list of inputs.
* #param minPartitions A suggestion value of the minimal splitting number for input data.
* #return RDD representing tuples of file path and the corresponding file content
*/
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
.....
}
Examples :
val distFile = sc.textFile("data.txt")
Above command returns the content of the file:
scala> distFile.collect()
res16: Array[String] = Array(1,2,3, 4,5,6)
SparkContext.wholeTextFiles can return (filename, content).
val distFile = sc.wholeTextFiles("/tmp/tmpdir")
scala> distFile.collect()
res17: Array[(String, String)] =
Array((maprfs:/tmp/tmpdir/data3.txt,"1,2,3
4,5,6
"), (maprfs:/tmp/tmpdir/data.txt,"1,2,3
4,5,6
"), (maprfs:/tmp/tmpdir/data2.txt,"1,2,3
4,5,6
"))
In your case I d prefer SparkContext.wholeTextFiles where you can get filename and its content after collect as described above, if thats the thing you wanted.
Spark is a fast and general engine for large-scale data processing. It processes all the data in parallel. So, to answer first question, then Yes, in following case:
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
code block{ }
sc.stop
Each file will be created as partition & will be sent to separate node & executed in parallel. But, depending on the size of a file number of partitions can be greater than the number of files being processed. For example, if log.txt in folder1 and folder2 are of few KB in size, then only 2 partitions are created as there will 2 files and they will be processed in parallel.
But, if log.txt in folder1 has size in GB(s), then multiple partitions will be created for it and number of partitions will be greater than the number of files.
However, we can always change the number of partitions of an RDD using repartition() or coalesce() method.
To answer second question, then in following case:
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
val cont = inp.collect.mkString(" ")
code block{ }
sc.stop
Spark will collect content from all files and not just from one file. Since, collect() means to get all content in stored an RDD and get it back to Driver in form of a collection.