Spark: Write each record in RDD to individual files in HDFS directory - scala

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.

You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")

You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF

Related

How read a csv with quotes using sparkcontext

I've started recently to use scala spark, in particular I'm trying to use GraphX in order to make a graph from a csv. To read a csv file with spark context I always do this:
val rdd = sc.textFile("file/path")
.map(line => line.split(","))
In this way I obtain an RDD of objects Array[String].
My problem is that the csv file contains strings delimited by quotes ("") and number without quotes, an example of some lines inside the file is the following:
"Luke",32,"Rome"
"Mary",43,"London"
"Mario",33,"Berlin"
If I use the method split(",") I obtain String objects that inside contain quotes, for instance the string Luke is saved as "Luke" and not as Luke.
How can I do to not consider quotes and make the correct string objects?
I hope I was clear to explain my problem
you can let the Spark DataFrame level CSV parser resolve that for you
val rdd=spark.read.csv("file/path").rdd.map(_.mkString(",")).map(_.split(","))
by the way, you can transform the Row directly to VertexId, (String,String) in the first map based on the Row fields
Try with below example.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="C://zipcodes.csv"
//Chaining multiple options
val df2 = spark.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(filePath)
df2.show(false)
df2.printSchema()
}
}

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Spark Scala - Process files in parallel and get dataframe for each file

I have thousands of large files to process.
I'm trying to read these files in parallel, and then convert each of them to DataFrame so that I can aggregate data and extract numerical features etc.
I tried sc.wholeTextFiles() which gives us file name and content tuple RDD, but i'm not allowed to use sparkContext/sqlContext to create dataframe which in the RDD map.
val allFiles = sc.wholeTextFiles(inputDir)
val rows = allFiles.map {
case (filename, content) => {
val s = content.split("\n")
// Convert the content to RDD and dataframe later
//val r = sc.parallelize(s); <-- Serialization Error
}
}
Access sc about throws serialization error as we are not supposed to sc to tasks.
I also considered:
val input = sc.textFile(inputDir)
input.mapPartitions(iter => <process partition data>)
But with above approach, when files are large it would split into several partitions and i loose ability to process file as a "whole"
So is there any other option where I can process whole files in parallel, convert each of the dataframe etc?

get size of parquet file in HDFS for repartition with Spark in Scala

I have many parquet file directories on HDFS that contain a few thousands of small(most < 100kb) parquet files each. They slow down my Spark job, so I want to combine them.
With the following code I can repartition the local parquet file to smaller number of parts:
val pqFile = sqlContext.read.parquet("file:/home/hadoop/data/file.parquet")
pqFile.coalesce(4).write.save("file:/home/hadoop/data/fileSmaller.parquet")
But I don't know how to get the size of a directory on HDFS through Scala code programmatically, hence I can't work out the number of partitions to pass to the coalesce function for the real data set.
How can I do this? Or is there a convenient way within Spark so that I can configure the writer to write fixed size of parquet partition?
You could try
pqFile.inputFiles.size
which returns "a best-effort snapshot of the files that compose this DataFrame" according to the documentation.
As an alternative, directly on the HDFS level:
val hdfs: org.apache.hadoop.fs.FileSystem =
org.apache.hadoop.fs.FileSystem.get(
new org.apache.hadoop.conf.Configuration())
val hadoopPath= new org.apache.hadoop.fs.Path("hdfs://localhost:9000/tmp")
val recursive = false
val ri = hdfs.listFiles(hadoopPath, recursive)
val it = new Iterator[org.apache.hadoop.fs.LocatedFileStatus]() {
override def hasNext = ri.hasNext
override def next() = ri.next()
}
// Materialize iterator
val files = it.toList
println(files.size)
println(files.map(_.getLen).sum)
This way you get the file sizes as well.

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):
df
.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
or coalesce:
df
.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("mydata.csv")
data frame before saving:
All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.
Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.
If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
val newData = << create your dataframe >>
val outputfile = "/user/feeds/project/outputs/subject"
var filename = "myinsights"
var outputFileName = outputfile + "/temp_" + filename
var mergedFileName = outputfile + "/merged_" + filename
var mergeFindGlob = outputFileName
newData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.mode("overwrite")
.save(outputFileName)
merge(mergeFindGlob, mergedFileName )
newData.unpersist()
Can't remember where I learned this trick, but it might work for you.
I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?
If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:
val fileprefix= "/mnt/aws/path/file-prefix"
dataset
.coalesce(1)
.write
//.mode("overwrite") // I usually don't use this, but you may want to.
.option("header", "true")
.option("delimiter","\t")
.csv(fileprefix+".tmp")
val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
.filter(file=>file.name.endsWith(".csv"))(0).path
dbutils.fs.cp(partition_path,fileprefix+".tab")
dbutils.fs.rm(fileprefix+".tmp",recurse=true)
If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.
This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.
The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.
spark's df.write() API will create multiple part files inside given path ... to force spark write only a single part file use df.coalesce(1).write.csv(...) instead of df.repartition(1).write.csv(...) as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition() vs coalesce()
df.coalesce(1).write.csv(filepath,header=True)
will create folder in given filepath with one part-0001-...-c000.csv file
use
cat filepath/part-0001-...-c000.csv > filename_you_want.csv
to have a user friendly filename
This answer expands on the accepted answer, gives more context, and provides code snippets you can run in the Spark Shell on your machine.
More context on accepted answer
The accepted answer might give you the impression the sample code outputs a single mydata.csv file and that's not the case. Let's demonstrate:
val df = Seq("one", "two", "three").toDF("num")
df
.repartition(1)
.write.csv(sys.env("HOME")+ "/Documents/tmp/mydata.csv")
Here's what's outputted:
Documents/
tmp/
mydata.csv/
_SUCCESS
part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000.csv
N.B. mydata.csv is a folder in the accepted answer - it's not a file!
How to output a single file with a specific name
We can use spark-daria to write out a single mydata.csv file.
import com.github.mrpowers.spark.daria.sql.DariaWriters
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = sys.env("HOME") + "/Documents/better/staging",
filename = sys.env("HOME") + "/Documents/better/mydata.csv"
)
This'll output the file as follows:
Documents/
better/
mydata.csv
S3 paths
You'll need to pass s3a paths to DariaWriters.writeSingleFile to use this method in S3:
DariaWriters.writeSingleFile(
df = df,
format = "csv",
sc = spark.sparkContext,
tmpFolder = "s3a://bucket/data/src",
filename = "s3a://bucket/data/dest/my_cool_file.csv"
)
See here for more info.
Avoiding copyMerge
copyMerge was removed from Hadoop 3. The DariaWriters.writeSingleFile implementation uses fs.rename, as described here. Spark 3 still used Hadoop 2, so copyMerge implementations will work in 2020. I'm not sure when Spark will upgrade to Hadoop 3, but better to avoid any copyMerge approach that'll cause your code to break when Spark upgrades Hadoop.
Source code
Look for the DariaWriters object in the spark-daria source code if you'd like to inspect the implementation.
PySpark implementation
It's easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default.
from pathlib import Path
home = str(Path.home())
data = [
("jellyfish", "JALYF"),
("li", "L"),
("luisa", "LAS"),
(None, None)
]
df = spark.createDataFrame(data, ["word", "expected"])
df.toPandas().to_csv(home + "/Documents/tmp/mydata-from-pyspark.csv", sep=',', header=True, index=False)
Limitations
The DariaWriters.writeSingleFile Scala approach and the df.toPandas() Python approach only work for small datasets. Huge datasets can not be written out as single files. Writing out data as a single file isn't optimal from a performance perspective because the data can't be written in parallel.
I'm using this in Python to get a single file:
df.toPandas().to_csv("/tmp/my.csv", sep=',', header=True, index=False)
A solution that works for S3 modified from Minkymorgan.
Simply pass the temporary partitioned directory path (with different name than final path) as the srcPath and single final csv/txt as destPath Specify also deleteSource if you want to remove the original directory.
/**
* Merges multiple partitions of spark text file output into single file.
* #param srcPath source directory of partitioned files
* #param dstPath output path of individual path
* #param deleteSource whether or not to delete source directory after merging
* #param spark sparkSession
*/
def mergeTextFiles(srcPath: String, dstPath: String, deleteSource: Boolean): Unit = {
import org.apache.hadoop.fs.FileUtil
import java.net.URI
val config = spark.sparkContext.hadoopConfiguration
val fs: FileSystem = FileSystem.get(new URI(srcPath), config)
FileUtil.copyMerge(
fs, new Path(srcPath), fs, new Path(dstPath), deleteSource, config, null
)
}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.{DataFrame,SaveMode,SparkSession}
import org.apache.spark.sql.functions._
I solved using below approach (hdfs rename file name):-
Step 1:- (Crate Data Frame and write to HDFS)
df.coalesce(1).write.format("csv").option("header", "false").mode(SaveMode.Overwrite).save("/hdfsfolder/blah/")
Step 2:- (Create Hadoop Config)
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
Step3 :- (Get path in hdfs folder path)
val pathFiles = new Path("/hdfsfolder/blah/")
Step4:- (Get spark file names from hdfs folder)
val fileNames = hdfs.listFiles(pathFiles, false)
println(fileNames)
setp5:- (create scala mutable list to save all the file names and add it to the list)
var fileNamesList = scala.collection.mutable.MutableList[String]()
while (fileNames.hasNext) {
fileNamesList += fileNames.next().getPath.getName
}
println(fileNamesList)
Step 6:- (filter _SUCESS file order from file names scala list)
// get files name which are not _SUCCESS
val partFileName = fileNamesList.filterNot(filenames => filenames == "_SUCCESS")
step 7:- (convert scala list to string and add desired file name to hdfs folder string and then apply rename)
val partFileSourcePath = new Path("/yourhdfsfolder/"+ partFileName.mkString(""))
val desiredCsvTargetPath = new Path(/yourhdfsfolder/+ "op_"+ ".csv")
hdfs.rename(partFileSourcePath , desiredCsvTargetPath)
spark.sql("select * from df").coalesce(1).write.option("mode","append").option("header","true").csv("/your/hdfs/path/")
spark.sql("select * from df") --> this is dataframe
coalesce(1) or repartition(1) --> this will make your output file to 1 part file only
write --> writing data
option("mode","append") --> appending data to existing directory
option("header","true") --> enabling header
csv("<hdfs dir>") --> write as CSV file & its output location in HDFS
repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)
you can use rdd.coalesce(1, true).saveAsTextFile(path)
it will store data as singile file in path/part-00000
Here is a helper function with which you can get a single result-file without the part-0000 and without a subdirectory on S3 and AWS EMR:
def renameSinglePartToParentFolder(directoryUrl: String): Unit = {
import sys.process._
val lsResult = s"aws s3 ls ${directoryUrl}/" !!
val partFilename = lsResult.split("\n").map(_.split(" ").last).filter(_.contains("part-0000")).last
s"aws s3 rm ${directoryUrl}/_SUCCESS" !
s"aws s3 mv ${directoryUrl}/${partFilename} ${directoryUrl}" !
}
val targetPath = "s3://my-bucket/my-folder/my-file.csv"
df.coalesce(1).write.csv(targetPath)
renameSinglePartToParentFolder(targetPath)
Write to a single part-0000... file.
Use AWS CLI to list all files and rename the single file accordingly.
by using Listbuffer we can save data into single file:
import java.io.FileWriter
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
val text = spark.read.textFile("filepath")
var data = ListBuffer[String]()
for(line:String <- text.collect()){
data += line
}
val writer = new FileWriter("filepath")
data.foreach(line => writer.write(line.toString+"\n"))
writer.close()
def export_csv(
fileName: String,
filePath: String
) = {
val filePathDestTemp = filePath + ".dir/"
val merstageout_df = spark.sql(merstageout)
merstageout_df
.coalesce(1)
.write
.option("header", "true")
.mode("overwrite")
.csv(filePathDestTemp)
val listFiles = dbutils.fs.ls(filePathDestTemp)
for(subFiles <- listFiles){
val subFiles_name: String = subFiles.name
if (subFiles_name.slice(subFiles_name.length() - 4,subFiles_name.length()) == ".csv") {
dbutils.fs.cp (filePathDestTemp + subFiles_name, filePath + fileName+ ".csv")
dbutils.fs.rm(filePathDestTemp, recurse=true)
}}}
There is one more way to use Java
import java.io._
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)
{
val p = new java.io.PrintWriter(f);
try { op(p) }
finally { p.close() }
}
printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)}