Specifying the filename when saving a DataFrame as a CSV [duplicate] - scala

This question already has an answer here:
Spark dataframe save in single file on hdfs location [duplicate]
(1 answer)
Closed 5 years ago.
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.
The function is defined as
def csv(path: String): Unit
path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
Code :
df.coalesce(1).write.csv("sample_path")
Current Output :
sample_path
|
+-- part-r-00000.csv
Desired Output :
sample_path
|
+-- my_file.csv
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.

It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question
In Scala it will look like:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
or just:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

Related

Spark - Write DataFrame with custom file name [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 1 year ago.
I have a Spark (2.4) DataFrame that I want to write as a Pipe separated file. It should be pretty straightforward like so
val myDF = spark.table("mySchema.myTable")
myDF.coalesce(1).write.format("csv").options("header", "true").options("delimiter", "|").save("/tmp/myDF")
I get a part-*.csv file in /tmp/myDF.
So far, so good. But I actually want the file name to be something specific, e.g. /tmp/myDF.csv
But giving this String in save will just create a dir called myDF.csv and create the part*.csv file in there.
Is there a way to write the DataFrame with a specific name?
You can't do that with Spark
You can rename the file later accessing the fileSystem
val directory = new File(/tmp/myDF)
if (directory.exists && directory.isDirectory) {
val file = directory.listFiles.filter(_.getName.endsWith(".csv")).head
file.renameTo("myDF.csv")
}

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

How to read data from hdfs using scala language [duplicate]

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 4 years ago.
How do I read the data from hdfs data sets using scala language? data is any "CSV" file with limited records.
You tagged the question with Spark, so I'm assuming you are trying to use that. I would recommend you start by reading through the Spark documentation here to get an idea of how to use Spark to interact with your data.
https://spark.apache.org/docs/latest/quick-start.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
But, to answer your specific question, in Spark you would read in the CSV file using code like this:
val csvDf = spark.read.format("csv")
.option("sep", ",")
.option("header", "true")
.load("hdfs://some/path/to/data.csv/")
The path your provide will be to a CSV file on HDFS, or a folder containing multiple CSV files. Also, Spark will accept other types of file systems. For example you could also use "file://" to access the local file system, or "s3://" to use S3. Once you have loaded the data you will have a Spark DataFrame object with SQL like methods available to interact with it.
Note, I provided an option for separator just to show you how to do it, but it defaults to "," anyways, so it is not required. Also, if your CSV files do not include a header, you will need to specify the Schema yourself and set header to false instead.
You can read data from HDFS by following this approach :-
val hdfs = FileSystem.get(new URI("hdfs://hdfsUrl:port/"), new Configuration())
val path = new Path("/pathOfTheFileInHDFS/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also please have a look at this article https://blog.matthewrathbone.com/2013/12/28/reading-data-from-hdfs-even-if-it-is-compressed
Please let me know if this answers your question.

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll
Thanks
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.
Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
or
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
In case you have a list of files you can do:
files = ['file1', 'file2',...]
df = spark.read.parquet(*files)
For ORC
spark.read.orc("/dir1/*","/dir2/*")
spark goes inside dir1/ and dir2/ folder and load all the ORC files.
For Parquet,
spark.read.parquet("/dir1/*","/dir2/*")
Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf
In Spark-Scala you can do this.
val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")