Load DataFrame as Text File into HDFS and S3 [duplicate] - scala

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.
val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)
Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.
Directory Name - /user/test/File/test_20170918055206.txt
-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt
Found 2 items
/user/test/File/test_20170918055206.txt/_SUCCESS
/user/test/File/test_20170918055206.txt/part-00000
I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.

In my opinion, this is working as design.
You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.
So this only one partition was stored in your HDFS as your saveAsTextFile operation.
This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.
For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!
if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe
I hope this explanation helps you.
If you then need and a complete whole file on your local file system, just try the hadoop client command:
hadoop fs -getmerge [src] [des]

Related

how to save a Dataset[row] as text file in spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Write spark dataframe to single parquet file
(2 answers)
Closed 3 years ago.
I would like to save a Dataset[Row] as text file with a specific name in specific location.
Can anybody help me?
I have tried this, but this produce me a folder (LOCAL_FOLDER_TEMP/filename) with a parquet file inside of it:
Dataset.write.save(LOCAL_FOLDER_TEMP+filename)
Thanks
You can`t save your dataset to specific filename using spark api, there is multiple workarounds to do that.
as Vladislav offered, collect your dataset then write it into your filesystem using scala/java/python api.
apply repartition/coalesce(1), write your dataset and then change the filename.
both are not very recommended, because in large datasets it can cause OOM or just lost of the power of spark`s parallelism.
The second issue that you are getting parquet file, its because the default format of spark, you should use:
df.write.format("text").save("/path/to/save")
Please use
RDD.saveAsTextFile()
It Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Refer Link : rdd-programming-guide
Spark always creates multiple files - one file per partition. If you want a single file - you need to do collect() and then just write it to file the usual way.

Spark textFile vs wholeTextFiles

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.
Now, from a technical point of view, what's the difference between :
val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions
and
val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions
In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?
The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.
When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.
wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values
That's not accurate:
textFile loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.
Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.
As of Spark2.1.1 following is the code for textFile.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path) }
Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file:// , hdfs://, and s3a://
Where as WholeTextFile the syntax is as below
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope
If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better
textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.
wholeTextFile() reads a directory of text files and returns pairRDD.
For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.
See below example for clarity:-
textFile = sc.textFile("ml-100k/u1.data")
textFile.getNumPartitions()
Output- 2
i.e. 2 partitions
textFile = sc.wholeTextFiles("ml-100k/u1.data")
textFile.getNumPartitions()
Output - 1
i.e. Only one partition.
So in short wholeTextFiles
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

"sqlContext.read.json" takes very long time to read 30,000 small JSON files (400 Kb) from S3

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition